The reCAPTCHA service that helps Web sites tell humans and computers apart has been acquired by Google. Started by Carnegie Mellon University computer science professor Luis von Ahn, the company feeds out millions of distorted images a day that deter malicious or commercially motivated automated behavior. Von Ahn and colleagues came up with the term CAPTCHA (a contrived acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart"), wrote an early paper on the topic, and have continued to advance the academic and practical elements.
A CAPTCHA is a puzzle that humans can solve relatively easily but that stymies computers; it's used as a type of Turing test. Alan Turing's famous test involved two parties, one a computer, attempting to convince a human interlocutor of their respective humanity. With CAPTCHAs, an automated system feeds out what are typically hard problems in artificial intelligence - still mostly centered on machine vision and text recognition - to ferret out faux folk.
The reCAPTCHA approach is particularly interesting, because it relies on a large base of scanned words that have failed separate attempts at optical character recognition by two different systems. The source scans include a commercial project for The New York Times to turn its vast archives into text. Such OCR-resistant words are perfect puzzles for humans, and they're helpful for fixing what are called "suspect" words in OCR. reCAPTCHA currently processes five to seven million words a day through nearly 40 million CAPTCHAs.
(The secret of how reCAPTCHA gets machines to act as arbiters of human intelligence? The system always provides two words, one of which is known and drawn from a huge database. The unknown word is shown to several people, most of whom must provide the same response, before the word is deemed solved.)
Google uses CAPTCHAs to prevent automated creation of accounts, automated comment spam in Blogger, and general havoc. TidBITS uses CAPTCHAs to protect email addresses of authors and restrict TipBITS submissions to real people. (Our comment verification process relies on email to determine whether or not the submitter is a person.)
I recently wrote an extensive article for The Economist about CAPTCHAs, focusing on reCAPTCHA. The reason for a spotlight on its efforts, even as I described the scope of the field, was that reCAPTCHA is widely admired for having come up with a clever and still-functional method of finding things humans do well. There are many other such categories, but OCR resistance is low-hanging fruit.
The use of reCAPTCHA at Google will be, von Ahn says in his inaugural Google Blog post, "not only to increase fraud and spam protection for Google products but also to improve our books and newspaper scanning process."
That's right! We're all going to have a hand in Google's book conversion project, one word at a time.