Keeping Robots Out of Your Corner of the Net
Search engines and searching tools have become ubiquitous on the Internet. People flock to search engine sites in order to find information quickly, and the information available comes with startling breadth and depth. (See Kirk McElhearn’s article in TidBITS 333).
For instance, I just searched AltaVista for "watermelon." I’ve barely scratched the surface of my search results, but I’ve already read about the status of the Texas watermelon crop, scanned an article about preparing watermelon (along with nutritional information), and visited a Web page devoted to Cezanne’s painting, "Still Life with Watermelon and Pomegranates."
Indexing Robots — Search engines acquire much of their information through robots. Also known as spiders or crawlers, robots traverse the Web, looking for and recording information. Robots typically start with URLs that seem like a reasonable starting spot, such as a URL submitted by a user, a page having lots of links, or the top level of a site. A robot accesses the initial page and then recursively accesses all pages linked to from that page. The robot might also check out all pages that it can find on a particular server. After accessing a page, the robot works with the search engine to index portions of the page, perhaps the title, some or all of the text, specific keywords, or other tagged elements.
One topic that deserves attention, however, is how to prevent search engines from indexing individual Web pages or Usenet news postings. Conventions exist to keep robots out of specially-marked Web pages or news postings, though whether individual robots comply to these standards is purely voluntary. So far, mainstream searching engines appear to respect these conventions.
Hey You, Get Out of My Site — Using the Robots Exclusion Protocol, you can ask robots to ignore Web pages that you don’t want indexed. For example, you might want to store club meeting minutes on the Web without having those minutes show up in search engines. You could, of course, set up a password system, but that might be a more complicated solution than you wish to implement. You might also have a site whose pages change so frequently that there’s no point in a robot attempting to index them.
To tell robots to go away, you place a robots.txt file on the local root level of a Web site. Using a specific syntax, this file tells robots that they should keep out of certain (or all) sections of the server. If you want to set up such a file, I recommend reading the World Wide Web Robots, Wanderers, and Spiders page:
As a brief example, though, to ask all robots to keep out of a directory called watermelon, your robots.txt file might look like this.
If you don’t have enough control over your server to set up a robots.txt file, you can try adding a META tag to the head section of an HTML document. For instance, a tag like this:
<META NAME="ROBOTS" CONTENT="NOINDEX">
tells robots not to index that particular page. Or, a tag like this:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
tells robots not to follow links on the page. Support for the META tag among robots is more sporadic than the Robots Exclusion Protocol, although most of major Web indexes currently support it. Information on the robot META tag can be found in the Spidering BOF (Birds of a Feather) Report:
Private News — To keep the fingers of search engines out of your Usenet news postings, you can create an "X-no-archive" line in of your postings’ headers:
Although common news clients, such as NewsWatcher, permit you to add an X-no-archive line to the headers of your news postings, you aren’t completely out of luck if your client doesn’t permit you to do so. At least one engine, Deja News, will ignore your posting if you make the following text the first line of text in the body of your message:
In addition, if you inquire personally, Deja News will remove your posts from their archive; to ask, send email to <[email protected]>.
Assumption of Non-Privacy — The source of confusion regarding privacy and Internet indexing systems usually stems from the assumption (made by most search engines) that all information they find is public unless marked otherwise.
Many Internet veterans have no problem with the search engines’ assumption that all information is public, since much of the material has always been available one way or another. However, some new Internet users find the practice startlingly invasive. For these Internet users, it’s like being told every phone call they made during the last year was recorded by a private company, who’s now giving away those conversations to anyone who asks.
The long-term memory of these search engines makes the ramifications of their behavior larger than ever. Though Digital’s AltaVista search engine currently only remembers the last few months of Usenet, Deja News has archives going back to early 1995, and repeatedly claims that it wants to index all the way back to Usenet’s inception in 1979, where possible. In 1979, how many Usenet users could have known about the X-no-archive tag? Furthermore, though the robot and archive exclusion standards may help keep your material out of major, high-profile indexes, there are indexing and archiving systems out there that respect no such rules.
If you’re highly concerned about the privacy of your email and Usenet postings, check out anonymous remailers and PGP, a controversial strong encryption program from Phil Zimmerman. Both topics are beyond the scope of this article.
If you’re not particularly concerned about privacy, still remember that your words on the Internet may become immortal – anything you write on Usenet will be archived somewhere for eternity, anything you publish on the Web will be indexed somewhere. Choose your words with care – you may have to stand behind them in a future situation that you cannot currently imagine.
In the future, as privacy becomes a larger issue on the Internet horizon, we can probably expect commercial and consumer newsreaders and publishing tools to tout "privacy compatibility" as a feature. No doubt newsreaders will soon come pre-configured to insert X-no-archive headers by default, and Web authoring programs will come with preferences to insert robot META tags and create robots.txt files automatically. However, these features will not alter the fundamental assumptions of Internet indexing tools: everything is public.