Question: How do I make a Web site searchable? Beverley Neff <[email protected]> writes: "I was on a Web page recently that had a search form to search that Web site. They were using Excite, but when I went to the Excite home page I couldn’t find any information. Could you tell me more about what is required to make a Web site searchable?"
Answer: Adding search capabilities for a particular Web site, as opposed to an Internet-wide search engine like Excite, requires three separate pieces: an index builder, a search engine to examine the index, and an interface to compose queries that are sent to the search engine.
Internet-wide search engines use programs called spiders or robots to trace links, but a search system for a particular site can read the files directly out of the local directory structure. The index builder essentially creates a concordance of all Web pages you point it at – that is, a list containing all words in all pages listed just once, and pointers along with each word noting what page the word appears on.
You can add some complexity by using an index builder that also stores proximity information. Each word not only has a list of pages it appears on, but a matrix of other words it appears near and on which pages. These matrices make it easier to find complex matches, like "all pages which contain gold and luster within 10 words of each other."
The index is built as a separate step, often overnight. For sites that change frequently, some software allows incremental additions to an index; others always require the entire index to be rebuilt each time. For large sites – think Microsoft – enormous resources have to be devoted for the processing and storage of indexes.
The search engine handles queries, reading entries from the index and trying to find pages that match the conditions. The interface to the search engine is often a Web form that explains how search queries have to be formulated. If you look at AltaVista, you can see both extremes. Their simple search lets you enter a number of keywords. If you enter just keywords, it finds all pages on which any of the words appear. Using a plus sign in front of any word means that the search engine only matches pages in which the keyword appears.
AltaVista’s advanced search lets you construct queries that include proximity and Boolean operators like AND, OR, and NOT. So you can write a query like, "adam NEAR engst AND NOT tidbits," which would return a match on all pages which contain Adam Engst’s name (where Adam and Engst are within a few words of each other) but don’t contain the word TidBITS anywhere. (Capitalization is usually ignored.)
Indexing and search software can be expensive, and there are products available for every platform. Unix systems often rely on the currently freeware Excite for Web Servers (EWS), which also runs under Windows NT. Another cross-platform solution is Maxum’s Phantom, which runs under Windows NT and Macintosh. [GF]