Series: Search Engine Shootout
All the details of TidBITS's contest to find searchable article database
Article 1 of 3 in series
For some time, we've been lamenting the fact that TidBITS doesn't have a good, full-text, search engine. Years ago, Ephraim Vishniac set up an excellent WAIS source for TidBITS, but that was when Thinking Machines ran the public WAIS server on their Connection MachineShow full article
For some time, we've been lamenting the fact that TidBITS doesn't have a good, full-text, search engine. Years ago, Ephraim Vishniac set up an excellent WAIS source for TidBITS, but that was when Thinking Machines ran the public WAIS server on their Connection Machine. That service eventually went away, and several attempts were made to replace it. The current search engine is run by Sensei Consulting in Australia, and although it's welcome, we often hear of troubles accessing it. In addition, searches return entire issues, rather than articles, so you must also search within the returned issue.
A variety of searching tools that run on Macs have appeared over the years, but we've never had the proper combination of time, hardware, and experience to put them through their paces. So, we've come up with a different method for evaluating these pieces of software - we're going to have a search tool shootout!
We have a number of goals in mind. First, we want to pull out the best search tools for the Macintosh among the numerous contenders. Second, we want to let the creators of these programs strut their stuff. Third, we want to provide a way for people to search TidBITS easily.
Who Can Participate? Anyone can participate, although we expect that those who have written search tools will be the most interested, since this will give them a chance to show off in a real-world test that will be useful to thousands of people. If, however, you're a consultant and specialize in setting up Macintosh-based search tools, you're welcome to compete.
What's the Test? Once everyone who has expressed interest in participating has contacted our Managing Editor Jeff Carlson at <firstname.lastname@example.org>, we'll provide access to all back issues of TidBITS, in HTML format. No pansying around here - the competition will use the contents of over 360 issues of TidBITS, about 11 MB of text covering the last seven years. Once everyone has the issues, they can set up their search engines. We don't have anywhere near enough Macs to host this, so contestants will have to provide their own hardware and Internet connection. Technical questions regarding our format or other issues can be directed to me at <email@example.com>.
Specification -- No contest would be complete without rules. All entries:
- Must offer full-text search capabilities of all TidBITS issues.
- Must be made with and run on a single Macintosh running the Mac OS.
- Must be accessible via the Web.
- Must automatically integrate new issues every week.
- Must return results at an article level (articles all start with <H2> tags).
- Must display results using HTML source from TidBITS issues, including hot links.
In addition, these bonus items could be included and will improve an entry's chance of winning:
- Sorting results by date or relevance
- Low cost
- Short setup time
- Other additional features, such as suggesting alternative sites to search if a search comes up empty
The Time Frame -- We don't expect contestants to drop everything and start working on this full time - in fact, we'd prefer to hear things in the best entries like "Yeah, I whipped this off while I was waiting for my pizza to arrive." The Macintosh is about ease-of-use, and we hope that it won't be difficult to set up these systems. Here are the dates to watch:
- 17-Mar-97: Deadline for entering the contest.
- 21-Apr-97: Deadline for completing entries. Judging starts.
- 12-May-97: Winner announced.
How Will We Judge? Implementation details are up to the people participating in the shootout, but we have guidelines that contestants should keep in mind. All of the specifications should be met, although we won't disqualify entries for not meeting all of them (other than the Mac and Web requirements, which aren't negotiable).
- Compliance with the specifications
- Speed of searching, independent of connection speed
- Attractive and usable interface for the search page
- Attractive and readable results pages
- Cost and setup time
- Additional features
The Prizes -- Obviously, a contest requires prizes, and we'll reward the winning entry (or entries) with the main thing we have - exposure to an estimated 150,000 Macintosh users. We plan to write about the shootout, looking at each entry and concentrating on the best of the crop. Then, assuming everything works out, we'll implement the best solution on our servers for everyone to use, giving that entry full credit and significant exposure. Other contestants can continue to host their searchable archives of TidBITS as a real-world demonstration of what their software can do, and we'll link to those who keep the archive up-to-date with new issues.
Article 2 of 3 in series
The deadlines for our TidBITS Search Engine Shootout contest announced in TidBITS-368 have come and gone, and it's time to share the results. To begin, we want to thank each and every entrant personallyShow full article
The deadlines for our TidBITS Search Engine Shootout contest announced in TidBITS-368 have come and gone, and it's time to share the results. To begin, we want to thank each and every entrant personally. These folks put tremendous effort into creating search engines that would serve the Macintosh community, and for that alone they all deserve kudos. Overall, the quality of the search engines was great, and we enjoyed reading about how the entries were constructed.
In this week's article, we're going to spotlight each entrant and provide comments about each search engine. Then, next week, after we've had more time to chat with the top entrants, we'll announce the winner (or winners, if necessary). Feel free to visit the sites (listed below in no particular order), but don't worry if you can't connect - because some entries are running on personal machines, they may not be available full time. You can also refer to TidBITS-368 for the contest criteria.
Scott Ribe & WebServer 4D -- By far the snappiest entry came from Scott Ribe, who wrote a text indexing extension that works with MDG's $295 WebServer 4D to provide a blindingly fast, full-text search engine for TidBITS. Although Scott had to write the code, which took a few weeks (and it's still relatively hard-wired to TidBITS, but he plans to generalize it for commercial release), the setup seems simple, with the text indexing extension looking for TidBITS issues in a specific drop folder.
We liked this entry quite a bit, in large part thanks to its speed. It has a relatively spartan results page, with the issue number and the article title, but I imagine it could fairly easily add the author, or perhaps the first line of the article to a summary list. Results are sorted by reverse chronological order, and Scott plans relevance ranking for a future release. The search finds articles containing all the search terms, and although you can search for issue dates, neither Boolean nor phrase searching is available. Oddly, it also can't handle hyphenated words, like "Ashton-Tate". [ACE]
Ethan Benatan, Frontier & Phantom -- Ethan Benatan came up with a creative, highly functional solution for searching TidBITS issues: using Userland Frontier, Ethan wrote a scheduled script that uses Fetch to download new TidBITS issues, and (when a new issue appears) breaks it up into articles and saves the resulting files in a local directory. Each night, Maxum's Phantom adds any new files to its cumulative index, while continuously serving as a CGI to handle queries from users. Frontier also uses Eudora Light to send status reports. Phantom is about $300, while Frontier and other components have little or no cost.
The result is a spiffy TidBITS search engine, offering word-stemming, Boolean and phonetic searching capabilities from Phantom, plus "convenience" features for searching just 1996 or 1997 TidBITS issues, searching only URLs or headers, detailed or compact results formats, and relevancy-ranked search results (expressed in percentages). To our delight, Ethan went to the extra effort of breaking MailBITS up into separate articles so they can be matched individually. Although the detailed search results are marred by navigation links showing up in the three-line previews, all in all, Ethan's effort is outstanding. [GD]
Andrew Warner & FoxPro -- You don't hear much about the Mac version of FoxPro since Microsoft purchased Fox back in 1992 (see TidBITS-113). But, it's still out there, and Andrew Warner has shown that it can still perform. This search engine was written entirely in FoxPro and is highly customizable. It reads TidBITS issues from a drop folder, and provides dynamic headers and footers. The system includes a file parsing program that reads the HTML of each issue and parses them into separate articles. Then, Phdbase, a text searching library add-on for FoxPro/Mac, does the indexing.
Since Andrew had to run this on his personal machine, we couldn't do much testing in the time available. Boolean and phrase searching (via quotes) were available, and you could limit the searches to specific fields (such as article title or, hypothetically, date) as well. Andrew didn't spend much time on this solution, but he said he could easily add or modify many features, given more time. The results list included the article title and issue date, and articles displayed relatively well, with an occasional glitch or inappropriate search hit. [ACE]
Ole, David, FileMaker & Frontier -- Ole Saalmann and David Weingart harnessed Userland Frontier not only as a CGI engine for returning search results, but also as a parser and scheduled retriever for new TidBITS issues. Frontier scripts grab TidBITS issues, break them into articles, and stores them in a simple FileMaker Pro database. When search requests come in from users, Frontier tells FileMaker what to search for, then returns the results in HTML.
Ole and David's project offers a pleasing AltaVista-like interface, detailed and compact results pages (plus an Advanced Search option with some Boolean and phrase-searching operations, plus searches in articles titles, issue ranges, and date ranges). Although the service displays some HTML oddities and doesn't offer relevancy ranking for articles, it's speedy, offers excellent search results pages, and has a particularly elegant scripting setup on the Web server. [GD]
Duane Bemister & WebSonar -- Duane Bemister created his entry using Virginia Systems' WebSonar Professional. Products in the WebSonar line make it possible to search large quantities of documents via the Web, and those documents can be in many different formats, making it possible to place documents online without converting them to HTML.
Although WebSonar offers many sophisticated options, it suffers under the burden of so many possibilities that casual users may become discouraged with the complex menu- and toolbar-driven interface. Further, WebSonar uses a page metaphor which causes search results to not appear to return discrete articles. WebSonar represents a powerful tool, but we aren't convinced that casual searchers will wish to devote the mental cycles necessary to jump its learning curve. [TJE]
David, Curt & Apple e.g. -- We received two entries that used Apple e.g., a CGI (currently freely available and in beta) from Apple that adds search features to Macintosh-based Web sites. Technically speaking, Apple e.g. uses technology from Apple formerly codenamed the V-Twin text indexing engine, but now saddled with the rather dull appellation of Apple Information Access Toolkit. From a backend standpoint, we like the way both entries integrate Apple e.g. with TidBITS, and we also like the user experience. It's easy to find articles, and the results list gives a relevancy score for each found article. Plus, there's a feature for checking off particularly relevant documents in a results list, and then finding similar articles to those checked. We were rather impressed at how well that feature works.
The first entry, created by David Clatfelter, gives results in table or text format. Table format uses graphics to create a relevancy score fill bar and gives information about each found article. Unfortunately, the information begins with a jumble of text from the top of the issue containing the found article. The text format uses asterisks to indicate a relevancy score and gives the title of the issue in which the found article resides.
Curt Stevens submitted the second Apple e.g. entry. Users can choose from full or compact format for viewing results. Full format returns a list of found articles, each with a fill bar indicating a relevancy score. After the score, each entry begins with the article title, and includes the first few lines of the article, making it easy to determine if the article is of interest. Compact format is much like David's text format, except it lists the article's title instead of the title of the issue that containing the article. Overall, we are impressed with the performance and possibilities of Apple e.g. and plan to take a closer look. [TJE]
Jacque Landman Gay & LiveCard -- When I wrote about LiveCard, the $150 CGI from Royal Software, in TidBITS-338 I mostly noted its ability to put HyperCard stacks on the Web with little or no modification. Little did I expect one of the most noted members of the HyperCard community would use it as the basis for a TidBITS search engine.
LiveCard acts as an intermediary between a Macintosh Web server and Jacque's custom HyperCard stack that indexes issues, performs searches, and report results. LiveCard presents a simple search form for entering up to three sets of search terms. Quoted phrases can be used, and Boolean search options are available. Search results are displayed as a list of article titles, and clicking a title takes users to the appropriate location in a TidBITS issue. Although HyperCard is sometimes maligned as a CGI engine in comparison to Frontier or compiled solutions, this LiveCard tool searches more than 10 MB of TidBITS articles and returns search results with surprising speed (and my server, where it's temporarily being hosted, isn't particularly fast). Although this search engine doesn't let users restrict searches to particular ranges of dates or issues and only presents a bare-bones results listing, it's a surprisingly smooth effort given the small amount of time Jacque was able put into it, and an apt demonstration of the kinds of Web services that can be produced with off-the-shelf authoring software (especially since LiveCard is included in Apple's HyperCard 2.3.5 Value Bundle). [GD]
Glen Stewart & WarpSearch -- Glen Stewart's WarpSearch CGI works differently from most of the other entrants. Other solutions usually index the entire TidBITS archive, which makes for fast searches, but requires weekly additions to the index and can use a fair amount of disk space. In contrast, WarpSearch just searches the entire archive each time. That might sound slow, but it still manages to search the 10 MB of TidBITS issues at roughly 700K per second.
WarpSearch only allows phrase searches, and no Boolean or multiple non-contiguous word searches. The results list provides the issue name, the size of the issue, the modified date, and the number of matches in that issue. Unfortunately, it doesn't break articles out of the overall issues, sometimes returns unintelligible issues, and because it uses text from our setext files rather than the HTML versions, the found text doesn't look as good as it could. [ACE]
Nisus Software & GIA -- Although Nisus Software's GIA (Guided Information Access) technology isn't precisely a full-text search engine, we decided to let them compete anyway. GIA provides keyword-based live filtering, so as you select keywords from a predefined list, the lists of matching TidBITS articles and available keywords both shrink. Selecting additional keywords decreases the number of articles and keywords until you've narrowed the search to a manageable set of articles. The hardest part of setting up a keyword system is selecting the keywords, and the system seemed to work best for relatively broad searches. Looking for a specific article was sometimes frustrating if necessary keywords weren't present.
I continue to be impressed with the possibilities of GIA, but its reality lags. Nisus Software has implemented GIA entirely in Java, and although we used it with a different Java VMs (including Internet Explorer on a PC), it was continually plagued by interface glitches. Some can no doubt be easily fixed, but others may be more basic to Java or current tools. In the end, although GIA is fascinating technology, it doesn't meet the shootout criteria, since the server doesn't currently run on a Mac, and it's not providing a full-text search. [ACE]
Roger McNab & NZDL -- Roger McNab at the University of Waikato integrated the text of TidBITS issues with the search engine of the New Zealand Digital Library (NZDL). The NZDL enables users to search specific collections of documents (including Project Gutenberg, FAQ Archives, others only available in PostScript or TeX formats), and permits ranked or Boolean queries, additional search options, and compact results pages that identify article titles and authors.
Although the NZDL archive is functional, useful, and offers an attractive query interface, it also violates one of our contest's ground rules: it doesn't run on a Macintosh. Although core portions of the project are written in Perl and the author doesn't anticipate problems with a Macintosh port, the simple fact is that a Mac version doesn't yet exist. [GD]
Tune In Next Week -- There you have our contest entrants - tune in next week for more details on our favorites and the eventual winner or winners.
Article 3 of 3 in series
First, a correction. While developing search engines for the TidBITS Search Engine Shootout, some entrants sent more than one URL as they changed configurations, or temporarily used different servers as test machinesShow full article
First, a correction. While developing search engines for the TidBITS Search Engine Shootout, some entrants sent more than one URL as they changed configurations, or temporarily used different servers as test machines. The URL we gave last week for Glen Stewart's WarpSearch entry such a temporary location, set up only for the duration of the Shootout. You can check out WarpSearch reliably at the following URL:
Last week in TidBITS-379 we introduced you to all the entrants and promised we'd make a decision this week. It hasn't been easy. Of our 11 entrants, all of whom submitted excellent entries, four stood out.
- Scott Ribe and WebServer 4D
- Ethan Benatan, Frontier and Phantom
- Ole Saalmann and David Weingart, Frontier and FileMaker Pro
- Curt Stevens and Apple e.g.
The Criteria -- We had hoped that one of the entries would obviously rise to the top, but we had no such luck. So, we came up with some refined criteria for comparing our top four entries. These criteria are:
- Ease of use for the end user
- Searching power for the end user
- Ease of setup and maintenance for us
- Searching speed
- Setup cost
- "Hit by a bus" survivability (I'll explain this later)
- Overall accuracy of results
We are aware that Apple has not yet shipped a final Telepathy extension, so we're sure some of the comments below can easily be addressed by the developers. We've tried to take that flexibility into account, but overall, we judged what we saw.
Also keep in mind that we didn't evaluate these search engines for which is generically the best. We instead chose which would be the best solution for TidBITS. That's likely to be different from anything you may want a search engine to do, so if you want to build your own Mac OS-based search engine, you should investigate these technologies more closely (and check last week's article for others that might suit your purposes).
Ease of Use -- Obviously, a search engine should be as easy to use, because otherwise people will avoid it. This criterion is often at odds with the next one, which rates searching power, since the more options, the more complex the interface and the results list inevitably become. Ethan's Phantom-based entry has more options on its main search page than the rest, lowering its ease of use slightly. Some of us like AltaVista's interface, and familiarity on the Web is a good thing, so Ole and David's Frontier and FileMaker entry gets points for providing both simple and advanced search forms. Curt's Apple e.g. entry and Scott's WebServer 4D entry have dead simple interfaces, which is good.
All our entrants provide results at the article level (and Ethan gets extra points for breaking out MailBITS separately), although Curt links to the article within the full issue rather than breaking the articles out as individual files. Curt's technique forces people to download a full issue each time but provides context around the article in question and makes it easy to scan other articles in the same issue. Ole and David straddle the fence by breaking the articles out and also pointing into the full issue on our Web site, which is good for an independent search engine, but less important for something we'd run ourselves.
A final part of the ease of use criterion is the results page. The results should be attractive, easy to scan quickly, and sorted well. Ole and David score points from their homage to AltaVista but display results newest first, whereas Ethan and Curt both take advantage of relevance sorting. Ethan's results list unfortunately includes the text from the navigation bar in the summary text, but that's probably easily rectified. Scott's results page does chronological sorting (relevance is slated for a later release) and uses a simple table with the issue number and article title, but no summary text, which makes it more difficult to determine which article you might want. I suspect that's fixable.
Both Ethan and Curt include a field for a new search in the results list, and Ethan puts the search terms in the field. Apple e.g.'s option to find similar documents is more flexible than Phantom's, since you can select multiple articles by clicking multiple More checkboxes, whereas you can only find documents similar to a single hit in Phantom's results list.
Although we're splitting hairs here, since all four are easy to use, we give the ease of use award to Curt Stevens and Apple e.g. for the combination of a simple interface and a clear and attractive results list.
Ease of Use: Curt Stevens and Apple e.g.
Searching Power -- Sometimes you want to find information that's not easily identified with a word or two. For that, you need additional flexibility and power in the search engine. You may know roughly when an article was published, or you may know how a word starts or how it sounds but not know how to spell it properly. Ethan's Phantom-based entry wins hands down when it comes to searching power, which is the trade-off for losing a bit on simplicity of interface. Phantom provides Boolean searching, phonetic searching, word stemming, searching within certain HTML tags, and some level of date range searching. Ole and David's Frontier/FileMaker entry offers an advanced search that provides Boolean searching, title searches, issue number searches, and date range searches, which are quite useful. Curt's Apple e.g solution and Scott's WebServer 4D entry offer little in the way of this sort of flexibility, although you can throw parts of dates (like the last two digits of the year) into the search string to improve granularity.
The capability to find similar documents is useful for narrowing searches. It's provided by both Ethan and Curt via Phantom and Apple e.g., and both seem to do a good job at it. Overall, we found that Apple e.g. had a better interface for finding similar documents, but it's not enough to compete with Phantom's searching flexibility.
Searching Power: Ethan Benatan, Frontier and Phantom
Ease of Setup and Maintenance -- This category is difficult to judge, because we neither set up nor attempted to administer all of the contest entries. However, based on what we know of the tools involved and what we know of our existing tools, we can make some assumptions.
Ole and David and Ethan use Frontier to suck in new TidBITS issues, parse them into articles (and MailBITS, in Ethan's case), and then turn them over to the database engine (FileMaker Pro and Phantom, respectively). Ole and David also use Frontier as the CGI to communicate between the Web server and FileMaker Pro, whereas Phantom acts as both the indexer and the Web server. Using Frontier offers significant flexibility, but may suffer from ease of setup - scripting solutions seldom have well-designed graphical interfaces. Similarly, although the flexibility is there, changes require programming, and although both Geoff Duncan and Matt Neuburg are capable of that, the rest of us at TidBITS aren't. Since we're small, we try to keep overlapping skill sets so anyone can step in for anyone else if necessary.
Scott and Curt both look in a drop folder for new issues of TidBITS to index, which is an ideal solution for us, because it's easy for us to modify our existing distribution automation to put a copy of the issue in a folder. Curt's Apple e.g. entry is probably the best here, since we believe we can point it at our existing folder of TidBITS issues, whereas Scott's WebServer 4D entry currently deletes the original from the drop folder after importing it. We're sure that's an easy thing to change if necessary.
Ease of Setup and Maintenance: Curt Stevens and Apple e.g.
Speed -- Overall, we didn't notice that any of the entries were particularly slow, and speed wouldn't have entered our consciousness in a big way if it hadn't been for Scott Ribe's WebServer 4D entry. Everyone else seemed roughly similar (and since there are lots of variables in how fast something works on the Web, we ignored occasional differences), but Scott's entry was blindingly fast, so much so that I ended up using it a few times in the last few weeks because I knew it would be the quickest to send results back. There's not much else to say about this criterion, but wow!
Speed: Scott Ribe and WebServer 4D
Cost -- Again, it's difficult to estimate the cost of setting up one of these search engines since we already have some of the necessary equipment and software. For those of you interested in setting up a similar server from scratch, we'll rough out the costs as we understand them.
- Scott's entry requires the $295 WebServer 4D from MDG, and he said that he hopes to sell the custom text indexing extension he's writing for this purpose for somewhere in the $100 to $200 range. It achieves its blinding speed on a Quadra 800 with a PPC upgrade card, which is about as slow as Power Macs get, so CPU power isn't much of an issue, nor is disk space or speed. RAM is useful though, and Scott recommends a system with 48 MB.
- Ethan's entry uses Maxum's Phantom running in stand-alone mode, so it doesn't even require an additional Web server. Phantom is the major cost at $395, although Ethan's setup also uses the free Frontier and the free Eudora Light (for reports). Currently, Ethan's entry runs on a 32 MB PowerBase 180 from Power Computing.
- Ole and David's entry uses the free Frontier, Chris Hawk's free Quid Pro Quo as the Web server, and Claris's FileMaker Pro, which costs roughly $200. To avoid buying FileMaker Pro, Ole and David say that you could use their Frontier suite with other databases. Ole and David's entry was hosted on two separate machines; the main one we pointed at turned out to use a 68040 and 20 MB of RAM, so hardware shouldn't be an problem for their solution.
- Curt's entry uses Apple e.g., which is free, although it does require a Web server such as StarNine's WebSTAR, which we use, or the free Quid Pro Quo. It's running on an Apple Workgroup Server 8150/110 with 40 MB (10 MB for Apple e.g.). That's a 100 MHz PowerPC 601 - not a particularly fast machine. The bottom line comes down to the fact that if you have a Power Mac, you wouldn't have to spend any money to get Apple e.g. up and running.
Cost: Curt Stevens and Apple e.g.
Hit by a Bus -- As I noted before, TidBITS is a small organization, and as with any small organization, we worry about what TidBITS would do if something terrible (such as being hit by a bus) were to happen to one of us. As such, we avoid situations where any one of us is the only person who could perform an important task - if that person were to die in a freak gardening accident, that task would be difficult to continue. So, in thinking about which search engines to adopt, we considered the ramifications of the hit by a bus scenario for each one.
Curt's Apple e.g. entry would seem to be the obvious winner, but for one wee problem: it's currently a custom job. Curt works at Apple on Apple e.g., and he modified Apple e.g. to understand that TidBITS issues have more than one article in them. So, unless Curt's custom changes are rolled into the public version of Apple e.g. and maintained (which is the plan), Curt becomes our weak link. And, given Apple's recent troubles, Apple e.g.'s future in general is something of a question mark.
Scott's entry suffers some of the same problems, given that the bulk of the work is his custom text indexing extension, which is currently hard-coded to certain aspects of TidBITS. If we were to change something about our format, and Scott had been abducted by space aliens, we'd be in trouble. Also, although WebServer 4D is obviously performing well, MDG is a small company in what can be a hard market.
Interestingly, although we marked Ethan and Ole and David's entries down slightly for ease of setup because they're based in large part on Frontier, they both do better in this category because of that. Frontier may not be the sort of thing that some of us have ever been able to wrap our heads around, but many people know it and could help in case of emergency. Ethan also uses Phantom, and Maxum seems like a solid company that is unlikely to disappear or drop Phantom. Ole and David rely on FileMaker Pro, and given that it's the most popular database on the Macintosh, it's a good bet that it will be around forever with plenty of people who know how to use it.
Ethan edges out Ole and David by a hair here, if only because he seems to rely on Frontier a little bit less, which means finding someone who could fix a problem in his code would be slightly easier.
Hit by a Bus: Ethan Benatan, Frontier and Phantom
Overall Accuracy -- There's nothing worse than not being able to find something you know exists thanks to some quirk in a search engine. Geoff Duncan was a software tester in a previous lifetime, and he briefly hammered on all of the entrants with deliberately stressful and unusual searches. I'll let him report on which ones fared well.
Fortunately, the four final entrants all provide essentially correct and functional search results. Simple targeted tests for known items - the word "emporia," for instance, which until now only appeared in one TidBITS issue - worked correctly in all engines; similarly, Boolean functions plus issue and date restrictions appeared to function correctly where they were offered. Stress tests for large (or huge) results lists and simultaneous queries were also handled properly. However, some more complex (or more naive) queries occasionally generated mixed hits or unexpected results lists. After isolating the search engines' behaviors, I tried to figure out how quirks might impact real users.
Both Curt with Apple e.g. and Ethan with Phantom sort search results by perceived relevance, which proves both a strength and a weakness. On one hand, they both tend to let the most appropriate articles float to the top of a results list, which is obviously useful. However, relevancy ranking also tends to break down with (perhaps unwittingly) vague queries. Apple e.g. casts a wide net, routinely finding more than 100 matches for simple queries ("RAM Doubler review"), of which the top-most matches were fine, but subsequent matches can appear random at first glance and also have a comparatively high relevancy. Phantom, conversely, throws away the chaff: the same query turns up just three items, the first of which is right on target, and the other two of which mention all the terms but (appropriately) have single-digit relevancy. Phantom does a similarly good job narrowing down results with other generally phrased searches.
Neither Scott's nor Ole and David's entries offer relevancy; instead sorting results from most to least recent. However (and this is probably fixable), Ole and David's entry sometimes returns duplicate hits in early TidBITS issues, with some early hits appearing at the top of the results list, then repeated later in correct sort order. More often than not, trying to access these duplicated entries returns an error. Scott's entry doesn't suffer from result duplication, but it does ignore URLs, which (judging from TidBITS email) are frequently sought items.
So, although Apple e.g. provides more advanced features for finding articles similar to ones in a results list, for pure accuracy and relevancy of results, I give the nod to Phantom.
Accuracy: Ethan Benatan, Frontier and Phantom
Quantitative Ratings -- As a final method of differentiating the search engines, I asked everyone at TidBITS to list these four search engines in order of overall preference. I figured that would help include any intangibles that might have slipped through the criteria above. I then took the ratings and assigned points, one point for the first choice, two for second, three for third, and four for fourth. I next added the points for each entrant, and ranked the entrants accordingly (like in the cross-country races I ran in high school and college). With five people voting, the scores could range between 5 and 20. Here's how it came out:
- 6 points: Curt Stevens and Apple e.g.
- 13 points: Ethan Benatan, Frontier and Phantom
- 14 points: Ole Saalmann and David Weingart, Frontier and FileMaker
- 17 points: Scott Ribe and WebServer 4D
Quantitative Ratings: Curt Stevens and Apple e.g.
And in the End... I feel terrible having to single out a winner. All four entrants have done a fabulous job. Scott knocked our socks off with the raw speed of his search engine - keep an eye out for when he releases the commercial version of his text indexing extension. Ethan showed how he could use Frontier to enhance Phantom's already impressive capabilities. Ethan also says he's looking for work soon - someone give this man a job! Ole and David wanted to make sure Frontier got the exposure it deserves, and they put together a great resource despite not knowing each other and living on different continents. They're a tribute to the spirit of the Internet. Curt wanted to show what Apple's free Apple e.g. could do, and frankly, Apple can use all the impressive technology demonstrations it can muster.
In our eyes then, they're all winners. But, we don't need to run four separate search engines ourselves, so we plan to implement Curt's Apple e.g. solution first because, all other things being equal, it seems to be the easiest to merge into our existing setup. Should we run into problems, we'll next test both Ethan's and Ole and David's solutions. It will probably be easier to try Ethan's solution, since it doesn't have to integrate with our existing Web server. However, Ole and David's solution might dovetail nicely with some other work that Geoff is doing with keyword indexing. The final option would be Scott's WebServer 4D solution solely because it involves acquiring, installing, and learning several new pieces of software. There's no overall problem in that, just the reality of how much time and bandwidth we have to learn new things.
Thanks again to all of our entrants!