TidBITS Staff 12 May 1997

Shootout at the Searching Corral

The deadlines for our TidBITS Search Engine Shootout contest announced in TidBITS-368 have come and gone, and it’s time to share the results. To begin, we want to thank each and every entrant personally. These folks put tremendous effort into creating search engines that would serve the Macintosh community, and for that alone they all deserve kudos. Overall, the quality of the search engines was great, and we enjoyed reading about how the entries were constructed.

In this week’s article, we’re going to spotlight each entrant and provide comments about each search engine. Then, next week, after we’ve had more time to chat with the top entrants, we’ll announce the winner (or winners, if necessary). Feel free to visit the sites (listed below in no particular order), but don’t worry if you can’t connect – because some entries are running on personal machines, they may not be available full time. You can also refer to TidBITS-368 for the contest criteria.

Scott Ribe & WebServer 4D — By far the snappiest entry came from Scott Ribe, who wrote a text indexing extension that works with MDG’s $295 WebServer 4D to provide a blindingly fast, full-text search engine for TidBITS. Although Scott had to write the code, which took a few weeks (and it’s still relatively hard-wired to TidBITS, but he plans to generalize it for commercial release), the setup seems simple, with the text indexing extension looking for TidBITS issues in a specific drop folder.

<http://www.mdg.com/>

We liked this entry quite a bit, in large part thanks to its speed. It has a relatively spartan results page, with the issue number and the article title, but I imagine it could fairly easily add the author, or perhaps the first line of the article to a summary list. Results are sorted by reverse chronological order, and Scott plans relevance ranking for a future release. The search finds articles containing all the search terms, and although you can search for issue dates, neither Boolean nor phrase searching is available. Oddly, it also can’t handle hyphenated words, like "Ashton-Tate". [ACE]

<http://38.254.39.13/tidbits_archive/>

Ethan Benatan, Frontier & Phantom — Ethan Benatan came up with a creative, highly functional solution for searching TidBITS issues: using Userland Frontier, Ethan wrote a scheduled script that uses Fetch to download new TidBITS issues, and (when a new issue appears) breaks it up into articles and saves the resulting files in a local directory. Each night, Maxum’s Phantom adds any new files to its cumulative index, while continuously serving as a CGI to handle queries from users. Frontier also uses Eudora Light to send status reports. Phantom is about $300, while Frontier and other components have little or no cost.

<http://www.maxum.com/Phantom/>

<http://www.scripting.com/Frontier/>

The result is a spiffy TidBITS search engine, offering word-stemming, Boolean and phonetic searching capabilities from Phantom, plus "convenience" features for searching just 1996 or 1997 TidBITS issues, searching only URLs or headers, detailed or compact results formats, and relevancy-ranked search results (expressed in percentages). To our delight, Ethan went to the extra effort of breaking MailBITS up into separate articles so they can be matched individually. Although the detailed search results are marred by navigation links showing up in the three-line previews, all in all, Ethan’s effort is outstanding. [GD]

<http://anacardium.bio.pitt.edu:8080/>

Andrew Warner & FoxPro — You don’t hear much about the Mac version of FoxPro since Microsoft purchased Fox back in 1992 (see TidBITS-113). But, it’s still out there, and Andrew Warner has shown that it can still perform. This search engine was written entirely in FoxPro and is highly customizable. It reads TidBITS issues from a drop folder, and provides dynamic headers and footers. The system includes a file parsing program that reads the HTML of each issue and parses them into separate articles. Then, Phdbase, a text searching library add-on for FoxPro/Mac, does the indexing.

<http://www.microsoft.com/vfoxpro/vf_xplat.htm>

Since Andrew had to run this on his personal machine, we couldn’t do much testing in the time available. Boolean and phrase searching (via quotes) were available, and you could limit the searches to specific fields (such as article title or, hypothetically, date) as well. Andrew didn’t spend much time on this solution, but he said he could easily add or modify many features, given more time. The results list included the article title and issue date, and articles displayed relatively well, with an occasional glitch or inappropriate search hit. [ACE]

<http://agency.arnoldcom.com/aw.search2.html>

Ole, David, FileMaker & Frontier — Ole Saalmann and David Weingart harnessed Userland Frontier not only as a CGI engine for returning search results, but also as a parser and scheduled retriever for new TidBITS issues. Frontier scripts grab TidBITS issues, break them into articles, and stores them in a simple FileMaker Pro database. When search requests come in from users, Frontier tells FileMaker what to search for, then returns the results in HTML.

<http://www.scripting.com/Frontier/>

<http://www.claris.com/products/claris/ filemakerpro/filemakerpro.html>

Ole and David’s project offers a pleasing AltaVista-like interface, detailed and compact results pages (plus an Advanced Search option with some Boolean and phrase-searching operations, plus searches in articles titles, issue ranges, and date ranges). Although the service displays some HTML oddities and doesn’t offer relevancy ranking for articles, it’s speedy, offers excellent search results pages, and has a particularly elegant scripting setup on the Web server. [GD]

<http://www.gilbert.org/searchBITs.fcgi>

Duane Bemister & WebSonar — Duane Bemister created his entry using Virginia Systems’ WebSonar Professional. Products in the WebSonar line make it possible to search large quantities of documents via the Web, and those documents can be in many different formats, making it possible to place documents online without converting them to HTML.

Although WebSonar offers many sophisticated options, it suffers under the burden of so many possibilities that casual users may become discouraged with the complex menu- and toolbar-driven interface. Further, WebSonar uses a page metaphor which causes search results to not appear to return discrete articles. WebSonar represents a powerful tool, but we aren’t convinced that casual searchers will wish to devote the mental cycles necessary to jump its learning curve. [TJE]

<http://www.websonar.com/websonarcom/tidbits_ challenge.html>

David, Curt & Apple e.g. — We received two entries that used Apple e.g., a CGI (currently freely available and in beta) from Apple that adds search features to Macintosh-based Web sites. Technically speaking, Apple e.g. uses technology from Apple formerly codenamed the V-Twin text indexing engine, but now saddled with the rather dull appellation of Apple Information Access Toolkit. From a backend standpoint, we like the way both entries integrate Apple e.g. with TidBITS, and we also like the user experience. It’s easy to find articles, and the results list gives a relevancy score for each found article. Plus, there’s a feature for checking off particularly relevant documents in a results list, and then finding similar articles to those checked. We were rather impressed at how well that feature works.

<http://cybertech.apple.com/apple_eg.html>

The first entry, created by David Clatfelter, gives results in table or text format. Table format uses graphics to create a relevancy score fill bar and gives information about each found article. Unfortunately, the information begins with a jumble of text from the top of the issue containing the found article. The text format uses asterisks to indicate a relevancy score and gives the title of the issue in which the found article resides.

<http://idoseek.ucr.edu/cgi-bin/appleeg/eg.acgi>

Curt Stevens submitted the second Apple e.g. entry. Users can choose from full or compact format for viewing results. Full format returns a list of found articles, each with a fill bar indicating a relevancy score. After the score, each entry begins with the article title, and includes the first few lines of the article, making it easy to determine if the article is of interest. Compact format is much like David’s text format, except it lists the article’s title instead of the title of the issue that containing the article. Overall, we are impressed with the performance and possibilities of Apple e.g. and plan to take a closer look. [TJE]

<http://17.255.9.121:8080/TidBITS.acgi>

Jacque Landman Gay & LiveCard — When I wrote about LiveCard, the $150 CGI from Royal Software, in TidBITS-338 I mostly noted its ability to put HyperCard stacks on the Web with little or no modification. Little did I expect one of the most noted members of the HyperCard community would use it as the basis for a TidBITS search engine.

<http://www.quibble.com/HyperActive/ LiveCard.acgi>

LiveCard acts as an intermediary between a Macintosh Web server and Jacque’s custom HyperCard stack that indexes issues, performs searches, and report results. LiveCard presents a simple search form for entering up to three sets of search terms. Quoted phrases can be used, and Boolean search options are available. Search results are displayed as a list of article titles, and clicking a title takes users to the appropriate location in a TidBITS issue. Although HyperCard is sometimes maligned as a CGI engine in comparison to Frontier or compiled solutions, this LiveCard tool searches more than 10 MB of TidBITS articles and returns search results with surprising speed (and my server, where it’s temporarily being hosted, isn’t particularly fast). Although this search engine doesn’t let users restrict searches to particular ranges of dates or issues and only presents a bare-bones results listing, it’s a surprisingly smooth effort given the small amount of time Jacque was able put into it, and an apt demonstration of the kinds of Web services that can be produced with off-the-shelf authoring software (especially since LiveCard is included in Apple’s HyperCard 2.3.5 Value Bundle). [GD]

<http://www.interedu.com/royalsoftware/ descriptions/LiveCard.html>

<http://hypercard.apple.com/>

Glen Stewart & WarpSearch — Glen Stewart’s WarpSearch CGI works differently from most of the other entrants. Other solutions usually index the entire TidBITS archive, which makes for fast searches, but requires weekly additions to the index and can use a fair amount of disk space. In contrast, WarpSearch just searches the entire archive each time. That might sound slow, but it still manages to search the 10 MB of TidBITS issues at roughly 700K per second.

WarpSearch only allows phrase searches, and no Boolean or multiple non-contiguous word searches. The results list provides the issue name, the size of the issue, the modified date, and the number of matches in that issue. Unfortunately, it doesn’t break articles out of the overall issues, sometimes returns unintelligible issues, and because it uses text from our setext files rather than the HTML versions, the found text doesn’t look as good as it could. [ACE]

<http://stewart-3.pnet.msen.com/cgi/warpsearch/ warpsearch.html>

Nisus Software & GIA — Although Nisus Software’s GIA (Guided Information Access) technology isn’t precisely a full-text search engine, we decided to let them compete anyway. GIA provides keyword-based live filtering, so as you select keywords from a predefined list, the lists of matching TidBITS articles and available keywords both shrink. Selecting additional keywords decreases the number of articles and keywords until you’ve narrowed the search to a manageable set of articles. The hardest part of setting up a keyword system is selecting the keywords, and the system seemed to work best for relatively broad searches. Looking for a specific article was sometimes frustrating if necessary keywords weren’t present.

I continue to be impressed with the possibilities of GIA, but its reality lags. Nisus Software has implemented GIA entirely in Java, and although we used it with a different Java VMs (including Internet Explorer on a PC), it was continually plagued by interface glitches. Some can no doubt be easily fixed, but others may be more basic to Java or current tools. In the end, although GIA is fascinating technology, it doesn’t meet the shootout criteria, since the server doesn’t currently run on a Mac, and it’s not providing a full-text search. [ACE]

<http://www.infoclick.com/gia/gia6/TidBits1.html>

Roger McNab & NZDL — Roger McNab at the University of Waikato integrated the text of TidBITS issues with the search engine of the New Zealand Digital Library (NZDL). The NZDL enables users to search specific collections of documents (including Project Gutenberg, FAQ Archives, others only available in PostScript or TeX formats), and permits ranked or Boolean queries, additional search options, and compact results pages that identify article titles and authors.

Although the NZDL archive is functional, useful, and offers an attractive query interface, it also violates one of our contest’s ground rules: it doesn’t run on a Macintosh. Although core portions of the project are written in Perl and the author doesn’t anticipate problems with a Macintosh port, the simple fact is that a Mac version doesn’t yet exist. [GD]

<http://www.cs.waikato.ac.nz/~nzdl/tbc/>

Tune In Next Week — There you have our contest entrants – tune in next week for more details on our favorites and the eventual winner or winners.

Share

Subscribe today so you don’t miss any TidBITS articles!