Matt Neuburg 12 July 2004

The Simple Brilliance of Webstractor

Sometimes a new idea is so simple, you can’t believe no one’s thought of it before. Sometimes a simple idea is so ingenious, it feels magical. When an application embodies a new idea of that sort, you may not realize right away what it does: it lives just outside your accustomed paradigms, so at first you keep trying to see it as something it isn’t, like a child stuffing a square peg in a round hole.

Softchaos’s Webstractor is like that. It’s not big; it’s not complicated; it doesn’t feel particularly powerful or revolutionary; but it’s not quite like anything you’ve ever seen before, either. It’s small, simple, new, and downright brilliant. When you do grasp what it does, you’re amazed for an instant, as if someone had splashed water in your face. Then the instant passes (the water evaporates, the sun is warm, the day is bright) and you simply go back to your same old life as if nothing had happened – except that it isn’t quite the same old life, because now you’re using Webstractor. But it feels like the same old life, because you’re using Webstractor automatically, without thinking, as if it had always been part of your life.

That’s what I’ve been trying to say about Webstractor all along. It’s an old friend – an old friend you’ve never met before.

<http://www.softchaos.com/products/ webstractor.html>

The Two Faces of Janus — What does Webstractor do? Well, for one thing, it’s a document-based application that can surf the Web. A Webstractor document starts out as a collection of Web pages that you’ve visited using Webstractor as your browser. The window is divided into two sections: the upper part is a list of the Web pages collected in this document, and when you click on a listing, that Web page is displayed in the lower part of the window.

Now, you might say: So what? Other programs I’ve reviewed in TidBITS, such as NoteTaker and DEVONthink, can be used as Web browsers. But Webstractor is not merely browsing the Web; it’s storing every Web page it renders, complete with any images and other secondary information such as frames and linked CSS and JavaScript pages. This means that the entire Web page is now captured in your Webstractor document, and can be viewed again later without using the network at all. A Web page stored in a Webstractor document is like an Internet Explorer "Web archive": it encapsulates the whole page, for offline reading. This alone is extremely welcome, because Safari doesn’t make Web archives. And remember, for Webstractor, such archives aren’t afterthoughts you create with Save As; every page you view using Webstractor is archived automatically into your document.

<https://tidbits.com/getbits.acgi?tbart=07584>

<https://tidbits.com/getbits.acgi?tbart=07575>

But there’s more. It turns out that a Webstractor document has two faces. The collection of Web pages is one face (called Browse mode). The other face (called Edit mode) is a single narrative that you’ve created by stringing some or all of those Web pages together, possibly in edited form. When I say "in edited form," I mean that you’re able to modify the content of a Web page as represented in the document’s Edit mode (the same Web page as represented in Browse mode lives on unaltered). The key moves that you can make in editing a page are things such as selecting a stretch of text and cropping so that the rest of the document is eliminated; highlighting a stretch of text (i.e. give it a bright yellow background); changing some text’s font, size, or color; and of course adding and deleting text.

I say "of course" as if these abilities were obvious; but in fact they are just plain jaw-dropping amazing. It is this transformation of a Web page into an editable thing that constitutes the simple, magical brilliance of Webstractor. The first time you see it, you can’t imagine how it is even possible.

Consider, for example, something like the TidBITS home page. It’s laid out in an elaborate way. It has a header with an image and a couple of form fields, then a complicated four-column table, then a series of two-column tables, then a footer. Yet this Web page can be transformed by Webstractor into an editable thing. This editable thing looks like the original page; but, behind the scenes, it has been divided into a collection of separately editable "text frames." The header image is a text frame; the form fields are a text frame; the first three columns in the opening table are text frames, and in the fourth column (the Take Control ad) nearly every line is a separate text frame of its own; then each column of each two-column table is a text frame; and the footer is five text frames (the four links and the copyright notice).

<http://www.tidbits.com/>

The reason for this approach is that a Web page like this, with its arrangement of tables and images, is too complicated to be represented in a simple RTF-based TextEdit type of word-processing window; but each of the text frames into which the page has been broken down behind the scenes is sufficiently simple. In effect, each text frame of the editable page is a separate stretch of RTF, with the text frames laid out to look like the original Web page. So now you can edit some spot in a text frame; when you’re done editing, the entire page re-renders itself. You can also eliminate whole text frames, so as to leave just the part of the page you’re interested in; when you do, you can reformat the remaining frame so that instead of being narrow (because it was once one column of a multi-column table) it is the full width of the page.

The point of this aspect of Webstractor is not merely to let you store and edit Web pages; it is to let you string the edited pages into a single document. You end up with what I earlier called a single narrative, like a multi-page word-processing document. The content of this narrative comes from the original Web pages, ordered and edited by you. You can also insert new material of your own, corresponding to no Web page at all (that is, you simply insert some completely original styled-text content). In the final narrative, you don’t necessarily even see the divisions between the original Web pages; a "page" in the narrative is a piece-of-paper page, not a Web page, and you can close up the gaps between Web pages so that the material is repaginated into a single seamless flow.

The Why and Wherefore — You’re probably now asking: "Okay, but why would I want to do that? What would I want to use Webstractor for?" My advice is not to ask that question, because your answer will almost certainly be wrong; you won’t be able to second-guess yourself, to predict your own behavior. Instead, just use it. Let yourself go. Webstractor is so easy and obvious, you’ll instantly find yourself doing automatically with it whatever it is that needs to be done.

For example, I’m currently studying Microsoft Word 2004. So, when I notice that a Web site is discussing problems or features of this version of Word that I might want to remember, I navigate to it in my Webstractor document devoted to Word 2004. In the case of something like MacFixIt, I’m not interested in the whole Web page; I just want the part that’s talking about Word, so in Edit mode I crop out everything else. In Browse mode, this Webstractor document is a collection of Web pages, but in Edit mode it’s a terse series of statements about Word 2004 that I can refer to later on.

<http://www.macfixit.com/>

In other cases, you might not bother with Edit mode at all. You might simply collect some Web pages to form a Webstractor document, just because a bunch of stored Web pages is a lot faster and simpler to read through than having to save a bunch of URLs and navigate to the actual sites in your browser – not to mention that you can do it without going online. I did just that during the weekend before I wrote the TidBITS article on the URL scheme security exploit. I scoured the Internet for information, using Webstractor as my browser. Webstractor captures every page visited, so I ended up with several dozen Web pages in my Webstractor document, of which only about a dozen were really germane, so I deleted the others. Then on Monday I wrote the article, glancing through the stored Web pages as I did so; and then I threw the document away, it having served its purpose.

<https://tidbits.com/getbits.acgi?tbart=07680>

It’s also hard to resist using the power to edit imported Web pages just for fun, to "hack" your own personal version of someone else’s Web page. Luckily you can’t export the result to HTML; it lives solely within your Webstractor document. But you can print it or export it as a PDF, and of course you can take a screen shot of it…

<http://www.tidbits.com/matt/downloads/ NotMicrosoft.tiff>

The PDF export option is worthy of additional mention, since it provides an easy way to share a Webstractor document with anyone else, such as an editor who may want to review the sites you used when developing an article. And if that editor wanted to check those sites for updates, you could send her the original Webstractor document with its live updating capability.

Bells and Whistles — This article can’t describe Webstractor completely, but there honestly isn’t that much more to it than I’ve outlined. Just a few further points deserve separate mention.

When you reload a stored Web page in Webstractor, if the page has changed on the Web, the new version is stored as a separate item (different versions of a single Web page are nicely listed hierarchically, with the date and time, at the top of the Browse mode window). This means Webstractor can be used to maintain successive states of a Web page, as I did in my MacFixIt example above.

The Links Inspector is a utility window listing all the links in the current Web page, divided into several useful categories. Naturally you can navigate any link from here, adding it to your document.

A simple but useful Find capability works rather as in Preview: a drawer opens, you type a term into a search field and hit Return, and all matches are listed, with a little context, in a table. You click on a listing in the table to navigate to that occurrence in your document. The same drawer can be used to perform find-and-replace in the Edit mode portion of your document.

The manual is a Help Viewer document; it’s rather superficial and incomplete (there’s an entire menu whose purpose is nowhere explained, for example). And why, oh why, can’t online help authors be bothered to supply decent navigational links between pages?

<http://www.macdevcenter.com/pub/a/mac/2004/03/ 30/online_help.html>

Conclusion — There’s very little not to like in Webstractor. It has a tendency to put up the "spinning pizza of death" from time to time, but it isn’t actually dying – it’s just performing some time-consuming process, and I’m sure that in future versions this will be made to take less time or will be sluffed off to a thread. The price (about $80, depending on the pound-dollar exchange rate) seems a bit steep – it’s higher than DEVONthink or NoteTaker – but that’s a trade-off you’ll have to judge for yourself, and you can easily do so, since a demo version is available for download (note that Webstractor requires Mac OS X 10.3 or later).

<http://www.softchaos.com/downloads/>

Perhaps you’ll use Webstractor simply to make up for Safari’s inability to save Internet Explorer-like Web archives; perhaps you’ll use it to assemble parts of Web pages as a vast set of notes for some research project. In any case, you’ll surely find it easy, fun, intuitive, and darned clever.

Share

Subscribe today so you don’t miss any TidBITS articles!