Verifying Web Links in PDF Files
Our ebooks have tons of Web links in them, and for a long time, one of the most tedious production tasks was verifying that the links were still valid since the author added them in the manuscript. In an effort to simplify this task, I came up with the following process.
Unfortunately for those trying to replicate it, my process relies on an expensive plug-in, the $699 Aerialist Pro from ARTS PDF. I initially purchased Aerialist Pro because it can generate PDF links from page numbers to the associated pages in a PDF; I used it to link up all the page numbers in the index of the ebook version of iPhoto ’08: Visual QuickStart Guide. That task would have taken many hours using the astonishingly bad linking tool in Acrobat Professional 8, so I was able to justify the price. On the Mac, Aerialist Pro runs only in Acrobat
Professional 7, so I was glad I kept that version around, and copies still seem to be available via Amazon.com.
Aerialist Pro has other useful features, including the capability to produce a report listing all external links, which gave me what I needed to develop the rest of my process. (Unhappily, another Aerialist Pro feature that I would love to use – the capability to set link properties like zoom level and appearance en masse – turns out to have a bug that causes problems with documents viewed in Continuous mode. ARTS PDF has confirmed the bug, and I hope they fix it, along with enabling Aerialist Pro to work inside Acrobat Professional 8.)
Aerialist Pro’s external link report is itself a PDF, so my first step is to save the report from Acrobat as a plain text file, called Dependency Report.txt (the extension isn’t optional). But in the end, I need a .html file, so I set up Noodlesoft’s Hazel to look for a file called Dependency Report.txt in a specific folder, rename it uniquely and with a .html extension, and open it in BBEdit.
Once I have the file in BBEdit, I run a text factory that takes the rather plain output from Aerialist Pro, strips out the cosmetic parts, and turns all of the links into proper HREFs. It’s a lot of grep pattern matching, and while it wasn’t trivial to create, it wasn’t all that hard.
The next trick is to check all the links. After much searching and testing, I found a $25 utility called Braxton’s Link Tester (BLT) that does a nice job of checking links and reporting back on which ones have problems. After running the BBEdit text factory and saving the file, I drag the file’s proxy icon (the little icon in the title bar of every window; just click, hold, and drag to use it just as though you were dragging the file’s Finder icon) to BLT’s Dock icon. In BLT, I then click the Check Links button and go do something else for a few minutes while it visits all the links.
What I like about BLT is that it’s easy to deselect the green checkmark tab that shows all the good links, since I don’t care about those, and focus in on broken links (for the screenshot below, I left the good ones showing). BLT goes beyond a simple thumbs-up/thumbs-down display, identifying failed links, forbidden links, links that time out, links forbidden by robots.txt, server errors, email links that must be verified manually, and protocols that BLT doesn’t recognize.
Most of the time there are only a couple of broken links, if any, and then it’s just a matter of going back into the original Word document and the working PDF file and either removing the links or replacing them with correct links.
I won’t pretend this is the only way to automate link checking. It might be possible, for instance, to write an AppleScript that would identify and check the links, reporting back on which ones had troubles. But I do hope this will give you a sense of how you might be able to eliminate a manual step in producing PDF files that work as they should.