This article originally appeared in TidBITS on 2011-09-30 at 12:50 p.m.
The permanent URL for this article is: http://tidbits.com/article/12472
Include images: Off

How Take Control Makes EPUBs in Pages

by Michael E. Cohen

A long, long time ago (to be precise, in 2003), Adam and Tonya Engst conceived the Take Control series. It was to be an “ebook-first” series, meaning that the manuscripts were written and formatted right from the start with the idea that they would be read as ebooks. It was possible to print them, of course, but they were really designed for reading onscreen. In those early days, the authors of Take Control books prepared their manuscripts in Microsoft Word X on the Mac.

Word was an obvious choice: Tonya, who created the first template, was an agile user of Word’s advanced features, and she knew that there was an Adobe Acrobat Pro plug-in for Word for Windows 2003 that would enable her to export PDFs that included clickable Web URLs, internal links, and bookmarks — that is, so long as the authors created the manuscripts to spec. Back then, PDF was the only widely used ebook standard, and for an “ebook-first” series, hot links and bookmarks were (and still are) required.

(Regrettably, the Mac versions of Word are, to this day, unable to generate a PDF with links and bookmarks. And, yes, there is more than a little irony that the Mac-focused Take Control series was exported to PDF with Windows software — until earlier this year when we switched away from authoring in Word. That said, Tonya did use Word on the Mac to export a PDF that she then swapped in to replace the visible layer of the somewhat ugly Windows-generated PDF, thus gaining the links and bookmarks from the Windows version and the better-looking PDF from the Mac version.)

Also, Word had long been a de facto standard in the book publishing world, which meant that all Take Control authors owned it and knew how to use it. Word offered a plethora and a half of formatting features, which meant that our authors could, with a little guidance, produce manuscripts that were close in appearance to the final PDF file that we distributed. That is, they could write directly into their layout, thus eliminating extra production time that would otherwise be needed to flow their final words into the final layout. Word’s advanced table-formatting features were especially useful, and tables were used to create grids for eye-catching visual elements such as figures and highlighted tips.

The Way it Was -- In those early days of the series, PDF was (as it remains) the primary format for published Take Control books. Within a few years, though, we started offering print-on-demand versions (generated from the PDFs, run through Apago’s PDF Enhancer [1] to shrink them to an appropriate physical trim size), and in late 2009 we even started publishing most of our titles in two other ebook formats, EPUB and Mobipocket. But, for the vast majority of our readers, PDF continued to be the format of choice.

This was far from surprising: the print-on-demand books cost significantly more than the PDF versions and took longer for customers to obtain than the downloadable PDF versions. As for the two non-PDF ebook formats we offered, the market was tiny, since few people had, or used, ebook readers.

As a result, we focused the bulk of our energies on producing our PDFs in-house and delegated the production of the other versions of our ebooks to our reselling partner O’Reilly Media. Given the low demand for these other versions, a short delay between the publication of the PDF version of an ebook and its availability in other formats seemed acceptable.

Then the Kindle with its Mobipocket format appeared, breathing life into the languishing ebook market, and, after the Kindle, the iPad. The iPad was immediately important to many people who read Take Control ebooks. And, although there are several excellent options for reading PDFs on the iPad, ebooks in EPUB format, such as those sold through Apple’s iBookstore, have become a fundamental part of the iPad experience.

But we were still using Microsoft Word to prepare our manuscripts, and our relationship with Word was a difficult one. Although it gave us our dealbreaker feature — the capability to add links and bookmarks to a long manuscript — and it supported many other desirable features, it also was, well, flaky. And, we were often under so much time pressure to finish ebooks quickly that we could not take the time to track down and share with Microsoft the exact nature of each and every problem we encountered. Without going into all the sordid details, let’s just say that 100-plus-page Word documents with huge amounts of change-tracking and comments do not always behave well. In addition, the internal links our books required were fragile and painstaking to create and correct, resulting in a lot of effort going into checking and fixing links.

Meanwhile, as interest in the iPad and mobile ebook reading took off, we realized that outsourcing our EPUB book production, with its associated time lag, was not working well for many of our readers. Also, outsourcing meant that we had to relinquish control over the final look and feel of our EPUBs. But, for the longest time, we weren’t able to find a viable alternative to Word that would let us generate our PDFs, work in a feature-rich writing and editing environment with our formatting showing as we wrote, and generate EPUBs too.

Enter Pages -- A few months after releasing the iPad, Apple released a minor update to its iWork productivity software suite for the Mac, which included an update to the Pages [2] word processor. Included in that update was the capability to export Pages documents to the EPUB format (see “iWork 9.0.4 Gives Pages EPUB Support [3],” 27 August 2010). Taking the revised Pages for a spin around the block, Adam discovered that it could import a Word manuscript for a Take Control book and export both a decent PDF and a credible EPUB version of it. Metaphorical bells rang and imaginary light bulbs turned on.

Tonya took on the side project of converting the complex welter of Word paragraph and character styles used in Take Control manuscripts to Pages equivalents. The object of the exercise was two-fold: to see if Pages could produce the look and feel of a Take Control book in PDF form from a Pages manuscript, and to see if, with minimal effort, that same manuscript could produce a satisfactory EPUB version, with the definition of “satisfactory” being something at least as good in appearance and navigability as those that were already being made available to customers.

Most Take Control books use about 12 custom character styles, and nearly 60 custom paragraph styles, so there were a lot of styles to consider. Plus, many of the table-based layouts had to be abandoned, since Page’s table features are blunt tools when compared with Word’s many refinements and because tables make less sense in the EPUB format (more on that in a moment). Furthermore, while Pages and Word are similar in many ways, their internal content models have some subtle (and some blatant) differences. Adding to the complexity were the limitations of the EPUB content model itself as compared to what is possible in a PDF.

Dealing with the limitations of the EPUB content model is no small undertaking. By comparison, the PDF content model is a powerful and complex thing, designed, among other considerations, to produce onscreen as close to an exact copy of a document’s printed appearance as possible. In a sense, PDFs live at the intersection of print and pixel: an onscreen PDF (such as a Take Control book) should look exactly like a printed version of the document, with the same fonts, colors, and layout characteristics, including the same pagination.

The EPUB format, on the other hand, was designed to present documents in a readable way on portable digital devices, allowing the EPUB reading software to adjust the layout and appearance of an onscreen document to conform to the characteristics of the device on which it is read. Identical presentation of the fonts, layout, and pagination of an EPUB from one device to another is not what the format is about. The device, and the user of the device, are in control of much of the appearance of an EPUB book. While an EPUB book can look rather like a printed version of the same book if enough care and trouble are taken in its design, it will never look exactly like it, and there is no guarantee that any two readers will see the book in exactly the same way, unlike with PDF.

Reflecting the limitations of the EPUB format, the EPUB documents exported from Pages can use only a subset of the formatting capabilities of Pages itself: in its EPUB exports, Pages ignores, among other things, bordered text boxes, floating text and graphic elements, and various other features.

So Tonya not only had to convert the Take Control styles from Word to Pages, she also had to create versions of those styles that would look attractive both in a PDF and when reduced to those visual characteristics that an EPUB reader — such as the iBooks app on an iPad — can reproduce. With a lot of trial-and-error experimentation and rethinking of the purpose of each of the styles, Tonya eventually came up with a set of styles required for producing a PDF ebook and an EPUB ebook from the same Pages file. This project took about 40 hours of work spread out over a number of weeks. (Note, however, that Tonya’s original set of styles continues to evolve as we develop more understanding of how Pages produces EPUBs.)

In addition to rethinking the ebooks’ visual appearance, Tonya and Adam had to figure out what to do about what they considered key ebook features — the hot internal link, the hot Web link, and the bookmarks list. Creating a hot internal or Web link in Word was a slightly unreliable and fussy business involving the Hyperlink dialog, which Microsoft has never updated in any significant way through various versions of Word. Some authors eventually developed automation to help them through it and Tonya ended up linking up a lot of ebooks by hand, using a Keyboard Maestro [4] macro for some of the heavy lifting. At least the Adobe Acrobat PDF plug-in for Word for Windows did generate bookmarks reliably.

Luckily, Pages includes a feature for making internal links and Web URLs, which we’ve found to be extremely reliable. However, nobody on the Take Control team has found a satisfactory way to automate it, so the process in the Link Inspector is more manual than we would prefer. At least, however, once we set a link, it sticks around reliably.

We ran into two problems with PDFs exported from Pages. First, for each bit of text that’s a link, Pages creates two PDF links, one stacked on top of the other. The functional effect for the reader is the same — clicking either link works. The problem, however, is what the reader sees while clicking — PDF links can either show a box around the clicked link or can invert the screen around the link. With either, the fact that the clicked link likely doesn’t cover the link text entirely is disconcerting. Luckily, it turns out to be easy to select all links in Acrobat Pro and change their highlight style to none — there’s no visual indication at all that a link has been clicked now. It’s the lesser of the weevils, and hopefully Apple will fix this unfortunate bug in the next revision of Pages.

Second, PDFs exported from Pages lack bookmarks associated with the headings in the document. This caused us consternation, since we’d rather not add them by hand during the final moments of production. Adam’s first solution involved using Smile’s PDFpen Pro [5], which provides tools for making bookmarks quickly from selected text, but that was still more involved than the technique he eventually settled on: using Aerialist X Pro [6], an Acrobat plug-in from Debenu that offers several advanced PDF manipulation features, including one that builds a hierarchical set of bookmarks automatically based on scanning for text in specific fonts and sizes. (Another tool that’s worth a look for this task, if you’re on a budget, is PDFOutliner [7].)

Making the Switch -- Having a set of styles for producing an EPUB does not a workflow make. For starters, we needed a set of style guidelines and instructions for our authors, as well as a template document containing all of the necessary styles, so that they could start working in Pages.

Coming up with those instructions and that template was only part of the task, of course: the other part was helping the Take Control authors to switch from one working tool, Word — software with which most had become comfortable and productive over the years — to a completely different one. Several authors, however, had been encouraging us to switch away from Word for years, even if they couldn’t suggest a viable alternative, and Joe Kissell, in particular, was supportive of leaving Word, which was important given the number of books he has written. (Now that Nisus Writer Pro [8] provides EPUB export and many other welcome features, Joe is looking at whether it could take over from Pages.)

An even bigger task was figuring out how to convert our collection of existing manuscripts from Word to Pages. After all, many Take Control books are updates or new editions of existing Take Control books. Simply importing those Word manuscripts into Pages is not enough: the original Word styles, designed for producing PDF output from Word, have to be replaced by Pages styles designed to work both for EPUBs and PDFs.

In addition, in the move to Pages, Tonya took the opportunity to rationalize the set of Word styles that had grown organically over time into a set of Pages styles that were more consistently named and organized in the Styles drawer in Pages. This means that most of the styles in a Word version of a Take Control manuscript have to be replaced by hand with differently named Pages equivalents.

The style conversion process has all sorts of hidden pitfalls: for example, the numbered and bulleted list styles that we used in our Word manuscripts avoided Word’s auto-numbering and auto-bulleting capabilities because we did not like them — they were overly helpful, and too often guessed wrong about what styling we wanted to apply. The equivalent Pages styles, however, have been a big help in speeding up production and keeping our numbering correct. But using auto-generated bullets and numbers means that converting a Word manuscript to Pages involves a lot of search and replace work, removing all of the manually inserted numbers and bullets that the Word manuscript contains, as well as setting and checking the automatic numbering and bullets produced by the Pages equivalents.

There are many other finicky differences between the Word styles and the Pages styles that require massaging. Tonya had to develop a written set of procedures for performing this conversion so that she wouldn’t have to memorize all the steps, nor be the only person who knew how to do it.

This procedural document currently consists of 30 major steps, along with substeps and notes. The original document had even more steps and involved running the Word document through the HTML format and a tremendously complex BBEdit text factory to clean up the internal links sufficiently so they could work in Pages. Over time, however, Tonya decided to dump Word’s links (which have several oddities that make them harder to work with in Pages) in favor of re-creating them in Pages, giving the documents a fresh start, link-wise. Not including any necessary re-linking, for a typical manuscript conversion from Word to Pages, running through those steps and double-checking the results takes about 2 hours — not an inconsiderable amount of time, but not a huge amount either, especially when compared to the amount of time required to write and edit a book!

Making a Book -- Actually producing both PDF and EPUB ebooks from a Pages manuscript is, however, a far less arduous undertaking than converting a Word document into its Pages equivalent.

Currently, there are roughly 15 steps that need to be performed on a final edited manuscript that are the same for both the PDF version and the EPUB version. None of these steps are particularly difficult, though some are time-consuming (for example, one of those steps is to check every single link in the manuscript, both internal links and links to external sites, a process that can take an hour or two for a link-heavy manuscript, but which, thankfully, seems to be required only once — if a link works in either the EPUB or the PDF, it will work in the other version of the ebook).

Once those steps are performed, the document branches: one branch becomes the PDF and the other becomes the EPUB. By developing a workflow that branches only near the very end of the production process, we can be reasonably sure that both versions of a book contain identical content: usually, all typos and other minor errors in the content are addressed before the branching occurs. If a typo is discovered in either branch after the split, we do have to fix it in both branches, but that’s minor.

For the EPUB branch, there are about ten steps that we follow to make the manuscript ready for exporting. One of those steps involves replacing three styles, out of the several dozen in the manuscript, with modified styles that work better in EPUB. To be specific, the Pages EPUB exporter cannot produce multiple adjacent block paragraphs that share a colored background, such as the ones we use in chapter openers and sidebars. Therefore, we substitute paragraph styles that use first-line indents instead of empty space following the paragraph for those color-background paragraphs. We also adjust the ebook’s page margin settings (EPUBs don’t need page margins; the EPUB reader supplies its own), remove the Table of Contents (the Pages exporter creates its own EPUB table of contents), provide a cover page designed for the EPUB version, and export the EPUB.

After that, it’s a matter of visually checking the book in an EPUB reader for any obvious visual anomalies. For that, we look the book over in iBooks on an iPad in both horizontal and vertical orientation, using the default iBooks font, Palatino. We also may do spot-checks of the book in Firefox, using the EPUBReader [9] add-on (see “EPUBReader Displays EPUBs in Firefox [10],” 10 September 2010). The entire process following the branch usually takes less than an hour, with the bulk of the time devoted to the final visual check.

Once all that’s done (and a similar number of post-branch steps performed for the PDF version), the ebooks are ready to be loaded into our Take Control catalog and made available for sale, something that takes Adam a few hours as he deals with the different content management systems involved at that final stage.

What’s Next -- Developing and refining our in-house workflows and document styles is a continuing journey. Books, especially technical books like Take Control books, are inherently complicated, and there are often exceptions or adjustments to our template that we have to make to accommodate a particular book’s content — after all, the template is the servant of the content, not the other way around.

What’s more, we have yet to find a good way of producing Mobipocket (Kindle) format ebooks in-house: the KindleGen [11] software that Amazon makes available for making Kindle books is something of a black box, and we have yet to find a way of employing it in-house that produces acceptable results with heavily formatted books, so we still outsource the production of Mobipocket ebooks to O’Reilly. But maybe someday a good EPUB-to-Mobipocket converter will appear (Calibre isn’t it, before you ask), and if it does, we’ll be able to bring the Mobipocket production in-house as well. It’s also possible things will change when the recently announced Kindle Format 8 becomes available, along with KindleGen 2.

We are finding certain frustrations with Pages that we didn’t anticipate. A big one is comment retention: In Word, a comment sticks around even if the text that it is associated with is deleted. So, in Word, an editor can highlight a word, insert a comment, and write “I think you made a typo in this word.” The author can easily perceive the problem, delete the word, and re-type it. However, in Pages, the author’s act of deleting the misspelled word also deletes the comment, so neither the editor nor the author can refer to it if a question remains about the edit. We’ve had to be much more careful about comment placement, and more wordy because the comment can’t highlight the text that it is discussing.

Another frustration is with navigating a manuscript in Pages: Word’s optional Navigation pane, which neatly shows the table of contents in a sidebar at the left of the main document display area, is sorely missed by some Take Control people. Tonya has resorted to opening a second copy of the Pages document and viewing its table of contents or using Outline view to simulate this functionality — she finds the ability to see the big picture while editing in a small area to be extremely important, especially when she considers the paths that different readers might take as they click through the internal links in the final ebook.

Nonetheless, the adoption of Pages, with all of the work we had to do to get there, is beginning to pay off for us: we are able to produce new Take Control books and revise existing books more quickly and efficiently than ever, and with more attractive results for the EPUBs. Not bad for a few dozen hours of hard thinking and research, and with a word processor that costs less than $20!

[1]: http://www.apagoinc.com/prod_home.php?prod_id=37
[2]: http://www.apple.com/iwork/pages/
[3]: http://tidbits.com/article/11550
[4]: http://www.keyboardmaestro.com/
[5]: http://www.smilesoftware.com/PDFpenPro/
[6]: http://www.artspdf.com/arts_pdf_aerialist_pro.asp
[7]: http://www.onekerato.me/OneKerato/PDFOutliner.html
[8]: http://www.nisus.com/pro/
[9]: https://addons.mozilla.org/en-US/firefox/addon/epubreader/
[10]: http://tidbits.com/article/11590
[11]: http://www.amazon.com/gp/feature.html?ie=UTF8&docId=1000234621