Avoid and Fix Word Document Corruption
The main reason we switched from Microsoft Word for the Mac to Apple’s Pages for writing Take Control books was that Pages has support for EPUB export, and its PDF export was superior to Word’s. Another small reason for the switch was concern with occasional document corruption, which would always hit at an inopportune time. Since our documents were long and complex, with some breaking the 200-page mark, we learned to avoid certain Word features.
For example, we found that automated cross-references often caused corruption in our Word (.doc) files, and we eventually banned their use in Take Control manuscripts. We also developed specific ways of working to reduce the impact of a corrupted document. Before opening a file, each of us would make a copy in a separate folder, and increment a version number in the filename, making it easy to revert to a previous version should corruption crop up.
Even though Word document corruption is no longer a concern for our production process, I noticed that a recent discussion on the Office for Mac forum offers two useful pieces of advice for those who do worry about these problems: a list of best practices for avoiding corruption from MVP John McGhie and a technique for removing corruption if it happens.
John’s best practices include:
- Always run the latest version of Microsoft Office — John says that Word 2011 won’t necessarily even run on OS X 10.8 Mountain Lion without all the latest updates. To that I would add that it’s always worth waiting a week or so on updates, since quick follow-ups to fix newly introduced bugs in large software packages are becoming all the more common.
- Never use Track Changes. John is adamant about this, but it’s something I’d never heard before. We always used Track Changes, it being one of Word’s most useful features for collaborative editing, but the only problem we ever associated with it was sluggishness in documents with extensive tracked changes. Instead, John suggests relying on Compare Documents after the fact (find it in Tools > Track Changes > Compare Documents), which gives the same result safely (though you may find it more difficult to work with — we certainly did). For what it’s worth, we rely heavily on Track Changes in Pages too, and haven’t seen corruption issues there.
-
Don’t apply direct formatting (bold, italic, font changes, etc.). Instead, define named character and paragraph styles and rely entirely on them. I’ve not heard this advice before, but it makes sense, given how Word stores formatting information in paragraph marks following each paragraph and at the end of the document. Better yet, a properly styled document is much easier to work with if you want to make wholesale style changes or import it into another application, like Adobe InDesign. We relied almost entirely on named styles in Word, though we applied some direct styling, like bold, by hand.
-
Never use drag-and-drop for editing, and instead rely on cut and paste. John notes that he has trouble avoiding drag-and-drop editing, since it can be extremely convenient, and it’s a shame that such direct manipulation can cause trouble.
-
Use only the modern .docx format, and save older .doc files to .docx. The XML-based .docx format can describe and store aspects of a document that are impossible in .doc, so saving in .doc format can remove information from your document. We were doing this “wrong” too, because we worked with too many people who hadn’t upgraded to versions of Word that could use the .docx format. Nowadays, there’s little excuse for not using .docx.
What if your Word document is already showing signs of corruption? A technique called “doing a Maggie” (named for Margaret Secara from the TECHWR-L mailing list, who first publicized the technique) can help. Follow these steps:
- Create a new, empty document in the .docx format.
-
In your corrupted document, display paragraph marks (¶); there’s usually a button you can click to do so, or try the Command-8 shortcut.
-
Click at the very beginning of the corrupted document to set the insertion point there, scroll to the end of the document, hold down the Shift key, and click again just before the last paragraph mark in the document. (Various document attributes are stored in that last paragraph mark, so it’s a place where corruption can lurk.)
-
Copy the selected text, switch to the new document, paste the text, and save with a new name.
If that doesn’t work, particularly with a long document, make a backup and then try copying just the first half of the corrupted document out to a new document. If that new document seems fine, copy subsequent halves of what remains in the corrupted document, until you isolate the problem. (If the problem still exists, try the other half first.) At that point you can step back, extract large portions of the original document around the corruption and reunite them in a new document. The concept here is the same as the old “binary search” method of isolating extension conflicts in the classic Mac OS — turn half of the extensions off, and if the Mac boots properly, enable half of the remaining extensions, repeating as necessary
until the culprit is found.
In a worst-case scenario, where these techniques don’t help, we’ve sometimes had luck with saving as RTF, and then opening that document and converting the RTF file back to Word format. Some aspects of the document may be lost, but if it’s either that or saving as plain text and losing all style information, RTF is the lesser of the weevils.
We had problems with large Word documents 20-cough-odd years ago. I'm astounded it's still an issue.
p.s. I think the autocorrect got you on 'lesser of two evils', though 'lesser of the weevils' is funny, particularly in this context.
I think it's less of one than it used to be, but the fact that this information is being shared on Microsoft's own forums tells me that it hasn't disappeared.
The "lesser of the weevils" joke is from Patrick O'Brien's Jack Aubrey books... (on which the movie "Master and Commander" was based).
http://www.goodreads.com/quotes/118106-two-weevils-crept-from-the-crumbs-you-see-those-weevils
A technique I've used for years (on Windows) that has always worked for me, assuming that the document can still be opened in Word, is to do a Save As RTF. The RTF version will be MUCH larger than the original document. Close Word, and then open the RTF file in Word and then do a Save As doc/docx.
Yep, saving as RTF is a long-standing solution - Tonya used that one back in the early 90s when she was working at Microsoft tech support.
I wonder if saving as docx would clear up the problem.
DOC files are very binary -- it is basically a memory dump of the state of Word's internal storage.
DOCX is a newer format, where the document is saved as an XML structure.
Of course there is the risk that the act of saving the file to XML would trigger the problem when it attempts to read and parse the corrupted data. But in that case, you'd expect saving to RTF wouldn't work either.
One of the rules of recovering for corruption is always to have a backup, at which point you can experiment with all sorts of possibilities.
A broken backup isn't enough. A series of backups, back to an uncorrupted version is better.
Make sure that they're real backups, not temp files. Trash the temp files to avoid confusion.
How about not use Word? NisusWriter Pro is far superior and track changes actually works.
I've seen variations on this theme many times before, "Word is a powerful professional tool. And, it's perfectly stable as long as you never use Powerful Feature X."
In other words, have fun with your new Lamborghini! Just don't ever drive it fast.
A few additional tips and clarifications:
The cross-reference problem is a specific case of a more general problem: the more field codes, the greater the risk of corruption.
Maggie Secara (hi, Maggie!) popularized a technique invented by Woody Leonard in his book "Word 97 annoyances".
"Temporary" [sic] files are the bane of Word; delete them at least monthly. Google my article "Protecting yourself from Microsoft Word" for details.
Track changes has never been a problem for me (I'm a professional editor, and often use this feature to add more changes than there was original text). Stability isn't an issue.
Word 2011 is a black spot on Microsoft's reputation that is still in beta testing; it doesn't deserve to have the same name as its Windows counterparts, which are faster, stabler, easier to use, and more efficient. (Speaking as someone who wrote the book, Effective Onscreen Editing, on editing in Word...)
A classic way to create a corrupt document is to use Track Changes and have one writer bin a Mac environment and the other in a Windows environment. Even using .docx for storage, it's an easy way to trigger corruption. The usual symptom is the last few lines of paragraphs disappearing unless you click and drag through them.
To avoid this one, we've developed a protocol in which all changes are accepted, the document is saved under a new name, and editing is performed on the new document.
Cross-references hasn't worked reliably since Word 5.1. It owuld be wonderful to have; it was a terrific, bullet-proof feature of the late, lamented Adobe Framemaker. But, like formatted auto-numbering, it doesn't work reliably in Word.
I've used MSW for Mac to prepare engineering reports since 1987. Each rev got better until 5.1a, which is probably the best software MS ever wrote. After that MSW became a bloated mess. The entire 'Master Document' feature should also be on the 'Never, ever…' list. It'll turn a multi-file document to garbagein the twinkling of an eye, styled or not.
Another way to fix a corrupt document is to save as XML. Then re-load and save as docx and it won't be corrupt anymore :-) but it may have lost bits. :-(
I agree with use of references and change tracking causing corruption. (My suspicion is that the change tracking corruption occurs when one of the other users has a Normal.dot file that is FUBAR'd, but not enough to crash things.)
Not noticed a problem with applying bold/italic etc directly as opposed to using styles, though.
HTH
The main cause of "corruption" in Word documents is failure to define and use styles. That may also be a problem in Pages, but I don't like or use Pages.
The second cause of real corruption is using autosave, instead of frequent manual multiple backups; also any feature that depends on automation that you haven't created yourself. I've been using MS Word since v.3 for brochures, articles, technical reports and books. Once I figured out its quirks and powers, it's been reliable.
I use at least 50% of its features. Most people use less than 10%. Pages fails to do about 80% of what I can do in Word.
IMHO, XML is a big PITA. XML makes it harder, not easier to export/import and edit. Instead of using newer versions of Word, I've switched from Office 2004 [PPC on Mac] to LibreOffice or related software, or use MS Office for Windows with CrossOver Mac. I don't have time to make up for all the things that Pages can't do at all, or can't do very well.
When starting a document, NEVER, NEVER use the NORMAL template. It could look different for those sharing or receiving your document. I created my own NERMAL template that goes with the document wherever it's viewed and used. Normal defaults to Times New Roman 12, Nermal defaults to whatever I prefer, and you will see.
NEVER use predefined Styles. They're not yours and may not be the same on another computer. Create your own style names, i.e. Header=Hed, Body Text=Text, Footer=Foot, Hyperlink=Link, Table of Contents=TOC+. That way, with styles embedded in your document, it will always look the way you want with your chosen formatting, not the MS default. It will also export as you prefer.
Important:
If you will be exporting to a professional page layout program like Adobe InDesign or Quark XPress, be sure that the names that you use for styles match the style names in the target software program; same for sharing among word processing software.
I wanted to comment to your recent Word corruption article. I was told to go to my account and submit comment fromo there. The side bar has no click that takes you to your account. And when I click the accounts in the upper rightnext to your acknowledgement that I have signed in, I only come to a change your info page. My info is fine. I wanted to send a comment. I have spent 45 min on this, almost as long as it takes to fix a word document with corruption. Tom Wilson [email protected]
Submitting this required that I sign into my account! I thought I was. And then login would not take the password I used successfully a few minutes ago.
Tom