You have a crucial thesis or presentation that’s due in the morning, but when you try to open it, you get a message saying the file has an error. It may seem like the end of the road, but with a little XML hacking, you can repair your file in just a few minutes and be back to work. Document repair is something you can do yourself.
First, let’s look at different causes of file corruption. The number one cause is working on files while they are on temporary or removable media. A USB or flash drive is a convenient way to carry data. The common alternative is to keep your information in the Cloud. But both of these are hazardous if you’re editing files. Accidentally ejecting a USB stick or losing your Internet connection while a file is open in Office is a near-guarantee of corruption. This type of corruption is also disastrous, because the file contents are so thoroughly scrambled, there is no way to recover the data.
But there are also files that get scrambled by software and usually these are recoverable. We’ll use the same techniques covered in previous posts. Windows users should review XML Hacking: An Introduction, while OS X hackers need to follow these instructions: XML Hacking: Editing in OS X
Is the File Recoverable?
When opened in Office, unrecoverable files may give you errors like these:
The first step is to rename a copy of the file with a .zip ending and expand it. An unrecoverable file (one scrambled by a USB or Cloud drive) will almost always raise an Zip error. Cut your losses, you’re not going to be able to fix this. As a second-best alternative, try opening the original damaged file in NotePad (on Windows) or Text Edit (OS X) to recover whatever text you can. You also might be able to extract some contents by opening in a different word processor, like Pages on a Mac.
By contrast, if you see the following messages, document repair is possible:
You can see that the first 2 messages are generic, while the second 2 give a specific location for the error. This means the file is at least partially readable by the program.
There are quite a few document repair articles on the web that are worth reading for the variety of tools that people are using. I prefer a combination of a good text editor (NotePad++ on Windows, BBEdit on OS X), plus a modern browser like FireFox or Chrome. The text editor is where you do the editing, while the browser parses the XML and finds any errors.
You’ve already unzipped the document or presentation, now look for the XML portion that contains the error. Most of the time, with a Word file, document.xml will be the culprit. Open document.xml in the text editor and Prettify (NotePad++) or Tidy (BBEdit) it to make it readable. A raw document.xml file only has 2 lines, which is why the XML errors are invariably reported as being on line 2. Making the text readable also adds useful line numbers to error reports, making the errors much quicker to find. Now the file should look like this:
Save document.xml, then open it in your choice of browser. This is how FireFox and Chrome show where the first error is:
As you can see, the report is a little more informative in FireFox. The error is a mismatched tag: a tag was opened but not closed. It expected to see the closing tag </mc:Fallback> and it tells you exactly where it thought that tag should be. The arrow points to the first character that is in error. The correct way to interpret this is that the expected end tag should be inserted immediately before the tag pointed to.
Document Repair Technique
Here’s what the error location looks like in the text editor:
Then here is what it looks like after inserting the closing tag (you can copy and paste directly from the browser window):
Save document.xml in the text editor, then refresh the browser. The next error is shown:
Repeat the steps. Some files have only a couple of errors, others may have dozens. You’ll know when you’re done, because refreshing the browser will give you a different screen, displaying the XML instead of an error message:
Rebuild the File
Close the text editor and browser, then re-zip the folders and [Content_Types].xml, giving the zip file a new name and a file ending that matches the original. Open it to ensure it works. Office does not tolerate XML errors well and doesn’t give you clear error messages, so if the file doesn’t open, you missed something. In addition, Mac users have to use Terminal to zip and view files, as noted on the XML Hacking: Editing in OS X page.
Lots of people ask “How can I prevent this?”, but there isn’t a really good answer. If a file can be repaired, it’s almost always due to a program bug that writes malformed XML. In the Word file used for example, this is often when a placed graphic has no fallback information, which is supposed to help with graphic depiction in older file formats. It appears that the program omits the closing fallback tags when saving and you get the error. It’s not your fault, but Microsoft has not been able to find and eliminate this bug since the 2007 version.