OOXML Hacking: Document Repair

You have a crucial thesis or presentation that’s due in the morning, but when you try to open it, you get a message saying the file has an error. It may seem like the end of the road, but with a little XML hacking, you can repair your file in just a few minutes and be back to work. Document repair is something you can do yourself.

First, let’s look at different causes of file corruption. The number one cause is working on files while they are on temporary or removable media. A USB or flash drive is a convenient way to carry data. The common alternative is to keep your information in the Cloud. But both of these are hazardous if you’re editing files. Accidentally ejecting a USB stick or losing your Internet connection while a file is open in Office is a near-guarantee of corruption. This type of corruption is also disastrous, because the file contents are so thoroughly scrambled, there is no way to recover the data.

But there are also files that get scrambled by software and usually these are recoverable. We’ll use the same techniques covered in previous posts. Windows users should review XML Hacking: An Introduction, while OS X hackers need to follow these instructions: XML Hacking: Editing in OS X


Is the File Recoverable?

When opened in Office, unrecoverable files may give you errors like these:

Recover Text
Parts Missing

The first step is to rename a copy of the file with a .zip ending and expand it. An unrecoverable file (one scrambled by a USB or Cloud drive) will almost always raise an Zip error. Cut your losses, you’re not going to be able to fix this. As a second-best alternative, try opening the original damaged file in NotePad (on Windows) or Text Edit (OS X) to recover whatever text you can. You also might be able to extract some contents by opening in a different word processor, like Pages on a Mac.

By contrast, if you see the following messages, document repair is possible:

Illegal Character
XML Corruption Warning

You can see that the first 2 messages are generic, while the second 2 give a specific location for the error. This means the file is at least partially readable by the program.


Error-finding

There are quite a few document repair articles on the web that are worth reading for the variety of tools that people are using. I prefer a combination of a good text editor (NotePad++ on Windows, BBEdit on OS X), plus a modern browser like FireFox or Chrome. The text editor is where you do the editing, while the browser parses the XML and finds any errors.

You’ve already unzipped the document or presentation, now look for the XML portion that contains the error. Most of the time, with a Word file, document.xml will be the culprit. Open document.xml in the text editor and Prettify (NotePad++) or Tidy (BBEdit) it to make it readable. A raw document.xml file only has 2 lines, which is why the XML errors are invariably reported as being on line 2. Making the text readable also adds useful line numbers to error reports, making the errors much quicker to find. Now the file should look like this:

document.xml

Save document.xml, then open it in your choice of browser. This is how FireFox and Chrome show where the first error is:

Browsers View XML

As you can see, the report is a little more informative in FireFox. The error is a mismatched tag: a tag was opened but not closed. It expected to see the closing tag </mc:Fallback> and it tells you exactly where it thought that tag should be. The arrow points to the first character that is in error. The correct way to interpret this is that the expected end tag should be inserted immediately before the tag pointed to.


Document Repair Technique

Here’s what the error location looks like in the text editor:

Document Repair Error Location

Then here is what it looks like after inserting the closing tag (you can copy and paste directly from the browser window):

Step 1 Document Repair

Save document.xml in the text editor, then refresh the browser. The next error is shown:

Step 1 Result

Repeat the steps. Some files have only a couple of errors, others may have dozens. You’ll know when you’re done, because refreshing the browser will give you a different screen, displaying the XML instead of an error message:

Final Document Repair Result


Rebuild the File

Close the text editor and browser, then re-zip the folders and [Content_Types].xml, giving the zip file a new name and a file ending that matches the original. Open it to ensure it works. Office does not tolerate XML errors well and doesn’t give you clear error messages, so if the file doesn’t open, you missed something. In addition, Mac users have to use Terminal to zip and view files, as noted on the XML Hacking: Editing in OS X page.

Lots of people ask “How can I prevent this?”, but there isn’t a really good answer. If a file can be repaired, it’s almost always due to a program bug that writes malformed XML. In the Word file used for example, this is often when a placed graphic has no fallback information, which is supposed to help with graphic depiction in older file formats. It appears that the program omits the closing fallback tags when saving and you get the error. It’s not your fault, but Microsoft has not been able to find and eliminate this bug since the 2007 version.

8:18 pm

17 thoughts on “OOXML Hacking: Document Repair

  1. Thank you John for your clear and detailed explanation about this error! It was a little difficult editing the file for its big size and lack of memory in Notepad++, but I fixed it and recover all important information on it.

  2. Hi, thanks a lot the article. I am having trouble re-creating the powerpoint file. After I unzip it, zip it back and rename to .pptx, PowerPoint says it found a problem with the content and doesn’t open it. This happens even if I don’t do anything to the content of the unzipped folders, just rename, un-zip, zip back and rename back. Tried on both Mac and Win. Any ideas? Please help. Thank you.

    • If you’re editing on a Mac, this is a common problem. This article discusses the problem and how to fix it: http://www.brandwares.com/bestpractices/2015/11/xml-hacking-editing-in-os-x/. I always use BBEdit on OS X, since it allows you to edit the XML files without unzipping them.
      In Windows, I’ve never had such a problem. But since this is an article about document repair, are you positive you’re not working on a file that is already damaged and giving this report when you try to open it without unzipping and rezipping? If that’s not the case, please let me know what version of Windows you’re using and I’ll try to re-create the problem.

      • Thanks a lot for your quick response.
        I am using Win 7 on Parallels on Mac. PowerPoint 2016. The original file wasn’t broken and I wasn’t trying to repair it. What I was trying to do is change the links of the linked pictures I have in the presentation. So once I did that and re-zipped the folder and renamed the file back to .pptx, the PowerPoint didn’t like it. So I tried to unzip the brand new .pptx and zip it back without any changes at all. And PowerPoint doesn’t like this either.

        • Hmmm. I’ve never had that problem on Windows. Do you use WinZip, or the built-in Windows zip utility, or something else? If you have the patience, I’d like to get this figured out. If you can email the before and after versions, I’d like to take a look at the them. My address is production at brandwares dot com.

          • The problem was this: When you re-zipped for files, you rezipped the top level folder that the files are in, adding a folder that PowerPoint can’t recognize. Instead, open the folder that the XML files are in, select the _rels, docProps and ppt folders plus the [Content_Types].xml file and zip that. Then the file opens with no issues.

  3. Thank you, thank you!

    One stumbling block I hit with your article was to Prettify (NotePad++) – after downloading Notepad++ I couldn’t get it to show indents and line breaks until I researched more and realized I needed to install the XML Tools plugin from menu:

    “Pugins -> Plugin manager -> Show Plugins” and then install the XML Tools from the list .

    Then I could choose: Plugins – “XML TOools – > Pretty print (XML only – with line breaks)

    Hope this helps somebody. The article sure helped me!

  4. Thank you very much for sharing your knowledge. We did as described in the article and were successful in word file recovery.

  5. Dear SIr,
    I read and found this page very useful after trying microsoft fix it tool which was very cumbersome since relevant tool is since removed by microsoft and I tried 4-5 additional software for recovery but no I got success so far. Kindly Guide, I am facing similar problem for multiple word file but surprisingly many .xml extension files like style.xml are missing when I rename the original damaged docx file into .zip extension.
    THough document.xml is present, but when I used notepad++ along with XML Tools additionally, to find errors, at a particular column a pop up window is displayed something “at the end of document”
    I am not getting results may be because pen drive in which originally file was placed might be not removed safely and power was disconnected suddenlty but data is crucial would u help me if I send you 2-3 docx files to recover data. ?

    • I wish I was able to help, but as mentioned in the article:

      “The first step is to rename a copy of the file with a .zip ending and expand it. An unrecoverable file (one scrambled by a USB or Cloud drive) will almost always raise an Zip error. Cut your losses, you’re not going to be able to fix this.”

      This is the problem you have. Editing files while they are on a removable drive is the fastest way to corrupt them and the resulting files are scrambled. They are not repairable. You can try recovering text in NotePad.

  6. Hello,

    First, thanks for publishing this guide! I’m learning a lot!

    I’m trying to repair a corrupted Word doc that spits out the error message “xml parsing error, Location…. line 2452 column (string of numbers)”. Unfortunately, I’m fighting this battle on a Mac, which seems to involve some extra pain-in-the-butt type troubleshooting.

    1) BBEdit no longer has the function “Tidy”. Any text filters that can serve the same function?
    2) For some reason I can only open the document.xml in Safari and not Firefox or Chrome. I wonder why?
    3) Reduced to using simple TextEdit and Safari, I can still find the error. But when I insert ” ” just before the indicated error in the document.xml, save, and reopen in Safari, I just get a new error message that says “Opening and ending tag mismatch: t line 0 and Fallback”.

    Would be grateful for any help. Thanks!

    • Here’s a new download of a BBEdit script to prettify XML into human-readable form: XML Tidy Script. The zip file contains instructions on how to install it.

      TextEdit is somewhat underpowered for document repair, BBEdit will work better. Probably the tag mismatch is preventing the file opening in FireFox or Chrome. Feel free to upload the problem document in a Zip archive to a cloud service, then email a link to me at production at brandwares dot com and I’ll take a look at it. Please send the most original version of the file you have.

  7. Thanks for the helpful guide. I experience the following error when attempting to open my word document: “HRESULT 0x80004005, Location: Part: /word/document.xml, Line:0 Column: 0”. Repair is unsuccessful, and when following the prescribed steps above I don’t get any errors when opening document.xml in Chrome or Firefox. Any ideas? Thanks.

    • HRESULT 0x80004005 is an unspecified error, so that isn’t informative. Unfortunately, not all documents can be repaired successfully. Repairable documents most often have been messed up by a bug in the host program. Documents that are ruined by corruption, editing while the document is on a USB drive or a non-supported cloud service are often too scrambled to be fixed.

Leave a Reply

*Required fields. Your email address will not be published.

Posting XML? To enter XML code, please replace all less than signs "<" with "&lt;" and greater than signs ">" with "&gt;". Otherwise, Wordpress will strip them out and you will see only a blank area where your code would have appeared.