XML Hacking: An Introduction

With the introduction of Office 2007, Microsoft changed the basic file format that underlies Word, PowerPoint and Excel. Instead of the proprietary and mostly undocumented format that ruled from Office 97 to Office 2003, Microsoft made a smart decision and switched to XML. This is tagged text, similar in structure and concept to HTML code with which you may already be familiar.

XML opens up a world of possibilities for automated document construction, but that’s a topic for another day. The everyday relevance for you and I is that if a Word or PowerPoint file isn’t doing what you need it to do and there are no tools in the program for the job, we can now dive in a edit the file ourselves. If you’re a point-and-click user, this is probably not thrilling. But if you’re a hacker at heart, a midnight coder or just a curious tinkerer, you can do some cool stuff.

The main tool you’re going to need is a text editor. While you can get away for a while with Notepad or TextEdit, those simple text editors don’t quite have the tools that get the job done efficiently. On Mac, I use BBEdit and on Windows I reach for Notepad++. BBEdit is reasonably-priced shareware and Notepad++ is freeware. They have a similar style of operation, so if you’re a cross-platform hacker it’s easy to switch between them. Notepad++ uses a plugin system, so you can add tools. For this job, you’re definitely going to want the free XML plugin.

The macOS requires somewhat more care with handling expanded Office files, or they won’t open after being rezipped. Please see this article for the best procedure on a Mac. The rest of this article mentions Windows methods, but the XML file structure is the same on both platforms.

Word, Excel and PowerPoint files in the new format are actually simple Zip files with a different file ending. Getting into them couldn’t be easier: if you’re using Windows, add .zip to the end of the file (a copy of the file, if it’s anything important). You’ll get a warning from your OS, but you know what you’re doing! Now unzip it. Out pop several folders of XML, plus a top-level file or two.

Inside a Word File

Inside a simple Word file. The document text is stored in document.xml.

Select one of the files and open it in your text editor. All the files have been linearized to minimize file size. This is where your XML tools come into play. In Notepad++, choose Plugins>XML Tools>Pretty Print (XML Only – with line breaks). Now you have a nicely indented, easy-to-read page to edit. When you’re done, it’s not necessary to re-linearize. Word, PowerPoint or Excel will do that for you later.

For people using Window’s built-in zip utility, there is an easy mistake to watch out for. By default, unzipping a file in Windows creates a new folder named for the file being expanded. If, when you’re re-assembling the file, you include this top-level folder, PowerPoint will raise an error about unreadable content in the presentation. To avoid this, first open the folder that Windows created. Select the _rels, docProps and ppt folders, plus the [Content_Types].xml file, then create a zip file from them.

XML hacking is useful for Excel or Word when you want to add additional color themes or when you need to rescue a corrupt document. But it really shines with PowerPoint, allowing you to create custom table formats, extra custom colors that don’t fit into a theme, setting the default text size for tables and charts and much more. This technique separates the PowerPoint pros from the wannabes.

In my next post, I’ll get into the specifics of some cool XML hacking Office tricks. In the mean time, check out text editors and XML tools so you’re ready to hack!

Inside a PowerPoint File

A plain vanilla PowerPoint file: more complex than Word.

8:17 pm

4 thoughts on “XML Hacking: An Introduction

  1. Hello
    I tried a few changes, looks good, only for some xlsm I see bin files and a few one can not be exported (unzipped) , after rezipping the content I have a repair dialogbox and after repairing all seems to be ok.
    Is there a way to force the unzipping of the primary BIN files ? Or do you know why such files can’t be unzipped ? May be protected ?

    • A .bin file is not a zip archive, it’s a binary file. It’s most likely a VBA macro, that’s the most common use of .bin files in Office XML. VBA can be edited using the program interface after you make the Developer tab visible.

  2. I used the OOXML chrome plug in to edit the theme.xml file on my Mac. I downloaded the file and when I opened it in powerpoint, I got an error “PowerPoint found a problem with content in FS Powerpoint template_2018.potx. PowerPoint can attempt to repair the presentation.”

    When I repair the presentation, all of the formatting is stripped out. Thoughts?

    • When you edited the XML, you introduced an error. It can be small, like an omitted quotation mark, or large, PowerPoint gives you the same error message. It’s small help, but the section that was removed was the part that had the mistake.

      As mentioned in the article, because of Office’s uninformative feedback, it’s best to make one small change at a time, downloading and testing the file as you proceed. Once you gain experience and have a library of tested XML, it gets faster.

Leave a Reply

*Required fields. Your email address will not be published.

Posting XML? To enter XML code, please replace all less than signs "<" with "&lt;" and greater than signs ">" with "&gt;". Otherwise, Wordpress will strip them out and you will see only a blank area where your code would have appeared.

This site uses Akismet to reduce spam. Learn how your comment data is processed.