XML Hacking: An Introduction

With the introduction of Office 2007, Microsoft changed the basic file format that underlies Word, PowerPoint and Excel. Instead of the proprietary and mostly undocumented format that ruled from Office 97 to Office 2003, Microsoft made a smart decision and switched to XML. This is tagged text, similar in structure and concept to HTML code with which you may already be familiar.

XML opens up a world of possibilities for automated document construction, but that’s a topic for another day. The everyday relevance for you and I is that if a Word or PowerPoint file isn’t doing what you need it to do and there are no tools in the program for the job, we can now dive in a edit the file ourselves. If you’re a point-and-click user, this is probably not thrilling. But if you’re a hacker at heart, a midnight coder or just a curious tinkerer, you can do some cool stuff.

The main tool you’re going to need is a text editor. While you can get away for a while with Notepad or TextEdit, those simple text editors don’t quite have the tools that get the job done efficiently. On Mac, I use BBEdit and on Windows I reach for Notepad++. BBEdit is reasonably-priced shareware and Notepad++ is freeware. They have a similar style of operation, so if you’re a cross-platform hacker it’s easy to switch between them. Notepad++ uses a plugin system, so you can add tools. For this job, you’re definitely going to want the free XML plugin.

Word, Excel and PowerPoint files in the new format are actually simple Zip files with a different file ending. Getting into them couldn’t be easier: if you’re using Windows, add .zip to the end of the file (a copy of the file, if it’s anything important). You’ll get a warning from your OS, but you know what you’re doing! Now unzip it. Out pop several folders of XML, plus a top-level file or two.

Inside a Word File

Inside a simple Word file. The document text is stored in document.xml.

Nov. 14, 2015 edit: OS X requires somewhat more care with handling expanded Office files, or they won’t open. Please see this article for the best procedure on a Mac.

Select one of the files and open it in your text editor. All the files have been linearized to minimize file size. This is where your XML tools come into play. In Notepad++, choose Plugins>XML Tools>Pretty Print (XML Only – with line breaks). If you’re using BBEdit, choose Markup>Tidy>Reflow Document. Now you have a nicely indented, easy-to-read page to edit. When you’re done, it’s not necessary to re-linearize. Word, PowerPoint or Excel will do that for you later.

For people using Window’s built-in zip utility, there is an easy mistake to watch out for. By default, unzipping a file in Windows creates a new folder named for the file being expanded. If, when you’re re-assembling the file, you include this top-level folder, PowerPoint will raise an error about unreadable content in the presentation. To avoid this, first open the folder that Windows created. Select the _rels, docProps and ppt folders, plus the [Content_Types].xml file, then create a zip file from them.

XML hacking is useful for Excel or Word when you want to add additional color themes or when you need to rescue a corrupt document. But it really shines with PowerPoint, allowing you to create custom table formats, extra custom colors that don’t fit into a theme, setting the default text size for tables and charts and much more. This technique separates the PowerPoint pros from the wannabes.

In my next post, I’ll get into the specifics of some cool XML hacking Office tricks. In the mean time, check out text editors and XML tools so you’re ready to hack!

Inside a PowerPoint File

A plain vanilla PowerPoint file: more complex than Word.

8:17 pm

5 thoughts on “XML Hacking: An Introduction

  1. Pingback: Multiple Color Themes, One Template - Office Best Practices

  2. Pingback: XML Hacking: Custom Picture Content Controls - Office Best Practices

  3. Pingback: XML Hacking: Graduated Color Table Borders - Office Best Practices

  4. Pingback: XML Hacking: Table Styles Complete - Office Best Practices

  5. Pingback: XML Hacking: Text Box Styles - Office Best Practices

Leave a Reply

*Required fields. Your email address will not be published.
To enter XML code, please replace greater than > and less than < signs with &gt; and &lt; or Wordpress will strip them out.