A couple of years after introducing Office 2007 Open
XML file formats, Microsoft recently published specifications
of their doc, xls and ppt binary formats. It seems that it was surprising
for everyone how complicated these formats are. For example, the Excel 97-2003
file format is a 349 page PDF file.
Joel Spolsky, who worked on Microsoft's Excel development
team, shed some light why the Microsoft
Office file formats are so complicated. He provides many points describing
why that happened, but it seems that it can be summarized just in 2 main
points:
- These file formats were designed long ago in the
era of slow machines
They were designed to be fast on very old computers. For the early versions of
Excel for Windows, 1 MB of RAM was a reasonable amount of memory, and an 80386
at 20 MHz had to be able to run Excel comfortably
- Microsoft did not care to clean the format or to design new ones for a long time
A lot of the complexities in these file
formats reflect features that are old, complicated, unloved, and rarely used.
They’re still in the file format for backwards compatibility, and because it
doesn’t cost anything for Microsoft to leave the code around.
When reading that, one question continuously popping up in my head.
Why it did take so long to switch to a better
file format? Computers became fast enough, not to deal with binary formats,
more than 10 years ago, Internet is here for 2 decades, XML became popular in
1990s, but Microsoft switched one of their most selling product’s format to a
better one only in 2006. If they do not care about its interoperability, how many
efforts it took to support those formats and to train new people who became
part of Office team….