Saturday, November 21, 2009

File Format Rant

File formats is something necessary evil that shows up in almost every program of some kind of complexity.
Modern languages often have good serialization systems that can take care of much of this automatically, but in the end if the data should be passed between different applications written in different languages on different platforms you'd most likely need some kind of standardized format to store stuff.
I've used and created many file formats in my life and there is one thing that makes me really upset. When the format designers think they are doing the users a favor by supporting many ways of storing data.
Typical examples:
  • Is the file format big endian or little endian when it comes to storing the words? Let's introduce a flag so it's up to the writer, after all you don't want to introduce unnecessary conversions!
  • Is the image stored as RGBA or ARGB? Well, introduce a big enum with lots of different encodings, why should two be enough, someone may have ABGR encoded images.
  • Is the image coordinate system with a upward or downward pointing Y axis? Let's support both, after all it makes it so much easier to write the file if you don't have to convert it before!
  • Is the camera stored as a point and a look direction, or perhaps a lookat point, or perhaps a matrix. And what about the projection, horizontal fov, aspect ratio, or vertical fov? Let's support all of them, preferably with some kind of redundancy conventions like if you have stored horizontal fov, vertical fov and aspect ratio, the vertical fov should be ignored.
I personally think a file format should have the confidence to be a standard rather than supporting every standard. For a file format that has some kind of data interchange role, there is guaranteed to be way more readers than writers out there, so make sure reading the file is the priority, not writing it!

From the reading point of view it's a nightmare to be able to support thousands of permutations for how the file is stored. Often you need a big test suite with sample data and differently encoded files to make sure you cover every single way of doing things. I ran into a TIFF file the other week that failed in my application because it was stored in a non interleaved way (i.e rrrrrr...gggggg...bbbbbb rather than rgbrgbrgb...). I've never seen one of those in my entire life and also Maya didn't handle it properly!

There are of course times where the requirements makes it important be flexible when it comes to writing.
  1. Your format has high performance requirements. Perhaps you need to know that you can memory map your image in a buffer to load it blazingly fast, or perhaps being able to read it as a blob and typecast it to a struct with well known padding to avoid having to translate every read word in some fancy way. This however is a dangerous path, there are many platforms out there and if you start using a highly platform dependent format you may make your program worse performing or hard to port. It completely makes sense for data stored in the temp directory though.
  2. Your format represents the internal state of your program. A 3d modelling package that supports many ways of working and specifying a camera needs to make sure whatever what used is what's in the file when loading it again. These formats should generally not be used for data exchange though.
So please think about the people reading the files, rather than the ones writing them. They are more than the writers. If there are programs that claims to be compatible with your format but fails to load a file someone wrote you didn't really made anyone a favour by making your format easy to write.