One thing that has bothered me for a while is that the unzip command line tool only decompresses one file at a time. As a weekend project I wanted to see if I could make it work in parallel. One could write this functionality from scratch but this gave me a possibility to try really look into incremental development.
All developers love writing stuff from scratch rather than fixing and existing solutions (yes, guilty as charged). However the accepted wisdom is that you should never do a from scratch rewrite but instead improve what you have via incremental improvements. Thus I downloaded the sources of Info-Zip and got to work.
Info-Zip’s code base is quite peculiar. It predates things such as ANSI C and has support for tons of crazy long dead hardware. MS-DOS ranks among the most recently added platforms. There is a lot of code for 16 bit processors, near and far pointers and all that fun stuff your grandad used to complain about. There are even completely bizarre things such as this snippet:
# define const const
The code base contains roughly 80 000 lines of K&R C. This should prove an interesting challenge. Those wanting to play along can get the code from Github .
Compiling the code turned out to be quite simple. There is no configure script or the like, everything is #ifdef fed inside the source files. You just compile them into an app and then you have a working exe. The downside is that the source has more preprocessor code than actual code (only a slight exaggaration).
Originally the code used a single (huge) global struct that houses everything . At some point the developers needed to make the code reentrant. Usually this means changing every function to take the global state struct as a function argument instead. These people chose not to do this. Instead they created a C preprocessor macro system that can be used to pass the struct as an argument but also compile the code so it has the old style global struct. I have no idea why they did that. The only explanation that makes any sort of sense is that adding the pointer to stack on every function call is too expensive on 16 bit and smaller platforms. This is just speculation, though, but if anyone knows for sure please let me know.
This meant that every single function definition was a weird concoction of preprocessor macros and K&R syntax. For details see this commit that eventually killed it.
Getting rid of all the cruft was not particularly difficult, only tedious. The original developers were very pedantic about flagging their #if / #endif pairs so killing dead code was straightforward. The downside was that what remained after that was awful. The code had more asterisks than letters. A typical function was hundreds of lines long. Some preprocessor symbols were defined in opposite ways in different header files but things worked because some other preprocessor clauses kept all but one from being evaluated (the code confused Eclipse’s syntax highlighter so it’s really hard to see what was really happening).
Ten or so hours of solid work later most dead cruft was deleted and the code base had shrunk to 30 000 lines of code. At this point looking into adding threading was starting to become feasible. After going through the code that iterates the zip index and extracts files it became a lot less feasible. As an example the inflate function was not isolated from the rest of the code. All its arguments were given in The One Big Struct and it fiddled with it constantly. Those would need to be fully separated to make anything work.
That nagging sound in your ear
While fixing the code I kept hearing the call of the rewrite siren. Just rewrite from scratch, it would say. It’s a lot less work. Go on! Just try it! You know you want to!
Eventually the lure got too strong so I opened the Wikipedia page on Zip file format. Three hours and 373 lines of C++ later I had a parallel unzipper written from scratch. Granted it does not do advanced stuff like encryption, ZIP64 or creating subdirectories for files that it writes. But it works! Code is available in this repo .
There really is no reason, business or otherwise, to modernise the codebase of Info-Zip. With contemporary tools, libraries and methodologies you can create code that is an order of magnitude simpler, clearer, more maintainable and just all around more pleasant to work with than existing code. In a fraction of the time.
Sometimes rewriting from scratch is the correct thing to do.
This is the exact opposite of what I set out to prove but that’s research for you.