Extremely fast LZ4 decompression for the 8088/8086 CPU
LZ4_8088 is a chunk of assembly code that implements incredibly fast LZ4 decompression for 8088 and 8086 CPUs. It is specifically optimized for the 8088 and is intended for use in hobby or retrocomputing projects. Code is provided for most generic x86 assemblers, as well as an example of how to use it in Turbo Pascal.
LZ4 itself is a compresson format that is implemented as an open-source C library , with ports to other platforms. The goal of the library is speed, and it currently holds many top speed rankings in compression benchmarks. One variant of LZ4 called LZ4_HC implements optimal parsing , which creates the very best possible set of literal and match runs for a given input and coding set. This leads to compression ratios that are competitive with PKZIP/ zlib , but unlike PKZIP, LZ4 doesn’t implement order-0 coding (ie. bit twiddling which 8088/8086 is very bad at) so the end result decompresses at nearly memcpy() speeds.
Here are some statistics comparing LZ4 with PKWare’s Data Compression Library (DCL) . The DCL was chosen as a comparison because it uses an algorithm very similar to deflate (which was considered state of the art speed-wise for many years on DOS platforms), and also because the DCL could be measured with microsecond accuracy. To eliminate implementation bias, PKWare’s retail DCL compiled assembly code was used.
|Data Type||Filename||Original size||LZ4_HC compressed size||LZ4_HC ratio||DCL Implode size||DCL Ratio||memcpy() time in �s (REP MOVSW)||LZ4 Decompression Speed (x slower than memcpy)||DCL Decompression speed (x slower than memcpy)|
|Small text file||text.txt||4988||3037||61%||2323||47%||13716||3.17||61.21|
|Large text file||largetxt.txt||56899||26890||47%||25642||45%||156457||3.17||54.43|
|Sparse compiled binary||robotron.com||40704||21048||52%||18178||45%||112036||2.75||53.51|
|Dense compiled binary||linewars.exe||61744||41500||67%||36641||59%||169924||2.16||70.08|
Key takeaways from the above table:
- For most sources, decompression is never slower than 3.2x memcpy.
- For source material that contains long runs of sequences (RLE), decompression is faster than memcpy .
- Compression ratios are competitive; they’re mostly within a few percentage points of PKWare ratios.
For full documentation, please see the included LZ4_8088.TXT in the download distribution. It is highly recommended you read the documentation to avoid any pitfalls using the code.
LZ4_8088.ZIP contains the assembler routine, a Turbo Pascal test harness, documentation, compression samples, and compiled binaries for Win32 and DOS 16-bit.
Ever striving for maximum speed, Peter Ferrie was able to write a slightly faster version of the decompressor, but it relies on reversed match offsets. lz4_8088_reversed_match_offsets.zip contains a small program to convert .LZ4 data to this reversed format, and also includes two new decompression routines that use the format.
The LZ4 library is provided under the BSD License . However, my code was not derived from any of the original LZ4 code so I can provide it via any license I choose. So, I am providing my code under what I am calling the Demoscene License. The Demoscene License grants you the following rights:
- You are free to use this code in any production, commercial or otherwise, without providing remuneration to the author.
- If you use this code, you must greet "Trixter/Hornet" if used in a demoscene production, or "Jim Leonard" if used in a normal program. Also, you must send email to email@example.com telling him you used the code so he can marvel at your result.
Questions and Answers
Q: Is there any way to speed this code up further? Yes, but there are a variety of tradeoffs involved that make further speedup less desirable. A list of these tradeoffs is in the included LZ4_8088.TXT file in the archive.
Q: Is this code faster than the LZ4 C source code? For 8088-80286 CPUs, yes. For 386+, no, because the 386 has 32-bit registers, additional segment registers, and additional instructions. Compiling the LZ4 C code with a suitable 32-bit compiler will produce code that definitely outperforms this 8088 assembly code.
Q: How is it possible to exceed the speed of a REP MOVSW? Because 1-byte and 2-byte runs in the compressed data are handled specially with REP STOSW. STOSW can set a fixed value in memory in half the time it takes to move a value from one memory location to another.
Q: Why did you bother doing this? Because I like impressing the 1980’s me.
Currently this distribution only includes decompression code. It is possible to implement LZ4 "c0" compression on 8088 in as little as 16K memory and with reasonable speed, but until I (or anyone else) has an application for compression on the 8088, I don’t plan on developing it.
Initial release — no known bugs.
20130107: Initial release.
20130209: Now includes a size-optimized version of the code as well. The speed routine compiles to 446 bytes including the shift table; the size-optimized code is 30% slower but compiles to 79 bytes. Also, speed-optimized version of the code sped up an additional 1% by contributions from both Peter Ferrie and Terje Mathisen (thanks guys!)
20130210: Forgot an optimization; the size version now optimizes to 78 bytes.
20130213: Fixed a bug that could corrupt output if the matches overlapped.
转载本站任何文章请注明：转载至神刀安全网，谢谢神刀安全网 » LZ4_8088 – Extremely fast LZ4 decompression for the 8088/8086 CPU