Comparison of Compression Algorithms

From LinuxReviews
Jump to navigationJump to search

GNU/Linux and *BSD has a wide range of compression algorithms available for file archiving purposes. There's gzip, bzip2, xz, lzip, lzma, lzop and less free tools like rar, zip, arc to choose from. Knowing which one to use can be such confusing. Here's an attempt to give you an idea how the various choices compare.


Most file archiving and compression on GNU/Linux and BSD is done with the tar utility. It's name is short for tape archiver which is why every tarcommand you will use ever has to include the f flag to tell it that you are will be working on files not a ancient tape device. Creating a compressed file with tar is typically done by running tar create f and a compression algorithms flag followed by files and/or directories. The compression flag options are:

short option long option algorithm
z --gzip gzip
j --bzip2 bzip2
J --xz xz
z --compress compress
--lzip lzip
--lzma lzma
--zstd zstd

So which should you use? It depends on the level of compression you want and speed you desire. You may have to pick just one of the two. Speed will depend widely on what binary you use for the compression algorithm you pick. As you will see below: There is a huge difference between using the standard bzip2 binary most (all?) distributions use by default and parallel pbzip2 which can into multi-core computing.

Compressing The Kernel (5.1.11)

Kemonomimi rabbit.svg
Note: These tests were done using a Ryzen 1600X with 2xSamsung SSDs in RAID1. The differences between bzip2 and pbzip2 and xz and pixz will be much smaller on a dual-core. We could test on slower systems if anyone cares, but that seems unlikely.

These results are what you can expect in terms of relative performance when using tar to compress the kernel with tar c --algo -f linux-5.1.11.tar.algo linux-5.1.11/ (or tar cfX linux-5.1.11.tar.algo linux-5.1.11/ or tar c -I"programname -options" -f linux-5.1.11.tar.algo linux-5.1.11/)

In the case of bzip2 and pbzip2 the binary was simply switched using a symbolic link.

Ruling out cache impact was done by running sync; echo 3 > /proc/sys/vm/drop_caches between runs.

The exact number will vary depending on your CPU, number of cores and SSD/HDD speed but the relative performance differences will be somewhat similar.

algorithm time size binary parameters info
none 0m1.193s 832M just tar cf
gzip 0m25.268s 163M gzip cfz
bzip2 1m4.857s 124M bzip2 cfj
bzip2 0m12.350s 125M pbzip2 cfj Parallel bzip2
lzip 5m7.392s 108M lzip c --lzip -f
lzip 0m55.115s 109M lzip c -Iplzip Parallel lzip, default level -6
plzip 2m5.645s 102M lzip c -I"plzip -9" Parallel lzip at best compression -9
xz 5m24.671s 105M xz cfJ
xz 1m3.713s 108M pigz c -Ipigz -f Parallel xz
xz 1m42.497s 103M pigz c -I"pixz -9" -f Parallel xz using best compression

A few minor details should be apparent from above numbers.

a) Standard xz compression is really slow compared to everything else. It is also what results in the best compression.
b) pixz is five times faster than xz unless you're a core-let in which case it won't make any difference. pixz at it's best compression level -9 provides the best speed and compression.
c) the difference between bzip2 and pbzip2 is huge. It may not appear that way since bzip is so much faster than xz but it's actually more than ten times faster.
d) pbzip2's default compression is apparently it's best at -9. A close-up inspection of the output files reveal that they are identical (130260727b) with and without -9.

pixz at -9 comes out as a clear winner both when considering compression and speed/compression. The one huge draw-back it has is that pixz is not a drop-in replacement for xz. Simply making xz a symbolic link to pigz won't work, it has to be invoked with -Ipixz or -I"pixz -9" to be used as a compressor.

plzip is a real contender to pixz. It's about the same as pixz at default settings but much slower at the highest compression ratio. Both xz and lzip use different Lempel-Ziv-Markov chain algorithm implementations which is why they perform somewhat similar.

pbzip2 wins when speed is a consideration and a slight increase in the output size is acceptable.

Lovelyz Kei ProTip.jpg
TIP: xz is just as fast as bzip2 and gzip when it comes to decompression and it's better compression - while more time-consuming - can make a real difference if you are going to distribute a file to hundreds or thousands of users. This is why the Linux Kernel is distributed using .tar.xz archives. It is an alternative worth considering if the difference between 120 and 110 MiB matters. You may want to use pixz or plzip with the best compression flag -9 when you plan on distributing something and bzip2 if you are going to backup 50 GB. Compressing something that size will take forever with pixz.



4 months ago
Score 0++
zstd is the new hot thing

Anonymous user #1

one month ago
Score 0++
Why are "J" and "j" used for the tar command? What does "J" stand for?


one month ago
Score 0++

j is a short-hand for --bzip2 and J is a short-hand for --xz

The J does not "stand for" anything and there's no logical reason why j means bzip2 and J means xz.
Add your comment
LinuxReviews welcomes all comments. If you do not want to be anonymous, register or log in. It is free.