## 2018-06-24

### Not only zip bombs

Crawling around few corners of the net I've found a file named something like do not uncompress me unless you're a bot dot gz. It is, as per extension, a small gzipped file (26 kBytes), but — and this explains the suggestion — it expands into a 10 GBytes long files.

A sort of modest zip bomb, but for gzip.

You can find more on

Beware, those files can be dangerous if not handled properly! Do whatever you want at your own risk. Be safe.

Of course I'm not a bot: I've just done

gzip -l donottrytouncompressmeunlessyouareabot.gz

and read the result:

         compressed        uncompressed  ratio ...
26524            10420385  99.7% ...

It must have been done to trick bots which try to uncompress files in order to check what's inside.

If I do

dd if=/dev/zero bs=1M count=10000 |gzip -9 >10.gz

the output file is a lot bigger than 26524 (9.8 MBytes). Still a lot less than 10 GBytes, anyway. I can't reproduce that gzip — maybe it is handcrafted.

Better luck with bzip2, though:

$dd if=/dev/zero bs=1M count=1000 |bzip2 >10.bz2 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 29.1305 s, 36.0 MB/s$ ls -l 10.bz2
-rw-r--r-- 1 user user 753 Jun XX XX:XX 10.bz2

A bzipped 10 GBytes stream of zeros is 7346 bytes. It took a while, though.

With gzip -l you can know in advance the size of uncompressed file, but bzip2 hasn't such an option. It seems the only way to do it is to try to uncompress it, avoiding the disk, of course:

bunzip2 -c file.bz2 |wc -c

You may save disk space, but not CPU and time (and you will add -s option to save memory, just in case — but it will take longer).

## Taking a look at the bz file

This is interesting, because you can spot a pattern.

00000000  42 5a 68 39 31 41 59 26  53 59 0e 09 e2 df 01 5f  |BZh91AY&SY....._|
00000010  8e 40 00 c0 00 00 08 20  00 30 80 4d 46 42 a0 25  |.@..... .0.MFB.%|
00000020  a9 0a 80 97 31 41 59 26  53 59 0e 09 e2 df 01 5f  |....1AY&SY....._|
00000030  8e 40 00 c0 00 00 08 20  00 30 80 4d 46 42 a0 25  |.@..... .0.MFB.%|
00000040  a9 0a 80 97 31 41 59 26  53 59 0e 09 e2 df 01 5f  |....1AY&SY....._|
00000050  8e 40 00 c0 00 00 08 20  00 30 80 4d 46 42 a0 25  |.@..... .0.MFB.%|
00000060  a9 0a 80 97 31 41 59 26  53 59 0e 09 e2 df 01 5f  |....1AY&SY....._|
...
00001c60  a9 0a 80 97 31 41 59 26  53 59 0e 09 e2 df 01 5f  |....1AY&SY....._|
00001c70  8e 40 00 c0 00 00 08 20  00 30 80 4d 46 42 a0 25  |.@..... .0.MFB.%|
00001c80  a9 0a 80 97 31 41 59 26  53 59 cc 7b e7 56 00 9e  |....1AY&SY.{.V..|
00001c90  cf c1 00 c0 00 00 00 80  08 20 00 30 cc 09 aa 69  |......... .0...i|
00001ca0  81 44 15 b5 55 48 82 bc  5d c9 14 e1 42 41 e2 7a  |.D..UH..]...BA.z|
00001cb0  4a 70                                             |Jp|

Let's take a look to a 1 Gbyte bzipped file.

00000000  42 5a 68 39 31 41 59 26  53 59 0e 09 e2 df 01 5f  |BZh91AY&SY....._|
00000010  8e 40 00 c0 00 00 08 20  00 30 80 4d 46 42 a0 25  |.@..... .0.MFB.%|
00000020  a9 0a 80 97 31 41 59 26  53 59 0e 09 e2 df 01 5f  |....1AY&SY....._|
00000030  8e 40 00 c0 00 00 08 20  00 30 80 4d 46 42 a0 25  |.@..... .0.MFB.%|
00000040  a9 0a 80 97 31 41 59 26  53 59 0e 09 e2 df 01 5f  |....1AY&SY....._|
00000050  8e 40 00 c0 00 00 08 20  00 30 80 4d 46 42 a0 25  |.@..... .0.MFB.%|
...
000002a0  a9 0a 80 97 31 41 59 26  53 59 0e 09 e2 df 01 5f  |....1AY&SY....._|
000002b0  8e 40 00 c0 00 00 08 20  00 30 80 4d 46 42 a0 25  |.@..... .0.MFB.%|
000002c0  a9 0a 80 97 31 41 59 26  53 59 40 2c 4a 4f 01 29  |....1AY&SY@,JO.)|
000002d0  20 40 08 c0 00 00 10 00  08 20 00 30 cc 05 29 a6  | @....... .0..).|
000002e0  02 22 46 c4 08 89 1e 2e  e4 8a 70 a1 21 22 bf ea  |."F.......p.!"..|
000002f0  ea                                                |.|

Also,

$hexdump -C 1.bz2 | grep -c "1AY&SY" 23$ hexdump -C 10.bz2 |grep -c "1AY&SY"
229

The files are the same up to offset 2b0 of the hex dump (hexdump -C). Then, the bigger files has repetitions of the same two lines (in term of the hex dump), until we reach the last 4 lines, where the pattern is broken.

It seems almost we can build huger files just adding 32 bytes several times (if 23 of those 32 bytes give 1G, and 229 give 10G, maybe something like 2300 of those give 100G), provided that we can figure out how the last “lines” will change accordingly. Hard to say without knowing, or studying, what's in those bytes which make the difference. There's a checksum, maybe, too.

The format must be reverse engineered from the source code of bzip2, since there isn't an official specification. It was done for example by Joe Tsai on bzip2 version 1.0.6, and the documentation can be found here.

There isn't an original file size field: it isn't possible to know the original size without actually decompressing the file. This is the reason behind these answers.

Bzip2 implementation ignore trailing garbage at the end of the file: the decompression algorithm doesn't get confused if you add bytes at the end of the file. It could be a good place to add informations — not secrets, of course, because it will be noticeable.