Crawling around few corners of the net I've found a file named something like do not uncompress me unless you're a bot dot gz. It is, as per extension, a small gzipped file (26 kBytes), but — and this explains the suggestion — it expands into a 10 GBytes long files.
A sort of modest zip bomb, but for gzip.
You can find more on
Beware, those files can be dangerous if not handled properly! Do whatever you want at your own risk. Be safe.
Of course I'm not a bot: I've just done
gzip -l donottrytouncompressmeunlessyouareabot.gz
and read the result:
compressed uncompressed ratio ...
26524 10420385 99.7% ...
It must have been done to trick bots which try to uncompress files in order to check what's inside.
If I do
dd if=/dev/zero bs=1M count=10000 |gzip -9 >10.gz
the output file is a lot bigger than 26524 (9.8 MBytes). Still a lot less than 10 GBytes, anyway. I can't reproduce that gzip — maybe it is handcrafted.
Better luck with bzip2, though:
$ dd if=/dev/zero bs=1M count=1000 |bzip2 >10.bz2
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 29.1305 s, 36.0 MB/s
$ ls -l 10.bz2
-rw-r--r-- 1 user user 753 Jun XX XX:XX 10.bz2
A bzipped 10 GBytes stream of zeros is 7346 bytes. It took a while, though.
With gzip -l
you can know in advance the size of uncompressed file, but bzip2 hasn't such an option. It seems the only way to do it is to try to uncompress it, avoiding the disk, of course:
bunzip2 -c file.bz2 |wc -c
You may save disk space, but not CPU and time (and you will add -s
option to save memory, just in case — but it will take longer).
Taking a look at the bz file
This is interesting, because you can spot a pattern.
00000000 42 5a 68 39 31 41 59 26 53 59 0e 09 e2 df 01 5f |BZh91AY&SY....._|
00000010 8e 40 00 c0 00 00 08 20 00 30 80 4d 46 42 a0 25 |.@..... .0.MFB.%|
00000020 a9 0a 80 97 31 41 59 26 53 59 0e 09 e2 df 01 5f |....1AY&SY....._|
00000030 8e 40 00 c0 00 00 08 20 00 30 80 4d 46 42 a0 25 |.@..... .0.MFB.%|
00000040 a9 0a 80 97 31 41 59 26 53 59 0e 09 e2 df 01 5f |....1AY&SY....._|
00000050 8e 40 00 c0 00 00 08 20 00 30 80 4d 46 42 a0 25 |.@..... .0.MFB.%|
00000060 a9 0a 80 97 31 41 59 26 53 59 0e 09 e2 df 01 5f |....1AY&SY....._|
...
00001c60 a9 0a 80 97 31 41 59 26 53 59 0e 09 e2 df 01 5f |....1AY&SY....._|
00001c70 8e 40 00 c0 00 00 08 20 00 30 80 4d 46 42 a0 25 |.@..... .0.MFB.%|
00001c80 a9 0a 80 97 31 41 59 26 53 59 cc 7b e7 56 00 9e |....1AY&SY.{.V..|
00001c90 cf c1 00 c0 00 00 00 80 08 20 00 30 cc 09 aa 69 |......... .0...i|
00001ca0 81 44 15 b5 55 48 82 bc 5d c9 14 e1 42 41 e2 7a |.D..UH..]...BA.z|
00001cb0 4a 70 |Jp|
Let's take a look to a 1 Gbyte bzipped file.
00000000 42 5a 68 39 31 41 59 26 53 59 0e 09 e2 df 01 5f |BZh91AY&SY....._|
00000010 8e 40 00 c0 00 00 08 20 00 30 80 4d 46 42 a0 25 |.@..... .0.MFB.%|
00000020 a9 0a 80 97 31 41 59 26 53 59 0e 09 e2 df 01 5f |....1AY&SY....._|
00000030 8e 40 00 c0 00 00 08 20 00 30 80 4d 46 42 a0 25 |.@..... .0.MFB.%|
00000040 a9 0a 80 97 31 41 59 26 53 59 0e 09 e2 df 01 5f |....1AY&SY....._|
00000050 8e 40 00 c0 00 00 08 20 00 30 80 4d 46 42 a0 25 |.@..... .0.MFB.%|
...
000002a0 a9 0a 80 97 31 41 59 26 53 59 0e 09 e2 df 01 5f |....1AY&SY....._|
000002b0 8e 40 00 c0 00 00 08 20 00 30 80 4d 46 42 a0 25 |.@..... .0.MFB.%|
000002c0 a9 0a 80 97 31 41 59 26 53 59 40 2c 4a 4f 01 29 |....1AY&SY@,JO.)|
000002d0 20 40 08 c0 00 00 10 00 08 20 00 30 cc 05 29 a6 | @....... .0..).|
000002e0 02 22 46 c4 08 89 1e 2e e4 8a 70 a1 21 22 bf ea |."F.......p.!"..|
000002f0 ea |.|
Also,
$ hexdump -C 1.bz2 | grep -c "1AY&SY"
23
$ hexdump -C 10.bz2 |grep -c "1AY&SY"
229
The files are the same up to offset 2b0
of the hex dump (hexdump -C
). Then, the bigger files has repetitions of the same two lines (in term of the hex dump), until we reach the last 4 lines, where the pattern is broken.
It seems almost we can build huger files just adding 32 bytes several times (if 23 of those 32 bytes give 1G, and 229 give 10G, maybe something like 2300 of those give 100G), provided that we can figure out how the last “lines” will change accordingly. Hard to say without knowing, or studying, what's in those bytes which make the difference. There's a checksum, maybe, too.
The format must be reverse engineered from the source code of bzip2, since there isn't an official specification. It was done for example by Joe Tsai on bzip2 version 1.0.6, and the documentation can be found here.
There isn't an original file size field: it isn't possible to know the original size without actually decompressing the file. This is the reason behind these answers.
Bzip2 implementation ignore trailing garbage at the end of the file: the decompression algorithm doesn't get confused if you add bytes at the end of the file. It could be a good place to add informations — not secrets, of course, because it will be noticeable.
No comments:
Post a Comment