Perfect File Compression is impossible.

mirhagk · **Posted:** Mon Mar 07, 2011 12:31 pm

With lossless file compression, it is impossible to design an algorithm to compress random data. While this seems wierd to think about, it can be verified with the pigeon-hole principle, if every random file is assigned to a compressed file, then let n represent the random file, and m represent the compressed file. If the n files were always bigger, then there are more possible files for n than there are for m, meaning n>m. That means that some compressed file represents 2 uncompressed files, which is impossible for lossless compression. That means our statement about n files being bigger than m files is impossible, as there must be an identical number of options.

So what does this mean? Does it mean that file compression is hopeless? Well mathematically, yes, with infinte random data you cannot reduce space. You may have some random data that will compress, but others will increase in size then.

Let's look at an example. I have a hard drive with 2TB of files on it, and I'm trying to compress it all in order to add more files. Mathematically speaking I must assume this is impossible with lossless compression, however in practice data is not truly random, and compression is in fact possible in most cases.

So just remember that theories do not always apply to the real world, and while one algorithm may appear to fail in worst case scenarios, those scenarios may not ever happen. So things with terrible worst case scenarios can actually be good algorithms if you look at whether the cases are in fact worst or not.

Like bubble sort, it fails hardcore with random lists, but if you have an already sorted list, or something with one unsorted element at the front, it will be one of the fastest algorithms possible, and will definetly outshine stuff like quicksort and mergesort.

Insectoid · **Posted:** Mon Mar 07, 2011 3:23 pm

A random file may or may not be compressable. Depends on what patterns arise.

Files with patterns can be compressed with no loss. The pidgeonhole principle does not apply because there are files that cannot be compressed, which negates any duplicate files.

Many files cannot be compressed much at all without data loss- images and sound, for example. In these cases though exact data recovery isn't as important as the final filesize, so we don't worry about the lost data.

Insectoid · **Posted:** Mon Mar 07, 2011 5:35 pm

On further note, if we assume that all files can be compressed, there are still far more compressed files than expanded. For all files of size n, there are (n-1)! possible compressed files, easily far more than n*. The worst compressions may only reduce the size to n-1, however files that compress well may reduce to mere bytes in size (however unlikely this is).

Compressing a file to a specific size m as you suggest in your post will, of course be far more limited in the number of files that can be compressed.

Furthermore, there may be different ways to interpret compressed data. A file n may use a particular algorithm to compress to archive m. Another file i might use a different algorithm and be compressed to an identical archive m. However, they expand using different algorithms as well, so while the data itself might be identical, the files contained are unique.

*this only applies to file sizes n>3 bits (but who's going to compress 3 bits of data?).

EDIT: There are actually far more than (n-1)! compressed files. The number of compressed files of size _n bits is _n! bits. Therefore the actual number of files (ignoring restrictions on file structure) is (n-1)!+(n-2)!+...+a! where a is the minimum file size.

mirhagk · **Posted:** Mon Mar 07, 2011 6:05 pm

if you include all files with less than 100 bytes of data, they can compress to files smaller than them, meaning your comparing n! to (n-1)!

some files cannot be compressed, that's my point.

and also if two files are compressed to the same compressed file, then they cannot be distinguished, they can be almost identical, varying in only the file name or rather the extension.

And you can see this by putting a zip in a zip, it increases file size, it's just that most files have alot of redundacy in them, so it's not random.

http://en.wikipedia.org/wiki/Pigeonhole_principle#Uses_and_applications

I turned a 181 byte file into a 1.45 kb file using 20 layers of rar files.

Brightguy · **Posted:** Mon Mar 07, 2011 6:06 pm

Insectoid @ Mon Mar 07, 2011 3:23 pm wrote:

A random file may or may not be compressable. Depends on what patterns arise.

Files with patterns can be compressed with no loss. The pidgeonhole principle does not apply because there are files that cannot be compressed, which negates any duplicate files.

Many files cannot be compressed much at all without data loss- images and sound, for example. In these cases though exact data recovery isn't as important as the final filesize, so we don't worry about the lost data.

Any file can be compressed with some algorithm. The problem is there is no general algorithm which can compress every file if the compression function is injective. Also, real-world images/sound compress relatively well.

Insectoid @ Mon Mar 07, 2011 5:35 pm wrote:

On further note, if we assume that all files can be compressed, there are still far more compressed files than expanded. For all files of size n, there are (n-1)! possible compressed files, easily far more than n*. The worst compressions may only reduce the size to n-1, however files that compress well may reduce to mere bytes in size (however unlikely this is).

Compressing a file to a specific size m as you suggest in your post will, of course be far more limited in the number of files that can be compressed.

Furthermore, there may be different ways to interpret compressed data. A file n may use a particular algorithm to compress to archive m. Another file i might use a different algorithm and be compressed to an identical archive m. However, they expand using different algorithms as well, so while the data itself might be identical, the files contained are unique.

*this only applies to file sizes n>3 bits (but who's going to compress 3 bits of data?).

EDIT: There are actually far more than (n-1)! compressed files. The number of compressed files of size _n bits is _n! bits. Therefore the actual number of files (ignoring restrictions on file structure) is (n-1)!+(n-2)!+...+a! where a is the minimum file size.

Huh?

Why would you compare the filesize to the number of possible files? Besides, (n-1)! isn't right; for binary files it would be 2^(n-1).

Insectoid · **Posted:** Mon Mar 07, 2011 6:13 pm

Thanks for pointing out that error.

Two files may be compressed to identical archives, using different algorithms. As long as you expand them using the correct algorithm, they will indeed expand to two different files.

For example, we'll take the list 2, 4, 6, 8, ... and compress it using the formula n/2, to give 1, 2, 3, 4, .... If we also take the list 10, 20, 30, 40, ... and compress it using the formula n/10, we'll get 1, 2, 3, 4, .... As you can see, the two sequences are identical. However, expanding them using the correct (2n and 10n respectively) formulas generates two unique sequences.

Now I know file compression hasn't progressed yet to the point where compression algorithms can be tailored to each individual file, but this shows that potentially any number of files can compress to an identical archive, with only metadata specifying each.

Insectoid · **Posted:** Mon Mar 07, 2011 6:15 pm

Ah, I misunderstood the point mirhagk was trying to make. I assume you mean that there is no single algorithm with which to compress any file?

Brightguy · **Posted:** Mon Mar 07, 2011 6:43 pm

Insectoid @ Mon Mar 07, 2011 6:13 pm wrote:

Thanks for pointing out that error.

Two files may be compressed to identical archives, using different algorithms. As long as you expand them using the correct algorithm, they will indeed expand to two different files.

For example, we'll take the list 2, 4, 6, 8, ... and compress it using the formula n/2, to give 1, 2, 3, 4, .... If we also take the list 10, 20, 30, 40, ... and compress it using the formula n/10, we'll get 1, 2, 3, 4, .... As you can see, the two sequences are identical. However, expanding them using the correct (2n and 10n respectively) formulas generates two unique sequences.

Now I know file compression hasn't progressed yet to the point where compression algorithms can be tailored to each individual file, but this shows that potentially any number of files can compress to an identical archive, with only metadata specifying each.

Naturally, two different compression algorithms can compress to the same file. But when it comes time to uncompress, the metadata will also need to be provided. Essentially, the metadata must be considered part of the compressed file.

Insectoid @ Mon Mar 07, 2011 6:15 pm wrote:

Ah, I misunderstood the point mirhagk was trying to make. I assume you mean that there is no single algorithm with which to compress any file?

Yes. If the compression is injective (reversible) and some file decreases in size then there must be some other file which increases in size.

Tony · **Posted:** Mon Mar 07, 2011 7:04 pm

mirhagk @ Mon Mar 07, 2011 6:05 pm wrote:

if you include all files with less than 100 bytes of data...

The problem with this is that you are arguing for compressing _all_ of those files, at the same time, but into individually compressed files. This is not how compression is used.

Take http://en.wikipedia.org/wiki/Huffman_coding for example. It works very well for any one individual file with custom build encoding tables. Including a global encoding table for all possible files with an ecoding of just one individual file naturally breaks the effectiveness of the compression.

So the point is moot?

RandomLetters · **Posted:** Mon Mar 07, 2011 11:19 pm

I don't understand. If the data is in binary, then wouldn't a repeated pattern emerge in even completely random data? This is because if the file is large, there must be at least, say 1111, and later on another 1111. If it is not continued, there would be repeating 1010 and etc.

Tony · **Posted:** Mon Mar 07, 2011 11:22 pm

@RandomLetters -- that's exactly it. In mirhagk's setup, the patterns in some files interfere with patterns in other files. Everything works perfectly fine for any individual file.

mirhagk · **Posted:** Mon Mar 07, 2011 11:41 pm

Tony @ Mon Mar 07, 2011 7:04 pm wrote:

mirhagk @ Mon Mar 07, 2011 6:05 pm wrote:

if you include all files with less than 100 bytes of data...

The problem with this is that you are arguing for compressing _all_ of those files, at the same time, but into individually compressed files. This is not how compression is used.

Take http://en.wikipedia.org/wiki/Huffman_coding for example. It works very well for any one individual file with custom build encoding tables. Including a global encoding table for all possible files with an ecoding of just one individual file naturally breaks the effectiveness of the compression.

So the point is moot?

But you either include the encoding table in the file, or you have one that works for all files. The second one will fall under the pigenhole principle (Huffman Encoding is great for "CAT HAT BAT" but fails at "john@xyz123.ca") and the first has overhead which will increase the file size in many cases.

1110010011 has repeated 1's and 0's true, but how do you compress it? You need to specify that this section is a repeated section, or is just raw data or whatever. There is overhead, which means that some files will compress, and others decompress.

And I've though alot about the multiple algorithms, and well they don't produce identical files, since the metadata must be known in order to decompress them, and there are still some files which can never be shrunk in size. take for instance a file with just 1 bit, by compressing it your adding on data, since you now need to know which decompression algorithm to use.

You can design many algorithms to take advantage of the fact that random data doesn't really exist, but if we had random data, loss-less compression is a lost cause.

Tony · **Posted:** Tue Mar 08, 2011 1:58 am

mirhagk @ Mon Mar 07, 2011 11:41 pm wrote:

But you either include the encoding table in the file, or you have one that works for all files. The second one will fall under the pigenhole principle (Huffman Encoding is great for "CAT HAT BAT" but fails at "john@xyz123.ca") and the first has overhead which will increase the file size in many cases.

Yes, you include that in a file. It doesn't work well for single short string, but if you have a contact list of thousands of emails, you can easily compress away all of the @gmail.com parts.

But if you want a universal compress utility for any one arbitrary word... then yes, on average there can't be any compression.

mirhagk · **Posted:** Tue Mar 08, 2011 9:31 am

But the point is that it works off of patterns, in a purely random file, it could compress it, or it could make it larger. Pigeon Hole principle.

Tony · **Posted:** Tue Mar 08, 2011 1:08 pm

Only if you try to use the same pattern for every random file.

	Computer Science Canada Programming C, C++, Java, PHP, Ruby, Turing, VB Username: Password: Register
Wiki Blog Search Turing Chat Room Members