Thursday, November 19, 2015

Compression of random data

Over the years, several companies have purported to come up with algorithms that can compress any files, even files of random-looking data.  See here for a review of such claims over the years. Compression here means that the compressed file is of a smaller size than the original file.  The problem with this is that if you keep applying the algorithm to the compressed file over and over again, you would reach a file of the smallest size: a file of a single bit.  Since there are only two possible files of a single bit ("0" or "1"), this simply cannot be the case.  This uses the pigeonhole principle and a similar argument of the impossibility of such an algorithm is as follows.  There are 2n different bit strings of n bits.  On the other hand, there are at most 2n-1 different bit strings of n-1 or less bits.  The pigeonhole principle implies it is not possible to compress all n bits strings to a smaller number of bits.

Based on these arguments, Mike Goldman proposed the following challenge (excerpted from comp.compression faq):

Mike Goldman <whig@by.net> makes another offer:

    I will attach a prize of $5,000 to anyone who successfully meets this challenge.  First, the contestant will tell me HOW LONG of a data file to generate.  Second, I will generate the data file, and send it to the contestant.  Last, the contestant will send me a decompressor and a compressed file, which will together total in size less than the original data file, and which will be able to restore the compressed file to the original state.

    With this offer, you can tune your algorithm to my data.  You tell me the parameters of size in advance.  All I get to do is arrange the bits within my file according to the dictates of my whim.  As a processing fee, I will require an advance deposit of $100 from any contestant.  This deposit is 100% refundable if you meet the challenge.

This appears to be an easier challenge since the compression and decompression algorithm can be tuned to the data generated.  Incidentally, in the following exchange, an attempt is made to win the challenge by exploiting how data is stored in a computer disk system and using the filename as additional storage by splitting the data into many files.

There are several aspects that makes Mike Goldman's challenge possibly solvable and not subject to the constraints of the pigeonhole principle.   First of all, the compression algorithm can be tuned to the data.  In other words, the compression algorithm only needs to compress only one single file.  It doesn't need to compress all files and thus the pigeonhole principle does not apply.  Secondly, the challenge did not prescribe what computer language the decompressor can be written in.  Can it be written in C, Perl, Python or Matlab?  Using a high level language essentially provides the decompressor with complex subroutines that are callable with simple calls, and thus some of the compression can be "hidden" in this way.