How much is big data compressible? An interesting theorem.

Published February 24, 2014 |

Vincent Granville

I am doing some research to compress data available as tables (rows and columns, or cubes) more efficiently. This is the reverse data science approach: instead of receiving compressed data and applying statistical techniques to extract insights, here, we are looking at uncompressed data, extract all possible insights, and eliminate everything but the insights, to compress the data.

In this process, I was wondering if one can design an algorithm that can compress any data set, by at least one bit. Intuitively, the answer is clearly no, otherwise you could recursilvely compress any data set to 0 bit. Any algorithm will compress some data sets, and make some other data sets bigger after compression. Data that looks random, that has no pattern, can not be compressed. I have seen contests offering an award if you find a compression algorithm that defeats this principle, but it would be a waste of time participating.

But what if you design an algorithm that, when a data set can not be compressed, leaves the data set unchanged? Would you be able, on average, to compress any data set then? Note that if you assemble numbers together to create a data set, the resulting data set would be mostly random. In fact, the vast majority of all data sets, are almost random and not compressible. But data sets resulting from experiments are usually not random, but they represent a tiny minority of all potential data sets. In practice this tiny minority represents all data sets that data scientists are confronted to.