Wednesday, March 23, 2011

Insanely compressed html files

Today, I discovered a URL that sent some insanely compressed content. The compressed content was sent by the server using Content-Encoding: gzip and Transfer-encoding: chunked. The compressed size of the content was 2,921,925 bytes and it decompressed to 1,004,263,982 bytes. The decompressed content was roughly 344 times the size of the compressed content.

This caused certain things to go wrong in the production process. I had set a limit of a few Megs on all fetches and had assumed that a single fetch could not be more than a few Megs. This was the first time I have seen such a huge decompression rate. This caused a subsequent file mapping to fail due to inadequate memory.

The downloaded content suggested why this would compress so well. The URL was There seems to be a dynamically generated part on this URL. If you examine its source, you will see a marker like this:


Content after that seems dynamically generated. You will find markup like this:

<h2></h2> - <br/><h4>... <a href="">read more</a></h4>

On this particular instance, there was an unusually large amount of fake content generated. The downloaded file had just 33 lines, but the last long line was a huge repeating pattern of :

<a href="">read more</a></h4><br/><br/><h2></h2> - <br/><h4>... 

This would of course compress well.

No comments: