Inspired by this idea of blogjects and twittering houses, I wanted to send out an early morning tweet of the number of cyclists who braved the streets across Fremont the day before. The data is uploaded by SDOT early morning, and a cron task would request for this and tweet it.
Simple and a tad boring. Now what if I could map the count to something interesting? Researching on interesting integers, I came up on the theorem that there are no uninteresting integers because after all if there were a bunch of these, and one of them must be the smallest of the lot, and the fact itself makes this number interesting.
Energized in no small measure by this revelation, I sallied forth to find a list of interesting integers in the thousands range, as every day there were ~ 3000 cyclists being logged. It didn't take long for me to reach a comprehensive page of integers. The pattern is an integer within a <font> tag, followed by a phrase that describes it.
<font size=+3 color=gray>0</font> is the <a href="http://mathworld.wolfram.com/AdditiveIdentity.html">additive identity</a>.<br> <font size=+3 color=gray>1</font> is the <a href="http://mathworld.wolfram.com/MultiplicativeIdentity.html">multiplicative identity</a>.<br> <font size=+3 color=darkblue>2</font> is the only even <a href="http://mathworld.wolfram.com/PrimeNumber.html">prime</a>.<br>
At first glance, it seemed a simple matter of using Beautiful Soup to get each font tag, extract its text, then look for the font's sibling to extract the phrase.
However, the font tag has multiple siblings that make up the complete phrase. In the soup these are represented as NavigableString objects. It's a matter of moving across the document until we hit a <br> tag, collecting all the text as we go along.
Now since all of this needs to be part of the tweet, I quickly realized that not much can be said in 140 characters. So I didn't bother keeping the URLs. I used a jupyter notebook to quickly prototype the outline, and I can't stress enough how useful this is, specially when you are dealing with an unfamiliar API (which Beautiful soup was to me).
Here is how I used the notebook to understand the basic structure of the page:
So getting to the integers was quite trivial as BeautifuSoup provides a way to search for a specific tag (font) with a specific value for a given attribute (size=+3). Since the phrase for the integer is in a number of contiguous elements, we need to construct it by visiting siblings of the font tag until we hit a <br> tag.
Now I ran through the results in jupyter, and the first few are shown below:
Since I collected the unknown tags, I could see what they were:
Here is an example of a superscript being used:
<font size=+3 color=FF6699>16</font> is the only number of the form x<sup>y</sup> = y<sup>x</sup> with x and y being different <a href="http://mathworld.wolfram.com/Integer.html">integers</a>.<br>
Now isn't that interesting, out of all the numbers that this is the only one?
Here is how the subscript is being used:
<font size=+3 color=brown>126</font> = <sub>9</sub><a href="http://mathworld.wolfram.com/Combination.html">C</a><sub>4</sub>.<br>
With this insight, I augmented the superscripts with the ^ symbol, and left the subscripts as is.
And I again digressed on a merry tangent where people were using non-ascii characters to tweet subscripts and superscripts.
However, somewhat surprisingly, the sibling list did not always end in a <br>. I hit a None for a section for integer 248 with the html:
<font size=+3 color=006600>248</font> is the smallest number n>1 for which the <a href="http://mathworld.wolfram.com/ArithmeticMean.html">arithmetic</a>, <a href="http://mathworld.wolfram.com/GeometricMean.html">geometric</a>, and <a href="http://mathworld.wolfram.com/HarmonicMean.html">harmonic means<a/> of <a href="http://mathworld.wolfram.com/TotientFunction.html">φ</a>(n) and <a href="http://mathworld.wolfram.com/DivisorFunction.html">σ</a>(n) are all <a href="http://mathworld.wolfram.com/Integer.html">integers</a>.<br>
Can you spot the problem, it is subtle?
Notice that harmonic means<a/> is not the correct encoding. Beautiful soup replaces this dangling tag with a beautiful pair <a></a>:
<a href="http://mathworld.wolfram.com/HarmonicMean.html">harmonic means<a></a> of <a href="http://mathworld.wolfram.com/TotientFunction.html">φ</a>(n) and <a href="http://mathworld.wolfram.com/DivisorFunction.html">σ</a>(n) are all <a href="http://mathworld.wolfram.com/Integer.html">integers</a>.<br/></a>
<font size=+3 color=FF6699>1351</font> has the property that <a href="http://mathworld.wolfram.com/e.html">e</a><sup>1351</a></sup> is within .0009 of an <a href="http://mathworld.wolfram.com/Integer.html">integer</a>.<br>
We are getting all the integers from 248 through 1351 in one unbroken block.
Is there then an easier way to solve this problem? It's tempting to think regular expressions when it comes to html parsing issues of this sort. What if we use a regular expression to split apart the sections containing the integer and phrase? After all, using a top down parser snagged on a mismatched tag, but maybe a regular expression can give us a better behaved set of html tags which we can then parse individually with Beautiful Soup.
We can get the html lines with a reg exp split.
Since we decided to forego the URLs, we could construct a BeautifulSoup instance for each line, and call the get_text() method to strip all tags.
But now we are not identifying the superscripts, subscripts, as you can see from the output for integer 16.
What we should do is to then use the regular expression to get the lines, apply the parser for each line, then use the function we wrote earlier to get the text within each line. Now since the parser can't go over a <br>, it just might result in a better extraction of phrases.
Now we convert this list of pairs to a dictionary, so we can quickly look up the integer:
The get_text_to_eol() was modified to handle hitting the end of the sibling list without hitting a <br>. Also, we keep all strings unicode up until they need to be output, at which point a conversion to utf-8 is done.
I think this will do for our purposes. In the next post I will show how this was used along with the bike counter uploads to tweet early morning updates to twitter.
The full source code, along with the twitter updates can be found here.