Sunday, November 05, 2017

A Beautiful Soupy Exercise in Scraping Interesting Integers

Integers can be very interesting at least if you are a mathematician, and even for a lay person like me, interesting integers can be used to spice up some data that is of interest. My interest here being the bike counter installed on the Fremont Bridge sidewalk that counts the number of bicycles crossing the bridge in both directions.

Inspired by this idea of blogjects and twittering houses, I wanted to send out an early morning tweet of the number of cyclists who braved the streets across Fremont the day before. The data is uploaded by SDOT early morning, and a cron task would request for this and tweet it.

Simple and a tad boring. Now what if I could map the count to something interesting? Researching on interesting integers, I came up on the theorem that there are no uninteresting integers because after all if there were a bunch of these, and one of them must be the smallest of the lot, and the fact itself makes this number interesting.

Energized in no small measure by this revelation, I sallied forth to find a list of interesting integers in the thousands range, as every day there were ~ 3000 cyclists being logged. It didn't take long for me to reach a comprehensive page of integers. The pattern is an integer within a <font> tag, followed by a phrase that describes it.











<font size=+3 color=gray>0</font> is the <a href="http://mathworld.wolfram.com/AdditiveIdentity.html">additive identity</a>.<br> <font size=+3 color=gray>1</font> is the <a href="http://mathworld.wolfram.com/MultiplicativeIdentity.html">multiplicative identity</a>.<br> <font size=+3 color=darkblue>2</font> is the only even <a href="http://mathworld.wolfram.com/PrimeNumber.html">prime</a>.<br>

At first glance, it seemed a simple matter of using Beautiful Soup to get each font tag, extract its text, then look for the font's sibling to extract the phrase.

However, the font tag has multiple siblings that make up the complete phrase. In the soup these are represented as NavigableString objects. It's a matter of moving across the document until we hit a <br> tag, collecting all the text as we go along.

Now since all of this needs to be part of the tweet, I quickly realized that not much can be said in 140 characters. So I didn't bother keeping the URLs. I used a jupyter notebook to quickly prototype the outline, and I can't stress enough how useful this is, specially when you are dealing with an unfamiliar API (which Beautiful soup was to me).

Here is how I used the notebook to understand the basic structure of the page:


So getting to the integers was quite trivial as BeautifuSoup provides a way to search for a specific tag (font) with a specific value for a given attribute (size=+3). Since the phrase for the integer is in a number of contiguous elements, we need to construct it by visiting siblings of the font tag until we hit a <br> tag.


unexpected_tags = {}
def get_text_to_eol(font_section):
    text_parts = []
    section = font_section.next_sibling
    while section.name != 'br':
        if section.name == 'a':
            text_parts.append(section.string)
        elif section.name is None:
            text_parts.append(str(section))
        else:
            print ("found %s tag" % section.name)
            unexpected_tags[section.name] = unexpected_tags.get(section.name, 0)+1
        section = section.next_sibling    
    return ' '.join(text_parts)  

Now I ran through the results in jupyter, and the first few are shown below:

for number, text in map(lambda section: (section.get_text(), get_text_to_eol(section)), integer_sections):
    print (number, text)

0  is the  additive identity .
1  is the  multiplicative identity .
2  is the only even  prime .
3  is the number of spatial dimensions we live in.
4  is the smallest number of colors sufficient to color all planar maps.
5  is the number of  Platonic solids .
6  is the smallest  perfect number .
7  is the smallest number of sides of a  regular  polygon that is not  constructible  by straightedge and compass.
8  is the largest  cube  in the  Fibonacci sequence .
9  is the maximum number of  cubes  that are needed to sum to any positive  integer .
10  is the base of our number system.
11  is the largest known  multiplicative persistence .
12  is the smallest  abundant number .
13  is the number of  Archimedean solids .
14  is the smallest even number n with no solutions to  φ (m) = n.
15  is the smallest  composite number  n with the property that there is only one  group  of order n.
found sup tag
found sup tag
16  is the only number of the form x  = y  with x and y being different  integers .
17  is the number of  wallpaper groups .
18  is the only positive number that is twice the sum of its digits.
found sup tag
19  is the maximum number of 4  powers needed to sum to any number.
20  is the number of  rooted trees  with 6 vertices.
21  is the smallest number of distinct  squares  needed to tile a  square .
22  is the number of  partitions  of 8.

Since I collected the unknown tags, I could see what they were:


Here is an example of a superscript being used:
<font size=+3 color=FF6699>16</font> is the only number of the form x<sup>y</sup> = y<sup>x</sup> with x and y being different <a href="http://mathworld.wolfram.com/Integer.html">integers</a>.<br>





Now isn't that interesting, out of all the numbers that this is the only one?

Here is how the subscript is being used:

<font size=+3 color=brown>126</font> = <sub>9</sub><a href="http://mathworld.wolfram.com/Combination.html">C</a><sub>4</sub>.<br>








With this insight, I augmented the superscripts with the ^ symbol, and left the subscripts as is.


def get_text_to_eol(font_section):
    text_parts = []
    section = font_section.next_sibling
    while section.name != 'br':
        if section.name == 'a':
            text_parts.append(section.string)
        elif section.name is None:
            text_parts.append(str(section))
        else:
            if section.name == 'sup':
                text_parts.append('^')
            text_parts.append(section.string)
        section = section.next_sibling    
    return ' '.join(text_parts)       

And I again digressed on a merry tangent where people were using non-ascii characters to tweet subscripts and superscripts.

However, somewhat surprisingly, the sibling list did not always end in a <br>. I hit a None for a section for integer 248 with the html:

<font size=+3 color=006600>248</font> is the smallest number n>1 for which the <a href="http://mathworld.wolfram.com/ArithmeticMean.html">arithmetic</a>, <a href="http://mathworld.wolfram.com/GeometricMean.html">geometric</a>, and <a href="http://mathworld.wolfram.com/HarmonicMean.html">harmonic means<a/> of <a href="http://mathworld.wolfram.com/TotientFunction.html">&phi;</a>(n) and <a href="http://mathworld.wolfram.com/DivisorFunction.html">&sigma;</a>(n) are all <a href="http://mathworld.wolfram.com/Integer.html">integers</a>.<br>

Can you spot the problem, it is subtle?

Notice that harmonic means<a/> is not the correct encoding. Beautiful soup replaces this dangling tag with a beautiful pair <a></a>:

<a href="http://mathworld.wolfram.com/HarmonicMean.html">harmonic means<a></a> of <a href="http://mathworld.wolfram.com/TotientFunction.html">φ</a>(n) and <a href="http://mathworld.wolfram.com/DivisorFunction.html">σ</a>(n) are all <a href="http://mathworld.wolfram.com/Integer.html">integers</a>.<br/></a>

This is all very nice, except that we were relying on a <br> tag to be an eventual sibling, and Beautiful soup is on a soupy wake trying to find the matching </a> to the tag it started with, finally finding it at:

<font size=+3 color=FF6699>1351</font> has the property that <a href="http://mathworld.wolfram.com/e.html">e</a><sup>1351</a></sup> is within .0009 of an <a href="http://mathworld.wolfram.com/Integer.html">integer</a>.<br>

We are getting all the integers from 248 through 1351 in one unbroken block.

Is there then an easier way to solve this problem? It's tempting to think regular expressions when it comes to html parsing issues of this sort. What if we use a regular expression to split apart the sections containing the integer and phrase? After all, using a top down parser snagged on a mismatched tag, but maybe a regular expression can give us a better behaved set of html tags which we can then parse individually with Beautiful Soup.

We can get the html lines with a reg exp split.

Since we decided to forego the URLs, we could construct a BeautifulSoup instance for each line, and call the get_text() method to strip all tags.

import re
lines = re.split("<br>[\r\n\s]*", html)
list_lines = list(filter(lambda x: x is not None, [re.search(r"<font size=\+3 .*", line) for line in lines]))
text_lines = [BeautifulSoup(l.group(0), 'html.parser').get_text() for l in list_lines]


['0 is the additive identity.',
 '1 is the multiplicative identity.',
 '2 is the only even prime.',
 '3 is the number of spatial dimensions we live in.',
 '4 is the smallest number of colors sufficient to color all planar maps.',
 '5 is the number of Platonic solids.',
 '6 is the smallest perfect number.',
 '7 is the smallest number of sides of a regular polygon that is not constructible by straightedge and compass.',
 '8 is the largest cube in the Fibonacci sequence.',
 '9 is the maximum number of cubes that are needed to sum to any positive integer.',
 '10 is the base of our number system.',
 '11 is the largest known multiplicative persistence.',
 '12 is the smallest abundant number.',
 '13 is the number of Archimedean solids.',
 '14 is the smallest even number n with no solutions to φ(m) = n.',
 '15 is the smallest composite number n with the property that there is only one group of order n.',
 '16 is the only number of the form xy = yx with x and y being different integers.',
 '17 is the number of wallpaper groups.',

But now we are not identifying the superscripts, subscripts, as you can see from the output for integer 16.

What we should do is to then use the regular expression to get the lines, apply the parser for each line, then use the function we wrote earlier to get the text within each line. Now since the parser can't go over a <br>, it just might result in a better extraction of phrases.


import re
lines = re.split("<br>[\r\n\s]*", html)
list_lines = list(filter(lambda x: x is not None, [re.search(r"<font size=\+3 .*", line) for line in lines]))
soups = [BeautifulSoup(l.group(0), 'html.parser').font for l in list_lines]
np_list = [(int(s.get_text()), get_text_to_eol(s)) for s in soups]


(7,
  ' is the smallest number of sides of a  regular  polygon that is not  constructible  by straightedge and compass.'),
 (8, ' is the largest  cube  in the  Fibonacci sequence .'),
 (9,
  ' is the maximum number of  cubes  that are needed to sum to any positive  integer .'),
 (10, ' is the base of our number system.'),
 (11, ' is the largest known  multiplicative persistence .'),
 (12, ' is the smallest  abundant number .'),
 (13, ' is the number of  Archimedean solids .'),
 (14, ' is the smallest even number n with no solutions to  φ (m) = n.'),
 (15,
  ' is the smallest  composite number  n with the property that there is only one  group  of order n.'),
 (16,
  ' is the only number of the form x ^ y  = y ^ x  with x and y being different  integers .'),
 (17, ' is the number of  wallpaper groups .'),

Now we convert this list of pairs to a dictionary, so we can quickly look up the integer:

hash = dict(np_list)


{0: ' is the  additive identity .',
 1: ' is the  multiplicative identity .',
 2: ' is the only even  prime .',
 3: ' is the number of spatial dimensions we live in.',
 4: ' is the smallest number of colors sufficient to color all planar maps.',
 5: ' is the number of  Platonic solids .',
 6: ' is the smallest  perfect number .',
 7: ' is the smallest number of sides of a  regular  polygon that is not  constructible  by straightedge and compass.',
 8: ' is the largest  cube  in the  Fibonacci sequence .',
 9: ' is the maximum number of  cubes  that are needed to sum to any positive  integer .',
 10: ' is the base of our number system.',
 11: ' is the largest known  multiplicative persistence .',
 12: ' is the smallest  abundant number .',
 13: ' is the number of  Archimedean solids .',
 14: ' is the smallest even number n with no solutions to  φ (m) = n.',
 15: ' is the smallest  composite number  n with the property that there is only one  group  of order n.',
 16: ' is the only number of the form x ^ y  = y ^ x  with x and y being different  integers .',

The get_text_to_eol() was modified to handle hitting the end of the sibling list without hitting a <br>. Also, we keep all strings unicode up until they need to be output, at which point a conversion to utf-8 is done.

def get_text_to_eol(font_section):
    text_parts = []
    section = font_section.next_sibling
    while section is not None:
        if section.name == 'a':
            text_parts.append(section.get_text())
        elif section.name is None:
            text_parts.append(unicode(section))
        else:
            if section.name == 'sup':
                text_parts.append('^')
            text_parts.append(section.get_text())
        section = section.next_sibling    
    return ' '.join(text_parts)     

I think this will do for our purposes. In the next post I will show how this was used along with the bike counter uploads to tweet early morning updates to twitter.

The full source code, along with the twitter updates can be found here.