Saturday, August 12, 2017
Now it is possible to reduce the space taken by the trie further by using an array instead of the map. Knowing that we need to use only lower case letters, we can use the charater before 'a' as the sentinel, so that the array length is set to 27.
Another space optimization comes about by using a single word (32 bits) to store both the character and the prefix count. Java uses two bytes for the char type, and we could do with one byte. But that still uses 5 bytes per Trie node, but we don't need the 2 billion range possible with 32 bits to represent the count of all prefixes for any English substring.
The prefix count is highest on the root node, as all words have the head node character as the prefix. So the highest prefix count is the number of words in the dictionary. This is generally never more than 250, 000. We can safely use 24 bits which can represent 8 million as a signed integer.
So we can combine the character and the prefix count to a single word.
Is there anything else we could do? Yes - we could read all the words into our Trie and trim the list on each Trie node. This results from the observation that we rarely use all the slots in our list - Especially as the trie spans out, there are fewer number of new words. Thus we could find the last used index on the list, and create a new shorter list.
Doing all of these drops the size of the trie from ~ 228M to ~ 68M.
Here is an implementation.
I store a random word list in pastebin for testing - there is code here that uses this, as well as pulling a dictionary of lower case words. If you use this, you will need to make sure the dictionary you substitute has only lower case words, so some pre-processing might be necessary - in particular, you are likely to find the hyphen (-) in some word which you will need to remove.
Last but not least, the memory stats don't give an idea of the space saving due to garbage collector not being deterministic. I use the sizeInBytes() to recursively calculate the memory foot print of the Trie.
Posted by thushara at 8:10 PM