QUESTION
What is the most space and time efficient way to implement a Trie in Mathematica?
Will it be practically faster than what is natively available in appropriate cases?
ANSWER
A combination of rules and recursion is able to produce rather powerful solutions. Here is my take on it:
ClearAll[makeTree];
makeTree[wrds : {__String}] := makeTree[Characters[wrds]];
makeTree[wrds_ /; MemberQ[wrds, {}]] :=
Prepend[makeTree[DeleteCases[wrds, {}]], {} -> {}];
makeTree[wrds_] :=
Reap[If[# =!= {}, Sow[Rest[#], First@#]] & /@
wrds, _, #1 -> makeTree[#2] &][[2]]
ClearAll[getSubTree];
getSubTree[word_String, tree_] := Fold[#2 /. #1 &, tree, Characters[word]]
ClearAll[inTreeQ];
inTreeQ[word_String, tree_] := MemberQ[getSubTree[word, tree], {} -> {}]
ClearAll[getWords];
getWords[start_String, tree_] :=
Module[{wordStack = {}, charStack = {}, words},
words[{} -> {}] :=
wordStack = {wordStack, StringJoin[charStack]};
words[sl_ -> ll_List] :=
Module[{},
charStack = {charStack, sl};
words /@ ll;
charStack = First@charStack;
];
words[First@Fold[{#2 -> #1} &, getSubTree[start, tree],
Reverse@Characters[start]]
];
ClearAll[words];
Flatten@wordStack];
The last function serves to collect the words from a tree, by performing a depth-first tree traversal and maintaining the stack of accumulated characters and words.
Here is a short example:
In[40]:= words = DictionaryLookup["absc*"]
Out[40]= {abscess,abscessed,abscesses,abscessing,abscissa,abscissae,abscissas,
abscission,abscond,absconded,absconder,absconders,absconding,absconds}
In[41]:= tree = makeTree[words]
Out[41]= {a->{b->{s->{c->{e->{s->{s->{{}->{},e->{d->{{}->{}},s->{{}->{}}},
i->{n->{g->{{}->{}}}}}}},i->{s->{s->{a->{{}->{},e->{{}->{}},s->{{}->{}}},
i->{o->{n->{{}->{}}}}}}},o->{n->{d->{{}->{},e->{d->{{}->{}},r->{{}->{},s->{{}->{}}}},
i->{n->{g->{{}->{}}}},s->{{}->{}}}}}}}}}}
In[47]:= inTreeQ[#,tree]&/@words
Out[47]= {True,True,True,True,True,True,True,True,True,True,True,True,True,True}
In[48]:= inTreeQ["absd",tree]
Out[48]= False
In[124]:= getWords["absce", tree]
Out[124]= {"abscess", "abscessed", "abscesses", "abscessing"}
I only constructed here a bare-bones tree, so you can only test whether or not the word is there, but not keep any other info. Here is a larger example:
In[125]:= allWords = DictionaryLookup["*"];
In[126]:= (allTree = makeTree[allWords]);//Timing
Out[126]= {5.375,Null}
In[127]:= And@@Map[inTreeQ[#,allTree]&,allWords]//Timing
Out[127]= {1.735,True}
In[128]:= getWords["pro",allTree]//Short//Timing
Out[128]= {0.015,{pro,proactive,proactively,probabilist,
<<741>>,proximate,proximately,proximity,proxy}}
In[129]:= DictionaryLookup["pro*"]//Short//Timing
Out[129]= {0.032,{pro,proactive,proactively,probabilist,<<741>>,
proximate,proximately,proximity,proxy}}
I don't know which approach has been used for the built-in functionality, but the above implementation seems to be generally in the same calss for performance. The slowest part is due to the top-level tree-traversing code in getWords
. It is slow because the top-level code is slow. One could speed it up considerably by hashing words to integers - then it can be Compiled
. This is how I'd do that, if I were really concerned with speed.
EDIT
For a really nice application of a Trie data structure, where it allows us to achieve major speed-up (w.r.t. using DictionaryLookup
, for example), see this post, where it was used it to implement an efficient Boggle solver.