John Machin
2006-08-09 22:54:30 UTC
I've tested that sorting just the strings instead of the tuples (and
numbers = '22233344455566677778889999'
conv = string.maketrans(string.lowercase, numbers)
lines =
file("/usr/share/dict/words").read().lower().splitlines()
# lines = map(str.strip, lines)
lines.sort()
self.dict = [(word.translate(conv), word) for word in lines]
If the words file is already sorted you can skip the sorting line.
If the file contains extraneous spaces, you can strip them uncommenting
that line.
1. Wouldn't it be a good idea to process the raw dictionary *once* andnumbers = '22233344455566677778889999'
conv = string.maketrans(string.lowercase, numbers)
lines =
file("/usr/share/dict/words").read().lower().splitlines()
# lines = map(str.strip, lines)
lines.sort()
self.dict = [(word.translate(conv), word) for word in lines]
If the words file is already sorted you can skip the sorting line.
If the file contains extraneous spaces, you can strip them uncommenting
that line.
cPickle the result?
2. All responses so far seem to have missed a major point in the
research paper quoted by the OP: each word has a *frequency* associated
with it. When there are multiple choices (e.g. "43" -> ["he", "if",
"id", ...]), the user is presented with the choices in descending
frequency order. Note that if one of the sort keys is (-frequency), the
actual frequency doesn't need to be retained in the prepared
dictionary.
3. Anyone interested in the techniques & heuristics involved in this
type of exercise might like to look at input methods for languages like
Chinese -- instead of 26 letters mapped to 8 digits, you have tens of
thousands of characters of wildly varying frequency mapped to e.g. 400+
Pinyin "words" entered on a "standard" keyboard.
Cheers,
John