technique to enter text using a mobile phone keypad (T9 dictionary-based disambiguation)

Discussion:

John Machin

2006-08-09 22:54:30 UTC

I've tested that sorting just the strings instead of the tuples (and
numbers = '22233344455566677778889999'
conv = string.maketrans(string.lowercase, numbers)
lines =
file("/usr/share/dict/words").read().lower().splitlines()
# lines = map(str.strip, lines)
lines.sort()
self.dict = [(word.translate(conv), word) for word in lines]
If the words file is already sorted you can skip the sorting line.
If the file contains extraneous spaces, you can strip them uncommenting
that line.

1. Wouldn't it be a good idea to process the raw dictionary *once* and
cPickle the result?

2. All responses so far seem to have missed a major point in the
research paper quoted by the OP: each word has a *frequency* associated
with it. When there are multiple choices (e.g. "43" -> ["he", "if",
"id", ...]), the user is presented with the choices in descending
frequency order. Note that if one of the sort keys is (-frequency), the
actual frequency doesn't need to be retained in the prepared
dictionary.

3. Anyone interested in the techniques & heuristics involved in this
type of exercise might like to look at input methods for languages like
Chinese -- instead of 26 letters mapped to 8 digits, you have tens of
thousands of characters of wildly varying frequency mapped to e.g. 400+
Pinyin "words" entered on a "standard" keyboard.

Cheers,
John

bearophileHUGS

2006-08-10 11:32:24 UTC

Permalink

Yu-Xi Lim:

Thank you for your comments, and sorry for my last cryptic answer.

I think Bearophile isn't refering to compression of the dictionary, but the predictive algorithms used by modern data compressors. However, I think he's over-complicating the issue. It is *not* a data compression problem, imho.<

I agree that my solution is probably too much complex for most purposes
(using a compressor simpler than PAQ is probably better for most of
such purposes), but I think it is a data compression problem, because
compressing data essentially means predicting the next bit, and this
program has to predict what's the most probable letter that the user
wanted to add. See Dasher too at the end of this post.

While predictive input is desired, the PAQ algorithm utilizes multiple "contexts" (the novel contribution of the paper mentioned below). This is intended for general purpose data compressors which work on a variety of data, such as uncompressed graphics and audio, text, or other binary data. There is however, only one context in this case.<

PAQ8 manages many contexts. Some of them are fit for digital
audio/images, or Jpeg, etc. Such contexts can be removed (disabled)
from the PAQ source code, it's not difficult. But PAQ8 contains many
(more than one) contexts just for textual data, and you can keep such
contexts, because they improve text compression, so they improve the
prediction. (Removing those contexts improves speed and even more it
reduces memory used). For this program I think you can keep the
following ones: Order n, Sparse, Text, Formatted text, Fixed record
length, Context gap, Indirect. If you are short on memory you can
probably remove some of them. If you use the keyboard to input specific
kinds of data, you may add a context for them too.

A more advanced system (beyond regular T9 and comparable to Motorola's iTap) may consider the context of the word. So typing followed 2255#466 would make "call home" the most likely word.<

A good compressor (a PPM can be okay too) can do this too, its contexts
can be many chars long (but you need lot of memory, probably too much
for a telephone of today).

https://www.cs.fit.edu/Projects/tech_reports/cs-2005-16.pdf

Note that this document doesn't explain the new versions, that contain
a new good idea. Code for the last version:
http://cs.fit.edu/~mmahoney/compression/paq8h.zip

You can be interested in a similar project, that uses a PPM:
http://www.inference.phy.cam.ac.uk/djw30/dasher/

Using an instrumented PAQ this Dasher can be improved a little (speed
of the algorithm isn't important, because you are compressing few
bits/minute. The dictionaries created by the PAQ can be even frozen, in
some cases, so they can be read from disk/flash at the start of the
program.

Bye,
strong bear hugs,
bearophile

Justin Azoff

2006-08-09 03:51:17 UTC

Permalink

I have a standard 12-key mobile phone keypad connected to my Linux
machine as a I2C peripheral. I would like to write a code which allows
the text entry to the computer using this keypad (something like T9 on
the mobile phones)
According to the http://www.yorku.ca/mack/uist01.html
dictionary-based disambiguation is coming in the mind.
With dictionary-based disambiguation, each key is pressed only once.
For example, to enter the, the user enters 8-4-3-0. The 0 key, for
SPACE, delimits words and terminates disambiguation of the preceding
keys. The key sequence 8-4-3 has 3 ? 3 ? 3 = 27 possible renderings
(see Figure 1). The system compares the possibilities to a dictionary
of words to guess the intended word.
I would like to ask some guru here to give me the direction which
technique (Python functionality) or which strategy to use to solve
this riddle.
Thanks for your advices and comments
Regards
Petr Jakes

I can think of 2 approaches to this, 1) Map the numbers to parts of a
regular expression, and then use this to search through the
dictiionary. 2) Pre-compute a copy of the dictionary converted to it's
numerical equivalent, then just match the numbers.

The basic structure you need for both of these is simple. For the
first method you use
keys = ['','abc','def','ghi',....']

then if you have s="123321"
''.join(['[%s]' % keys[int(l)] for l in s])
will give you a string like
'[abc][def][ghi][def][abc]', which you can then use to match words...

I think the second solution would end up being faster, as long as you
have the memory - no regex work, plus, you can sort the wordlist.

The following quickly written class seems to work nicely:

import string
import bisect

letters = string.lowercase
numbers = '2223334445556667777888999'
letter_mapping = dict(zip(letters, numbers))

class phone:
def __init__(self):
self.read_dictionary()

def word_as_numbers(self, word):
nums=''
for letter in word:
if letter in letter_mapping:
nums += letter_mapping[letter]
return nums

def read_dictionary(self):
words = []
for line in file("/usr/share/dict/words"):
word = line.strip().lower()
nums = self.word_as_numbers(word)
words.append((nums, word))

words.sort()
self.dict = words

def get_matching_words(self, number_str):
tup = (number_str,)
left = bisect.bisect_left(self.dict, tup)

for num, word in self.dict[left:]:
if num.startswith(number_str):
yield word
else:
break

It takes a second or two to read the list of words in, but matching is
instant thanks to bisect:
In [14]:%time p=phone.phone()
CPU times: user 1.65 s, sys: 0.00 s, total: 1.65 s
Wall time: 1.66

In [15]:%time list(p.get_matching_words('43556'))
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.01
Out[15]:['hello', 'hellman', "hellman's", "hello's", 'hellos']

It seems the ruby version just posted takes a similar approach, but
uses an actual tree.. using the bisect module keeps it simple.

--
- Justin

bearophileHUGS

2006-08-09 20:14:07 UTC

Permalink

Post by Justin Azoff
It takes a second or two to read the list of words in,

Nice solution. If you want to speed up the initialization phase you may
use something like this (it requires a bit more memory, because lines
contains all the words).

Note that the words and numbers have the same sorting order, so you may
use this to speed up the sorting a little, like doing it on words only
(that is the lines list), but for small dictionaries sort is fast
enough already, so this isn't much important.

Note: you have to add another 9 to numbers, because z too is associated
to 9.

import string

class Phone:
def __init__(self):
numbers = '22233344455566677778889999'
convtable = string.maketrans(string.lowercase, numbers)
lines =
file("/usr/share/dict/words").read().lower().splitlines()
words = []
for line in lines:
word = line.strip()
nums = word.translate(convtable)
words.append( (nums, word) )

words.sort()
self.dict = words

p = Phone()

Bye,
bearophile

bearophileHUGS

2006-08-09 20:23:47 UTC

Permalink

Note that this is essentially a data-compression problem, so the most
accurate solution is probably to use an instrumeted PAQ compressor in a
certain smart way, but you have to work a lot to implement this
solution, and maybe this problem doesn't deserve all this work.

Bye,
bearophile

bearophileHUGS

2006-08-09 20:45:41 UTC

Permalink

I've tested that sorting just the strings instead of the tuples (and
removing the stripping) reduces the running time enough:

def __init__(self):
numbers = '22233344455566677778889999'
conv = string.maketrans(string.lowercase, numbers)
lines =
file("/usr/share/dict/words").read().lower().splitlines()
# lines = map(str.strip, lines)
lines.sort()
self.dict = [(word.translate(conv), word) for word in lines]

If the words file is already sorted you can skip the sorting line.
If the file contains extraneous spaces, you can strip them uncommenting
that line.

Bye

Petr Jakes

2006-08-09 19:39:03 UTC

Permalink

Thanks a lot. It works flawlessly and I have learned few new Python
"tricks" as well.
Petr Jakes

John Machin

2006-08-09 23:56:34 UTC

Permalink

Post by John Machin
2. All responses so far seem to have missed a major point in the
research paper quoted by the OP: each word has a *frequency* associated
with it. When there are multiple choices (e.g. "43" -> ["he", "if",
"id", ...]), the user is presented with the choices in descending
frequency order.

I haven't missed it; if you use the instrumeted PAQ compressor
approach, you gain the frequency information and more :-)

I didn't comment on that before because:

(1) I thought it sounded like a tool in search of a problem -- the
problem being to produce a user interface that meets conflicting goals
(few keystrokes and few mistakes and minimal
hurl-the-device-out-of-the-window frustrations); compression of the
dictionary is of course desirable but I wouldn't have thought that that
should have been foremost in the design process.

(2) Googling for instrumen?ted PAQ compress(or|ion) yielded nothing
that seemed relevant -- can you supply a link or two?

Cheers,
John

gene tani

2006-08-09 03:38:39 UTC

Permalink

http://rubyquiz.com/quiz20.html

Petr Jakeš

2006-08-08 22:45:46 UTC

Permalink

I have a standard 12-key mobile phone keypad connected to my Linux
machine as a I2C peripheral. I would like to write a code which allows
the text entry to the computer using this keypad (something like T9 on
the mobile phones)

According to the http://www.yorku.ca/mack/uist01.html
dictionary-based disambiguation is coming in the mind.

With dictionary-based disambiguation, each key is pressed only once.
For example, to enter the, the user enters 8-4-3-0. The 0 key, for
SPACE, delimits words and terminates disambiguation of the preceding
keys. The key sequence 8-4-3 has 3 ? 3 ? 3 = 27 possible renderings
(see Figure 1). The system compares the possibilities to a dictionary
of words to guess the intended word.

I would like to ask some guru here to give me the direction which
technique (Python functionality) or which strategy to use to solve
this riddle.

Thanks for your advices and comments

Regards

Petr Jakes

bearophileHUGS

2006-08-09 23:33:40 UTC

Permalink

I haven't missed it; if you use the instrumeted PAQ compressor
approach, you gain the frequency information and more :-)

Bye,
bearophile