Discussion:
Stripping ASCII codes when parsing
(too old to reply)
Erik Max Francis
2005-10-17 20:13:50 UTC
Permalink
I am working with a text format that advises to strip any ascii control
characters (0 - 30) as part of parsing data and also the ascii pipe
character (124) from the data. I think many of these characters are
from a different time. Since I have never seen most of these characters
in text I am not sure how these first 30 control characters are all
represented (other than say tab (\t), newline(\n), line return(\r) ) so
what should I do to remove these characters if they are ever
encountered. Many thanks.
Use ''.translate. Pass in the identity mapping for the first argument,
and for the second parameter, specify the list of all the characters you
wish to delete. This would probably be something like:

IDENTITY_MAP = ''.join([chr(x) for x in range(256)])
BAD_MAP = ''.join([chr(x) for x in range(32) + [124])

aNewString = aString.translate(IDENTITY_MAP, BAD_MAP)

Note that ASCII 31 is also a control character (US).
--
Erik Max Francis && max at alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
The believer is happy; the doubter is wise.
-- (an Hungarian proverb)
David Pratt
2005-10-17 17:50:31 UTC
Permalink
This is very nice :-) Thank you Tony. I think this will be the way to
go. My concern ATM is where it will be best to unicode. The data after
this will go into dict and a few processes and into database. Because
input source if not explicit encoding, I will have to assume ISO-8859-1
I believe but could well be cp1252 for most part ( because it says no
ASCII (0-30) but alright ASCII chars 128-254) and because most are
Windows users. Am thinking to unicode after stripping these characters
and validating text, then unicoding (utf-8) so it is unicode in dict.
Then when I perform these other processes it should be uniform and then
it will go into database as unicode. I think this should be ok.

Regards,
David
In article <mailman.2153.1129538807.509.python-list at python.org>,
I am working with a text format that advises to strip any ascii
control
characters (0 - 30) as part of parsing data and also the ascii pipe
character (124) from the data. I think many of these characters are
from a different time. Since I have never seen most of these
characters
in text I am not sure how these first 30 control characters are all
represented (other than say tab (\t), newline(\n), line return(\r) )
so
what should I do to remove these characters if they are ever
encountered. Many thanks.
Most of those characters are hard to see.
Represent arbitrary characters in a string in hex: "\x00\x01\x02" or
with chr(n).
If you just want to remove some characters, look into "".translate().
nullxlate = "".join([chr(n) for n in xrange(256)])
delchars = nullxlate[:31] + chr(124)
outputstr = inputstr.translate(nullxlate, delchars)
_______________________________________________________________________
_
TonyN.:'
*firstname*nlsnews at georgea*lastname*.com
'
<http://www.georgeanelson.com/>
--
http://mail.python.org/mailman/listinfo/python-list
David Pratt
2005-10-14 18:55:09 UTC
Permalink
I am working with a text format that advises to strip any ascii control
characters (0 - 30) as part of parsing data and also the ascii pipe
character (124) from the data. I think many of these characters are
from a different time. Since I have never seen most of these characters
in text I am not sure how these first 30 control characters are all
represented (other than say tab (\t), newline(\n), line return(\r) ) so
what should I do to remove these characters if they are ever
encountered. Many thanks.
Steve Holden
2005-10-17 09:04:37 UTC
Permalink
I am working with a text format that advises to strip any ascii control
characters (0 - 30) as part of parsing data and also the ascii pipe
character (124) from the data. I think many of these characters are
from a different time. Since I have never seen most of these characters
in text I am not sure how these first 30 control characters are all
represented (other than say tab (\t), newline(\n), line return(\r) ) so
what should I do to remove these characters if they are ever
encountered. Many thanks.
You will find the ord() function useful: control characters all have
ord(c) < 32.

You can also use the chr() function to return a character whose ord() is
a specific value, and you can use hex escapes to include arbitrary
control characters in string literals:

myString = "\x00\x01\x02"

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/
David Pratt
2005-10-17 15:21:14 UTC
Permalink
Many thanks Steve. This is good information. I think this should work
fine. I was doing a string.replace in a cleanData() method with the
following characters but don't know if that would have done it. This
contains all the control characters that I really know about in normal
use. ord(c) < 32 sounds like a much better way to go and comprehensive.
So I guess instead of string.replace, I should do a ... for char
in ... and check evaluate each character, correct? - or is there a
better way of eliminating these other that reading a string in
character by character.

'\a','\b','\e','\f','\n','\r','\t','\v','|'

Regards,
David
I am working with a text format that advises to strip any ascii
control
characters (0 - 30) as part of parsing data and also the ascii pipe
character (124) from the data. I think many of these characters are
from a different time. Since I have never seen most of these
characters
in text I am not sure how these first 30 control characters are all
represented (other than say tab (\t), newline(\n), line return(\r) )
so
what should I do to remove these characters if they are ever
encountered. Many thanks.
You will find the ord() function useful: control characters all have
ord(c) < 32.
You can also use the chr() function to return a character whose ord()
is
a specific value, and you can use hex escapes to include arbitrary
myString = "\x00\x01\x02"
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/
--
http://mail.python.org/mailman/listinfo/python-list
Steve Holden
2005-10-17 15:49:32 UTC
Permalink
David Pratt wrote:
[about ord(), chr() and stripping control characters]
Post by David Pratt
Many thanks Steve. This is good information. I think this should work
fine. I was doing a string.replace in a cleanData() method with the
following characters but don't know if that would have done it. This
contains all the control characters that I really know about in normal
use. ord(c) < 32 sounds like a much better way to go and comprehensive.
So I guess instead of string.replace, I should do a ... for char
in ... and check evaluate each character, correct? - or is there a
better way of eliminating these other that reading a string in
character by character.
'\a','\b','\e','\f','\n','\r','\t','\v','|'
There are a number of different things you might want to try. One is
translate() which, given a string and a translate table, will perform
Post by David Pratt
delchars = "".join(chr(i) for i in range(32)) + "|"
print repr(delchars)
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\
x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f|'
Post by David Pratt
nultxfrm = "".join(chr(i) for i in range(256))
So delchars is a list of characters you want to remove, and nultxfrm is
a 256-character string where the nultxfrm[n] == chr(n) - this performs
no translation at all. So then

s = s.translate(nultxfrm, delchars)

will remove all the "illegal" characters from s.

Note that I am sort-of cheating here, as this is only going to work for
8-bit characters. Once Unicode enters the picture all bets are off.

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/
Tony Nelson
2005-10-17 16:48:35 UTC
Permalink
In article <mailman.2153.1129538807.509.python-list at python.org>,
I am working with a text format that advises to strip any ascii control
characters (0 - 30) as part of parsing data and also the ascii pipe
character (124) from the data. I think many of these characters are
from a different time. Since I have never seen most of these characters
in text I am not sure how these first 30 control characters are all
represented (other than say tab (\t), newline(\n), line return(\r) ) so
what should I do to remove these characters if they are ever
encountered. Many thanks.
Most of those characters are hard to see.

Represent arbitrary characters in a string in hex: "\x00\x01\x02" or
with chr(n).

If you just want to remove some characters, look into "".translate().

nullxlate = "".join([chr(n) for n in xrange(256)])
delchars = nullxlate[:31] + chr(124)
outputstr = inputstr.translate(nullxlate, delchars)
________________________________________________________________________
TonyN.:' *firstname*nlsnews at georgea*lastname*.com
' <http://www.georgeanelson.com/>
David Pratt
2005-10-17 17:30:42 UTC
Permalink
Hi Steve. My plan is to parse the data removing the control characters
and validate to data as records are being added to a dictionary. I am
going to Unicode after this step but before it gets into storage (in
which case I think the translate method could work well).

The encoding itself is not explicit for this data except to say that it
is ASCII and that besides not using chars 0-30, ASCII 128-254 is
permitted. I am not certain whether I should assume cp1252 or
ISO-8859-1. I can't say that everyone is using Windows although likely
vast majority for sure.

Would you think it safe to unicode before or after seeking out control
characters and validating stage? My validations are relatively simple
but to ensure that if I am expecting a date, integer, string etc the
data is what it is supposed to be, (since next stage is database),
unify whitespace, remove control characters, and check for SQL strings
in the data to prevent any stupid things from happening if someone
wanted to be malicious.

Regards,
David
Post by Steve Holden
[about ord(), chr() and stripping control characters]
Post by David Pratt
Many thanks Steve. This is good information. I think this should work
fine. I was doing a string.replace in a cleanData() method with the
following characters but don't know if that would have done it. This
contains all the control characters that I really know about in normal
use. ord(c) < 32 sounds like a much better way to go and
comprehensive.
So I guess instead of string.replace, I should do a ... for char
in ... and check evaluate each character, correct? - or is there a
better way of eliminating these other that reading a string in
character by character.
'\a','\b','\e','\f','\n','\r','\t','\v','|'
There are a number of different things you might want to try. One is
translate() which, given a string and a translate table, will perform
Post by David Pratt
delchars = "".join(chr(i) for i in range(32)) + "|"
print repr(delchars)
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12
\x13\x14\
x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f|'
Post by David Pratt
nultxfrm = "".join(chr(i) for i in range(256))
So delchars is a list of characters you want to remove, and nultxfrm is
a 256-character string where the nultxfrm[n] == chr(n) - this performs
no translation at all. So then
s = s.translate(nultxfrm, delchars)
will remove all the "illegal" characters from s.
Note that I am sort-of cheating here, as this is only going to work for
8-bit characters. Once Unicode enters the picture all bets are off.
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/
--
http://mail.python.org/mailman/listinfo/python-list
Tony Nelson
2005-10-18 03:11:24 UTC
Permalink
In article <mailman.2178.1129571437.509.python-list at python.org>,
Post by David Pratt
This is very nice :-) Thank you Tony. I think this will be the way to
go. My concern ATM is where it will be best to unicode. The data after
this will go into dict and a few processes and into database. Because
input source if not explicit encoding, I will have to assume ISO-8859-1
I believe but could well be cp1252 for most part ( because it says no
ASCII (0-30) but alright ASCII chars 128-254) and because most are
Windows users. Am thinking to unicode after stripping these characters
and validating text, then unicoding (utf-8) so it is unicode in dict.
Then when I perform these other processes it should be uniform and then
it will go into database as unicode. I think this should be ok.
Definitely "".translate() then unicode(). See the docs for
"".translate(). As far as charset, well, if you can't know in advance
you'll want to have some way to configure it for when it's wrong. Also,
maybe 255 is not allowed and should be checked for?
________________________________________________________________________
TonyN.:' *firstname*nlsnews at georgea*lastname*.com
' <http://www.georgeanelson.com/>
Continue reading on narkive:
Loading...