Discussion:
Degree symbol (UTF-8 > ASCII)
Steven Taschuk
2003-04-17 19:06:05 UTC
Permalink
[...]
So just do something like
print >>fileobject, chr(176)
print >>fileobject, u'\N{DEGREE SIGN}'.encode('latin-1')
It worked fine when printing to a file (i.e., from the prompt),
but I still got an error when I try to append it to a string destined
w += [u'\N{DEGREE SIGN}'.encode('latin-1') + scale.strip()]
UnicodeError: ASCII decoding error: ordinal not in range(128)
And scale is a Unicode string, right?
'\xb0' + 'C'
'\xb0C'
'\xb0' + u'C'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)

(For the same reason as before: concatenating a normal string and
a Unicode string results in a Unicode string, which means that the
normal string has to be decoded into a Unicode string first.)

I'd suggest doing all your string assembly with Unicode strings,
and only encoding to ISO-8859-1 at the output stage. Mixing the
two kinds of strings is asking for trouble, as you have seen.
--
Steven Taschuk o- @
staschuk at telusplanet.net 7O )
" (
Dennis Reinhardt
2003-04-16 19:10:43 UTC
Permalink
I'm working with a xml document which doesn't include an encoding,
The lack of an encoding is your problem. Pick an encoding which defines 256
characters, not 128, if you want to represent chr(176).in a single byte.
I would like to insert the degree symbol (chr(176)), but
because this is outside the bounds (chr(128) is the limit), Python
raises an XML error.
UTF-8 has no chr(128) limit. Your document is UTF-8 encoded (because you
have not specified an encoding) and you need to meet UTF-8 encoding until
you explicitly specify some other encoding. More likely is that chr(176)
and perhaps what follows are not legal UTF-8.
--

Dennis Reinhardt

DennisR at dair.com http://www.spamai.com
Peter Clark
2003-04-17 01:20:28 UTC
Permalink
Post by Dennis Reinhardt
I'm working with a xml document which doesn't include an encoding,
The lack of an encoding is your problem. Pick an encoding which defines 256
characters, not 128, if you want to represent chr(176).in a single byte.
I'm sorry, upon re-reading my first message, I realize that I did
not describe the problem adequately enough. Let me see if I can try
again.
When I try chr(176).encode('latin-1') I get the following message:
UnicodeError: ASCII decoding error: ordinal not in range(128). But
when I try 'print chr(176)' I get the degree symbol, which is what I
want. Yet when I try inserting this into the XML stream, I get the
following error:

w += [chr(176) + scale.strip()]
UnicodeError: ASCII decoding error: ordinal not in range(128)

I've tried changing its encoding to 'utf-8' as well with
chr(176).encode('utf-8'), but that just returns the same error. This
doesn't make sense: if 'print chr(176)' works fine, why doesn't it
work later on? In a nut shell, all I want to do is add the degree
character to a string contained within a list. That doesn't seem hard
at all, but I'm completely stumped.
Duncan Booth
2003-04-17 08:52:47 UTC
Permalink
pc451 at yahoo.com (Peter Clark) wrote in
Post by Peter Clark
w += [chr(176) + scale.strip()]
UnicodeError: ASCII decoding error: ordinal not in range(128)
I've tried changing its encoding to 'utf-8' as well with
chr(176).encode('utf-8'), but that just returns the same error. This
doesn't make sense: if 'print chr(176)' works fine, why doesn't it
work later on? In a nut shell, all I want to do is add the degree
character to a string contained within a list. That doesn't seem hard
at all, but I'm completely stumped.
A latin1 string containing chr(176) is a degree symbol. This string is
already encoded (as latin-1), but Python doesn't know that until you tell
it. If you were to do 'print chr(176)' in a command window on the machine
I'm typing on, you get a funny graphic character because the output is
using codepage 437 not latin1.

If you want to change its encoding, e.g. to utf-8, you must first DECODE it
from latin1 (giving you a unicode string) and then ENCODE it to utf8.
Post by Peter Clark
chr(176).decode('latin1').encode('utf8')
'\xc2\xb0'

(Or in my case I can do "print chr(176).decode('latin1').encode('cp437')"
and see a degree symbol.)
--
Duncan Booth duncan at rcp.co.uk
int month(char *p){return(124864/((p[0]+p[1]-p[2]&0x1f)+1)%12)["\5\x8\3"
"\6\7\xb\1\x9\xa\2\0\4"];} // Who said my code was obscure?
Martin v. Löwis
2003-04-18 08:45:28 UTC
Permalink
Post by Steven Taschuk
And scale is a Unicode string, right?
Only because the XML document has no specified encoding, so it
defaults to UTF-8, yes. But all the text is straight ASCII, except of
course for the inclusion of the degree symbol.
No. UTF-8 is *not* Unicode. In Python, there are two data types: <type
'string'>, and <type 'unicode'>. The type string represents bytes (8
bit per element), and the type unicode represents characters.

The string type can also be used to represent characters, but only if
you assume that you are using some encoding. UTF-8 is an encoding, and
so is Latin-1. A string encoded in UTF-8 is still a byte string, not a
character string. A Unicode object may contain characters that can be
encoded in ASCII, or it can contain characters that cannot be encoded
in ASCII.

If this is not the mental model that you have, you will have a hard
time understanding all the phenomenons you observe, and I suggest
reading

http://manatee.mojam.com/~skip/unicode/unicode/

Regards,
Martin
Peter Clark
2003-04-18 03:11:57 UTC
Permalink
Post by Steven Taschuk
And scale is a Unicode string, right?
Only because the XML document has no specified encoding, so it
defaults to UTF-8, yes. But all the text is straight ASCII, except of
course for the inclusion of the degree symbol.
Post by Steven Taschuk
I'd suggest doing all your string assembly with Unicode strings,
and only encoding to ISO-8859-1 at the output stage. Mixing the
two kinds of strings is asking for trouble, as you have seen.
You're right, of course. The list is sent to a simple function
that prints it to stdout; the following worked fine, and handles any
integers or floats that might otherwise cause an attribute error:

def printlist(w):
for item in w:
try:
print item.encode('latin-1')
except:
print item

Which is much nicer than having to mess with the strings as they
are added to the list. Thanks.
Peter Clark
2003-04-17 01:32:06 UTC
Permalink
The "degree" symbol is chr(176) in what character encoding?
Certainly not in UTF-8. Perhaps in windows cp-1252 or ISO-8859-1.
Yes, latin-1.
(if you look what is written you'll see that the character
is encoded in two bytes when using UTF-8: 0xc2,0xb0)
This works in the shell, expect that 'print
degreeChar.encode('UTF-8')' returns two characters, '??' Well,
obviously, since it's Unicode being displayed in a latin-1
environment. (Strictly speaking, sys.getdefaultencoding() returns
'ascii', but the interactive prompt still prints characters greater
than 128.) However, when I put it into the code, I get this error:

w += [deg.encode('UTF-8') + scale.strip()]
UnicodeError: ASCII decoding error: ordinal not in range(128)

Since the output is meant to be read to be displayed by a font
which is in essentially latin-1 encoding, I need to restrict the
manner in which the degree symbol is displayed to one byte. Yet I
cannot get it to behave, even though 'print chr(176) works perfectly
fine at the prompt. My suspicion is that the default encoding of the
system is messing python up somewhere along the way--is there any way
to tell it to just print the stupid character and not be concerned
with the output?
Peter Clark
2003-04-17 13:55:48 UTC
Permalink
Post by Peter Clark
w += [deg.encode('UTF-8') + scale.strip()]
UnicodeError: ASCII decoding error: ordinal not in range(128)
Aha! Yes, it's a *decoding* error. Your 'deg' variable is a
normal string, right?
That's correct.
Post by Peter Clark
Since the output is meant to be read to be displayed by a font
which is in essentially latin-1 encoding, I need to restrict the
manner in which the degree symbol is displayed to one byte. [...]
So just do something like
print >>fileobject, chr(176)
print >>fileobject, u'\N{DEGREE SIGN}'.encode('latin-1')
It worked fine when printing to a file (i.e., from the prompt),
but I still got an error when I try to append it to a string destined
to be included in a list:

w += [u'\N{DEGREE SIGN}'.encode('latin-1') + scale.strip()]
UnicodeError: ASCII decoding error: ordinal not in range(128)

In this case, 'scale' is either 'C' or 'F' (hence the need for a
degree character). But then I realized that what you said above would
also apply to 'scale' as well. So when I changed the line to:

w += [u'\N{DEGREE SIGN}'.encode('latin-1') +
scale.strip().encode('latin-1')]

it worked perfectly. Problem solved, thank you very much! So in
retrospect, my mistake was in not paying attention to the encoding of
the rest of the string, which was why the error messages kept popping
up. Duely noted.
Erik Max Francis
2003-04-17 01:56:34 UTC
Permalink
Post by Peter Clark
Since the output is meant to be read to be displayed by a font
which is in essentially latin-1 encoding, I need to restrict the
manner in which the degree symbol is displayed to one byte. Yet I
cannot get it to behave, even though 'print chr(176) works perfectly
fine at the prompt. My suspicion is that the default encoding of the
system is messing python up somewhere along the way--is there any way
to tell it to just print the stupid character and not be concerned
with the output?
I've come into this conversation late, but could it be that what's
confusing you is that UTF-8 and Latin-1 are not the same thing? It
sounds like you want Latin-1 but are asking for UTF-8. UTF-8 is an
octet representation of Unicode which uses escape sequences and the like
to represent eight-bit information; Latin-1 is an eight-bit encoding.
Both have the property that pure-ASCII data will be represented without
modification, but they aren't the same beast. If you're converting to
UTF-8 and are puzzled why 8-bit data is expanding to multiple
characters, then chances are UTF-8 isn't what you wanted.

[where u is a Unicode string representing the degree symbol]
Post by Peter Clark
u.encode('latin-1')
'\xb0'
Post by Peter Clark
u.encode('utf-8')
'\xc2\xb0'
Post by Peter Clark
print u.encode('latin-1')
[the degree symbol]
--
Erik Max Francis / max at alcyone.com / http://www.alcyone.com/max/
__ San Jose, CA, USA / 37 20 N 121 53 W / &tSftDotIotE
/ \ It was involuntary. They sank my boat.
\__/ John F. Kennedy (on how he became a war hero)
Bosskey.net: Return to Wolfenstein / http://www.bosskey.net/rtcw/
A personal guide to Return to Castle Wolfenstein.
Steven Taschuk
2003-04-17 03:01:53 UTC
Permalink
Post by Peter Clark
w += [deg.encode('UTF-8') + scale.strip()]
UnicodeError: ASCII decoding error: ordinal not in range(128)
Aha! Yes, it's a *decoding* error. Your 'deg' variable is a
normal string, right?
Post by Peter Clark
'\xb0'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII decoding error: ordinal not in range(128)
Post by Peter Clark
'\xb0'.decode('latin-1').encode('utf-8')
'\xc2\xb0'
Post by Peter Clark
'\xb0'.decode('latin-1') # produces Unicode string
u'\xb0'
Post by Peter Clark
u'\xb0'.encode('utf-8') # uses Unicode string
'\xc2\xb0'

See? For .encode to work, it has to know what characters are in
the string being encoded. If it's a normal string with bytes
outside of range(128), they're not ASCII, so it quite properly
refuses to guess what they are. But if you tell it the correct
encoding, it can construct the right sequence of characters (as a
Unicode string), and then encode that as instructed.
Post by Peter Clark
Since the output is meant to be read to be displayed by a font
which is in essentially latin-1 encoding, I need to restrict the
manner in which the degree symbol is displayed to one byte. [...]
So just do something like
print >>fileobject, chr(176)
print >>fileobject, u'\N{DEGREE SIGN}'.encode('latin-1')

(Note that if you do want your file to be in ISO-8859-1, the XML
declaration should say that's what it is.)
--
Steven Taschuk w_w
staschuk at telusplanet.net ,-= U
1 1
Irmen de Jong
2003-04-16 19:11:03 UTC
Permalink
I'm working with a xml document which doesn't include an encoding, so
it defaults to UTF-8. Of course, all of the text is ASCII, and likely
to remain so. I would like to insert the degree symbol (chr(176)), but
because this is outside the bounds (chr(128) is the limit), Python
raises an XML error. What's the simplest way of getting including an
unadorned degree symbol? Again, it's not necessary to preserve the
UTF-8 encoding, but I'm not quite certain as to how to tell Python
that the XML document is plain ASCII (or I guess in the case of the
degree symbol, latin-1).
Thanks,
:Peter
The "degree" symbol is chr(176) in what character encoding?
Certainly not in UTF-8. Perhaps in windows cp-1252 or ISO-8859-1.

Are you creating the UTF-8 encoded XML document using Python?
Then try the following to add a "degree" character to your
output file:

To get the (unicode) character you want:
degreeChar = u'\N{DEGREE SIGN}'

(this is unicode character 00b0, so you could also use:
degreeChar = u'\u00b0'

To write it using UTF-8 encoding to a file object 'output':
output.write(degreeChar.encode('UTF-8'))

(if you look what is written you'll see that the character
is encoded in two bytes when using UTF-8: 0xc2,0xb0)

--Irmen
Steven Taschuk
2003-04-16 18:50:30 UTC
Permalink
I'm working with a xml document which doesn't include an encoding, so
it defaults to UTF-8. Of course, all of the text is ASCII, and likely
to remain so. I would like to insert the degree symbol (chr(176)), but
because this is outside the bounds (chr(128) is the limit), Python
raises an XML error. What's the simplest way of getting including an
unadorned degree symbol? Again, it's not necessary to preserve the
UTF-8 encoding, but I'm not quite certain as to how to tell Python
that the XML document is plain ASCII (or I guess in the case of the
degree symbol, latin-1).
I'm not sure what you're after.

On the one hand, you say the text is all ASCII "and likely to
remain so", but then immediately declare that you want to insert a
non-ASCII character, namely the degree symbol. If the XML
document is to be in ASCII, it may not include chr(0xB0), since
ASCII has no such character. End of story.

If you want this character, either your document must be in a
character encoding which includes it (such as UTF-8 or
ISO-8859-1), or it must use XML character entity markup "&#xB0;".

If you use ISO-8859-1 (aka Latin-1), you'll have to indicate so in
the XML declaration; if I were you I'd leave the document in
UTF-8. (Among other things, XML processors are not required to
understand ISO-8859-1, but they are required to understand UTF-8.)
--
Steven Taschuk staschuk at telusplanet.net
"Its force is immeasurable. Even Computer cannot determine it."
-- _Space: 1999_ episode "Black Sun"
Peter Clark
2003-04-16 17:30:09 UTC
Permalink
I'm working with a xml document which doesn't include an encoding, so
it defaults to UTF-8. Of course, all of the text is ASCII, and likely
to remain so. I would like to insert the degree symbol (chr(176)), but
because this is outside the bounds (chr(128) is the limit), Python
raises an XML error. What's the simplest way of getting including an
unadorned degree symbol? Again, it's not necessary to preserve the
UTF-8 encoding, but I'm not quite certain as to how to tell Python
that the XML document is plain ASCII (or I guess in the case of the
degree symbol, latin-1).
Thanks,
:Peter
Ben Hutchings
2003-04-24 16:16:01 UTC
Permalink
I'm working with a xml document which doesn't include an encoding, so
it defaults to UTF-8. Of course, all of the text is ASCII, and likely
to remain so. I would like to insert the degree symbol (chr(176)), but
because this is outside the bounds (chr(128) is the limit), Python
raises an XML error. What's the simplest way of getting including an
unadorned degree symbol?
Since you mentioned an XML error I'm assuming you're using some XML API
to read and write the file, so you want a Unicode string and the answer
is:
unichr(176)

chr(176) returns an ordinary string which should be interpreted in the
local encoding, which is ASCII by default; however, ASCII doesn't define
a character 176. Python doesn't interpret strings as anything other
than arbitrary bytes until you try to print or convert them, so that's
why the error only shows up later.
Again, it's not necessary to preserve the UTF-8 encoding, but I'm not
quite certain as to how to tell Python that the XML document is plain
ASCII (or I guess in the case of the degree symbol, latin-1).
An XML document in some 8-bit or multibyte encoding other than UTF-8
*must* have the encoding specified in its XML declaration. You should
not try to override this.
Continue reading on narkive:
Loading...