The first thing to know about Python's Unicode support is that you may need to install a recent version of Python to get it. Users of RedHat Linux 7.x have Python 1.5.2 by default, for compatibility reasons. Unicode support was introduced in Python 1.6.
Python has two different string types: an 8-bit non-Unicode string
type (str) and a 16-bit Unicode string type
(unicode).
Unicode strings are written with a leading u. They
may contain Unicode escape sequences of the form \u0000,
just as in Java. For example:
question = u'\u00bfHabla espa\u00f1ol?' # ¿Habla español?
Some Unicode characters have numbers beyond U+FFFF, so
Python has another escape: \U00000000, which offers more
than enough digits to specify any Unicode codepoint. (Recent C and
C++ standards also offer this, but Java does not.)
Python also offers a \N escape which allows you to
specify any Unicode character by name.
# This string has 7 characters in all, including the spaces
# between the symbols.
symbols = u'\N{BLACK STAR} \N{WHITE STAR} \N{LIGHTNING} \N{COMET}'
One more way to build a Unicode string object is with the built-in
unichr() function, which is the Unicode version of
chr().
Unicode strings are very similar to Python's ordinary 8-bit
strings. They have the same useful methods (split(),
strip(), find(), and so on). The
+ and * operators work on Unicode strings
just as they do for plain strings. And like plain strings, Unicode
strings can do
printf-like formatting, using the % symbol.
For the most part, you'll feel right at home.
This seamlessness extends to most of Python's standard library.
Python regular expressions can search Unicode strings.
Python's standard gettext
module supports Unicode. This is the module to use for internationalization
of Python programs.
The Tkinter GUI toolkit offers excellent Unicode support. Here is a minimal Hello, world program using Unicode and Tkinter.
Python's standard XML library is Unicode-aware (as required by the XML specification).
Most of the standard library works smoothly with Unicode strings. Some modules still are not fully Unicode-friendly, but the most important pieces are in place.
Reading and writing Unicode files from Python is simple. Use
codecs.open() and specify the encoding.
import codecs
# Open a UTF-8 file in read mode
infile = codecs.open("infile.txt", "r", "utf-8")
# Read its contents as one large Unicode string.
text = infile.read()
# Close the file.
infile.close()
The same function is used to open a file for writing; just use
"w" (write) or "a" (append) as the second
argument.
A fourth argument, after the encoding, can be provided to specify error-handling. The possible values are:
'strict' - The default. Throw exceptions
if errors are detected while encoding or decoding data.'ignore' - Skip over errors or unencodeable
characters.'replace' - Replace bad or unencodeable data with
a "replacement character", usually a question mark.Since 'strict' is the default, expect a lot of
UnicodeExceptions to be thrown if your data isn't quite
right. Once you get the hang of it, those errors become much less
frequent.
Sometimes a program simply needs to encode or decode a single chunk
of Unicode data. This, too, is easy in Python: Unicode strings have
an encode() method that returns a str, and
str objects have a decode() method that
returns a unicode string.
# Suppose we are given these bytes, perhaps over a socket
# or perhaps taken from a database.
bytes = 'Bun\xc4\x83-diminea\xc8\x9ba, lume'
# We want to convert these UTF-8 bytes to a Unicode string.
unicode_strg = bytes.decode('utf-8')
# Now print it, but in the ISO-8859-1 encoding, because
# (let's suppose) that is the format of our display.
print unicode_strg.encode('iso-8859-1', 'replace')
However, note that in this particular example, the source string
contains two characters (ă and ț) that are not available
in ISO-8859-1! Unfortunately, if our display can only handle
ISO-8859-1 characters, there is no satisfactory answer to this
problem. Some characters will be lost. The last line of the sample
code instructs Python to use the 'replace' error-handling
behavior instead of the default 'strict' behavior. This
way, although some characters will be replaced with question marks, at
least no exception will be thrown.
Of course, it would be better to use a display that can handle all Unicode characters, such as a Tk GUI.
print and Unicode stringsWe now come to the most puzzling aspect of Python's Unicode
support. Attempting to print a Unicode string causes an
error:
>>> print u'\N{POUND SIGN}'
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)
Two elements combine to cause this error:
'strict'. If Python
encounters a character that it can't encode, it raises a
UnicodeError. (This is different from Java, which silently
replaces the character with a ? instead.)Python defaults to ASCII because ASCII is the only thing likely to work everywhere. The correct encoding is not always Latin-1. In fact, it depends on how you are accessing Python.
When Python executes a print statement, it simply
passes the output to the operating system (using fwrite()
or something like it), and some other program is responsible
for actually displaying that output on the screen. For example, on
Windows, it might be the Windows console subsystem that displays the
result. Or if you're using Windows and running Python on a Unix box
somewhere else, your Windows SSH client is actually responsible for
displaying the data. If you are running Python in an xterm on Unix,
then xterm and your X server handle the display.
To print data reliably, you must know the encoding
that this display program expects.
Earlier it was mentioned that IBM PC computers use the "IBM Code Page 437" character set at the BIOS level. The Windows console still emulates CP437. So this print statement will work, on Windows, under a console window.
# Windows console mode only
>>> s = u'\N{POUND SIGN}'
>>> print s.encode('cp-437')
£
Several SSH clients display data using the Latin-1 character set;
Tkinter assumes UTF-8, when 8-bit strings are passed into it. So in
general it is not possible to determine what encoding to use with
print. It is therefore better to send Unicode output to
files or Unicode-aware GUIs, not to sys.stdout.