Unicode in Python

The first thing to know about Python's Unicode support is that you may need to install a recent version of Python to get it. Users of RedHat Linux 7.x have Python 1.5.2 by default, for compatibility reasons. Unicode support was introduced in Python 1.6.

Unicode Strings in Python

Python has two different string types: an 8-bit non-Unicode string type (str) and a 16-bit Unicode string type (unicode).

Unicode strings are written with a leading u. They may contain Unicode escape sequences of the form \u0000, just as in Java. For example:

  question = u'\u00bfHabla espa\u00f1ol?'  # ¿Habla español?

Some Unicode characters have numbers beyond U+FFFF, so Python has another escape: \U00000000, which offers more than enough digits to specify any Unicode codepoint. (Recent C and C++ standards also offer this, but Java does not.)

Python also offers a \N escape which allows you to specify any Unicode character by name.

  # This string has 7 characters in all, including the spaces
  # between the symbols.
  symbols = u'\N{BLACK STAR} \N{WHITE STAR} \N{LIGHTNING} \N{COMET}'

One more way to build a Unicode string object is with the built-in unichr() function, which is the Unicode version of chr().

Unicode Support in the Python Standard Library

Unicode strings are very similar to Python's ordinary 8-bit strings. They have the same useful methods (split(), strip(), find(), and so on). The + and * operators work on Unicode strings just as they do for plain strings. And like plain strings, Unicode strings can do printf-like formatting, using the % symbol. For the most part, you'll feel right at home.

This seamlessness extends to most of Python's standard library.

Most of the standard library works smoothly with Unicode strings. Some modules still are not fully Unicode-friendly, but the most important pieces are in place.

Unicode files and Python

Reading and writing Unicode files from Python is simple. Use codecs.open() and specify the encoding.

  import codecs
  # Open a UTF-8 file in read mode
  infile = codecs.open("infile.txt", "r", "utf-8")
  # Read its contents as one large Unicode string.
  text = infile.read()
  # Close the file.
  infile.close()

The same function is used to open a file for writing; just use "w" (write) or "a" (append) as the second argument.

A fourth argument, after the encoding, can be provided to specify error-handling. The possible values are:

Since 'strict' is the default, expect a lot of UnicodeExceptions to be thrown if your data isn't quite right. Once you get the hang of it, those errors become much less frequent.

Sometimes a program simply needs to encode or decode a single chunk of Unicode data. This, too, is easy in Python: Unicode strings have an encode() method that returns a str, and str objects have a decode() method that returns a unicode string.

  # Suppose we are given these bytes, perhaps over a socket
  # or perhaps taken from a database.
  bytes = 'Bun\xc4\x83-diminea\xc8\x9ba, lume'

  # We want to convert these UTF-8 bytes to a Unicode string.
  unicode_strg = bytes.decode('utf-8')

  # Now print it, but in the ISO-8859-1 encoding, because
  # (let's suppose) that is the format of our display.
  print unicode_strg.encode('iso-8859-1', 'replace')

However, note that in this particular example, the source string contains two characters (ă and ț) that are not available in ISO-8859-1! Unfortunately, if our display can only handle ISO-8859-1 characters, there is no satisfactory answer to this problem. Some characters will be lost. The last line of the sample code instructs Python to use the 'replace' error-handling behavior instead of the default 'strict' behavior. This way, although some characters will be replaced with question marks, at least no exception will be thrown.

Of course, it would be better to use a display that can handle all Unicode characters, such as a Tk GUI.

print and Unicode strings

We now come to the most puzzling aspect of Python's Unicode support. Attempting to print a Unicode string causes an error:

>>> print u'\N{POUND SIGN}'
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: ASCII encoding error: ordinal not in range(128)

Two elements combine to cause this error:

  1. Python's default encoding is ASCII. The pound sign is not an ASCII character. (By contrast, Java's default encoding is usually something like Latin-1, which covers a bit more ground than ASCII.)
  2. The default error behavior is 'strict'. If Python encounters a character that it can't encode, it raises a UnicodeError. (This is different from Java, which silently replaces the character with a ? instead.)

Python defaults to ASCII because ASCII is the only thing likely to work everywhere. The correct encoding is not always Latin-1. In fact, it depends on how you are accessing Python.

When Python executes a print statement, it simply passes the output to the operating system (using fwrite() or something like it), and some other program is responsible for actually displaying that output on the screen. For example, on Windows, it might be the Windows console subsystem that displays the result. Or if you're using Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually responsible for displaying the data. If you are running Python in an xterm on Unix, then xterm and your X server handle the display.

To print data reliably, you must know the encoding that this display program expects.

Earlier it was mentioned that IBM PC computers use the "IBM Code Page 437" character set at the BIOS level. The Windows console still emulates CP437. So this print statement will work, on Windows, under a console window.

  # Windows console mode only
  >>> s = u'\N{POUND SIGN}'
  >>> print s.encode('cp-437')
  £

Several SSH clients display data using the Latin-1 character set; Tkinter assumes UTF-8, when 8-bit strings are passed into it. So in general it is not possible to determine what encoding to use with print. It is therefore better to send Unicode output to files or Unicode-aware GUIs, not to sys.stdout.

< Back: Java | Next: Windows >