Unicode in HTML and XML

All HTML and XML documents are made up of Unicode characters, so any HTML or XML document may contain arbitrary Unicode text. (This is a good reason to use a dedicated XML parser library when appropriate, rather than writing home-brewed code that doesn't support the whole XML standard.)

Specifying the Encoding

If you are writing a program that generates an XML or HTML file, you should specify which encoding you are using. Here's how.

Specifying the encoding of an XML file:
  <?xml version="1.0" encoding="ISO-8859-1"?>

This is called a text declaration; it belongs at the absolute beginning of the XML file. No comments or whitespace may precede it. (For more details, see section 4.3.3 of the XML specification.)

Specifying the encoding of an HTML file:
  <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

The <meta> tag is allowed anywhere in the <head> of an HTML document, but the earlier the better.

Numeric Character References

Both HTML and XML allow the author to specify a Unicode character by its codepoint. The syntax is:

  &#x00A3;

This is called a numeric character reference. The 00A3 part is the Unicode codepoint of the desired character. This example specifies the codepoint U+00A3, which is the pound sign. So the code &#x00A3; looks like this in a document: £.

The leading zeros are optional, so the same character can be specified like this: &#xA3;.

HTML and XHTML also provide convenient nicknames for some commonly used characters. Here are some examples:

Some HTML Character Entity References
Code Character Name
&eacute; é Latin small letter e with acute
&copy; © Copyright sign
&deg; ° Degree sign
&raquo; » Right-pointing double angle quotation mark

A complete table is provided in the HTML 4 Recommendation.

< Back: The Web | Next: Java >