All HTML and XML documents are made up of Unicode characters, so any HTML or XML document may contain arbitrary Unicode text. (This is a good reason to use a dedicated XML parser library when appropriate, rather than writing home-brewed code that doesn't support the whole XML standard.)
If you are writing a program that generates an XML or HTML file, you should specify which encoding you are using. Here's how.
<?xml version="1.0" encoding="ISO-8859-1"?>
This is called a text declaration; it belongs at the absolute beginning of the XML file. No comments or whitespace may precede it. (For more details, see section 4.3.3 of the XML specification.)
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
The <meta> tag is allowed anywhere in the
<head> of an HTML document, but the earlier the
better.
Both HTML and XML allow the author to specify a Unicode character by its codepoint. The syntax is:
£
This is called a numeric character reference. The
00A3 part is the Unicode codepoint of the desired
character. This example specifies the codepoint U+00A3,
which is the pound sign. So the code £ looks
like this in a document: £.
The leading zeros are optional, so the same character can be
specified like this: £.
HTML and XHTML also provide convenient nicknames for some commonly used characters. Here are some examples:
| Code | Character | Name |
|---|---|---|
é |
é | Latin small letter e with acute |
© |
© | Copyright sign |
° |
° | Degree sign |
» |
» | Right-pointing double angle quotation mark |
A complete table is provided in the HTML 4 Recommendation.