Web Character Encoding

Just a few notes. I have never yet read a web page that clearly explains character encoding for websites. This One is no exception (I tried once before but the more you delve, the more confusing it gets). Anyway...

ASCII

ASCII provides definitions for 128 characters that represent text in computers. The first 32 are non-printing control characters that affect how text and space are processed. The remaining 95 represent printable letters, digits, punctuation marks, and a few miscellaneous symbols:

!"#$%&'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_
`abcdefghijklmnopqrstuvwxyz{|}~

ASCII printable characters:
1100001 = a (7-bit byte from character codes 32-127) HTML a
The extended ASCII table (256 characters):
01010111 = W (8-bit byte from character codes 128-255) HTML W
11101001 = é (ditto) HTML é

See the ASCII character map and extended ASCII table »

Windows Notepad creates ASCII text.

Multi-byte encodings

To map characters for a language that uses more than 256 characters, one byte isn't enough. Using two bytes (16 bits) it's possible to encode 65,536 values:

10000001 01000000 (a Chinese character)

Two bytes still aren't enough. The Unicode standard defines 1,114,112 code points that can be used for all sorts of letters and symbols. However, Unicode is not an encoding. UTF-8 is. UTF-8 is a variable-length encoding. If a character can be represented using a single byte (because its code point is a very small number) UTF-8 will encode it with a single byte.

Unicode is a character 'repertoire'. ISO 8859, Mac OS Roman, Windows-1252 (for Western languages) and ISO 8859-1 (Western Europe) are other examples.

Handling encodings for web pages

Gar�bled te�xt is what happens when reading text using the wrong encoding. A computer (or web browser) must be told what encoding the text is in.

UTF-8 is binary compatible with ASCII: the de-facto baseline for all encodings. However, PHP has no concept of characters or encodings - it only knows bytes that may or may not eventually be interpreted as characters by somebody else. Its only requirement is that source code needs to be in an ASCII compatible encoding. PHP source code can be saved in ISO-8859-1, Mac Roman, UTF-8 or any other ASCII-compatible encoding and the string literals in a script will have whatever encoding the source code was stored as. Encoding only matters when manipulating strings or when systems are talking to each other:

A typical web publishing chain:
Web form input -> PHP
PHP -> text file (or database)
Text file (or database) -> PHP
PHP -> browser

The Unicode standard character 'repertoire' for HTML is ISO 10646 but it needs an encoding to go with it. UTF-8 can represent any character in the ISO 10646 repertoire and is the most widely supported encoding in text editors and other publishing tools.

Some basics:

(1) Set up the web server (if Apache) to send Content-Type headers as UTF-8 with the following line in an .htaccess file: AddDefaultCharset UTF-8

(2) In HTML5 document heads include: <meta charset="utf-8">

(3) Clean input by specifying UTF-8 character encoding to be used in form submission with: accept-charset="UTF-8" in the form action.

(4) Clean output from PHP with PHP's function: mb_convert_encoding() (as long as you know the original encoding).

(5) Save local documents in a text editor with encoding set to UTF-8.

(6) Don't use PHP's encoding functions without knowing exactly why and what they actually do.

Page last modified: November 30, 2014