All About Unicode, UTF8 & Character Sets

Advertisement

This is a story that dates back to the earliest days of computers. The story has a plot, well, sort of. It has competition and intrigue, as well as traversing oodles of countries and languages. There is conflict and resolution, and a happyish ending. But the main focus is the characters — 110,116 of them. By the end of the story, they will all find their own unique place in this world.

This story (or article, as known on Smashing Magazine) will follow a few of those characters more closely, as they journey from Web server to browser, and back again. Along the way, you’ll find out more about the history of characters, character sets, Unicode and UTF-8, and why question marks and odd accented characters sometimes show up in databases and text files.

Warning: this article contains lots of numbers, including a bit of binary – best approached after your morning cup of coffee.

ASCII

Computers only deal in numbers and not letters, so it’s important that all computers agree on which numbers represent which letters.

Let’s say my computer used the number 1 for A, 2 for B, 3 for C, etc and yours used 0 for A, 1 for B, etc. If I sent you the message HELLO, then the numbers 8, 5, 12, 12, 15 would whiz across the wires. But for you 8 means I, so you would receive and decode it as IFMMP. To communicate effectively, we would need to agree on a standard way of encoding the characters.

To this end, in the 1960s the American Standards Association created a 7-bit encoding called the American Standard Code for Information Interchange (ASCII1). In this encoding HELLO is 72, 69, 76, 76, 79 and would be transmitted digitally as 1001000 1000101 1001100 1001100 1001111. Using 7 bits gives 128 possible values from 0000000 to 1111111, so ASCII has enough room for all lower case and upper case Latin letters, along with each numerical digit, common punctuation marks, spaces, tabs and other control characters. In 1968, US President Lyndon Johnson made it official2 – all computers must use and understand ASCII.

Trying It Yourself

There are plenty of ASCII tables3 available, displaying or describing the 128 characters. Or you can make one of your own with a little bit of CSS, HTML and Javascript, most of which is to get it to display nicely:

<html>
<body>
<style type="text/css">p {float: left; padding: 0 15px; margin: 0; font-size: 80%;}</style>
<script type="text/javascript">
for (var i=0; i<128; i++) document.writeln ((i%32?'':'<p>') + i + ': ' + String.fromCharCode (i) + '<br>');
</script>
</body>
</html>

This will display a table like this:

Do-It-Yourself Javascript ASCII table viewed in Firefox
Do-It-Yourself Javascript ASCII table viewed in Firefox

The most important bit of this is the Javascript String.fromCharCode function. It takes a number and turns it into a character. In fact, the following four lines of HTML and Javascript all produce the same result. They all get the browser to display character numbers 72, 69, 76, 76 and 79:

HELLO
HELLO
<script>document.write ("HELLO");</script>
<script>document.write (String.fromCharCode (72,69,76,76,79));</script>

Also notice how Firefox displays the unprintable characters (like backspace and escape) in the first column. Some browsers show blanks or question marks. Firefox squeezes four hexadecimal digits into a small box.

The Eighth Bit

Teleprinters4 and stock tickers were quite happy sending 7 bits of information to each other. But the new fangled microprocessors5 of the 1970s preferred to work with powers of 2. They could process 8 bits at a time and so used 8 bits (aka a byte or octet) to store each character, giving 256 possible values.

An 8 bit character can store a number up to 255, but ASCII only assigns up to 127. The other values from 128 to 255 are spare. Initially, IBM PCs used the spare slots to represent accented letters, various symbols and shapes and a handful of Greek letters. For instance, number 200 was the lower left corner of a box: ╚, and 224 was the Greek letter alpha in lower case: α. This way of encoding the letters was later given the name code page 4376.

However, unlike ASCII, characters 128-255 were never standardized, and various countries started using the spare slots for their own alphabets. Not everybody agreed that 224 should display α, not even the Greeks. This led to the creation of a handful of new code pages7. For example, in Russian IBM computers using code page 885, 224 represents the Cyrillic letter Я. And in Greek code page 737, it is lower case omega: ω.

Even then there was disagreement. From the 1980s Microsoft Windows introduced its own code pages. In the Cyrillic code page Windows-1251, 224 represents the Cyrillic letter a, and Я is at 223.

In the late 1990s, an attempt at standardization was made. Fifteen different 8 bit character sets were created to cover many different alphabets such as Cyrillic, Arabic, Hebrew, Turkish, and Thai. They are called ISO-8859-1 up to ISO-8859-168 (number 12 was abandoned). In the Cyrillic ISO-8859-5, 224 represents the letter р, and Я is at 207.

So if a Russian friend sends you a document, you really need to know what code page it uses. The document by itself is just a sequence of numbers. Character 224 could be Я, a or р. Viewed using the wrong code page, it will look like a bunch of scrambled letters and symbols.

(The situation isn’t quite as bad when viewing Web pages – as Web browsers can usually detect a page’s character set based on frequency analysis and other such techniques. But this is a false sense of security – they can and do get it wrong.)

Trying It Yourself

Code pages are also known as character sets. You can explore these character sets yourself, but you have to use PHP or a similar server side language this time (roughly because the character needs to be in the page before it gets to the browser). Save these  lines in a PHP file and upload it to your server:

<html>
<head>
<meta charset="ISO-8859-5">
</head>
<body>
<style type="text/css">p {float: left; padding: 0 15px; margin: 0; font-size: 80%;}</style>
<?php  for ($i=0; $i<256; $i++) echo ($i%32?'':'<p>') . $i . ': ' . chr ($i) . '<br>'; ?>
</body>
</html>

This will display a table like this:

Cyrillic character set ISO-8859-5 viewed in Firefox
Cyrillic character set ISO-8859-5 viewed in Firefox

The PHP function chr does a similar thing to Javascript’s String.fromCharCode. For example chr(224) embeds the number 224 into the Web page before sending it to the browser. As we’ve seen above, 224 can mean many different things. So, the browser needs to know which character set to use to display the 224. That’s what the first line above is for. It tells the browser to use the Cyrillic character set ISO-8858-5:

<meta charset="ISO-8859-5">

If you exclude the charset line, then it will display using the browser’s default. In countries with Latin-based alphabets (like the UK and US), this is probably ISO-8859-1, in which case 224 is an a with grave accent: à. Try changing this line to ISO-8859-7 or Windows-1251 and refresh the page. You can also override the character set in the browser. In Firefox go to View > Character Encoding. Swap between a few to see what effect it has. If you try to display more than 256 characters, the sequence will repeat.

Summary Circa 1990

This is the situation in about 1990. Documents can be written, saved and exchanged in many languages, but you need to know which character set they use. There is also no easy way to use two or more non-English alphabets in the same document, and alphabets with more than 256 characters like Chinese and Japanese have to use entirely different systems.

Finally, the Internet is coming! Internationalization and globalization is about to make this a much bigger issue. A new standard is required.

Unicode To The Rescue

Starting in the late 1980s, a new standard was proposed – one that would assign a unique number (officially known as a code point) to every letter in every language, one that would have way more than 256 slots. It was called Unicode9. It is now in version 6.1 and consists of over 110,000 code points. If you have a few hours to spare you can watch them all whiz past10.

The first 128 Unicode code points are the same as ASCII. The range 128-255 contains currency symbols and other common signs and accented characters (aka characters with diacritical marks11), and much of it is borrowed ISO-8859-1. After 256 there are many more accented characters. After 880 it gets into Greek letters, then Cyrillic, Hebrew, Arabic, Indic scripts, and Thai. Chinese, Japanese and Korean start from 11904 with many others in between.

This is great – no more ambiguity – each letter is represented by its own unique number. Cyrillic Я is always 1071 and Greek α is always 945. 224 is always à, and H is still 72. Note that these Unicode code points are officially written in hexadecimal preceded by U+. So the Unicode code point H is usually written as U+0048 rather than 72 (to convert from hexadecimal to decimal: 4*16+8=72).

The major problem is that there are more than 256 of them. The characters will no longer fit into 8 bits. However Unicode is not a character set or code page. So officially that is not the Unicode Consortium’s problem. They just came up with the idea and left someone else to sort out the implementation. That will be discussed in the next two sections.

Unicode Inside The Browser

Unicode does not fit into 8 bits, not even into 16. Although only 110,116 code points are in use, it has the capability to define up to 1,114,112 of them, which would require 21 bits.

However, computers have advanced since the 1970s. An 8 bit microprocessor is a bit out of date. New computers now have 64 bit processors, so why can’t we move beyond an 8 bit character and into a 32 bit or 64 bit character?

The first answer is: we can!

A lot of software is written in C or C++, which supports a “wide character”. This is a 32 bit character called wchar_t. It is an extension of C’s 8 bit char type. Internally, modern Web browsers use these wide characters (or something similar) and can theoretically quite happily deal with over 4 billion distinct characters. This is plenty for Unicode. So – internally, modern Web browers use Unicode.

Trying It Yourself

The Javascript code below is similar to the ASCII code above, except it goes up to a much higher number. For each number, it tells the browser to display the corresponding Unicode code point:

<html>
<body>
<style type="text/css">p {float: left; padding: 0 15px; margin: 0; font-size: 80%;}</style>
<script type="text/javascript">
for (var i=0; i<2096; i++)
  document.writeln ((i%256?'':'<p>') + i + ': ' + String.fromCharCode (i) + '<br>');
</script>
</body>
</html>

It will output a table like this:

A selection of Unicode code points viewed in Firefox
A selection of Unicode code points viewed in Firefox

The screenshot above only shows a subset of the first few thousand code points output by the Javascript. The selection includes some Cyrillic and Arabic characters, displayed right-to-left.

The important point here is that Javascript runs completely in the Web browser where 32 bit characters are perfectly acceptable. The Javascript function String.fromCharCode(1071) outputs the Unicode code point 1071 which is the letter Я.

Similarly if you put the HTML entity Я into an HTML page, a modern Web browser would display Я. Numerical HTML entities also refer to Unicode.

On the other hand, the PHP function chr(1071) would output a forward slash / because the chr function only deals with 8 bit numbers up to 256 and repeats itself after that, and 1071%256=47 which has been a / since the 1960s.

UTF-8 To The Rescue

So if browsers can deal with Unicode in 32 bit characters, where is the problem? The problem is in the sending and receiving, and reading and writing of characters.

The problem remains because:

  1. A lot of existing software and protocols send/receive and read/write 8 bit characters
  2. Using 32 bits to send/store English text would quadruple the amount of bandwidth/space required

Although browsers can deal with Unicode internally, you still have to get the data from the Web server to the Web browser and back again, and you need to save it in a file or database somewhere. So you still need a way to make 110,000 Unicode code points fit into just 8 bits.

There have been several attempts to solve this problem such as UCS2 and UTF-16. But the winner in recent years is UTF-8, which stands for Universal Character Set Transformation Format 8 bit.

UTF-8 is a clever. It works a bit like the Shift key on your keyboard. Normally when you press the H on your keyboard a lower case “h” appears on the screen. But if you press Shift first, a capital H will appear.

UTF-8 treats numbers 0-127 as ASCII, 192-247 as Shift keys, and 128-192 as the key to be shifted. For instance, characters 208 and 209 shift you into the Cyrillic range. 208 followed by 17512 is character 1071, the Cyrillic Я. The exact calculation13 is (208%32)*64 + (175%64) = 1071. Characters 224-239 are like a double shift. 226 followed by 190 and then 12814 is character 12160: ⾀. 240 and over is a triple shift.

UTF-8 is therefore a multi-byte variable-width encoding. Multi-byte because a single character like Я takes more than one byte to specify it. Variable-width because some characters like H take only 1 byte and some up to 4.

Best of all it is backward compatible with ASCII. Unlike some of the other proposed solutions, any document written only in ASCII, using only characters 0-127, is perfectly valid UTF-8 as well – which saves bandwidth and hassle.

Trying It Yourself

This is a different experiment. PHP embeds the 6 numbers mentioned above into an HTML page: 72, 208, 175, 226, 190, 128. The browser interprets those numbers as UTF-8, and internally converts them into Unicode code points. Then Javascript outputs the Unicode values. Try changing the character set from UTF-8 to ISO-8859-1 and see what happens:

<html>
<head>
<meta charset="UTF-8">
</head>
<body>
<p>Characters embedded in the page:<br>
<span id="chars"><?php echo chr(72).chr(208).chr(175).chr(226).chr(190).chr(128); ?></span>
<p>Character values according to Javascript:<br>
<script type="text/javascript">
function ShowCharacters (s) {var r=''; for (var i=0; i<s.length; i++)
  r += s.charCodeAt (i) + ': ' + s.substr (i, 1) + '<br>'; return r;}
document.writeln (ShowCharacters (document.getElementById('chars').innerHTML));
</script>
</body>
</html>

If you are in a hurry, this is what it will look like:

A sequence of numbers shown using the UTF-8 character set
The sequence of numbers above shown using the UTF-8 character set

Same sequence of numbers shown using the ISO-8859-1 character set
Same sequence of numbers shown using the ISO-8859-1 character set

If you display the page  using the UTF-8 character set, you will see only 3 characters: HЯ⾀. If you display it using the character set ISO-8859-1, you will see six separate characters: HЯ⾀ . This is what is happening:

  1. On your Web server, PHP is embedding the numbers 72, 208, 175, 226, 190 and 128 into a Web page
  2. The Web page whizzes across the Internet from the Web server to your Web browser
  3. The browser receives those numbers and interprets them according to the character set
  4. The browser internally represents the characters using their Unicode values
  5. Javascript outputs the corresponding Unicode values

Notice that when viewed as ISO-8859-1 the first 5 numbers are the same (72, 208, 175, 226, 190) as their Unicode code points. This is because Unicode borrowed heavily from ISO-8859-1 in that range. The last number however, the euro symbol €, is different. It is at position 128 in ISO-8859-1 and has the Unicode value 8364.

Summary Circa 2003

UTF-8 is becoming the most popular international character set on the Internet, superseding the older single-byte character sets like ISO-8859-5. When you view or send a non-English document, you still need to know what character set it uses. For widest interoperability, website administrators need to make sure all their web pages use the UTF-8 character sets.

Perhaps the Ð looks familiar – it will sometimes show up if you try to view Russian UTF-8 documents. The next section describes how character sets get confused and end up storing things wrongly in a database.

Lots Of Problems

As long as everybody is speaking UTF-8, this should all work swimmingly. If they aren’t, then characters can get mangled. To explain way, imagine a typical interaction a website, such as a user making a comment on a blog post:

  1. A Web page displays a comment form
  2. The user types a comment and submits.
  3. The comment is sent back to the server and saved in a database.
  4. The comment is later retrieved from the database and displayed on a Web page

This simple process can go wrong in lots of ways and produce the following types of problems:

HTML Entities

Pretend for a moment that you don’t know anything about character sets – erase the last 30 minutes from your memory. The form on your blog will probably display itself using the character set ISO-8859-1. This character set doesn’t know any Russian or Thai or Chinese, and only a little bit of Greek. If you attempt to copy and paste any into the form and press Submit, a modern browser will try to convert it into HTML numerical entities like Я for Я.

That’s what will get saved in your database, and that’s what will be output when the comment is displayed – which means it will display fine on a Web page, but cause problems when you try to output it to a PDF or email, or run text searches for it in a database.

Confused Characters

How about if you operate a Russian website, and you have not specified a character set in your Web page? Imagine a Russian user whose default character set is ISO-8859-5. To say “hi”, they might type Привет. When the user presses Submit, the characters are encoded according to the character set of the sending page. In this case, Привет is encoded as the numbers 191, 224, 216, 210, 213 and 226. Those numbers will get sent across the Internet to the server, and saved like that into a database.

If somebody later views that comment using ISO-8859-5, they will see the correct text. But if they view using a different Russian character set like Windows-1251, they will see їаШТХв. It’s still Russian, but makes no sense.

Accented Characters with Lots of Vowels

If someone views the same comment using ISO-8859-1, they will see ¿àØÒÕâ instead of Привет. A longer phrase like Я тоже рада Вас видеть (“nice to see you” in a formal way to a female), submitted as ISO-8859-5, will show up in ISO-8859-1 as Ï âÞÖÕ àÐÔÐ. It looks like that because the 128-255 range of ISO-8859-1 contains lots of vowels with accents.

So if you see this sort of pattern, it’s probably because text has been entered in a single byte character set (one of the ISO-8859s or Windows ones) and is being displayed as ISO-8859-1. To fix the text, you’ll need to figure out which character set it was entered as, and resubmit it as UTF-8 instead.

Alternating Accented Characters

What if the user submitted the comment in UTF-8? In that case the Cyrillic characters which make up the word Привет would each get sent as 2 numbers each: 208/159, 209/128, 208/184, 208/178, 208/181 and 209/130. If you viewed that in ISO-8859-1 it would look like: Привет.

Notice that every other character is a Ð or Ñ. Those characters are numbers 208 and 209, and they tell UTF-8 to switch to the Cyrillic range. So if you see a lot of Ð and Ñ, you can assume that you are looking at Russian text entered in UTF-8, viewed as ISO-8859-1. Similarly, Greek will have lots of Î and Ï, 206 and 207. And Hebrew has alternating ×, number 215.

Vowels Before a Pound and Copyright Sign

A very common issue in the UK is the currency symbol £ getting converted into £. This is exactly the same issue as above with a coincidence thrown in to add confusion. The £ symbol has the Unicode and ISO-8859-1 value of 163. Recall that in UTF-8 any character over 127 is represented by a sequence of two or more numbers. In this case, the UTF-8 sequence is 194/163. Mathematically, this is because (194%32)*64 + (163%64) = 163.

Visually it means that the if you view the UTF-8 sequence using ISO-8859-1, it appears to gain a  which is character 194 in ISO-8859-1. The same thing happens for all Unicode code points 161-191, which includes © and ® and ¥.

So if your £ or © suddenly inherit a Â, it is because they were entered as UTF-8.

Black Diamond Question Marks

How about the other way around? If you enter Привет as ISO-8859-5, it will get saved as the numbers shown above: 191, 224, etc. If you then try to view this as UTF-8, you may well see lots of question marks inside black diamonds: �. The browser displays these when it can’t make sense of the numbers it is reading.

UTF-8 is self-synchronzising. Unlike other multi-byte character encodings, you always know where you are with UTF-8. If you see a number 192-247, you know you are at the beginning of a multi-byte sequence. If you see 128-191 you know you are in the middle of one. There’s no danger of missing the first number and garbling the rest of the text.

This means that in UTF-8, the sequence 191 followed by 224 will never occur naturally, so the browser doesn’t know what to do with it and displays �� instead.

This can also cause £ and © related problems. £50 in ISO-8859-1 is the numbers 163, 53 and 48. The 53 and 48 cause no issues, but in UTF-8, 163 can never occur by itself, so this will show up as �50. Similarly if you see �2012, it is probably because ©2012 was input as ISO-8859-1 but is being displayed as UTF-8.

Blanks, Question Marks and Boxes

Even if they are fully up-to-speed with UTF-8 and Unicode, a browser still may not know how to display a character. The first few ASCII characters 1-31 are mostly control sequences for teleprinters (things like Acknowledge and Stop). If you try to display them, a browser might show a ? or a blank or a box with tiny numbers inside it.

Also, Unicode defines over 110,000 characters. Your browser may not have the correct font to display all of them. Some of the more obscure characters may also get shown as ? or blank or a small box. In older browsers, even fairly common non-English characters may show as boxes.

Older browsers may also behave differently for some of the issues above, showing ? and blank boxes more often.

Databases

The discussion above has avoided the middle step in the process – saving data to a database. Databases like MySQL can also specify a character set for a database, table or column. But it is less important that the Web pages’ character set.

When saving and retrieving data, MySQL deals just with numbers. If you tell it to save number 163, it will. If you give it 208/159 it will save those two numbers. And when you retrieve the data, you’ll get the same two numbers back.

The character set becomes more important when you use database functions to compare, convert and measure the data. For example, the LENGTH  of a field may depend on its character set, as do string comparisons using LIKE and =. The method used to compare strings is called a collation15.

Character sets and collations in MySQL are an in-depth subject. It’s not simply a case of changing the character set of a table to UTF-8. There are further SQL commands to take into account to make sure the data goes in and out in the right format as well. This blog16 is a good starting point.

Trying It Yourself

The following PHP and Javascript code allows you to experiment with all these issues. You can specify which character set is used to input and output text, and you can see what the browser thinks about it too.

<?php
$charset = $_POST['charset']; if (!$charset) $charset = 'ISO-8859-1';
$string = $_POST['string'];
if ($string) {
        echo '<p>This is what PHP thinks you entered:<br>';
        for ($i=0; $i<strlen($string); $i++) {$c=substr ($string,$i,1); echo ord ($c).': '.$c.' <br/>';}
}       
?>      
<html>
<head>
<meta charset="<?=$charset?>">
</head>
<body>
<form method="post">
<input name="lastcharset" type="hidden" value="<?php echo $charset?>"/>
Form was submitted as: <?php echo $_POST['lastcharset']?><br/>
Text is displayed as: <?php echo $charset?><br/>
Text will be submitted as: <?php echo $charset?><br/>
Copy and paste or type here:
<input name="string" type="text" size="20" value="<?php echo $string?>"/><br/>
Next page will display as:
<select name="charset"><option>ISO-8859-1<option>ISO-8859-5
<option>Windows-1251<option>ISO-8859-7<option>UTF-8</select><br/>
<input type="submit" value="Submit" onclick="ShowCharacters (this.form.string.value); return 1;"/>
</form>
<script type="text/javascript">
function ShowCharacters (s) {
  var r='You entered:';
  for (var i=0; i<s.length; i++) r += 'n' + s.charCodeAt (i) + ': ' + s.substr (i, 1);
  alert (r);
}
</script>
</body>
</html>

This is an example of the code in action. The numbers at the top are the numerical values of each of the characters and their representation (when viewed individually) in the current character set:

Example of inputting and output in different character sets
Example of inputting and output in different character sets. This shows a £ sign turning into a � in Google Chrome.

The page above shows the previous, current and future character sets. You can use this code to quickly see how text can get really mangled. For example, if you pressed Submit again above, the � has Unicode code point 65533 which is 239/191/189 in UTF-8 and will be displayed as �50  in ISO-8859-1. So if you ever get £ symbols turning into �, that is probably the route they took.

Note that the select box at the bottom will change back to ISO-8859-1 each time.

One Solution

All the encoding problems above are caused by text being submitted in one character set and viewed in another. The solution is to make sure that every page on your website uses UTF-8. You can do this with one of these lines immediately after the <head> tag:

<meta charset="UTF-8">
<meta http-equiv="Content-type" content="text/html; charset=UTF-8">

It has to be one of the first things in your Web page, as it will cause the browser to look again at the page in a whole new light. For speed and efficiency, it should do this as soon as possible.

You can also specify UTF-8 in your MySQL tables, though to fully use this feature, you’ll need to delve deeper.

Note that users can still override the character set in their browsers. This is rare, but does mean that this solution is not guaranteed to work. For extra safety, you could implement a back-end check to ensure data is arriving in the correct format.

Existing Websites

If your website has already been collecting text in a variety of languages, then you will also need to convert your existing data into UTF-8. If there is not much of it, you can use a PHP page like the one above to figure out the original character set, and use the browser to convert the data into UTF-8.

If you have lots of data in various character sets, you’ll need to first detect the character set and then convert it. In PHP you can use mb_detect_encoding17 to detect and iconv18 to convert. Reading the comments for  mb_detect_encoding, it looks like quite a fussy function, so be sure to experiment to make sure you are using it properly and getting the right results.

A potentially misleading function is utf8_decode19. It turns UTF-8 into ISO-8859-1. Any characters not available in ISO-8859-1 (like Cyrillic, Greek, Thai, etc) are turned into question marks. It’s misleading because you might have expected more from it, but it does the best it can.

Summary

This article has relied heavily on numbers and has tried to leave no stone unturned. Hopefully it has provided an exhaustive understanding of character sets, Unicode, UTF-8 and the various problems that can arise. The morals of the story are:

  • You need to know the character set in order to make sense of non-Latin text
  • Internally, browsers use Unicode to represent characters
  • Make sure all your Web pages specify the UTF-8 character set

For a slightly different approach to this subject, this 2003 character set article20 is excellent. Thank you for sticking with this epic journey.

↑ Back to topShare on Twitter

Paul Tero is an experienced PHP programmer and server administrator. He developed the Stockashop ecommerce system in 2005 for Sensable Media. He now works part-time maintaining and developing Stockashop, and the rest of the time freelancing from a corner of his living room, and sleeping, eating, having fun, etc. He has also written numerous other open sourcish scripts and programs.

  1. 1

    François "cahnory" Germain

    June 6, 2012 5:24 am

    Interesting but I think it’s more focused on how utf-8 works than how to use it. Here is an article I wrote on “using utf-8″ from getting data to output it (in french):
    http://cahnory.tumblr.com/post/17108999879/utf-8-comment-ca-marche
    I think it could clarify the last part of your article.

    0
  2. 3

    A few of useful sites that help you explore unicode:

    http://unicodinator.com/ – Unicodinator, a visual unicode navigator
    http://graphemica.com/ – For people who ♥ letters, numbers, punctuation, & co.
    http://unicode.martinleopold.com/ – Explore Unicode

    0
  3. 4

    Great article. I wish I had it a year ago!

    Here’s something important to watch our for. So you can have all web elements displaying info in UTF-8. That includes pages to add/edit content. It all looks fine and displays fine. But when you look at the raw data in the database, they characters are messed up. But somehow it still works.

    If using MySQL, you can have a non-utf-8 database, connect to it using your normal connection strings and in the front end it will all seem fine.

    Even if you convert your databases to UTF-8 this will be the case. The key is in the MySQL connection.

    Regularly, this is all you need (using plain old mysql_connect).
    mysql_connect($host,$user,$password) OR DIE(“Unable to connect to database”);
    @mysql_select_db($away_db) or die(“Unable to select database..”);

    However if you’re using UTF-8, you MUST append the following 2 lines.
    @mysql_query(“SET NAMES ‘utf8′”);
    @mysql_query(“SET CHARACTER_SET utf8″);

    Do this, and your characters will display beautifully throughout your system, including in the database. I hope it helps someone who finds themselves stuck as I was a year ago.

    0
    • 5

      Matt Moore, your solution of placing the command:

      @mysql_query(“SET NAMES ‘utf8′”);

      in my PHP code made everything work like a charm.

      Thanks a lot!

      0
  4. 6

    “All About Unicode, UTF8 & Character Sets”

    Am I the only one who gets excited by that title?

    0
  5. 7

    Using UTF-8 to store and transmit stuff on the web is great, in fact I’d say it’s essential nowadays. However it is also important to use a font that will be able to display the Unicode characters you’re publishing, most essentially if you’re working with Southeast Asian languages or other non-Latin character sets.

    There are a number of issues to work around here. First of all, there are almost no fonts out there that actually cover the full range of Unicode characters, any font that did would be insanely large (at a guess, I’d say 10s to 100s of MB). So you have to pick a font that has, at minimum, all of the characters you’re trying to display. Such a font could potentially still get too large for @font-face embedding (the smallest usable Simplified Chinese font I’ve seen was something like 1.5MB).

    In these cases it’s useful to determine whether your users already have a font installed on their machine that will support the language you’re displaying, and if so what font (or fonts) is likely to serve this purpose? This may seem like a step backwards these days but it’s still a practical alternative to forcing multiple-MB downloads to every user of your site.

    If you do decide to embed a font using @font-face to provide your “exotic character” support, be wary of services like Font Squirrel. Their (admittedly great) optimisation routine explicitly strips out non-essential non-Latin characters. So if you need to convert to EOT/WOFF or strip out characters, I’d suggest doing it yourself using the open source tools available for the purpose.

    Then you’ve of course got issues with older browsers (looking at you IE). Modern browsers typically will have a “fallback font” that is capable of displaying *most* of the UTF-8 characters if a suitable character is not found in the fonts specified in CSS’s “font-family”. Browsers like IE however can only use a single font for each element. In effect, if that font lacks a certain character, IE is incapable of falling back to a character from the next font in the “font-family:” chain. It will simply display a box or equivalent “I can’t display this” character. Another terrible IE feature is that any embedded @font-face fonts will always be downloaded, even if that font-face is never applied to any element. Thanks!

    There are many other issues surrounding display of non-Latin characters on the web, but the ones above were the core issues that I encountered as part of a team developing a news site for Southeast Asia and Pacific island nations.

    0
    • 8

      I’m glad you made this remark. This article is wonderfully comprehensive, however, it definitely is missing the font piece. You could follow all the advise in the article and without the right font, you’re still getting nowhere.

      In my experience it depends on the control you have over your content. The more control you have, the more choices in font you have. If you have limited control, for example on a user-generated content site, you are extremely limited in the fonts you can chose to represent UTF-8 data. In my experience, Arial is a safe choice in that case.

      0
  6. 9

    And now there is UTF-16 and UTF-32. *suicide*

    0
  7. 10

    Setting the character set in the HTML header actually is NOT the best way to do this, since the browser has already received some text by then. The proper way to do this is by sending an HTTP request with the character set in it.

    0
  8. 11

    That was character information overload. My big takeaway is why I need a UTF-8 meta tag in my webpages. Thanks for research.

    0
  9. 12

    This is funny. I see “?” in the whole article where you give examples of foreign characters. I am using Chrome on windows.

    Thanks for the article.

    0
    • 13

      I do too, Chrome on OS X. Same in Safari and Firefox.

      The irony, it burns.

      0
    • 14

      Same issue here, but on Firefox 13 and on a Linux box.

      0
    • 15

      I think that’s because we don’t have fonts installed for those languages

      0
    • 16

      My guess is article was copy pasted in a non Unicode compliant way. Especially ironic given the content of the article.

      0
    • 17

      That is very ironic. The article was written directly into WordPress and I checked it many times while writing it. So must be something in the publishing – I’ll ask. Sorry about that – though it is quite funny.

      0
  10. 18

    Great article – many developers and designers from US i know simply do not understand this problem, as they have never met anythng else then that 127 iso chars. For us who use other native language with chars like ěščřžýáíéúů this is daily business. So very often i need to e.g. solve bugs with US scripts, where developer simply did not use multibyte compatible functions and they do not work as expected.
    There is also one other not mentioned problem – the characters must be supported by font – (and properly) – e.g. from >500 google web fonts only 180 report to know even just extended latin chars, and even from them about 30 % got some wrong / missing characters.
    P.S.: by the way – your russian azbuka texts here are replaced with ??? – nic example of what are writing about

    0
  11. 19

    Please keep in mind that Unicode does not “just solve” all text related problems. Read up on Unicode Normalization and why “ä” != “ä” (“x00E4″ != “x0061x0308″), depending on your favorite programming language.

    0
  12. 20

    Michael Kaiser

    June 8, 2012 9:35 pm

    If you want to demonstrate greek Characters you should use a font that has greek Characters included (unlike the Proxima Nova from Typekit). And of course you should send greek character codes and not just U+003F, which is a question mark.
    But apart from that, this is a well researched article.

    0
  13. 21

    The best article on Unicode/UTF-8. ASCII…so far…this is simply perfect

    0
  14. 22

    An intresting MySQL fact is that the utf8 character set does not support all utf-8 characters.
    You have to use the utf8mb4 character set instead.
    http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

    0
  15. 23

    Anders Floderus

    June 10, 2012 11:19 pm

    And there is more. Like special characters. To display ISO-8859-1 characters 128 – 255 in a browser, you can code them as special characters. You did show how to write them as numericals. For Latin capital letter A with diaeresis (Ä) you could write Ä . (Don’t know how general that is.) But a more general way to write it should be &Auml; .

    0
  16. 24

    great! When I started creatin webpages I soon saw the arising problem with the german ä and ö ‘s. So I was looking around and the decided to use UTF8. I didnt think much about other solutions, just wanted to see my pages correct wherever they are displayed. Please encourage people to make their pages correct for ALL users! Thanks.

    0
  17. 25

    This is great. The saying in the article reference towards the end of yours is right. There are no plain text files any more. I would love to see more about big and little endian.

    Also, I really need to explore more with the business logic and data layers. My site is fine for the website/presentation layers now. Everything goes through fine. Still issues in my back office, but.

    0
  18. 26

    Snorri Kristjánsson

    June 11, 2012 3:46 am

    Great article – thanks for sharing.

    0
  19. 27

    Hmm. I am afraid the bit around 8859-1 etc is nothing like accurate. I should know – I proposed it initially many years ago,…..
    Originally, we dealt with 7 bit codes. The eighth bit was used as a parity check for those noisy old modems, or even when we moved to digital, the US had a 56kbit/s path (8k sampling rate and –7–bits). So things like IBM’s code table 437 were designed to work so you could swap 7 bit code pages in and out, preserving the all important control characters (they controlled the teletypes and, in band, devices like modems) irrespective of which 7 bit table was in use. ISO 2022 governed how you did this, and invented ‘Escape sequences’ to control things (especially things with parameters, such as colour changes) as devices grew more complicated…

    So when we proposed 8859 as an 8 bit code, we needed an easy way to allow us to map in the old world of 7 bit. We did this by leaving the range 128-139 clear for extra ‘control’ characters – I was responsible for standardising Prestel, for example that used this mechanism, and the actual code tables came primarily from contributions to and from Digital Equipment’s VT100 and VT220 VDUs. So 8859-1 (familiar to HTML users?) was the first version of this new exciting 8 bit code. Needless to say Microsoft saw the gaps in the range 128-139, and ignoring the standard, filled them in with text characters, and Windows 1251 (and a fair amount of hatred for MS ignorance, or deliberate ignoring of standards) was born.
    I then decided it might be a good idea to use the code extension for a better way to compress images in band, and guess what came next… JPEG…..

    0
    • 28

      Thank you for that explanation. I’m sorry I didn’t get it correct in the text. I did research the history, but spent most of my time on the rest of the article.

      0
  20. 29

    Is is just me or are the Unicode character in this article not appearing?

    For example, where there is a reference to “omega”, I see a question mark.

    0
  21. 30

    UTF-8 is not a character set. It is an *encoding* of the Unicode character set. I suspect that this error is responsible for much of the misunderstanding that surrounds an essentially simple idea.

    0
    • 31

      Hmm probably a bug .I’ll look into it the inntet is that (as in HTML) there should be no case-sensitivity for embed names. The plugin deliberately does not allow you to create two embeds with the same name but different cases (at least when they are global embed, I can’t really prevent it with custom fields), but I guess there are still a couple of loopholes here and there. It should not matter what case you use. Thanks for the feedback, David.Mike

      0
  22. 32

    Thank you for all the comments, and the tips about MySQL and fonts. I focused on the characters in this article, but those two areas could be lengthy topics in their own right. I apologise for the ? issue. That is embarrassing and I hope to fix it very soon.

    0
  23. 33

    JavaScript doesn’t deal with full unicode well. It only handles 16 bit code units. String.fromCharCode(n) where n is greater than 0xFFFF doesn’t work so well. See fromCharCode() documenation, refer to the example function fixedFromCharCode().

    https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/String/fromCharCode

    0
  24. 35

    It would be sinful not to leave a comment! I’ve had a horrible experience regarding character sets a few years ago, and this would’ve been really a lifesaver if I knew it back then.

    One thing I would love to follow though, is the database ‘collation’ aspect of this topic. As with the above remark regarding fonts, I think anyone working on character sets would sooner or later bump into database collations.

    0
    • 36

      When I started looking into database character sets and collations, it was more involved than I thought so I couldn’t include much in the article. The MySQL website explains collations pretty well: https://dev.mysql.com/doc/refman/5.5/en/charset-general.html. It’s about how strings are compared. So in a Greek case-insensitive collation, lower and upper case Alpha would match, but they wouldn’t in a Russian collation.

      0
  25. 37

    Thank You for the article, it’s clear and useful.

    Only minor quibble about C++ “wchar_t” type length. With different C++ compilers, I’ve got different length of the “wchar_t” type:

    On 32-bit Visual Studio 2008 Express Edition C++ compiler, “sizeof(wchar_t)” returns 2, so it’s only 16-bit length and the article http://msdn.microsoft.com/en-us/library/s3f49ktz%28v=vs.90%29.aspx just confirmed it.

    On 32-bit Windows Vista, MinGW g++ 4.6.1 compiler “sizeof(wchar_t)” returns 2 also.

    On Fedora 8 and i386 g++ 4.1.2 compiler with default settings “sizeof(wchar_t)” returns 4, so it’s 32-bit length, but with “-fshort-wchar” option of the compiler “sizeof(wchar_t)” returns 2.

    It seems like the length of “wchar_t” depends on version/settings of a compiler.

    0
  26. 38

    I often imbed special UTF-8 characters directly in my HTML using amper x format. (Maybe that path, instead of what you suggest, has been my downfall.)

    I’m fascinated when I scroll down through the master reference at http://www.columbia.edu/kermit/utf8-t1.html. I was taken aback, though, to discover that a few of the (I thought) often-used characters posed problems for Opera Mini on my iPad. Some characters were simply displayed as blanks, but some (in the x9xxxx range) even led to rendering changes and formatting problems. So, I’m passing on, be sure to test before deploying.

    0
  27. 39

    One thing I was hoping this article would explain is what character encoding your operating system itself is using. When I edit a PHP or CSS file for example in a text editor and upload it to a server, is it encoded to UTF-8 by default or something? I know nobody puts @charset in their stylesheets anymore. For many years I had falsely assumed that when working in files it was always necessary to use HTML entities to avoid character encoding issues.

    0
  28. 41

    If you’re talking about Unicode and PHP, it’s a good idea to mention never to use the standard string functions, but to always use the mb_-counterparts. So don’t use strlen(), but use mb_strlen().

    0
  29. 42

    I echo Benjamin’s sentiment! Hope one of you experts could give us a pointer on that. I understand that it’s important to specify which character set you are using – but how do I know myself if I’m just typing in Notepad?

    0
  30. 43

    A fantastic description Paul!

    How I wish in 1968 the US President Lyndon Johnson had made it compulsory for all of the world that all computers must use and understand ASCII – for ever! We would then not have had all these problems. I prefer the simplistic attitude – if you want to deal/work with computers and cannot (or do not want to) speak English, then remain a “computer illiterate”.

    0
    • 44

      I think you must be joking. I don’t think a modern Internet user could seriously hold an opinion as narrow-minded and offensive as that.

      0
  31. 45

    we have held on to the 8 bit processor and it’s limited resources from the days of the early days like the 8080 intel chip.
    why has this not been expanded with just 8 irq’s which got extended to 16 when a second processor was added to give us 16 bit operation so why has it not moved on. just to keep one software company happy.

    0
  32. 46

    Our language project made the decision to use a somewhat rare symbol to display the glottal stop (ꞌ). It is ok for in print papers, books, and dictionaries. No one was thinking much about the internet’s transition from computer to mobile phone. Currently our character works great when browsing from a computer but is nowhere to be found in any of the standard fonts that come with android and ios. This is sad as I notice that over half of the people liking our facebook page are doing it from mobile browsers. This means that they are getting boxes for the glottal stop. This looks really ugly. I can’t hope that android or ios will make a change, but making the change for our language project will be painful and I don’t know if I can convince everyone involved.

    Oh and big thumbs down to Sreeni’s comment. Very ethnocentric. 14.1% of the world speaks Chinese as their native language. Whether you like it or not, the Chinese would have to go against your desires to keep everything ASCII. Add to that 4.46% Hindi speakers, 4.23% Arabic speakers, 3.05% Bengali speakers, 2.42% Russians, and 1.92% Japanese and you start to understand why this was such a big problem.

    0
  33. 47

    reputation impression

    October 3, 2012 4:59 am

    This is the best blog for anybody who desires to find out about this topic. You notice so much its nearly onerous to argue with you (not that I truly would want…HaHa). You undoubtedly put a brand new spin on a topic thats been wrote about for years. Nice stuff, simply nice!

    0
  34. 48

    I have a question. I am trying to upload a turnkey ebook store onto my exiting website planforlifesuccess.com. The instructions say to import the satabase tables txt file and then select latin1 from the character set of file list. However, latin1 is not given as an option in the list. Is another character set equivalent?

    0
  35. 50

    The most thorough and complete article I have ever read on ‘tin ‘ter ‘net. This filled the gaps in my knowledge about character sets, especially the history.

    Thank you.

    0
  36. 51

    Nice article! I’m trying to figure out if a persistent spammer on one of my sites is using something about character encoding to sneak spam through my email processor (PHP-based). I look for a string of numbers in the web form number field in my program, and block specific matches. However, the spam gets right through it!! For example (pseudocode):

    if ( trim($spamstring) NOT EQUALS “3425″) then (send mail) else ignore;

    So, the code finds any matches on “3425″ and will not send the message. However, the spam gets through even if $spamstring (apparently) matches “3425″. I need to check how string matching is done in my system, but this is right now a curious little puzzle. Any thoughts appreciated.

    Bob

    0
  37. 53

    Excellent article, normally I skim thru articles but this one captivated me and I ended up spending more than an hour on it. Thanks for the comprehensive coverage in an easy to read style.

    0
  38. 54

    Thanks for a great article (and some enlightening comments) that sheds light on this tricky subject.

    Until now I’d pretty much just used ISO-8859-1 and been happy with that, but now I want to allow user generated content in multiple languages ,so I’ve got to figure it out.

    Thanks for making it easier

    0
  39. 55

    Hello, your article is well written and very intersting.

    I’ve a doubt what’s the formulae to calculate the Unicode equivalent of character represented with three bytes (just like the japanese kanji) and four and more bytes ?

    Thank you very mauch for the article and for the help.

    Best regards.

    0
  40. 56

    Great article. Good followup commentary, et al. Am amused at the amount of ruckus involved in adapting to meet all the different character sets in the world. Too bad we abandoned the original “official language” of the Internet: English. But now that Pandora’s box is open, there is no going back. One point five megabyte font sets may seem passe in the future. Good thing the OS folks are moving away from local hardware-based solutions to totally online solutions; where the amount of computing and storage overhead to support all this will be cloud-based (though unfortunately it will likely be $ub$cription-ba$ed LoL). One prominent manufacturer is already setting the stage by migrating it’s extremely popular star product to the online rental model (aka, Office 365). Can hardly wait to see how large their already bloated documents get when supporting full Chinese character sets. LoL

    0
  41. 57

    I had a hard time understanding what you were saying in the “UTF-8 to the Rescue” section, specifically this part of a paragraph:

    “…For instance, characters 208 and 209 shift you into the Cyrillic range. 208 followed by 175 is character 1071, the Cyrillic Я. The exact calculation is (208%32)*64 + (175%64) = 1071. Characters 224-239 are like a double shift. 226 followed by 190 and then 128 is character 12160: ⾀. 240 and over is a triple shift.”

    I didn’t know what this paragraph was talking about. When you say “208 followed by 175″, do you mean how they are encoded in UTF-8 as 208/175? And what exactly is a double/tuple shift? I honestly did not know what that meant.

    Also, I didn’t get this part:

    “…If you see a number 192-247, you know you are at the beginning of a multi-byte sequence. If you see 128-191 you know you are in the middle of one. There’s no danger of missing the first number and garbling the rest of the text.”

    How do I know if I am at the beginning or middle or end? I don’t see how looking at the numbers tell me that… And when you say. 192-247, so you mean the range of #s starting from 192 to 247?? Please reply! Thanks. :)

    0
  42. 58

    Thanks, it helped me.

    0

Leave a Comment

Yay! You've decided to leave a comment. That's fantastic! Please keep in mind that comments are moderated and rel="nofollow" is in use. So, please do not use a spammy keyword or a domain as your name, or else it will be deleted. Let's have a personal and meaningful conversation instead. Thanks for dropping by!

↑ Back to top