Menu Search
Jump to the content X X
Smashing Conf San Francisco

We use ad-blockers as well, you know. We gotta keep those servers running though. Did you know that we publish useful books and run friendly conferences — crafted for pros like yourself? E.g. upcoming SmashingConf San Francisco, dedicated to smart front-end techniques and design patterns.

You, Me And The Emoji: Character Sets, Encoding And Emoji

We all recognize emoji. They’ve become the global pop stars of digital communication. But what are they, technically speaking? And what might we learn by taking a closer look at these images, characters, pictographs… whatever they are 🤔 (Thinking Face). We will dig deep to learn about how these thingamajigs work.

Please note: Depending on your browser, you may not be able to see all emoji featured in this article (especially the Tifinagh1 characters). Also, different platforms vary in how they display emoji as well. That’s why the article always provides textual alternatives. Don’t let it discourage you from reading though!

Now, let’s start with a seemingly simple question. What are emoji?

What Are Emoji? Link

    🌲             🏡        🌲🌲     🏃  🌲

What we’ll find is that they are born from, and depend on, the same technical foundation, character sets and document encoding that underlie the rest of our work as web-based designers, developers and content creators. So, we’ll delve into these topics using emoji as motivation to explore this fundamental aspect of the web. We’ll learn all about emoji as we go, including how we can effectively work them into our own projects, and we’ll collect valuable resources along the way.

There is a lot of misinformation about these topics online, a fact made painfully clear to me as I was writing this article. Chances are you’ve encountered more than a little of it yourself. The recent release of Unicode 9 and the enormous popularity of emoji make now as good a time as any to take a moment to appreciate just how important this topic is, to look at it afresh and to fill in any gaps in our knowledge, large or small.

By the end of this article, you will know everything you need to know about emoji, regardless of the platform or application you’re using, including the distributed web. What’s more, you’ll know where to find the authoritative details to answer any emoji-related question you may have now or in the future.

21 June 2016 brought the official release of Unicode Version 9.02, and with it 72 new emoji. What are they? Where do new emoji come from anyway? Why aren’t your friends seeing the new ROFL, Fox Face, Crossed Fingers and Pancakes emoji you’re sending them?! 😡 (Pouting Face emoji) Keep reading for the answers to these and many other questions.

A question to get us started: What is the plural form of the word “emoji”?

It’s a question that came up in the process of reviewing and editing this article. The good news is that I have an answer! The bad news (depending on how bothered you are by triviality) is that the answer is that there is no definitive answer. I believe the most accurate answer that can be given is to say that, currently, there is no established correct form for the plural of emoji.

An article titled “What’s the Plural of Emoji?3” by Robinson Meyer4, published by The Atlantic5 on 6 January 2016, discusses exactly this issue. The author turns up recent conflicting uses of both forms “emoji” and “emojis,” even within the same national publications:

In written English right now, there’s little consensus on this question. National publications have not settled on a regular style. The Atlantic, for instance, used both (emoji6, emojis7) in the last quarter of 2015. And in October alone in The New York Times, you could find the technology reporter Vindu Goel covering Facebook’s “six new emoji,”8 despite, two weeks later, Austin Ramzy detailing the Australian foreign minister’s “liberal use of emojis9.” …

The Unicode Emoji Subcommittee, which, as we will see, is the group responsible for emoji in the Unicode Standard, uses “emoji” as the plural form. This plural form appears in passages of documentation quoted in this article. Consider, for example, the very first sentence of the first paragraph of the Emoji Subcommittee’s official homepage at unicode.org10:

Emoji are pictographs (pictorial symbols) that are typically presented in a colorful form and used inline in text. They represent things such as faces, weather, vehicles and buildings, food and drink, animals and plants, or icons that represent emotions, feelings, or activities.

I have chosen the plural “emoji”, for the sake of consistency if nothing else. At this point in time, you can confidently use whichever form you prefer, unless of course the organization or individual for whom you’re writing has strong opinions one way or the other. You can and should consult your style guide if necessary.

We’ll start at the beginning, with the basic building blocks not just of emoji, nor even digital communication, but of all written language: characters and character sets.

Table of Contents Link

  1. Character Sets And Document Encoding: An Overview11
    1. Characters12
    2. Character Sets13
    3. Coded Character Sets14
    4. Encoding15
  2. Declaring Character Sets And Document Encoding On The Web16
    1. content-type HTTP Header Declaration17
    2. Checking HTTP Headers Using A Browser’s Developer Tools18
    3. Checking HTTP Headers Using Web-based Tools19
    4. Using A Meta Element With charset Attribute20
    5. An Encoding By Any Other Name21
  3. What Were We Talking About Again? Oh Yeah, Emoji!22
    1. So What Are Emoji?23
    2. How Do We Use Emoji?24
    3. Character References25
    4. Glyphs26
    5. How Do We Know If We Have These Symbols?27
    6. The Great Emoji Proliferation Of 201628
  4. Emoji OS Support29
    1. Emoji Support: Apple Platforms (macOS and iOS)30
    2. Emoji Support: Windows31
    3. Emoji Support: Linux32
    4. Emoji Support: Android33
  5. Emoji On The Web34
    1. Emoji One35
    2. Twemoji36
  6. Conclusion37

Character Sets And Document Encoding: An Overview


A dictionary definition for a character will do to get us started: “A character is commonly a symbol representing a letter or number.”

That’s simple enough. But like so many other concepts, for it to be meaningful, we need to consider the broader context and put it into practice. Characters, in and of themselves, are not enough. I could draw a squiggle with a pencil on a piece of paper and rightfully call it a character, but that wouldn’t be particularly valuable. Not only that, but it is difficult to convey a useful amount of information using a single character. We need more.

Character Sets

A character set is “a set of characters.”

Expanding on that a bit, we can take a step back and consider a slightly more precise but still general description of a set: a group or collection of things that belong together, resemble one another or are usually found together.

Because we’re dealing with sets in the context of computing, we can be a little more precise. In the field of computer science a set is: a collection of a finite number of values in no particular order, with the added condition that none of the values are repeated.

What’s this about a collection? Technically speaking, a collection is a grouping of a number of items, possibly zero, that have some shared significance.

So, a character set is a grouping of some finite number of characters (i.e. a collection), in no particular order, such that none of the characters are repeated.

That’s a solid, precise, if pedantic, definition.

The World Wide Web Consortium (W3C38), the international community of member organizations that work together to develop standards for the web, has its own definition, which is not far from the generic one we’ve arrived at on our own.

From the W3C’s “Character Encodings: Essential Concepts39“:

A character set or repertoire comprises the set of characters one might use for a particular purpose — be it those required to support Western European languages in computers, or those a Chinese child will learn at school in the third grade (nothing to do with computers).

So, any arbitrary set of characters can be considered a character set. There are, however, some well-known standardized character sets that are much more significant than any random grouping we might put together. One such standardized character set, unarguably the most important character set in use today, is Unicode. Again, quoting the W3C40:

Unicode is a universal character set, i.e. a standard that defines, in one place, all the characters needed for writing the majority of living languages in use on computers. It aims to be, and to a large extent already is, a superset of all other character sets that have been encoded.

Text in a computer or on the Web is composed of characters. Characters represent letters of the alphabet, punctuation, or other symbols.

In the past, different organizations have assembled different sets of characters and created encodings for them — one set may cover just Latin-based Western European languages (excluding EU countries such as Bulgaria or Greece), another may cover a particular Far Eastern language (such as Japanese), others may be one of many sets devised in a rather ad hoc way for representing another language somewhere in the world.

Unfortunately, you can’t guarantee that your application will support all encodings, nor that a given encoding will support all your needs for representing a given language. In addition, it is usually impossible to combine different encodings on the same Web page or in a database, so it is usually very difficult to support multilingual pages using ‘legacy’ approaches to encoding.

The Unicode Consortium provides a large, single character set that aims to include all the characters needed for any writing system in the world, including ancient scripts (such as Cuneiform, Gothic and Egyptian Hieroglyphs). It is now fundamental to the architecture of the Web and operating systems, and is supported by all major web browsers and applications. The Unicode Standard also describes properties and algorithms for working with characters.

This approach makes it much easier to deal with multilingual pages or systems, and provides much better coverage of your needs than most traditional encoding systems.

We just learned that the Unicode Consortium is the group responsible for the Unicode Standard. From their website41:

The Unicode Consortium enables people around the world to use computers in any language. Our freely-available specifications and data form the foundation for software internationalization in all major operating systems, search engines, applications, and the World Wide Web. An essential part of our mission is to educate and engage academic and scientific communities, and the general public.

They provide the following answer to the question, what is Unicode?42:

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. … The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.

In short, Unicode is a single (very) large set of characters designed to encompass “all the characters needed for writing the majority of living languages in use on computers.” As such, it “provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language.”

Both the W3C and Unicode Consortium use the term “encoding” as part of their definitions. Descriptions like that, helpful as they may be, are a big part of the reason why there is often confusion around what are in fact simple concepts. Encoding is a more involved, difficult-to-grasp concept than character sets, and one we’ll discuss shortly. Don’t worry about encoding quite yet; before we get from character sets to encoding, we need one more step.

Coded Character Sets

Going back to the same “Character Encodings: Essential Concepts” document, the W3C has something to say about coded character sets43, too:

A coded character set is a set of characters for which a unique number has been assigned to each character. Units of a coded character set are known as code points. A code point value represents the position of a character in the coded character set. For example, the code point for the letter ‘à’ in the Unicode coded character set is 225 in decimal, or E1 in hexadecimal notation. (Note that hexadecimal notation is commonly used for referring to code points…)

Note: There is an unfortunate mistake in the passage above. The character displayed is “à” and the location given for that symbol in the Unicode coded character set is 225 in decimal, or E1 hexadecimal notation. But 225 (dec) / E1 (hex) is the location of “á,” not “à,” which is found at 224 (dec) / E0 (hex). Oops! 😒 (Unamused Face emoji)

That isn’t too difficult to understand. Being able to describe any one character with a numeric code is convenient. Rather than writing “the Latin script letter ‘a’ with a diacritic grave,” we can say \xE0, the hexadecimal notation for the numeric location of that symbol (“à”) in the coded character set known as Unicode. Among other advantages of this arrangement, we can look up that character without having to know what “Latin script letter ‘a’ with a diacritic grave” means. The natural-language way of describing a character can be awkward for us, even more so for computers, which are both much better at looking up numeric references than we are and much worse at understanding natural-language descriptions.

So, a coded character set is simply a way to assign a numeric code to every character in a set such that there is a one-to-one correspondence between character and code. With that, not only is the Unicode Consortium’s description of Unicode more understandable, but we’re ready to tackle encoding.

    🌲                    🏡        🏃 🌲🌲         🌲


We’ve quickly reviewed characters, character sets and coded character sets. That brings us to the last concept we need to cover before turning our attention to emoji. Encoding is both the hardest concept to wrap our heads around and also the easiest. It’s the easiest because, as we’ll see, in a practical sense, we don’t need to know all that much about it.

We’ve come to an important point of transition. Character sets and coded character sets are in the human domain. These are concepts that we must have a good grasp of in order to confidently and effectively do our work. When we get to encoding, we’re transitioning into the realm of the computing devices and, more specifically, the low-level storage, retrieval and transmission of data. Encoding is interesting, and it is important that we get right what little of it we are responsible for, but we need only a high-level understanding of the technical details in order to do our part.

The first thing to know is that “character sets” and “encodings” (or, for our purpose here, “document encodings”) are not the same thing. That may seem obvious to you, especially now that we’re clearly discussing them separately, but it is a common source of confusion. The relationship is a little easier to understand, and keep straight, if we think of the latter as “character set encodings.”

It’s back to the W3C’s “Character Encodings: Essential Concepts” for a definition of encoding44 to get us started:

The character encoding reflects the way the coded character set is mapped to bytes for manipulation by a computing device.

In the table below, which reproduces the same information from a graphic appearing in the W3C document, the first 4 characters and corresponding code points are part of the Tifinagh alphabet, and the fifth is the more familiar exclamation point.

Table 1: A representation of the same information from a graphic that appears in the W3C document “Character Encodings: Essential Concepts1296446
Character (glyph) Hexadecimal representation of Unicode code point UTF-8 encoding (bytes in memory)
2D30 E2 B4 BO
2D63 E2 B5 A3
2D53 E2 B5 93
2D4D E2 B5 8D
! 21 21

The table shows, from left to right, the symbol itself, the corresponding code point and the way the code point maps to a sequence of bytes using the UTF-8 encoding scheme. Each byte in memory is represented by a two-digit hexadecimal number. So, for example, in the first row we see that the UTF-8 encoding of the Tifinagh letter ya (ⴰ) requires 3 bytes of storage (E2 B4 BO).

There are two important points to take away from the information in this table:

First, encodings are distinct from the coded character sets. The coded character set is the information that is stored, and the encoding is the manner in which it is stored. (Don’t worry about the specifics.)

Secondly, note how under the UTF-8 encoding scheme the Tifinagh code points map to three bytes, but the exclamation point maps to a single byte.

From the W3C49:

Although the code point for the letter à in the Unicode coded character set is always 225 (in decimal), in UTF-8 it is represented in the computer by two bytes. … there isn’t a trivial, one-to-one mapping between the coded character set value and the encoded value for this character. … the letter à can be represented by two bytes in one encoding and four bytes in another.

The encoding forms that can be used with Unicode are called UTF-8, UTF-16, and UTF-32.

The W3C’s explanation is accurate, concise, informative and, for many readers, clear as mud. At this point, we’re dealing with pretty low-level stuff. Let’s keep pushing ahead; as is often the case, learning more will give us the context we need to better understand what we’ve already seen.

UTF is a set of encodings specifically created for the implementation of Unicode. It is part of the core specification of Unicode itself.

The Unicode Consortium maintains an official website for Unicode 9.050 (as well as all previous versions of the specification). A PDF of the core specification51 was just recently published to the website in August 2016. You’ll find the discussion of UTF in “Section 2.5: Encoding Forms.”

Computers handle numbers not simply as abstract mathematical objects, but as combinations of fixed-size units like bytes and 32-bit words. A character encoding model must take this fact into account when determining how to associate numbers with the characters.

Actual implementations in computer systems represent integers in specific code units of particular size—usually 8-bit (= byte), 16-bit, or 32-bit. In the Unicode character encoding model, precisely defined encoding forms specify how each integer (code point) for a Unicode character is to be expressed as a sequence of one or more code units. The Unicode Standard provides three distinct encoding forms for Unicode characters, using 8-bit, 16-bit, and 32-bit units. These are named UTF-8, UTF-16, and UTF-32, respectively. The “UTF” is a carryover from earlier terminology meaning Unicode (or UCS) Transformation Format. Each of these three encoding forms is an equally legitimate mechanism for representing Unicode characters; each has advantages in different environments.

Note: These encoding forms are consistent from one version of the specification to the next. In fact, their stability is vital to maintaining the integrity of the Unicode standard. Whatever we read about the encoding forms in the Version 9.0 specification was true of Version 8.052 as well, and will hold going forward.

The Unicode specification discusses at length the pros and cons and preferred usage of these three forms — UTF-8, UTF-16 and UTF-32 — endorsing the use of all three as appropriate. For the purposes of this brief discussion of UTF encoding, it’s enough to know the following:

  • UTF-8 uses 1 byte to represent characters in the ASCII set, 2 bytes for characters in several more alphabetic blocks, 3 bytes for the rest of the BMP, and 4 bytes as needed for supplementary characters.
  • UTF-16 uses 2 bytes for any character in the BMP, and 4 bytes for supplementary characters.
  • UTF-32 uses 4 bytes for all characters.

From the brief description of storage requirements for the various UTF encodings above, you might guess that UTF-8 is more complicated to implement (owing to the fact that it is not fixed-width) but more space efficient than say UTF-32, which is more regular but less space-efficient, with every character taking up exactly 4 bytes.

Table 1: A representation of the same information from a graphic that appears in the W3C document “Character Encodings: Essential Concepts1296446
Encoding !
UTF-8 E2 B4 BO E2 B5 A3 E2 B5 93 E2 B5 8D 21
UTF-16 2D 30 2D 63 2D 53 2D 4D 00 21
UTF-32 00 00 2D 30 00 00 2D 63 00 00 2D 53 00 2D 4D 00 00 00 21

Ignoring the Berber characters and focusing on the exclamation point in the rightmost column, we see that the same character would take up a single byte in UTF-8, 2 bytes (two times the storage) in UTF-16, and 4 bytes (four times the storage) in UTF-32. That’s three very different amounts of storage to convey the exact same information. Multiply that difference in storage requirements by the size of the web, estimated to be at least 4.83 billion pages65 currently, and it’s easy to appreciate that the storage requirements of these encodings is not an inconsequential consideration.

Whether or not that all made sense to you, here’s the good news…

When dealing with HTML, the character set we’ll use is Unicode, and the character encoding is always UTF-8. It turns out that that’s all we’ll ever need to concern ourselves with. 😌 (Relieved Face emoji, U+1F60C) Regardless, it’s no less important to be aware of the general concepts, as well as the simple fact that there are other character sets and encodings.

Now, we can bring all of this to the context of the web, and start working our way toward emoji.

    🌲                    🏡   🏃      🌲🌲         🌲

Declaring Character Sets And Document Encoding On the Web

We need to tell user agents (web browsers, screen readers, etc.) how to correctly interpret our HTML documents. In order to do that, we need to specify both the character set and the encoding. There are two (overlapping) ways to go about this:

  • utilizing HTTP headers,
  • declaring within the HTML document itself.

That gives us a very quick summary. After a period of officially working together, these two standards bodies have parted ways. However, there is still an awkward collaboration of sorts on the HTML5 standard itself. The WHATWG works on its specification, rolling in changes continually. Much like a modern evergreen operating system (OS) or application with an update feature, the latest changes are incorporated without waiting for the next official release. This is what the WHATWG means by “living standards,” which it describes as follows:

This means that they are standards that are continuously updated as they receive feedback, either from Web designers, browser vendors, tool vendors, or indeed any other interested party. It also means that new features get added to them over time, at a rate intended to keep the specifications a little ahead of the implementations but not so far ahead that the implementations give up.

Despite the continuous maintenance, or maybe we should say as part of the continuing maintenance, a significant effort is placed on getting the specifications and the implementations to converge — the parts of the specification that are mature and stable are not changed willy nilly. Maintenance means that the days where the specifications are brought down from the mountain and remain forever locked, even if it turns out that all the browsers do something else, or even if it turns out that the specification left some detail out and the browsers all disagree on how to implement it, are gone. Instead, we now make sure to update the specifications to be detailed enough that all the implementations (not just browsers, of course) can do the same thing. Instead of ignoring what the browsers do, we fix the spec to match what the browsers do. Instead of leaving the specification ambiguous, we fix the the [sic] specification to define how things work.

For its part, the W3C will from time to time package these updates (at least some of them), as well as its own changes possibly, to arrive at a new version of its HTML 5.x standard.

Assuming that the WHATWG process works as advertised — and that may be a pretty good assumption considering that many of the people directly involved with the WHATWG also work for the organizations responsible for the implementation of the standard (e.g. Apple, Google, Mozilla and Opera) — the best strategy is probably to refer to the WHATWG spec first. That is what I have done in this article. Where I quote from an HTML5 spec, I am referencing the WHATWG specification. I do, however, make use of informational documents from the W3C throughout the article because they are helpful and not inconsistent with either spec.

Honestly, for our purposes here, it hardly matters. The sections I pull from are nearly (though not strictly) identical. But I suppose that’s really part of the problem, rather than evidence of cohesiveness. To get a sense of just how messy the situation is, take a look at the “Fork Tracking” page on the WHATWG’s wiki70.

content-type HTTP Header Declaration

As long as we’re talking about the web, there’s good reason to believe that the W3C has something to say71 about the topic:

When you retrieve a document, a web server sends some additional information. This is called the HTTP header. Here is an example of the kind of information about the document that is passed as part of the header with a document as it travels from the server to the client.

HTTP/1.1 200 OK
Date: Wed, 05 Nov 2003 10:46:04 GMT
Server: Apache/1.3.28 (Unix) PHP/4.2.3
Content-Location: CSS2-REC.en.html
Vary: negotiate,accept-language,accept-charset
TCN: choice
P3P: policyref=
Cache-Control: max-age=21600
Expires: Wed, 05 Nov 2003 16:46:04 GMT
Last-Modified: Tue, 12 May 1998 22:18:49 GMT
ETag: "3558cac9;36f99e2b"
Accept-Ranges: bytes
Content-Length: 10734
Connection: close
Content-Type: text/html; charset=UTF-8
Content-Language: en

If your document is dynamically created using scripting, you may be able to explicitly add this information to the HTTP header. If you are serving static files, the server may associate this information with the files. The method of setting up a server to pass character encoding information in this way will vary from server to server. You should check with the documentation or your server administrator.

Without getting too heavily into the details, while still coming away from the discussion with some sense of how this works, let’s clarify some of the terminology.

HTTP is the network application protocol underlying communication on the web. (The same protocol is also used in other contexts, but was originally designed for the web.) HTTP is a client-server protocol and facilitates communication between the software making a request (the client) and the software fulfilling or responding to the request (the server) by exchanging request and response messages. These messages all must follow a well-defined, standardized structure so that they can be anticipated and interpreted properly by the recipient.

Part of this structure is a header providing information about the message itself, about the capabilities or requirements of the originator or of the recipient of the message, and so on. The header consists of a number of individual header lines. Each line represents a single header field comprising a lone key-value pair. One of the approximately 45+ defined fields is Content-Type, which identifies both the encoding and character set of the content of the message.

In the example above, among the header lines, we see:

Content-Type: text/html; charset=UTF-8

The field contains two pieces of information.

The first is the media type74, text/html, which identifies the content of the message as an HTML document, which a web server can process directly. There are other media types, like application/pdf (a PDF document), which generally need to be handled differently.

The second piece of information is the document encoding and character set, charset=UTF-8. As we’ve already seen, UTF-8 is exclusively used with Unicode. So UTF-8 alone is enough to identify both the encoding and character set.

You can view these HTTP headers yourself. Options for doing so include, among others:

  • Browser development tools
  • web-based tools

Checking HTTP Headers Using A Browser’s Developer Tools

Checking HTTP Headers with Firefox (Recent Versions)

  1. Open the “Web Console” from the “Tools” → “Web Developer” menu.
  2. Select the “Network” tab in the pane that appears (at the bottom of the browser window, if you haven’t changed this default).
  3. Navigate to a web page you’d like to inspect. You’ll see a list consisting of all resources that contribute to the page fill-in, including the root document, at the top (with component resources listed underneath).
  4. Select any one of these resources from the list for which you’d like to look at the accompanying headers. (The pane should split.)
  5. Select the “headers” tab in the new pane that appears.
  6. You’ll see response and request headers corresponding to both ends of the exchange, and among the response headers you should find the Content-Type field.

Checking HTTP Headers with Chrome (Recent Versions)

  1. Open the “Developer Tools” from the “View” → “Developer” menu.
  2. Select the “Network” tab in the pane that appears (at the bottom of the browser window, if you haven’t changed this default).
  3. Navigate to a web page you’d like to inspect. You’ll see a list of all resources that contribute to the page fill-in, including the root document, at the top (with component resources listed underneath).
  4. Select any one of these resources from the list for which you’d like to look at the accompanying headers. (The pane should split.)
  5. Select the “headers” tab in the new pane that appears.
  6. You’ll see response and request headers corresponding to both ends of the exchange, and among the response headers you should find the Content-Type field.

Checking HTTP Headers Using Web-Based Tools

Many websites allow you to view the HTTP headers returned from the server for any public website, some much better than others. One reliable option is the W3C’s own Internationalization Checker75.

Simply type a URL into the provided text-entry field, and the page will return a table with information related to the internationalization and language of the document at that address. You should see a section titled “Character Encoding,” with a row for the “HTTP Content-Type” header. You’re hoping to see a value of utf-8.

Using A Meta Element With charset Attribute

We can also declare the character set and encoding in the document itself. More specifically, we can use an HTML meta element to specify the character set and encoding.

There are two different, widely used, equally valid formats (both interpreted in exactly the same way).

There is the older HTML 4.x format:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

And the newer, equivalent, HTML5 version:

<meta charset="UTF-8">

The latter form is shorter and, so, easier to type and harder to get wrong accidentally, and it takes up less space. For these reasons, it’s the one we should use.

It might occur to you to ask (or not), “If the browser needs to know the character set and encoding before it can read the document, how can it read the document to find the meta element and get the value of the charset attribute?”

That’s a good question. How clever of you. 🐙 (Octopus emoji, U+1F419 — widely considered to be among the cleverest of all animal emoji76) I probably wouldn’t have thought to ask that. I had to learn to ask that question. (Sometimes it’s not just the answers we need to learn, but the questions as well.)

For the answer to this apparent riddle, I’ll quote an often-cited blog post by Joel Spolsky on the topic of character sets and encoding, titled “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)77“:

It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.

Notice the use of the older-style meta element. This post is from 2003 and, therefore, predates HTML5.

That makes sense, right? Let’s also take a look at what the HTML spec has to say about it. From the WHATWG HTML5 spec on “Specifying the Document’s Character Encoding78“:

The element containing the character encoding declaration must be serialized completely within the first 1024 bytes of the document.

In order to satisfy this condition as safely as possible, it’s best practice to have the meta element specifying the charset as the first element in the head section of the page.

This means that every HTML5 document should begin very much like the following (the only differences being the title text, possibly the value of the lang attribute, the use of white space, single versus double quotation marks around attribute values, and capitalization).

<!doctype html>
<html lang="en">
  <meta charset="utf-8">
  <title>Example title</title>

Now you might be thinking, “Great, I can specify the character set and encoding in the HTML document. I’d much rather do that than worry about HTTP header field values.”

I don’t blame you. But that begs the question, “What happens if the character set and encoding are set in both places?”

Maybe somewhat surprisingly the information in the HTTP headers takes precedence. Yes, that’s right, the meta element and charset attribute do not override the HTTP headers. (I think I remember being surprised by this.) 🙁 (Slightly Frowning Face emoji, U+1F641)

The W3C tells us in “Declaring Character Encodings in HTML79“:

If you have access to the server settings, you should also consider whether it makes sense to use the HTTP header. Note however that, since the HTTP header has a higher precedence than the in-document meta declarations, content authors should always take into account whether the character encoding is already declared in the HTTP header. If it is, the meta element must be set to declare the same encoding.

There’s no getting away from it. To avoid problems, we should always make sure the value is properly set in the HTTP header as well as in the HTML document.

An Encoding By Any Other Name

Before we finish with character sets and encoding and move on to emoji, there are two other complications to consider. The first is as subtle as it is obvious.

It’s not enough to declare a document is encoded in UTF-8 — it must be encoded in UTF-8! To accomplish this, your editor needs to be set to encode the document as UTF-8. It should be a preference within the application.

Say, I have a joke for you… What time is it when you have an editor that doesn’t allow you to set the encoding to UTF-8?

Punchline: Time to get a new editor! 🙄 (Face with rolling eyes emoji, U+1F644)

So, we’ve specified the character set and encoding both in the HTTP headers and in the document itself, and we’ve taken care that our files are encoded in UTF-8. Now, thanks to Unicode and UTF-8, we can know, without any doubt, that our documents will be interpreted and displayed properly for every visitor using any browser or other user agent on any device, running any software, anywhere in the world. But is that true? No, it’s not quite true.

There is still a missing piece of the puzzle. We’ll come back to this. Building suspense, that’s good writing! 😃 (Smiling Face with Mouth Open emoji, U+1F603)

    🌲            🏃      🏡           🌲🌲         🌲

What Were We Talking About Again? Oh Yeah, Emoji!

So What Are Emoji?

We’ve already mentioned the Unicode Consortium, the non-profit responsible for the Unicode standard.

There’s a subcommittee of the Unicode Consortium dedicated to emoji, called, unsurprisingly, the Unicode Emoji Subcommittee80. As with the rest of the Unicode standard, the Unicode Consortium and its website (unicode.org81) are the authoritative source for information about emoji. Fortunately for us, it provides a wealth of accessible information, as well as some more formal technical documents, which can be a little harder to follow. An example of the former is its “Emoji and Dingbats” FAQ82.

Q: What are emoji?

A: Emoji are “picture characters” originally associated with cellular telephone usage in Japan, but now popular worldwide. The word emoji comes from the Japanese 絵 (e ≅ picture) + 文字 (moji ≅ written character).

Note: See those Japanese characters in this primarily English-language document? Thanks Unicode!

Emoji are often pictographs — images of things such as faces, weather, vehicles and buildings, food and drink, animals and plants — or icons that represent emotions, feelings, or activities. In cellular phone usage, many emoji characters are presented in color (sometimes as a multicolor image), and some are presented in animated form, usually as a repeating sequence of two to four images — for example, a pulsing red heart.

Q: Do emoji characters have to look the same wherever they are used?

A: No, they don’t have to look the same. For example, here are just some of the possible images for U+1F36D LOLLIPOP, U+1F36E CUSTARD, U+1F36F HONEY POT, and U+1F370 SHORTCAKE:

Ex representations of 4 emoji83
Figure 3: Example representations of four emoji shown in four styles each. (Image: Unicode, Inc.84) (Large preview85)

In other words, any pictorial representation of a lollipop, custard, honey pot or shortcake respectively, whether a line drawing, gray scale, or colored image (possibly animated) is considered an acceptable rendition for the given emoji. However, a design that is too different from other vendors’ representations may cause interoperability problems: see Design Guidelines86 in UTR #5187.

Read through just that one FAQ — and this article, of course 😁 (Grinning face with Smiling Eyes emoji, U+1F601) — and you’ll have a better handle on emoji than most people ever will.

How Do We Use Emoji?

The short answer is, the same way we use every other character. As we’ve already discussed, emoji are symbols associated with code points. What’s special about them is just semantics — i.e. the meaning we ascribe to them, not the mechanics of them.

If you have a key on a keyboard mapped to a particular character, producing that character is as simple as pressing the key. However, considering that, as we’ve seen, more than 120,000 characters are currently in use, and the space defined by Unicode allows for more than 1.1 million of them, creating a keyboard large enough to assign a character to each key is probably not a good strategy.

When we exhaust the reach of our keyboards, we can use a software utility to insert characters.

Recent versions of macOS include an “Emojis and Symbols” panel that can be accessed from anywhere in the OS (via the menu bar or the keyboard shortcut Control + Command + Space). Other OS’ offer similar capabilities to view and click or to copy and paste emoji into text-entry fields. Applications may offer additional app-specific features for inserting emoji beyond the system-wide utilities.

Lastly, we can take advantage of character references to enter emoji (and any other character we like, for that matter) on the web.

Character References

Character references are commonly used as a way to include syntax characters as part of the content of an HTML document, and also can be used to input characters that are hard to type or not available otherwise.

Note: I’m sure many of you are familiar with character references. But if you keep reading this section, it wouldn’t surprise me if you learn something new.

HTML is a markup language, and, as such, HTML documents contain both content and the instructions describing the document together as plain text in the document itself. Typically, the vast majority of characters are part of the document’s content. However, there are other “special” characters in the mix. In HTML, these form the tags corresponding to the HTML elements that define the structure and semantics of the document. Moreover, it’s worth taking a moment to recognize that the syntax itself — i.e. its implementation — creates a need for additional markup-specific characters. The ampersand is a good example. The ampersand (&) is special because it marks the beginning of all other character references. If the ampersand itself were not treated specially, we’d need another mechanism altogether.

Syntax characters are treated as special, and should never be used as part of the content of a document, because they are always interpreted specially, regardless of the author’s intent. Mistakenly using these characters as content makes it difficult or impossible for a browser or other user agent to parse the document correctly, leading to all sorts of structural and display issues. But aside from their special status, markup-related characters are characters like any other, and very often we need to use them as content. If we can’t type the literal characters, then we need some other way to represent them. We can use character references for this, referred to as “escaping” the character, as in getting outside of (i.e. escaping) the character’s markup-specific meaning.

What characters need to be escaped? According to the W3C three characters should always be escaped88: the less-than symbol (<, &lt;), the greater-than symbol (>, &gt;) and the ampersand (&, &amp;) — just those three.

Two others, the double quote (", &quot;) and single quote (', &apos;) are often escaped based on context; in particular, when they appear as part of the value of an attribute and the same character is used as the delimiter of the attribute value.

Note: Even this is a bit of a fib, though it comes from a typically reliable source. To be safe, we can always escape those characters. But in fact, because of the way document parsing works, we can often get away without escaping one or more of them.

If you’re interested in a more detailed run-through of just how complicated and fiddly the exceptions to the syntax-character rules can get in practice, have a look at the blog post “Ambiguous Ampersands110,” in which Mathias Bynens considers precisely when these characters must be escaped and when they don’t need be in practice.

But here’s the thing: These types of liberties can, and frequently do, cascade through our markup when we inevitably make other mistakes. For that reason alone, you may want to stick to the advice from the W3C. Not only is it the safer approach, it’s a lot easier to remember, and that’s not a bad thing.

In addition to these few syntax characters, there are similar references for every other Unicode character as well, all 120,000+ of them. These references come in two types:

  • named character references (named entities)
  • numeric character references (NCRs)

Note: It is perfectly confusing that the term “numeric character reference” is abbreviated NCR, which could just as easily be used as the abbreviation for named character reference. 😩 (Weary Face, U+1F629)

Named character references

Named character references (also known as named entities, entity references or character entity references) are pre-defined word-like references to code points. There are quite a few of them. The WHATWG provides a handy comprehensive table of character reference names supported by HTML5113, listing 2,231 of them. That’s far, far more than you will ever likely see used in practice. After all, the idea is that these names will serve as mnemonics (memory aids). But it’s difficult to remember 2,231 of anything. If you don’t know that a named reference exists, you won’t use it.

But where does this information come from? How can we be certain that the information in the table referenced above, which I describe as “comprehensive,” is in fact comprehensive? Not only are those perfectly valid questions, it’s exactly the type of questioning we need more of, to cut through the clutter of hearsay and other misinformation that is all too common online.

The very best sources of authoritative information are the specifications themselves, and that “handy, comprehensive table” is in fact is a link to section 12.5 of the WHATWG’s HTML5 spec114.

Let’s say you wanted to include a greater-than symbol (>) in the content of your document.

As we have just seen, and as you probably already knew, we can’t just press the key bearing that symbol on your keyboard. That literal character is special and may be treated as markup, not content. There is a named character reference for the greater-than symbol. It looks like &gt;. Typing that in your document will get you the character you’re looking for every time, >.

So, in summary, named entities are predefined mnemonics for certain Unicode characters.

Essentially, a group of people thought it would be nice if we could refer to & as &amp;, so they arranged to make that possible, and now &amp; corresponds to the code point for &.

Numeric Character References

There isn’t a named reference for every Unicode character. We just saw that there are 2,231 of them. That’s a lot, but certainly not all. You won’t find any emoji in that list, for one thing. But while there are not always named references, there are always numeric character references. We can use these for any Unicode character, both those that have named references and those that don’t.

An NCR is a reference that uses the code point value in decimal or hexadecimal form. Unlike the names, there is nothing easy to remember about these.

We’ve looked at the ampersand (&), for which there is a named character reference, &amp;. We can also write this as a numeric reference in decimal or hexadecimal form:

  • decimal: & (&#38;)
  • hexadecimal: & (&#x00026;)

Note: The # indicates that what follows is a numeric reference, and #x indicates that the numeric reference is in hexadecimal notation.

You’ll find a table containing a complete list of emoji in the document “Full Emoji Data182117,” maintained by the Unicode Consortium’s Emoji Subcommittee. It includes several different representations of each emoji for comparison and, importantly, the Unicode code point in hexadecimal form, as well as the name of the character and some additional information. It’s a great resource and makes unnecessary any number of other websites that often contain incomplete, out of date or otherwise partially inaccurate information.

Note: If you look carefully, you might spot some unfamiliar emoji in this list. As I write this, the list includes new emoji from the just recently released Unicode Version 9.0. So, you can find 🤠 Face with Cowboy Hat (U+1F920), 🤷 Shrug (U+1F937), 🤞 Selfie (U+1F91E), 🥔 Potato (U+1F954) and 68 others from the newest version of Unicode118.

To look at just a few:

  • 💩 (💩): Pile of Poo
  • 🍩 (🍩): Doughnut (that’s one delicious-looking emoji)
  • 🤸 (🤸): Person Doing a Cartwheel (new with version 9)

Are you seeing a box or other generic symbol, rather than an image for the last emoji in the list (and the preceding paragraph as well)? That makes perfect sense, if you’re reading this around the time I wrote it. In fact, it would be more remarkable if you were seeing emoji there. That last emoji, Person Doing a Cartwheel (U+1F938), is new as of Unicode Version 9.0119, released on 21 June 2016. It isn’t surprising at all if your platform or application doesn’t yet support an emoji released within the past several days (or weeks). Of course, as this newest release of Unicode begins to roll out, Version 9.0’s 72 new emoji120 will begin to appear, including 🤸 (Person Doing a Cartwheel, U+1F938), the symbol for which will automatically replace the blank here and in the list above.

Do we really have to type in these references? No, if the platform or application you’re using allows you to enter emoji in some other way, that will work just fine. In fact, it’s preferred. Emoji are not syntax characters and so can always be entered directly.

Here’s a basketball emoji I inserted into this document from OS X’s “Emojis and Symbols” panel: 🏀 (Basketball, U+1F3C0).

This brings up a more general question, “If there’s a reference for every Unicode character, when should we use them?”

The W3C provides us with a perfectly reasonable, well-justified answer to this question in a Q&A document titled “Using Character Escapes in Markup and CSS121“. The short version of its answer to the question of when should we use character references is, as little as possible:

When not to use escapes

It is almost always preferable to use an encoding that allows you to represent characters in their normal form, rather than using character entity references or NCRs.

Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size.

Many English-speaking developers have the expectation that other languages only make occasional use of non-ASCII characters, but this is wrong.

Take for example the following passage in Czech.

Jako efektivnější se nám jeví pořádání tzv. Road Show prostřednictvím našich autorizovaných dealerů v Čechách a na Moravě, které proběhnou v průběhu září a října.

If you were to require NCRs for all non-ASCII characters, the passage would become unreadable, difficult to maintain and much longer. It would, of course, be much worse for a language that didn’t use Latin characters at all.

Jako efektivn&#x115;j&#x161;&#xED; se n&#xE1;m jev&#xED; po&#x159;&#xE1;d&#xE1;n&#xED; tzv. Road Show prost&#x159;ednictv&#xED;m na&#x161;ich autorizovan&#xFD;ch dealer&#x16F; v &#x10C;ech&#xE1;ch a na Morav&#x11B;, kter&#xE9; prob&#x11B;hnou v pr&#x16F;b&#x11B;hu z&#xE1;&#x159;&#xED; a &#x159;&#xED;jna.

As we said before, use characters rather than escapes for ordinary text.

So, we should only use character references when we absolutely must, such as when escaping markup-specific characters, but not for “ordinary text” (including emoji). Still, it is nice to know that we can always use a numeric character reference to input any Unicode character at all into our HTML documents. Literally, a world 🌎 (Earth Globe Americas, U+1F30E) of characters are open to us.

The final point I will make before moving on has to do with case-sensitivity. This is another one of those issues about which there is much confusion and debate, despite the fact that it is not open to interpretation.

Character References and Case-Sensitivity

Named character references are case-sensitive and must match the case of the names given in the table of named character references supported by HTML122 that is part of the HTML5 spec. Having said that, if you look at the named references in that table carefully, you will see more than one name that maps to the same code point (and, so, the same character).

Take the ampersand. You’ll find the following four entries in the table for this single character:

AMP;U+00026 (&)
AMPU+00026 (&)
amp;U+00026 (&)
ampU+00026 (&)

This may be where some of the confusion originates. One could easily be fooled into assuming that this simulated, limited case-insensitivity is the real thing. But it would be a mistake to think that these kinds of variations are consistent. They definitely are not. For example, the one and only one valid named reference for the backslash character is Backslash;.

Backslash;U+02216 (∖)

You won’t find any other references to U+02216 in that table, and so no other form is valid.

If we look closely at the four named entities for the ampersand (U+00026) again, you’ll see that half of them include a trailing semicolon (;) and the other half don’t. This, too, may have led to confusion, with some people mistakenly believing that the semicolon is optional. It isn’t. There are some explicitly defined named character references, such as AMP and amp, without it, but the vast majority of named references and all numeric references include a semicolon. Furthermore, none of the named references without the trailing semicolon can be used in HTML5. 😮 (Face with Mouth Open, U+1F62E) Section 12.1.4 of the HTML5 specification123 tells us (emphasis added):

Character references must start with a U+0026 AMPERSAND character (&). Following this, there are three possible kinds of character references:

Named character references

The ampersand must be followed by one of the names given in the named character references124 section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).

Decimal numeric character reference

The ampersand must be followed by a U+0023 NUMBER SIGN character (#), followed by one or more ASCII digits125, representing a base-ten integer that corresponds to a Unicode code point that is allowed according to the definition below. The digits must then be followed by a U+003B SEMICOLON character (;).

Hexadecimal numeric character reference

The ampersand must be followed by a U+0023 NUMBER SIGN character (#), which must be followed by either a U+0078 LATIN SMALL LETTER X character (x) or a U+0058 LATIN CAPITAL LETTER X character (X), which must then be followed by one or more ASCII hex digits126, representing a hexadecimal integer that corresponds to a Unicode code point that is allowed according to the definition below. The digits must then be followed by a U+003B SEMICOLON character (;).

“The ampersand must be followed by one of the names given in the named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).”

That answers that.

By the way, the alpha characters in hexadecimal numeric character references (a to f, A to F) are always case-insensitive.

    🌲     🏃             🏡           🌲🌲         🌲


At the end of the encoding section, I asked whether Unicode and UTF-8 encoding are enough to ensure that our documents, interfaces and all of the characters in them will display properly for all visitors to our websites and all users of our applications. The answer was no. But if you remember, I left this as a cliffhanger. 😯 (Hushed Face, U+1F62F) There’s just one thing left to complete our picture of emoji (pun intended) 😬 (Grimacing Face U+1F62C).

Unicode and UTF-8 do most of the heavy lifting, but something is missing. We need a glyph associated with every character in order to be able to see a representation of the character.

From the Wikipedia entry for glyph127:

In typography, a glyph /’ɡlɪf/ is an elemental symbol within an agreed set of symbols, intended to represent a readable character for the purposes of writing and thereby expressing thoughts, ideas and concepts. As such, glyphs are considered to be unique marks that collectively add up to the spelling of a word, or otherwise contribute to a specific meaning of what is written, with that meaning dependent on cultural and social usage.

For example, in most languages written in any variety of the Latin alphabet the dot on a lower-case i is not a glyph because it does not convey any distinction, and an i in which the dot has been accidentally omitted is still likely to be recognized correctly. In Turkish, however, it is a glyph because that language has two distinct versions of the letter i, with and without a dot.

The relationship between the terms glyph, font and typeface is that a glyph is a component of a font that is composed of many such glyphs with a shared style, weight, slant and other characteristics. Fonts, in turn, are components of a typeface (also known as a font family), which is a collection of fonts that share common design features but each of which is distinctly different.

Essentially, we need a font that includes a representation for the code point of the emoji we want to display. If we don’t have that symbol we’ll get a blank space, an empty box or some other generic character as an indication that the symbol we’re after isn’t available. You’ve probably seen this. Now you understand why, if you didn’t before.

Going back to the figure from the W3C’s “Character Encodings: Essential Concepts128,” the original diagram featured the following five characters:

Table 3: A representation of the information from a graphic that appears in the W3C document “Character Encodings: Essential Concepts1296446
Character !
Name Tifinagh Letter Ya Tifinagh Letter Yaz Tifinagh Letter Yu Tifinagh Letter Yal Exclamation Point
Code point U+2D30 U+2D63 U+2D53 U+2D4D U+0021
NCR &#x2d30; &#x2d63; &#x2d53; &#x2d4d; &#x0021;

What would happen if you tried to insert those characters into a web page using a numeric character reference? Let’s see:

  • ⴰ (&#x2d30;)
  • ⵣ (&#x2d63;)
  • ⵓ (&#x2d53;)
  • ⵍ (&#x2d4d;)
  • ! (&#x0021;)

Chances are you can see the exclamation point, but some of the others might be missing, replaced by a box or other generic symbol.

As has already been mentioned, those symbols are Tifinagh characters. Tifinagh130 is “a series of abjad and alphabetic scripts used to write Berber languages.” According to Wikipedia131:

Berber or the Amazigh languages or dialects (Berber name: Tamaziɣt, Tamazight, ⵜⴰⵎⴰⵣⵉⵖⵜ [tæmæˈzɪɣt], [θæmæˈzɪɣθ]) are a family of similar and closely related languages and dialects indigenous to North Africa. They are spoken by large populations in Algeria and Morocco, and by smaller populations in Libya, Tunisia, northern Mali, western and northern Niger, northern Burkina Faso, Mauritania, and in the Siwa Oasis of Egypt. Large Berber-speaking migrant communities have been living in Western Europe since the 1950s. In 2001, Berber became a constitutional national language of Algeria, and in 2011 Berber became a constitutionally official language of Morocco, after years of persecution.

Not long ago, it would have been surprising if your platform displayed any of the Tifinagh characters. But wide internationalization support has improved dramatically and is getting better all the time, thanks in large part to Unicode. There is now a good chance you’ll see them.

We can still reliably stump our platforms with references to characters that have been only very recently released, as we saw earlier with 🤸 (Person Doing a Cartwheel, U+1F938).

For those of you who really appreciate missing characters, here are some others:

  • 🤤 — (Drooling Face, U+1F924)
  • 🤷 — (Shrug, U+1F937)
  • 🤦 — (Face Palm, U+1F9260)
  • 🤳 — (Selfie, U+1F933)
  • 🦉 — (Owl, U+1F989)
  • 🥕 — (Carrot, U+1F955)
  • 🥘 — (Shallow Pan of Food, U+1F958)
  • 🛒 — (Shopping Trolley, U+1F6D2)

Note: If you are seeing the eight emoji in that list, then it’s safe to say you have support for the newest emoji introduced with Unicode Version 9.0. As we’ll see, that support could be coming from your OS or application or it could even be loaded via Javascript for a particular website (but not all of the others).

Regardless of what you or I do or do not see, Unicode is doing its part.

OK, so emoji are symbols (i.e. glyphs) that correspond to a specific code point — that’s a clean, easy-to-understand arrangement… well, it’s not quite so simple. There is one more thing we need to cover — the zero-width joiner.

Zero-Width Joiner: The Most Important Character You’ll Never See

The zero-width joiner (ZWJ) has a code point but no corresponding symbol. It is used to connect two or more other Unicode code points to create a new “compound character” with a unique glyph all its own.

The ZWJ is a part of Unicode, so let’s see what the Unicode Consortium has to say about it.137:

The U+200D ZERO WIDTH JOINER (ZWJ) can be used between the elements of a sequence of characters to indicate that a single glyph should be presented if available. An implementation may use this mechanism to handle such an emoji zwj sequence as a single glyph, with a palette or keyboard that generates the appropriate sequences for the glyphs shown. So to the user, these would behave like single emoji characters, even though internally they are sequences.

When an emoji zwj sequence is sent to a system that does not have a corresponding single glyph, the ZWJ characters would be ignored and a fallback sequence of separate emoji would be displayed. Thus an emoji zwj sequence should only be supported where the fallback sequence would also make sense to a recipient. …

So, a ZWJ is exactly what it says it is: It does not have any appearance (i.e. it is “zero width”), and it joins other characters.

One important use of the ZWJ related to emoji are the skin tone modifiers that have already been mentioned. We’ve said that these skin tones were added to Unicode in version 8.0 to bring diversity to the appearance of emoji depicting human beings by allowing for a range of skin color. More specifically, Unicode has adopted the Fitzpatrick scale, a numeric classification scheme for skin tones specifying six broad groups or “types” of skin (Type I to Type VI) that represent in a general way at least a majority of people. From the Wikipedia entry for the Fitzpatrick scale138:

It was developed in 1975 by Thomas B. Fitzpatrick, a Harvard dermatologist, as a way to estimate the response of different types of skin to ultraviolet (UV) light. It was initially developed on the basis of skin and eye color, but when this proved misleading, it was altered to be based on the patient’s reports of how their skin responds to the sun; it was also extended to a wider range of skin types. The Fitzpatrick scale remains a recognized tool for dermatological research into human skin pigmentation.

  • Type I (scores 0–6) always burns, never tans (pale white; blond or red hair; blue eyes; freckles).
  • Type II (scores 7–13) usually burns, tans minimally (white; fair; blond or red hair; blue, green, or hazel eyes)
  • Type III (scores 14–20) sometimes mild burn, tans uniformly (cream white; fair with any hair or eye color)
  • Type IV (scores 21–27) burns minimally, always tans well (moderate brown)
  • Type V (scores 28–34) very rarely burns, tans very easily (dark brown)
  • Type VI (scores 35–36) Never burns, never tans (deeply pigmented dark brown to darkest brown)

Getting back to the ZWJ, let’s take a closer look at this in practice using the Boy emoji (U+1F466). To illustrate that this really works as described, I’m going to write out all of the emoji using numeric references. (You can take a look at the page source if you want to confirm this.)

So, we’ll start with our base emoji:

  • 👦 — (Boy, 👦)

Although the Fitzpatrick scale specifies six skin types, the depiction of the first two types using emoji are combined under Unicode. So, only five skin tone modifiers are actually available to us:

  • 🏻 — Emoji Modifer Fitzpatrick Type 1-2 (🏻)
  • 🏼 — Emoji Modifer Fitzpatrick Type 3 (🏼)
  • 🏽 — Emoji Modifer Fitzpatrick Type 4 (🏽)
  • 🏾 — Emoji Modifer Fitzpatrick Type 5 (🏾)
  • 🏿 — Emoji Modifer Fitzpatrick Type 6 (🏿)

These skin tone modifiers appear in the list “Full Emoji Data144,” along with all of the other emoji that are part of Unicode. They are represented as a square or some other shape of the appropriate color (i.e. skin tone) when used alone. That’s what you should be seeing in the list above.

We can manually build up the skin-type variants of the Boy emoji by combining the base emoji with each of the skin-type symbols:

  • 👦🏻 — Boy Type 1 to 2 (👦ZWJ placeholder🏻)
  • 👦🏼 — Boy Type 3 (👦ZWJ placeholder🏼)
  • 👦🏽 — Boy Type 4 (👦ZWJ placeholder🏽)
  • 👦🏾 — Boy Type 5 (👦ZWJ placeholder🏾)
  • 👦🏿 — Boy Type 6 (👦ZWJ placeholder🏿)

Note: The ZWJ placeholder used in the list above is just an image I’ve used as a placeholder for the ZWJ character, which is not visible itself.

Family and other groups are built up in exactly same way, one character at a time, with ZWJs in between. But don’t forget we’ve already learned that we should use character references only when we absolutely must. So, although we could build up the group “Family, Man, Woman, Girl, Boy” manually…

👨 ‍ 👩 ‍ 👧 ‍ 👦 — 👨‍👩‍👧‍👦

…it’s preferable (and also easier and less error-prone) to use the emoji symbol itself. So, of course that is what we should do. — 👨‍👩‍👧‍👦

We could also write the Latin lowercase “a” as either simply “a” or a (a). Imagine writing an entire HTML document out using only numeric character references, and you will probably appreciate the pointlessness of the exercise.

How Do We Know If We Have These Symbols?

Like any other character, in order to display a specific emoji, a symbol must be included in some font available on the device you are using. Otherwise, the OS will find no way to graphically represent the code point.

An OS, be it a desktop operating system such as Apple’s macOS, or a mobile OS like iOS and Android, ships with quite a few preinstalled fonts. (Many more can be acquired and installed separately.) The fonts available may differ by geographic region or primary language of the intended users across localized versions.

Many of these fonts will overlap, offering the same characters but presenting them in different styles. From one font to the next, these differences may be subtle or extreme. However, not all fonts overlap. Fonts intended for use with Western languages, for example, will not contain the symbols for Chinese, Japanese or Korean characters, and the reverse is also true. Even overlapping fonts may have extended characters that differ.

It is the responsibility of the font designer to decide which characters will be included in a given font. It is the responsibility of developers of an OS to make sure that the total collection of fonts covers all of the intended languages and provides a wide range of styles and decorative, mathematic and miscellaneous symbols and so on, so that platform is as expressively powerful as possible.

The current versions of all major platforms (Windows, macOS, Linux, iOS and Android) support emoji. Precisely what that means differs from one platform to the next, version to version and, in the case of Linux, from distribution to distribution.

The Great Emoji Proliferation Of 2016

In this article we’ve covered a little about the history of emoji through the current version of the Unicode Standard, Version 9.0 (released 21 June 2016). We’ve seen that Version 9.0 introduced 72 entirely new emoji, not counting variants. (The number is 167 if we add skin tone modifers.)

Before moving on I should mention that the emoji included in Unicode 9.0 are referred to separately as “Unicode Emoji Version 3.0”. That is to say that “Unicode Version 9.0 emoji” and “Unicode Emoji Version 3.0” are the same set of characters.

As we begin to look at Emoji support with the next section, and considering everything we’ve learned about emoji so far, it would seem to be a relatively straight-forward to know what we’re looking for. Ideally we would like to see full support for all Version 3.0 emoji, including skin tone modifiers, and multi-person groupings. After all, a platform can’t possibly do better than full support for the current version of the Standard. That would be madness! Have you heard the Nietzsche quote:

There is always some madness in technology. But there is also always some reason in madness.

Nietzsche was talking about love, but isn’t it just as true of tech? 😉 (Winking face, U+1F609)

It turns out you can do “better” than 100% support, that is depending on your definition of better. Let’s say that you can do more than 100% support. This isn’t a unique concept in technology, or even web design and development. For example, implementators doing more than 100% support for CSS led to vendor prefixes. Somehow more than 100% always seems good at first, but tends to lead to problems. It’s a little like how spending more than 100% of the money you have starts out as a solution to a problem, but tends to eventually lead to more problems 😦 (Frowning face with open mouth, U+1F626).

What does more than 100% support look like in the context of emoji?

First, let’s agree that simply calling something an emoji does not make it an emoji. You may remember (or not) Pepsi’s global PepsiMoji campaign145. A PepsiMoji is not an emoji, it’s an image. We won’t concern ourselves with gimmicky marketing ploys like that. There are still a couple of ways platforms are exceeding 100% support:

  • Using Zero Width Joiners to create non-standard symbols by combining standard emoji.
  • Rolling out proposed emoji before they are officially released.

An example of the former are Microsoft’s somewhat silly (IMO) Ninja Cat emoji, which I discuss in the section on Windows’ emoji support. But cat-inspired OS mascots are not the only place we see this sort of unofficial emoji sequence. More practical uses have to do with multi-person groupings and diversity, i.e. gender and skin tones.

The Unicode Consortium has this to say about multi-person groupings in UTR51146, the Unicode Version 3.0 Technical Report.

Emoji for multi-person groupings present some special challenges:

Gender combinations. Some multi-person groupings explicitly indicate gender: MAN AND WOMAN HOLDING HANDS, TWO MEN HOLDING HANDS, TWO WOMEN HOLDING HANDS. Others do not: KISS, COUPLE WITH HEART, FAMILY (the latter is also non-specific as to the number of adult and child members). While the default representation for the characters in the latter group should be gender-neutral, implementations may desire to provide (and users may desire to have available) multiple representations of each of these with a variety of more-specific gender combinations.

Skin tones. In real multi-person groupings, the members may have a variety of skin tones. However, this cannot be indicated using an emoji modifier with any single character for a multi-person grouping.

The basic solution for each of these cases is to represent the multi-person grouping as a sequence of characters—a separate character for each person intended to be part of the grouping, along with characters for any other symbols that are part of the grouping. Each person in the grouping could optionally be followed by an emoji modifier. For example, conveying the notion of COUPLE WITH HEART for a couple involving two women can use a sequence with WOMAN followed by an emoji-style HEAVY BLACK HEART followed by another WOMAN character; each of the WOMAN characters could have an emoji modifier if desired.

This makes use of conventions already found in current emoji usage, in which certain sequences of characters are intended to be displayed as a single unit.

We’re told, “the default … should be gender-neutral, implementations may desire to provide … multiple representations of each of these with a variety of more-specific gender combinations.”

This is in fact what implementations are doing, and pretty aggressively. As we will see, with its Windows 10 Anniversary Update, Microsoft now allows for as many as 52,000 combinations of multi-person groupings. Is this a good thing?

It is certainly laudable effort. Having said that, it’s worth keeping in mind what it means from a standards perspective. Though Microsoft has greatly increased universality of its emoji through its flexibly diverse emoji handling, it comes at the expense of compatibility with other platforms. Be aware that users of every other platform, including every version of Windows other than Windows 10 Anniversary Update, will not see many of these groupings as intended.

Microsoft is not the only implementor to do this, but they’ve taken it further than others.

That leaves the idea of rolling out proposed emoji before they are officially released. Everyone is getting in on this, and it is a new phenomenon. First, let’s take a brief moment to understand the situation.

Like most other standards bodies, the Unicode Consortium doesn’t wait for the release of one version of a standard to begin working on the next. As soon as the door closes on inclusion for one release, changes and new proposals are evaluated for the next.

Generally speaking, the Unicode Consortium invites outside parties to participate in selecting future emoji, the process for which is outlined in the document “Submitting Emoji Character Proposals147“. As that page describes, the process is somewhat drawn out, and proposals can be rejected for many reasons, but those that fair well are eventually added as “candidates”.

From the same page:

…proposals that are accepted as candidates are added to Emoji Candidates148, with placeholder code points…

So proposed Emoji are made public in some cases well before inclusion in the Unicode Emoji Standard. However the proposals carry this warning:

Candidates are tentative: they may be removed or their code point, glyph, or name changed. No code point values for candidates are final, until (and if) the candidates are included as characters in a version of Unicode. Do not deploy any of these.

That would seem pretty cut and dry. Promising proposals are added to a publicly available candidates list with the disclaimer that they should not be used until officially released. But rarely are these kinds of issues so simple for long. Seemingly more often than not, the exception is the rule, and by that measure, emoji are following the rules.

It started with a proposal submitted by Google titled, “Expanding Emoji Professions: Reducing Gender Inequality149” which begins:

Google wants to increase the representation of women in emoji and would like to propose that Unicode implementers do the same. Our proposal is to create a new set of emoji that represents a wide range of professions for women and men with a goal of highlighting the diversity of women’s careers and empowering girls everywhere.

That proposal, well worth reading, was submitted in May 2016, a little over a month before the release of Unicode Version 9.0 with its 72 new emoji. It has led to a flurry of activity resulting in a Proposed Update to UTR #51150, and making way for Unicode Emoji Version 4.0 much more quickly than might have been expected. How quickly — right about now.

The original proposal led to another by the Unicode Subcommittee, “Gender Emoji ZWJ Sequences151“, published on 14 July 2016, that fast-tracked new officially recognized sequences providing greater gender parity among existing emoji characters as well as new profession emoji.

From the proposal:

This document describes how vendors can support a set of both female and male versions of many emoji characters, including new profession emoji. Because these emoji use sequences of existing Unicode characters composed according to UTR#51: Unicode Emoji, vendors can begin design and implementation work now and can deploy before the end of 2016, rather than waiting for Unicode v10.0 to come out in June of 2017.

Unicode itself does not normally specify the gender for emoji characters: the emoji character is RUNNER, not MAN RUNNER; POLICE OFFICER not POLICEMAN. Even where the name may appear to be exclusively one gender, such as U+2603 SNOWMAN or U+1F482 GUARDSMAN the character can be treated as neutral regarding gender.

To get a greater sense of realism for these characters, however, vendors typically have picked the appearance of a particular gender to display. This has led to gender disparities in the emoji that people can use. There is also a lack of emoji representing professions and roles, and the few that are present (like POLICE OFFICER) do not provide for both genders; a vendor has to choose one or the other, but can’t represent both.

You can also read more about this in the Unicode Consortium blog post, “Proposed Update UTR #51, Unicode Emoji (Version 4.0)152“.

At present all of this has resulted in a parallel beta standard for Unicode Emoji Version 4.0153.

How big a change are we talking about? There are a total of 88 new sequences combining existing emoji in new ways to provide gender alternates for current characters and also professions, for which there are male and female representations. However, adding skin tone variants to these new alternate characters and professions means that in total the number of emoji has increased from 1,788 in Version 3.0 to 2,243 in the proposed Version 4.0. That’s 455 new emoji in approximately 2 months. 🚀 (Rocket, U+1F680)

So why am I bothering to discuss unreleased, “beta” emoji? After all, the review period for the new proposed standard doesn’t close until 24 October 2016 (coincidentally the date this article is scheduled to be published). I’m covering all of this because implementations are already rolling out these changes. It’s no longer possible to accurately describe the current level of emoji support on these platforms without mentioning the proposed Version 4.0. Now that we have covered all of the officially-official emoji through the current version of the Standard, and unofficially-official emoji included in the post-current Standard, we can make sense of emoji support for across popular platforms. 😵 (Dizzy face, U+1F635)

Emoji OS Support

Emoji Support: Apple Platforms (macOS and iOS)

The current version of Apple’s Mac operating system, “macOS Sierra” (10.12) released on 20 September 2016, includes well over 100 fonts, and among them one named “Apple Color Emoji”. It’s this font that contains all of the symbols for the platform’s native emoji. The same font is used by the current version of Apple’s mobile OS, iOS 10.

Users of Apple’s OS’ have typically enjoyed very good emoji support, and the newest versions continue the trend. To begin with, iOS 10 and macOS Sierra support all emoji through Unicode Version 9.0. Apple’s OS release cycle is nicely timed as far as emoji are concerned. Unicode updates are happening in the Summer, and bring with them changes to Emoji. Apple is able to roll them out across all of their platforms in the Fall.

Beyond Unicode Emoji Version 3.0, macOS Sierra and iOS 10 support the gendered ZWJ sequences that are key part of the proposed Unicode Version 4.0. However new professions and sequences that add skin tone modifiers to existing multi-person groupings didn’t make the update. As a bonus Apple threw in the Version 4.0 draft specification 🏳️‍🌈 (Rainbow flag sequence — White flag, U+1F3F3 + Emoji variation selector, U+FE0F + ZWJ, U+200D + Rainbow, U+1F308).

In total, Apple’s newest OS updates include 632 changes and additions. Some of these changes are minor and reflect nothing more than an evolution of design sensibilities of those involved at Apple. Others are more dramatic however, most notably 🔫 (pistol, U+1F52B), which has been changed from a realistic looking weapon to a cartoonish sci-fi water gun.

Emoji Support: Windows

Windows 10, 8.1 and 8 all shipped with support for emoji. Limited support was added to Windows 7 through a software update154. For more information, there is an associated informational article titled “An Update for the Segoe UI Symbol Font in Windows 7 and in Windows Server 2008 R2 Is Available155“).

Windows 8 included a limited set of black-and-white emoji with the “Segoe UI Symbol” font. This same font was eventually added to Windows 7, providing that version of the OS with its basic emoji symbols.

Windows 8.1 was the first Windows OS to support color emoji by default, shipping with the “Segoe UI Emoji” font, which provides Windows with its unique set of color emoji symbols.

Windows 10 continues to build on increasingly good support, adding all Unicode Version 8.0 emoji, including skin tone modifiers156. However, Windows 10 did not include symbols for national flags 🇺🇸 (Flag of the United States of America, U+1F1FA, U+1F1F8) 🇩🇪 (Flag of Germany U+1F1E9, U+1F1EA), which are displayed as two-character country-code identifiers instead.

Note: Flag emoji are each associated with pairs of 26 individual “regional indicator symbols.” These combinations are referred to as an “emoji_flag_sequence”. It is the combination of pairs of code points together that produce a single flag emoji. For more information about flag symbols, refer to “Annex B: Flags157” in the document “Unicode Technical Report #51158.”

Windows 10 Anniversary Update159 released on 2 August 2016 (and available now via the Windows Update facility on your Windows 10 PC), brings with it a wealth of changes to emoji on Windows. The update (officially, version 1607, and the second major update to Windows 10) includes all of the new Unicode Version 9.0 emoji, but that’s just the beginning.

Microsoft took the Anniversary Update as an opportunity to completely rethink and rework emoji on Windows, an effort dubbed “Project Emoji.” In a Windows Experience blog160 post titled “Project Emoji: The Complete Redesign163161” we’re told:

The Microsoft Design Language Team embarked on Project Emoji, redesigning the emoji set from scratch in under a year. From early sketches to creating a new scripting method, the team knew only emoji. Illustrators, graphic designers, program managers, font technicians, production designers, and scripting gurus all worked with an impressive singular focus.

It’s a testament to the acknowledged significance of emoji that Microsoft would make this kind of an effort.

The update includes “over 1700 new glyphs, with a possible 52,000 combinations of diverse women, men, kids, babies, and families.” As we’ve already discussed, Unicode 9.0 adds only 72 new emoji. The fact that Microsoft added 1700 new symbols clearly demonstrates that diversity was a critical focus of the update. Support for diversity begins with the same skin tone modifiers available under previous editions of Windows 10, now extended to more emoji. But the most ambitious effort is expansive support for family and other multi-person groups.

From the same “Project Emoji” blog post:

So if you’re a single mother with three kids, you’ll be able to create that image. If your husband is dark-toned and you’re light-toned and your two kids are a blend of both, you can apply all of those modifiers to create your own personal family emoji, one that’s sincerely representative. It extends to the couple emoji, where you can join a woman, a heart, and a woman — both with unique skin tones — for a more inclusive emoji. Because they’re created dynamically, there are tens of thousands of permutations. And no other platform supports that today.

Emoji in Windows 10 Anniversary Update gives us a good sense of the scale of expanding skin tone modifiers across flexible multi-person groupings. However this effort seems to be largely independent of Unicode Emoji Version 4.0. Missing are all of the gendered and profession emoji that are the hallmark of proposed version.

The first thing you might notice are the bold outlines surrounding each of the new emoji. But the changes go much further than that. On the minor end of the scale, virtually all emoji have a more geometric look, which contributes to an overall stronger, more readable appearance. Beyond this, many emoji are drastically different, going so far as to be essentially new interpretations. Generally speaking, the new emoji are less generic, willowy and neutral, which is to say that they are more iconic, bold and dynamic. They’ve gone from looking like designs you might see on the wallpaper in a nursery to what you’d expect of the signage in a modern building. Some examples will give you a better sense of what I’m trying to describe:

Figure 5: Emoji in Windows 10 before and after the Anniversary Update
Smiling Face with Open Mouth emoji before and after Anniversay Update compared

Smiling Face with Open Mouth, U+1F603
Happy Person Raising One Hand emoji before and after Anniversay Update compared

Happy Person Raising One Hand, U+1F64B
Man and Woman Holding Hands emoji before and after Anniversay Update compared

Man and Woman Holding Hands, U+1F46B
Dromedary Camel emoji before and after Anniversay Update compared

Dromedary Camel, U+1F42A
Soft Ice Cream emoji before and after Anniversay Update compared

Soft Ice Cream, U+1F366

For more information on all things emoji in Windows 10 Anniversary Update, I highly recommend the Windows Experience Blog162 post “Project Emoji: The Complete Redesign163161,” written by Danielle McClune (4 August 2016). It does a good job of covering the changes introduced with the Anniversary Update and also provides a bit of an insider’s perspective. For those of you not particularly interested in Windows, the article offers some useful general information, including a brief illustrated discussion of the emoji skin tone modifiers and the Fitzpatrick scale.

If you still can’t get enough of Windows 10 emoji goodness, the next place to turn to is Emojipedia’s blog post “Avalanche of New Emojis Arrive on Windows164,” which nicely displays an overview of the changes to emoji in Anniversary Update. Beyond that, the Emojipedia page dedicated to the Anniversary Update165 lists all of the emoji for this version of Windows, with an option to narrow the list to just the new symbols.

If all of this wasn’t enough, Microsoft has added a new emoji keyboard to improve the experience of working with its updated emoji.

It’s fair to say Microsoft has really upped its emoji game with the Windows 10 Anniversary Update. Are there any notable gaps or other issues related to Windows’ emoji support other than the absent Version 4.0 symbols? I’ll mention two…

First, the flag emoji are still missing. You will continue to see country-code identifiers, as in earlier versions of Windows 10.

Second is something of an oddity that’s specific to Windows 10 Anniversary Update (and incompatible with every other platform): Ninja Cat.

Ninja Cat is a character that started out as something of an unofficial mascot for Windows 10 among developers, making its first appearance in a presentation about the OS in mid-2014 (before its release). Apparently, Ninja Cat has proven to be popular and enduring enough over the past couple of years to justify some desktop wallpapers and an animated GIF166, coinciding with the Anniversary Update, and — you know what’s coming — there are ninja cat emoji as well.

I’m making a point of mentioning Ninja Cat because it is another example of the use of zero-width joiners. All Ninja Cat emoji (yes there’s more than one) are sequences of the 🐱 (Cat face, U+1F431) emoji in combination with other standard emoji, connected with the ZWJ character, and resulting in new unofficial symbols.

Note: If skin tone modifiers and flexible multi-person groups are among the more important uses of ZWJ, then non-standardized mascots and gimmicks have to be among the worst, as fun as they may be.

The basic Ninja Cat is a combination of 🐱 (Cat Face, U+1F431) and 👤 (Bust in Silhouette, U+1F464). Other combinations include:

  • Astro Cat — 🐱 (Cat Face, U+1F431) and 🚀 (Rocket, U+1F680)
  • Dino Cat — 🐱 (Cat Face, U+1F431) and 🐉 (Dragon, U+1F409)
  • Hacker Cat — 🐱 (Cat Face, U+1F431) and 💻 (Personal Computer, U+1F4BB)
  • Hipster Cat — 🐱 (Cat Face, U+1F431) and 👓 (Eyeglasses, U+1F453)
  • Stunt Cat — 🐱 (Cat Face, U+1F431) and 🏍 (Racing motorcycle, U+1F3CD)

Here is what those symbols look like in Windows 10 Anniversary Update (the only place you will see them):

Figure 6: Ninja Cat emoji in Windows 10 Anniversary Update
Ninja Cat

Ninja Cat
Astro Cat

Astro Cat
Dino Cat

Dino Cat
Hacker Cat

Hacker Cat
Hipster Cat

Hipster Cat
Stunt Cat

Stunt Cat

Emoji Support: Linux

If you’re a Linux user, you’ll know that these kinds of things tend to be distribution-dependent. The Unicode underpinnings are there. What may be missing is a font providing the symbols for displaying the emoji. Adding an emoji font is a relatively simple matter, and good options are available, including some that we’ll see shortly.

Emoji Support: Android

Android has supported emoji since Jelly Bean (4.1) and color emoji since Kit Kat (4.4). The latest version of Android, the recently released “Nougat” version 7.1 includes substantial changes. Like Microsoft, Google has made a big push toward ensuring its emoji support is second to none. The prior version of Android, “Marshmallow” (6.0.1), supported all of the official emoji through Unicode Version 8, with the notable exception of skin tone modifiers.

“Noto” and “Roboto” are the standard font families on recent versions of Android and Chrome. Unlike “Apple Color Emoji” and Window’s “Segoe UI Emoji,” Noto167 and Roboto168 are freely available for download, including “Noto Color Emoji,” the primary emoji font on Android.

For starters, Google is beginning to move away from the amorphous blobs that Android users are familiar with. However, Android is keeping the same gumdrops for generic faces:

Figure 8: Generic face emoji in Android Nougat (version 7.0)
Slightly Smiling Face emoji in Android Dev Preview 2

Slightly Smiling Face, U+1F642 (Look familiar? Yep, it’s unchanged from Android 6.0.1)
Smiling Face with Open Mouth emoji in Android Dev Preview 2

Smiling Face With Open Mouth, U+1F603
Face with Stuck-out Tongue emoji in Android Dev Preview 2

Face With Stuck-out Tongue, U+1F61B
Drooling Face emoji in Android Dev Preview 2

Drooling Face, U+1F924 (New with Unicode 9.0)
Rolling on the Floor Laughing emoji in Android Dev Preview 2

Rolling on the Floor Laughing emoji, U+1F923 (New with Unicode 9.0)

Those last two, both introduced in Unicode 9.0, are proof that Google is not entirely abandoning its gumdrops. But emoji are changing where it matters most, with human-looking depictions for less generic faces, actions and groups, complete with skin tone modifiers (conspicuously absent from Android to date).

Here are a few examples to give you some sense of how much the situation has improved:

Figure 9: Comparison of Android emoji from 6.0.1 to 7.0 Dev Preview 2
Man with Turban
Man with Turban in Android 6.0.1

Android 6.0.1: Man With Turban, U+1F473
Man with Turban in Android N Dev Preview 2

Android N Dev Preview 2: Man With Turban, U+1F473 (U+200D U+1F3FB, U+200D U+1F3FC, U+200D U+1F3FD, U+200D U+1F3FE, U+200D U+1F3FF)
Girl in Android 6.0.1

Android 6.0.1: Girl, U+1F467
Girl in Android N Dev Preview 2

Android N Dev Preview 2: Girl, U+1F467 (U+200D U+1F3FB, U+200D U+1F3FC, U+200D U+1F3FD, U+200D U+1F3FE, U+200D U+1F3FF)
Happy Person Raising One Hand
Happy Person Raising One Hand in Android 6.0.1

Android 6.0.1: Happy Person Raising One Hand, U+1F467
Happy Person Raising One Hand in Android N Dev Preview 2

Android N Dev Preview 2: Happy Person Raising One Hand, U+1F64B (U+200D U+1F3FB, U+200D U+1F3FC, U+200D U+1F3FD, U+200D U+1F3FE, U+200D U+1F3FF)
Man and Woman Holding Hands
Man and Woman Holding Hands in Android 6.0.1

Android 6.0.1: Man and Woman Holding Hands, U+1F46B
Man and Woman Holding Hands in Android N Dev Preview 2

Android N Dev Preview 2: Man and Woman Holding Hands, U+1F46B
Couple with Heart
Couple with Heart: Woman Woman, Man Man, Woman Man in Android 6.0.1

Android 6.0.1: Couple With Heart: Woman, Woman (U+1F469 U+200D U+2764 U+FE0F U+200D U+1F469); Man, Man (U+1F468 U+200D U+2764 U+FE0F U+200D U+1F468); Woman, Man (U+1F469 U+200D U+2764 U+FE0F U+200D U+1F468)
Couple with Heart: Woman Woman, Man Man, Woman Man in Android N Dev Preview 2

Android N Dev Preview 2: Couple with Heart: Woman, Woman (U+1F469 U+200D U+2764 U+FE0F U+200D U+1F469); Man, Man (U+1F468 U+200D U+2764 U+FE0F U+200D U+1F468); Woman, Man (U+1F469 U+200D U+2764 U+FE0F U+200D U+1F468)

Nougat includes all of the new emoji introduced with Unicode Version 9169. However, despite it originally being Google’s proposal, and their continued close involvement with the beta standard, Version 4.0 emoji, including gendered emoji and professions, didn’t make the original Nougat (7.0) update.

However, just in the past few days, on 20 October 2016, Google released Android 7.1 with support for those Emoji Version 4.0 sequences. The 7.1 update is the first version of Android to include the new professions and genered emoji, as well as an expansion of multi-person groupings to include single parent families. For good measure Android 7.1 also includes 🏳️‍🌈 (Rainbow flag sequence — White flag, U+1F3F3 + Emoji variation selector, U+FE0F + ZWJ, U+200D + Rainbow, U+1F308).

Once again, Emojipedia is a good resource for platform-specific emoji information. A page dedicated to Android Nougat 7.1170 shows all of the emoji for the most recent version the OS. You can find pages for earlier releases as well.

Emoji On The Web

The person viewing your website or application must have emoji support to see the intended symbols. We’ve specified the character set and encoding, and written emoji into our documents and UIs. The user agent renders the page and all of the characters on it, and that’s what emoji are, of course.

You’ll recognize that this is no more than the usual arrangement. However, emoji are newer than the other elements we’re used to working with, and what’s more, they occupy an odd space somewhere between text and images. Also, many of us begin with a shaky understanding of character sets and encodings. Altogether, it leads to confusion and consternation 😕 (Confused Face, U+1F615) in the way that only something similar to but not exactly the same as what we already know well can. Though this should come as no surprise, it’s critically important, and for that reason I’m mentioning it here at the end of the article.

A code point without a glyph is just a code point. U+2D53 (Tifinagh letter Yu) is clearly different than U+1F32E (Taco emoji). But, ultimately, we care only about the corresponding symbols ⵓ (U+2D53) and 🌮 (U+1F32E). As usual, we’re at the mercy of our audience.

What about polyfills, shims and fallbacks? Is there anything like that we can use to improve the situation? I’m happy to say that the answer is yes.

Emoji One

The Emoji One171 project is aimed at creating a collection of up-to-date, high-quality and universal emoji resources. The largest component of the project by far is a comprehensive set of emoji that can be freely used by anyone for any purpose, both non-commercial or commercial (with attribution).

Emoji One set sampler
Figure 10: Emoji One set sampler

In the words of the people responsible for the project, Emoji One is:

The web’s first and only complete open source emoji set. It is 100% free and super easy to integrate.

In December 2015, Emoji One announced the upcoming release of an updated and redesigned set of all of their emoji, branded “The 2016 Collection” (Q1 2016 Version 2.1.0, January 29, 2016172). More importantly, they committed to quarterly design updates173, a promise they have made good on to date.

On May 30th, Emoji One officially released its 2nd quarterly update (Q2 2016 update, Version 2.2.0174), with a total of 624 “design upgrades,” encompassing changes to existing symbols and entirely new (to the set) emoji.

Shortly thereafter, on June 21st, and coinciding with the official release of Unicode 9.0, Emoji One released version 2.2.4175, updating its set to include the 72 new Unicode emoji along with associated skin tone sequences.

Version 2.2.4 is the current version, though Emoji One has recently begun promoting the next major update, version 3.0, teasing “The revolution begins this December.”

Currently, the Emoji One set comprises 1834 emoji, organized in nine categories, all of which can be browsed on the Emoji One website in the Emoji Gallery176 and searched via The Demo177.

Emoji One has also created two web apps:

emojicopy provides a responsive interface for searching emoji and copying selected symbols, with optional text, to paste into other apps. offers a number of useful features, including a “Cheat Sheet180” for browsing and searching Emoji One short codes, a method of inputting emoji in applications and websites with built-in support for Emoji One.’ “Family Tree181” is a table comparing emoji symbols across as many as 10 platforms (Symbola, Emoji One, Apple, Google, Windows, Twitter, Mozilla, LG, Samsung and Facebook). It’s essentially a prettier version of the same sort of table on the Unicode Consortium’s “Full Emoji Data182117” page previously mentioned.

The Family Tree is certainly much cleaner and provides a beneficial search interface. Just keep in mind that, when in doubt, the Unicode Consortium is the authoritative source.

Lastly, an Emoji Rolodex183 lists links to quite a few resources: information, libraries, scripts, plugins, standalone apps and more. There are valuable tools here and some purely fun links, too.

All Emoji One art files are available for personal and commercial use under a Creative Commons license (CC BY 4.0184). Emoji are available as PNG, SVG and font files (for Android, macOS and iOS) and can be downloaded as a complete set from the Emoji One developer page185 or individually from the gallery186.

In addition to the emoji art itself, the developer resources include a toolkit of conversion scripts, style sheets, sprite files and more.

The project also offers an extension for Google Chrome, called, fittingly, “EmojiOne for Chrome187” which is described as “four game-changers in one,” providing:

  • a panel in Chrome for inputting emoji
  • emoji character search
  • set-once toggling between skin tones
  • well, the last “game changer” is Emoji One itself (which is cheating, but I think we can let it slide)

Emoji One is, without a doubt, an important (assuming you think emoji are important) well-executed, well-maintained and ambitious project. If all of that wasn’t enough, Emoji One has paid up to become a voting member of the Unicode Consortium.

Let’s hope the project has a bright future.

We can include Emoji One in our websites and applications and know that not only will visitors see the emoji we intend, but also that the emoji we use will maintain a consistent, high-quality appearance across devices, OS’ and browsers. Come on, that’s pretty great!

How do we use Emoji One? Link

To get started, you’ll want to read through the information in Emoji One’s GitHub repository188, or download the complete developer toolkit189. Presently, the toolkit is around 60 MB and includes all emoji symbols as PNG files in three sizes (64, 128 and 512 pixels), and SVG vector images as well.

If you just want to read through the instructions for getting started, the Emoji One readme on GitHub190 is probably a better option.

You will learn that Emoji One has partnered with JSDeliver191, a “free super-fast CDN for developers and webmasters” (their words, not mine) to make it easy to install on any Javascript-capable website. It’s a matter of adding a script and link element to the CDN-hosted Javascript and CSS files in the usual way.

There are also options for installing via npm, Bower, Composer and Meteor package managers. If any of those are relevant to you, then you should have no trouble at all.


I would be remiss if I didn’t mention that Emoji One is not your only option for open source emoji sets. In November 2014, Twitter announced on its blog that it was “Open Sourcing Twitter emoji for Everyone192.”

Twemoji set sampler
Figure 11: Twemoji set sampler

From the blog post:

The project ships with the simple twemoji.js library that can be easily embedded in your project. We strongly recommend looking at the preview.html source code193 to understand some basic usage patterns and how to take advantage of the library, which is hosted by our friends at MaxCDN194.…

For more advanced uses, the twemoji library has one main method exposed: parse. You can parse a simple string, that should be sanitized, which will replace emoji characters with their respective images.

Twemoji195 is hosted on GitHub and includes, among other resources, a set of emoji images as PNG files at 16 × 16, 36 × 36, 72 × 72 sizes, and SVG vectors as well.

After the initial release, a version 2 included emoji from Unicode Version 8 (as well as other changes). Subsequently, version 2.1 added symbols for all of the Unicode 9 emoji, for a total of over 1830 symbols. Presently, the current version of Twemoji is 2.2, which includes support for the gendered and profession emoji from the Unicode Emoji Version 4.0 draft, bringing the total number of symbols to 2,477. A complete download of the current version of the project is an over 400 MB ZIP file. You can read about the latest updates in the project’s ReadMe file196.


🏃  🌲                    🏡           🌲🌲         🌲

Are we done? Could it be? Do we know everything there is to know about emoji? 😫 (Tired Face, U+1F62B)

It might be a stretch to say we know absolutely everything there is to know, but we know what we need to know to find and understand everything there is to know. 😤 (Face with Look of Triumph U+1F624)

🚶   🌲                    🏡           🌲🌲         🌲

(rb, al, il, vf)

Footnotes Link

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11 #charsets-and-encoding-sec
  12. 12 #chars-sec
  13. 13 #charsets-sec
  14. 14 #coded-charsets-sec
  15. 15 #encoding-sec
  16. 16 #declare-sec
  17. 17 #http-header-declare-sec
  18. 18 #check-headers-browser-sec
  19. 19 #check-headers-web-sec
  20. 20 #meta-charset-declare-sec
  21. 21 #encoding-by-any-name-sec
  22. 22 #oh-yeah-emoji-sec
  23. 23 #what-are-emoji-sec
  24. 24 #how-eomji-sec
  25. 25 #char-ref-sec
  26. 26 #glyphs-sec
  27. 27 #do-we-have-sec
  28. 28 #great-proliferation-sec
  29. 29 #emoji-os-support-sec
  30. 30 #emoji-support-apple-sec
  31. 31 #emoji-support-windows-sec
  32. 32 #emoji-support-linux-sec
  33. 33 #emoji-support-android-sec
  34. 34 #emoji-on-web-sec
  35. 35 #emoji-one-sec
  36. 36 #twemoji-sec
  37. 37 #conclusion-sec
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47 #glyphs-sec
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
  92. 92
  93. 93
  94. 94
  95. 95
  96. 96
  97. 97
  98. 98
  99. 99
  100. 100
  101. 101
  102. 102
  103. 103
  104. 104
  105. 105
  106. 106
  107. 107
  108. 108
  109. 109
  110. 110
  111. 111
  112. 112
  113. 113
  114. 114
  115. 115
  116. 116
  117. 117
  118. 118
  119. 119
  120. 120
  121. 121
  122. 122
  123. 123
  124. 124
  125. 125
  126. 126
  127. 127
  128. 128
  129. 129
  130. 130
  131. 131
  132. 132
  133. 133
  134. 134
  135. 135
  136. 136
  137. 137
  138. 138
  139. 139
  140. 140
  141. 141
  142. 142
  143. 143
  144. 144
  145. 145
  146. 146
  147. 147
  148. 148
  149. 149
  150. 150
  151. 151
  152. 152
  153. 153
  154. 154
  155. 155
  156. 156
  157. 157
  158. 158
  159. 159
  160. 160
  161. 161
  162. 162
  163. 163
  164. 164
  165. 165
  166. 166
  167. 167
  168. 168
  169. 169
  170. 170
  171. 171
  172. 172
  173. 173
  174. 174
  175. 175
  176. 176
  177. 177
  178. 178
  179. 179
  180. 180
  181. 181
  182. 182
  183. 183
  184. 184
  185. 185
  186. 186
  187. 187
  188. 188
  189. 189
  190. 190
  191. 191
  192. 192
  193. 193
  194. 194
  195. 195
  196. 196

↑ Back to top Tweet itShare on Facebook

Rob Reed is long time IT professional working (and living) in Boston, MA (USA), most recently as a consultant and previously with organizations ranging from regional nonprofits to a leading global business management firm. He has a Masters degree in Computer Science and for the past few years has been writing and teaching. Rob is a strong proponent of the open web and simple development, where simple means you know how it works.

  1. 1

    Were you paid by the word?

    I got bored half way through you going WAY off track and explaining what unicode is, which appears on page EIGHT if you were to print this out.

    A 27-point table of contents?

    You finally get onto the subject at hand in CHAPTER 3 ?????

    This needs a LOT of editing.

    • 2

      Hi Jim,

      I’m sorry you feel that way. Just like every article here, this article has been thoroughly edited and reviewed. And yes, it is quite long and comprehensive. No reason to get upset about that though. :-)

      • 3

        If i were to chose to print this article, Chrome tells me that it would currently run to around 78 pages of A4

        78 pages.

        As a comparison, the HTML 5 spec for FORMS (and all possible INPUT elements) currently runs to 75 pages, and one of those is almost blank

        I would suggest that it’s quite possible that , if I were to approach you on the street and ask “hey, tell me about emoji” – we would be finished with our conversation in around a minute, maybe 2 minutes.

        Not 78 pages.

        Today’s broadsheet newspaper – containing details of almost everything significant that has happened in the world in the last 24 hours – is unlikely to contain 70 pages of actual content. And LOTS is happening in the world at the moment.

        78 pages. On emoji.

        But thanks for editing it. I dread to think how long the original was….”Chapter 1 – The birth of Tim Berners-Lee’s grandfather”

        • 4

          valerio pierbattista

          November 15, 2016 12:14 pm

          dude ….

        • 5

          Jim you’re pretty mad. You could just not read it if you don’t care. Maybe you should relax.

        • 6

          So it’s a long article. So what? Why does that upset you to the point of writing such a scathing comment? Maybe through the author’s lens, the topic of emoji isn’t something that can be summarized in a two-minute street conversation. And that’s okay! :)

    • 7

      Ha. It is long. That was a calculated risk. In fact it did go through a ton of editing. The first draft was done at the beginning of the summer and then there were many revisions as the situation shifted with the release of Unicode 9, platform providers updated their emoji support and documentation, and other changes. IMO Smashing Magazine deserves a lot of credit for spending the time and effort required to edit it, and for being willing to publish the article in the first place. It is not the more typical relatively short piece about a narrow topic. Those are great, and often just what is needed, but this is something else. It tries to establish a foundation of knowledge on which many of those shorter articles build.

      Hopefully the TOC right at the beginning helps readers who are interested in just one aspect of the article to jump to that section, as well as to provide everyone an overview of the structure of the article. That’s the idea, and the TOC was intended specifically to address the length and the scope.

      I look at it like this… Any one article, book and the like is finite, and if it’s comprehensive and accurate, then it’s going to ultimately save me time as a reader because I won’t need to scramble to fill in gaps, read multiple overlapping and conflicting articles and books, and then try to sort it all out for myself – all of this scattered over weeks, months, and years. An article, even a long one, has a beginning and an end. Few things in technology provide that luxury.

      Not every topic lends itself to a comprehensive treatment, but I think it’s fair to say that chararacter sets and encoding are topics that do. I’m always looking for these kinds of articles to hold on to, and I find too few of them. But that’s as much my preference as anything else. It’s not surprising there is both positive and negative feedback. Reasonable people can have differences of opinion about this kind of thing, and to some extent it’s a matter of perspective.

      For anyone intimidated, frustrated, or annoyed by the length but genuinely interested in the article, I might recommend breaking the reading into chunks. Spend 5 or 10 minutes on it a day and in a short period you’ll have finished the whole thing. That’s probably a better way to absorb the information anyway.

      Regardless, I do appreciate the feedback. I’m not being dismissive in any way. I hope that’s not how it comes across. As I write in the future it’s nice to have this perspective to carry with me.


  2. 8

    > As an example, let’s look at 🐘 (Elephant emoji). The code point for the Elephant emoji is U+1F418. That’s in the BMP, and it fits into figure 1 at box 1F, an address space generally reserved for non-Latin European scripts.

    Spotted a little mistake. I don’t think U+1F418 is in the BMP

  3. 13

    xylon gellanggao

    November 15, 2016 12:52 am

    Wow! very informative, I find it interesting. Instead of clicking the icon right away I would rather love to use the code. Thanks!


  4. 14

    I would like to propose emojii as the plural or emoji. Pronounced like cacti (which is, itself, not technically correct apparently, but it’s what a lot of us learnt is the plural of cactus in school).

    • 15

      I agree with you, but as you say, your eaxample isn’t the best :) I propose relating it to the Japanese origin of the word, and Japanese does not have plural forms.

      Saying “emojis” is like saying “samurais”. 😅

      • 16

        That’s a very interesting point Tom. It made me want to dig a little more, and I found another article from The Atlantic which is a nice follow up to the one quoted in the article.

        If this question of pluralizing emoji is of interest to you, then you’ll want to read “The Trouble With Pluralizing Emoji” (Jan 2016).

  5. 17

    wooww ! awesome ! very very informative. and you this is article the best

  6. 18

    Holy 🐮, this was interesting! Best article I’ve read in a long time 😁
    I love your style of writing, all the explanations, asides and whatnots, hugely engaging and very informative. Great job, thank you!

    • 19

      Just WOW. It took me around 30 mins to read this article and it was worth it. Loved the way how author kept the readers engaged throughout. Massive information for any UX designer.


  7. 20

    Magnificent article. We need more of these long form articles. Documented, researched.

    One of the things I would have added (yeah making it a bit longer) is a reference to the art space it generated such as a translation of Herman Melville of Moby Dick into… emoji.

    • 21

      Thanks karl! Fortunately because of comments you did get to add that reference, and I think that’s valuable.

      I’m a huge fan of comments. Sure there are problems with comments, and I don’t mean to trivialize them. But we’re very capable of solving problems when we put our minds to it.

      I would argue that comments, and interaction more generally, are a key advantage of platforms like this and digital publication. I can imagine every book and article being the start of an ongoing exchange among a community including the author and readers. Rather than books and articles growing stale, they’re kept alive and relevant with corrections, new details, and entirely new ideas and perspectives. The author, readers, the publisher all benefit.

      Eventually that discussion could inform an update of an article or the next edition of a book.

      Think of the way GitHub has changed open source software projects. No longer is software just a download link. Instead it’s the beginning of a conversation.

      Thanks again!

  8. 22

    A true masterpiece 🎭

    Well done for writing this brain-busting knowledge-filled article. Read it over a few days and absolutely loved it. 👍🏼

  9. 23

    Mohamed Hussain S H

    November 21, 2016 8:25 am

    so much information in this article and it covers pretty well basic typography concepts along with emojis… thanks for this article…

    • 24

      Thanks Mohamed

      I wanted to tackle the topics of character sets and encoding, which are vitally important not just to emoji but everything we do as web designer and developers (and as digital communicators more generally). But those are pretty dry, intimidating topics. Emoji is such a fun topic that it struck me as a great approach. Also emoji are interesting and important in their own right.

      You’re absolutely right that everything in the article applies equally to typography. After all native emoji are implemented as fonts. That alone is an important insight.

  10. 25

    In the second half of the post it seems that you made several escaping bugs and what was probably intended to be a numeric representation of a character in brackets ended up being displayed as the character itself, giving a strange reduntant form “x (x)” in many places.

    • 26

      Hi qbolec,

      You’re right. Somehow that got overlooked. The initial ampersand should have been escaped. At least that’s how it is in the local draft I have (I just looked) but apparently something just got lost in the translation. It would be easy to correct, I don’t know if that’s something Smashing Magazine is willing to do, but I can ask. So I’ll do that.

      Thanks for the careful reading and for taking the time to report the issue.



    • 27

      This should be corrected now. I think I got all of them. It was a little more fiddly than simply not escaping the character references correctly.

      Named character references and numeric references are being treated differently by the platform. For named references it was enough to escape the leading ampersand to get reference to appear as content. That’s the behavior I’m used to. However with numeric references that wasn’t enough. The leading character reference was being decoded (by the the editor I assume) and then the entire numeric character reference was being decoded rather than displayed. Anyway, I worked around it.

      Thanks again qbolec!

  11. 28

    What we’ll find is that they are born from, and depend on, the same technical foundation, character sets and document encoding that underlie the rest of our work as web-based designers, developers and content creators. So, we’ll delve into these topics using emoji as motivation to explore this fundamental aspect of the web. We’ll learn all about emoji as we go, including how we can effectively work them into our own projects, and we’ll collect valuable resources along the way.


Leave a Comment

You may use simple HTML to add links or lists to your comment. Also, use <pre><code class="language-*">...</code></pre> to mark up code snippets. We support -js, -markup and -css for comments.

↑ Back to top