Twelve Commandments Of Software Localization

Advertisement

You’ve presented the new website and everyone loves it. The design is crisp, the code is bug-free, and you’re ready to release. Then someone asks, “Does it work in Japanese?”

You break out in a cold sweat: you have no idea. The website works in English, and you figured other languages would come later. Now you have to rework the whole app to support other languages. Your release date slips, and you spend the next two months fixing bugs, only to find that you’ve missed half of them.

Localization makes your application ready to work in any language — and it’s much easier if you do it from the beginning. Just follow these 12 simple rules and you’ll be ready to run anywhere in the world.

1. “Resource” All Of Your Strings

The first step of localization is to get user-visible strings out of your code and into resource files. Those strings include titles, product names, error messages, strings in images and any other text the user might see.

Most resource files work by giving each string a name and allowing you to specify different translation values for that string. Many languages use properties files like this:

name = Username

Or they use .pot files like this:

msgid "Username"
msgstr "Nom d'utilisateur"

Or they use XLIFF files like this:

<trans-unit id="1">
 <source xml:lang="en">Username</source>
 <target xml:lang="fr">Nom d'utilisateur</target>
</trans-unit>

The resource files are then loaded by a library that uses a combination of the language and country, known as the locale, to identify the right string.

Once you’ve placed your strings in external resource files, you can send the files to translators and get back translated files for each locale that your application supports.

2. Never Concatenate Strings

Appending one string to another almost always results in a localization bug. It’s easy to see this with modifiers such as color.

Suppose your stationery store has items such as pencils, pens and sheets of paper. Shoppers will choose what they want and then select a color. In the shopping cart you would show them items such as a red pencil or a blue pen with a function like this:

function getDescription() {
    var color = getColor();
    var item = getItem();

    return color + " " + item;
}

This code works well in English, in which the color comes first, but it breaks in French, in which “red pencil” translates as “crayon rouge” and “blue pen” is “stylo – encre bleue.” French speakers (but not only them) put modifiers after the words they modify. The getDescription function would never be able to support languages like this with simple string concatenation.

The solution is to specify parametrized strings that change the order of the item and color for each language. Define a resourced string that looks like this:

itemDescription = {0} {1}

It might not look like much, but this string makes the translation possible. We can use it in a new getDescription function, like this:

function getDescription() {
    var color = getColor();
    var item = getItem();

    return getLocalizedString('itemDescription', color, item);
}

Now, your translators can easily switch the order, like this:

itemDescription = {1} {0}

The getLocalizedString function here takes the name of a resource string (itemDescription) and some additional parameters (color and item) to substitute for placeholders in the resource string. Most programming languages provide a function similar to getLocalizedString. (The one notable exception is JavaScript, but we’ll talk more about that later.)

This method also works for strings with text in them, like:

invalidUser = The username {0} is already taken. Please choose another one.

3. Put All Of Your Punctuation In The Resourced String

Tacking on punctuation later is often tempting, so that you can reuse the same string, say, in a label where it needs a colon and in a tooltip where it doesn’t. But this is another example of bad string concatenation.

Here, we’re adding a simple log-in form using PHP in a WordPress environment:

<form>
<p>Username: <input type="text" name="username"></p>
<p>Password: <input type="text" name="password"></p>
</form>

We want the form to work in other languages, so let’s add the strings for localization. WordPress makes this easy with the __ function (i.e. underscore underscore):

<form>
<p><?php echo(__('Username', 'my-plugin')) ?>: <input type="text" name="username"></p>
<p><?php echo(__('Password', 'my-plugin')) ?>: <input type="text" name="password"></p>
</form>

Spot the bug? This is another case of string concatenation. The colon after the labels isn’t localized. This will look wrong in a language like French, which always puts spaces around colons. Punctuation is part of the string and belongs in the resource file.

<form>
<p><?php echo(__('Username:', 'my-plugin')) ?> <input type="text" name="username"></p>
<p><?php echo(__('Password:', 'my-plugin')) ?> <input type="text" name="password"></p>
</form>

Now the form can use Username: in English and Nom d'utilisateur : in French.

4. “First” Names Sometimes Aren’t

My name is Zack Grossbart. Zack is my given (or first) name, and Grossbart is my last (or family) name. Everyone in my family is named Grossbart, but I’m the only Zack.

In English-speaking countries, the first name is the given name and the last name is the family name. Most Asian countries go the other way, and some cultures have only one name.

The cellist Yo-Yo Ma is a member of the Ma family. In Chinese, he writes his family name first: Ma Yo-Yo (馬友友).

This gets tricky because many people change their names when moving from Asian countries to English-speaking ones. They often switch the order to fit local customs, so you can’t make any assumptions.

You must provide a way to customize the presentation of names; you can’t assume that the first name always comes first or that the last name always comes last.

WordPress handles this pretty well by asking you how you want your name to show up:

Name formatting in WordPress

It would be even better if WordPress supported a middle name and a way to specify the format per locale so that you could make your name one way in English and another in Chinese, but nothing’s perfect.

5. Never Hard-Code Date, Time Or Currency Formats

The whole world is inconsistent about date and time formats. Some people put the month first (6/21/2012), others the day first (21/6/2012). Some use 24-hour (14:00) time, and some use 12 (2:00 PM). Taiwan uses specially translated strings instead of AM and PM, and those come first (上午 2:00).

Your best bet is to store all dates and times in a standard format such as ISO time or epoch time, and to use a library like Date.js or Moment.js to format them for the given locale. These libraries can also handle converting the time to the current zone, so you can store all dates and times in a common format on the server (such as UTC) and convert them to the right time zone in the browser.

Dates and times are also tricky when displaying calendars and date pickers. Estonia starts the week on Saturday, the US starts on Sunday, the UK on Monday and the Maldives on Friday. The jQuery UI date picker includes over 50 localized files to support different calendar formats around the world.

The same is true of currencies and other number formats. Some countries use commas to separate numbers, and others use periods. Always use a library with localized files for each of the locales that you need to support.

StackOverflow covers this topic well when discussing daylight savings time and time zone best practices.

6. Use UTF-8 Almost All Of The Time

The history of computer character encodings is a long one, but the most important thing to remember is that UTF-8 is the right choice 99% of the time. The only time not to use UTF-8 is when you’re working primarily with Asian languages and absolutely need the efficiency of UTF-16.

This comes up a lot with Web applications. If the browser and the server don’t use the same character encoding, then the characters will get corrupted and your application will fill up with little squares and question marks.

Many programming languages store files using the system’s default encoding, but it won’t matter that your server is English when all of your users are browsing in Chinese. UTF-8 fixes that by standardizing the encodings across the browser and the server.

Invoke UTF-8 at the top of all of your HTML pages:

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

And specify UTF-8 in the HTTP Content-Type header, like this:

Content-Type: text/html; charset=utf-8

The JSON specification requires that all JSON documents use Unicode with a default of UTF-8, so make sure to use UTF-8 whenever you’re reading or writing data.

7. Give Strings Room To Grow And Shrink

Strings change size in translation.

Repeat password example

“Repeat password” is over 50% wider in German than in English; if there isn’t enough space, then your strings will overlap other controls. WordPress solves this problem by leaving extra space after each label for the string to grow.

Label spacing in the WordPress admin

This works well for languages whose strings are roughly of the same length, but for languages with long words, such as German and Finnish, the controls will overlap if you don’t leave enough space. You could add more space, but that would put the labels and controls pretty far apart from each other in compact languages such as Chinese, thus making them hard to use.

Label spacing in the WordPress admin in Chinese

Many designers of forms give their labels room to grow and shrink by aligning them to the right or by placing them above the controls.

Label above controls in the WordPress admin

Putting labels above the controls works well for a short form, but it makes a form with a lot of fields very tall.

There’s no perfect answer for how to make your application work in all languages; many form designers mix and match these approaches. Short labels like “User name” and “Role” won’t change much in translation and need just a little extra space. Longer paragraphs will change substantially and need room to grow wider, taller or sometimes both.

Label next to and above controls in the WordPress admin

Here, WordPress gives a little extra space for the “Biographical Info” label, but it puts the longer description below the field so that it can grow in translation.

8. Always Use A Full Locale

The full locale includes the language and country code, and it supports alternate spellings, date formats and other differences between two countries with a shared language.

Always use a full locale instead of just a language when translating, so that you know whether you’re doing someone a favor or a favour, and that they know whether to take the elevator or the lift, and that they know whether £100.00 is expensive.

9. Never Trust The Browser To Know The Right Locale

Localization is much more difficult with browsers and JavaScript because they give a different locale depending on who’s asking.

JavaScript has a property to tell you the current language, named navigator.userLanguage. All browsers support it, but it’s generally useless.

If I install Firefox in English, then my navigator.userLanguage value would say English. I can then go into my preferences and change my preferred languages. Firefox lets me select multiple languages, so I could specify my order of preference as English from the US, then any other English, then Japanese.

Language preferences in Firefox

Specifying a set of locales makes it possible for servers to find the best match between the languages that I know they support. Firefox takes these locales and sends them to the server in an HTTP header, like this:

Accept   en-us,en;q=0.7,ja;q=0.3

Firefox even uses the quality factor (that q= part) to indicate how much I prefer one locale over another.

This means that the server might return content in English or Japanese or another language if it doesn’t support either. However, even after I’ve set my preferred language in Firefox, the value of my navigator.userLanguage property will still be English and only English. The other browsers don’t do much better. This means that I might end up with the server thinking I want Japanese and with the JavaScript thinking I want English.

JavaScript has never solved this problem, and it has not one standard localization library, but dozens of different standards. The best solution is to embed a JavaScript property or some other field in your page that indicates the locale when the server processes each request. Then you can use that locale when formatting any strings, dates or numbers from JavaScript.

10. Plan For Languages That Read Left To Right And Right To Left

Most languages are written on screen from left to right, but Arabic, Hebrew and plenty of others go from right to left. HTML provides a property for the html element named dir that indicates whether the page is ltr (left to right) or rtl (right to left).

<html dir="rtl">

There’s also a direction property in CSS:

input {
    direction: rtl;
}

Setting the direction property will make the page work for the standard HTML tags, but it can’t switch a CSS element with float: left to float: right or change an absolutely positioned layout. To make more complex layouts work, you will need a new style sheet.

An easy way to determine the direction of the current language is to include a direction string in the resourced strings.

direction = rtl

Then you can use that string to load a different style sheet based on the current locale.

11. Never Sort In The Browser

JavaScript provides a sort function that arranges lists of strings alphabetically. It works by comparing each character in each string to determine whether a is greater than b or y is less than z. That’s why it makes 40 come before 5.

The browser knows that y comes before z by using a large mapping table for each character. However, the browser includes the mapping tables only in the current locale. This means that if you have a list of Japanese names, the browser wouldn’t be able to sort them properly in an English locale; it would just sort them by Unicode value, which isn’t correct.

This problem is easy to see in languages such as Polish and Vietnamese, which frequently use diacritical marks. The browser can tell that a comes before b, but it doesn’t know whether comes before ã.

The only place to sort strings properly is on the server. Make sure that the server has all of the code mappings for the languages you support, and that you send lists to the browser presorted, and that you call the server whenever you want to change the sorting. Also, make sure that the server takes locale into account for sorting, including right-to-left locales.

12. Test Early And Often

Most teams don’t worry about localization until it’s too late. A big customer in Asia will complain that the website doesn’t work, and everyone will scramble to fix 100 little localization bugs that they had never thought of. Following the rules in this article will avoid many of those problems, but you will still need to test; and translations usually aren’t ready until the end of the project.

I used to translate my projects into Pig Latin, but that didn’t test Asian characters, and most browsers don’t support it. Now I create test translations with Xhosa (xh_ZA). All browsers support Xhosa, and Nelson Mandela speaks it natively, but I’ve never been asked to support it in a product.

I don’t speak Xhosa, so I create a new translation file and add xh to the beginning and end of every string. The xh makes it easy to see whether I’ve missed a string in the code. Throw in a few Japanese Kanji characters to test character encoding, and I have a messy string that tests for all of my translation issues.

Making the test translation file is easy. Just save a new properties file with xh_ZA in the file name and turn…

name = Username

… into:

name = xh吳清源Username吳清源xh

The resulting jumble will test that I’ve resourced every string, that I’m using the right locale, that my forms work with longer strings and that I’m using the right character set. Then I’ll just quickly scan the application for anything without the xh and fix the bugs before they become urgent issues.

Do the right thing for localization ahead of time, and you’ll save yourself a lot of trouble in the long run.

(al) (km)

↑ Back to top

Zack Grossbart is an engineer, designer, and author. He's a founding member of the Spiffy UI project, the architect of the WordPress Editorial Calendar, and a Consulting Engineer with NetIQ. Zack began loading DOS from a floppy disk when he was five years old. He first worked professionally with computers when he was 15 and started his first software company when he was 16. Zack lives in Cambridge, Massachusetts with his wife and daughter.

  1. 1

    Gunnar Bittersmann

    July 18, 2012 7:03 am

    > Localization makes your application ready to work in any language — and it’s much easier if you do it from the beginning.

    No. You’re confusing localization with internationalization, cf. http://www.w3.org/International/questions/qa-i18n It’s internationalization what makes your application ready to work in any language.

    > 4. “First” Names Sometimes Aren’t

    That’s why “try to avoid using the labels ‘first name’ and ‘last name’ in non-localized forms”, use ‘given name’ and ‘family name’ instead (even though these labels don’t fit to some cultures), or use just one input field ‘name’. http://www.w3.org/International/questions/qa-personal-names esp. section http://www.w3.org/International/questions/qa-personal-names#fielddesign

    > However, even after I’ve set my preferred language in Firefox, the value of my navigator.userLanguage property will still be English and only English. The other browsers don’t do much better.

    Of course, they don’t. They can’t do better because it’s good as it is.

    > This means that I might end up with the server thinking I want Japanese and with the JavaScript thinking I want English.

    No, this is not what it means. You’re confusing the language of the browser’s user interface (navigator.userLanguage) with the user’s preferred (ordered list of) languages for content (the Accept-Language field in the HTTP header). For many users the browser UI might equal to the first language in Accept-Language, but that’s not necessarily the case.

    And please do not link to w3schools, thank you. http://w3fools.com/

    You’ll find good ressources on i18n and l10n on http://www.w3.org/International/articlelist

    0
    • 2

      Thanks for the comments. You’re right about the difference between localization and internationalization, but I wanted to keep the article easier to follow.

      Creating a single name field has some advantages, but it makes it impossible to sort by first and last name separately.

      The basic point about the browser’s locale is there’s no good way in JavaScript to know what the server thought the best locale was. You have to get that information from the server.

      Cheers,
      Zack

      0
      • 3

        Gunnar Bittersmann

        July 18, 2012 9:25 am

        Simplification is good, using wrong terms is over-simplification is not good, IMHO.

        Of course, one needs to consider whether given and family names are needed separately for a given application. If not, don’t bother the user with separate inputs.

        I’m not sure what you mean with “browser’s locale”. In the dialog shown in sec 9 the user sets just their preferred languages, but not preferred number/date formats, metric vs. imperial units etc. A “locale” would include all of this, so me thinks the term doesn’t fit here.

        I’ve also wondered once if there’s a way to access the preferred languages in JavaScript. And the answer is no.

        0
        • 4

          Well true, also Japanese is originally vacritel. With Japanese they actually mix horizontal and vacritel text in the newspapers where some titles are horizontal and the body text is vacritel.But for most daily use Chinese and Japanese are running horizontally.

          0
      • 5

        Hi Jack.

        Agree with your answers below. We generally use or follow word “Localization:” even though we know the difference between localization and internationalization.

        “Thanks for the comments. You’re right about the difference between localization and internationalization, but I wanted to keep the article easier to follow. ”

        MJ

        0
    • 6

      Gunnar Bittersmann

      July 24, 2012 3:45 am

      > It’s internationalization what makes your application ready to work in any language.

      … and it’s localization to actually make the application work in a specific language for a specific audience.

      You might want to read Molly Holzschlag’s article “Putting the World into the World Wide Web” http://h30565.www3.hp.com/t5/Feature-Articles/Putting-the-World-into-the-World-Wide-Web/ba-p/5052 that covers not only some of the internationalization points made by Zack’s in this article but also sheds light on localization and addresses cultural issues.

      0
    • 7

      Downvote all your want, but what is with all the hate with w3schools?

      0
  2. 8

    Good article. The amount of annoyances localisation can cause without really trying is a possible proof of God and the Tower of Babel “incident”.

    I prefer to output the language that the server is presenting as an attribute of the html tag. This makes it easy to set locale values in JavaScript.

    Some minor points:
    “French speakers (but not only them) put modifiers after the words they modify.”
    Except when they don’t (de jolies fleurs) :-). The grammatical rules for “de” instead of du, de la, de l’ and des include “in front of an adjective in front of a plural noun” so that construct must exist.

    Only the US uses month/day/year for dates. Everybody else uses the saner “order by unit size” be that day/month/year or year/month/day. Also, UK can start weeks on a Sunday or Monday, just to be annoying.

    If you’ve got the content type in HTTP headers you don’t need it in the HTML. If you do you get a re-evaluation of the entire page in IE when it discovers it, even if it is the same. http-equiv has no place in the document anyway as if you know what the value is you should have set the HTTP header with that value in the first place.

    JavaScript’s sort method can take a function as a parameter to determine the result of comparing two items. This means you can write functions that sort objects based on a property of those objects. Not providing the function means it defaults to string sorting.

    0
  3. 9

    Gunnar Bittersmann

    July 18, 2012 8:02 am

    A minor addition to my comment (that still has to show up): In HTML5, you can specify the character encoding the short way: <meta charset=”UTF-8″/>

    0
    • 10

      The weird thing about this short-hand is: You can send the current used charset with the HTTP header beforehand, so there is no need at all to specify it in the document itself.

      Example: https://gist.github.com/3151888

      This method formerly known as “meta http-equiv” is simply for the cases where there is no option to send HTTP headers (like when serving a simple HTML document instead of eg. a PHP-generated one) or for folks who don’t know jack about programming ;)

      cu, w0lf.

      0
      • 11

        Gunnar Bittersmann

        July 29, 2012 10:17 am

        > You can send the current used charset with the HTTP header beforehand, so there is no need at all to specify it in the document itself.

        Best practice is to use both: “you should use HTTP headers if it makes sense for any type of content, but in conjunction with an in-document declaration” http://www.w3.org/International/questions/qa-html-encoding-declarations

        > This method formerly known as “meta http-equiv” is simply for the cases where there is no option to send HTTP headers

        Not only for these cases. And with HTML5, you can use the shorter notation for that purpose. (That’s what I was gonna say with my previous comment.)

        0
  4. 12

    Good article overall, and thanks Gunnar for the pointers.

    0
  5. 13

    Very good article, thank you!
    Could you give an example where UTF-16 is needed when working with asian languages? I’ve worked on many projects using a very wide range of Chinese characters (Unicode Ext. A + B) and, I might be wrong but, the best choice has always been UTF-8.

    0
    • 14

      UTF-8 vs. UTF-16 is a question of size. UTF-8 is more efficient with ASCII characters since it represents them as one byte each. The trade off is that it must represent most Asian characters as three bytes each since it needs the first byte to identify that a longer character is coming. UTF-16 represents all characters as two bytes. That means a document of ASCII characters will be much shorter in UTF-8, but a document of Asian characters will be shorter in UTF-16.

      Both formats can represent all of the characters. It’s just a question of performance.

      Does that make sense?

      Thanks,
      Zack

      0
  6. 16

    Karel Thönissen

    July 19, 2012 7:52 am

    There is a large world outside web programming!

    The use of UTF-8 is probably a good choice for the encoding of a web page in transit. And agreed UTF-16, may be a good choice for a web page in transit written in an Asian language.

    However, for typical processing, UTF-8 is a horrible choice. Better use UTF-16 or UTF-32 for all internal representations of text, because the variable lengths of the encodings for the characters will make programming a hell.

    Use UTF-32 internally, and use UTF-8 for documents in transit.

    0
  7. 17

    Totally agree on the performance point of view!

    0
  8. 18

    That’s interesting, your use of Xhosa. I use a milder version of the same kind of thing: Metał. Every lower-case l gets transliterated into an ł, and every lower-case i gets transliterated into an ï. It is very rare for a message to lack both letters.

    The advantage of thïs subtler approach ïs that I can use ït throughout the software devełopment process because ït doesn’t ïmpact usabïłïty too much. The ï signals that the string has been translated and the ł checks that full UTF-8 support is present.

    Not that your more intrusive version doesn’t have its uses: but Metał is soft enough to be left turned on the whole time.

    0
  9. 19

    > This will look wrong in languages such as French and Russian,
    > which always put spaces around colons

    This is French only. There’s should be no space before colon in Russian, just like in English.

    0
    • 20

      And in french, it should be a non-breakable space (  or &nbsp; depending how this textarea will mangle ampersands :) ) before and a space after colons and semi-colons. It should be a thin space but it lacks a bit of support…

      0
    • 21

      Thanks for the update Vadim. I’ve taken Russian out of the punctuation example. It shows me yet again how hard it is to know everything about every language.

      0
  10. 22

    I’d like to add a general case warning: when you configure your server to serve localized interfaces, either using the browser User-Language HTTP header or IP-based country resoòving, always, *always* allow the visitor a way to override the server-side guess about the correct language. Being stuck in the wrong language because the server “knows best” is a big time annoyance (I’m talking to you, Bing). Let’s put it this way: you may guess and be right almost all the time, but your visitor definitely always knows for sure which is his/her preferred language.

    0
  11. 23

    The link to StackOverflow points to some Novell login page. It should be http://stackoverflow.com/questions/2532729/daylight-saving-time-and-timezone-best-practices I believe.

    0
  12. 25

    You should use YAML for localization files. It is much easier to read and write for humans and machines alike. http://www.yaml.org/

    0
  13. 26

    Instead of echo(__(‘Username:’, ‘my-plugin’) you just could try _e(‘Username:’, ‘my-plugin’). Shorter, better, more productive. By the way, you left unclosed every single double function call like this you shown. So c&p team, be carefull.

    You see, WordPress makes this even easier if you know your way around.

    0
  14. 28

    Good point about using a full locale. Some frameworks however don’t support this. I’ve seen a lot of systems that only support a language code in their localisation configs.

    Another common mistake is to use Geo-IP detection to force the /correct/ language. I don’t want the whole Internet in French just because I’m on holiday in France.

    This is a great primer for the technical and design problems. But 90% of the errors I’ve experienced over the years are due to coordination problems – getting content from design to translation and back to dev, all via people with different software and levels of technical understanding. I wrote my own software to manage this problem, and I’m launching it to everyone. http://localise.biz/

    0
  15. 29

    Some interesting tips here – especially the ones regarding testing and using concatenation with caution.

    Personally I think it’s better to check with the user what their language/locale is, certainly for something like a web based business application. I would rather have the app say “It looks like your locale is France – is that correct?” when I log in and for it to let me change it than be forced to accept French just because I’m visiting our French office and using a local desktop.

    0
  16. 30

    In Estonia, week starts on Monday not Saturday.

    0

Leave a Comment

Yay! You've decided to leave a comment. That's fantastic! Please keep in mind that comments are moderated and rel="nofollow" is in use. So, please do not use a spammy keyword or a domain as your name, or else it will be deleted. Let's have a personal and meaningful conversation instead. Thanks for dropping by!

↑ Back to top