Localizing a Game for Unity

 I enjoy putting off localization.

If you’re unfamiliar with the term, congratulations! You’ve lived a fun carefree life of placing strings directly in your code, and doing more fun “important” things like writing shaders and custom sorting algorithms.

Basically, localization means translating your game into different languages. Some people will say there’s a little more to it than that, but not really.

How do you know it’s time to start localization? For me, it’s all about minimizing the total amount of work. There’s no point in beginning localization when you’re still prototyping game systems, after all you might rip those out and replace them several times, and then all your localization work will go out with the trash. But as soon as you’re thinking about adding your first bit of final production content, it’s time to start localizing. It’s easier to do it as I go than run the risk of having to redo all my production content because I didn’t consider the nuts and bolts of how it would be localized.

I’m lazy, just proactively lazy. After all the person who has to go in and clean up any messes I’ve made is me, so it’s less work in the long term to plan things out ahead of time.

Okay with that out of the way, we can talk about the first topic in Localization, making sure you can display all the different languages you want to localize to. In most cases, Unity will handle this for you. If you use their standard text component it has access to the system fonts which have characters for pretty much any language you can think of.

I have however chosen to use TextMeshPro in my project… Why?? Well TextMeshPro, solves a ton of issues with rendering text in 3d, adds a bunch of fancy effects, and it’s free!

(from this point on, I’m going to go on and on about the virtues of Text Mesh Pro. But most of the things I’m going to say will be of use to any text rendering system that places it’s letters on a texture.)

Here take another look at the title image:

Alphacutoff

Can you guess which text is TMP and which is standard Unity text? The text on the bottom Starting with “EN Bob” is TMP. The blurry mess on top is standard Unity.

Normal text looks fine, until you zoom in, rotate around it, or look at it at any scale other then was intended, then the edges look blurry and blocky. TMP edges remain smooth because rather than using alpha as transparency, it uses alpha sort of like waterdrops with surface tension. It’s called “distance field rendering” and works similar to an alpha-cutoff shader.

In order to use TMP you have to go through the extra step of rendering all the font characters to a texture. Don’t worry TMP has a tool for this, and there are plenty of tutorials online for how to use it, so I won’t go into detail here. But you can’t just rely on Unity to provide fonts for you, you need to process each one you want to use ahead of time.

For Latin fonts it’s straight forward. Just grab some free fonts from your favorite website (I recommend fontsquirrel) But you will need to understand a little bit about ASCII characters.

When the computer world started (Thursday, 1 January 1970) the standard set of computer characters (A-Z) were all assigned a value of 1-128. Lots of software as written assuming these were the only characters anyone would ever need, standards were created, everyone was happy and then (sometime around 1982, if I remember correctly) someone who didn’t speak English tried to use a computer and everything fell apart.

Don’t be too hasty to judge, back in the 80s each bit of computer memory cost about $38,000 and took up most of a small home (my numbers might be a little off here), so it was more about practicality than exclusion.

Anyway, eventually memory prices came down, and Unicode was born. Unicode is like ASCII, but with no limit on the number of characters, so there is room enough for all the umlauts and kanji you can shake a stick at. They even made the numbers much up between Unicode and ASCII, so that all the world’s programmers wouldn’t have to commit ritual suicide.

But this does mean that all fonts contain all characters, and when rendering the fonts for use with TMP you’ll have to supply the codes for the characters you want. Just to be nice I’m going to list all the major markets below (in hexadecimal… if you don’t know what hexadecimal is then… HOLY CRUP!!! how did you read this far without getting bored!!, go read “The Martian”, it’s entertaining and explains hexadecimal)

0020007F : The standard Ascii characters needed for English speakers.

0080 —024F: The accented characters needed for the Latin based languages (umlauts for german, accents for French, etc…)

040004FF : Cyrillic a.k.a Russia.

AC00-D7A3 : Korean

and

Japanese (uses several ranges) 3000-303f, 3040-309f, 30a0-30ff, ff00-ffef
4e00-9fff (use this range for both Chinese and Japanese, but you’ll need different fonts!)

If you need more specific ranges go look them up at unicode-table.com(and check out the hieroglyphs while you’re at it, just keep scrolling, and scrolling) But you should be able to render everything up to Cyrillic onto a reasonably sized 512×512 texture no problem.

Except… not all fonts are going to be able to display Cyrillic, you’ll need to find Cyrillic specific fonts, and those might not be the same ones you want to use for English/Latin text. Fortunately, TMP allows fonts to pass through, so you can set up a chain of fonts, so if a character doesn’t exist in one, TMP will automatically swap in another. The interface for this looks like so:

fallbackfonts

And then there’s Asia… oh, boy. The problem here is that they have a lot of characters, almost as many as words. Korean, for example, has 11,172 characters. If you tried to render out a Korean font at the same level of detail as Latin, you’d need a 8,192 x 8,192 texture!

Korean is a bit of an odd ball as far as Asian languages go. Its characters are phonetic, not pictographs, but the are designed to sort of resemble pictographs. Each character represents a syllable, so they have a different character for “jot”, “tot”, and “rot”, but they are all structured the same, and they all look similar to one another (here’s a cartoon about it). Each of its base letters (19 consonants and 21 vowels) fit into slots around each character, so you could just render each letter into each character and then each character on screen, but most systems just aren’t setup that way, and it wouldn’t look quite as nice, so we’re handed a ridiculous pile of 11,172 characters.

Here’s what a texture sheet containing all 11,172 characters looks like:

korean

Each of those little specs is a letter.

And for reference here’s English:

english

There are two consequences to this:

  1. You’re probably only going to have one Korean font, so make sure it’s versatile.
  2. TMP relies on a large gaps between characters to do effects like outlines, so for any tightly packed language, you might not be able to do that.

You can cheat however, while Korean has 11,172 characters some of those are for syllables that are never used, or only used in a word or two, a lot of them will be non-sense syllables like “eekey” or “eyechai”.

There is a subset of Korean (dubbed the high-frequency jamo, jamo is Korean for syllable), one with only 2,780 characters, that works just fine for day to day use. The tricky bit is that while it used to be like ASCII with all the common characters in a single range, they’ve now combined those characters into the range of all characters,(interspersing them) so it’s hard to get one without the other.

Here are links to files with both the common and rare Korean characters (A special thank you to Dr. Ken Lunde!)

CommonKoreanGylphs

RareKoreanGylphs

NOTE: Don’t just copy and paste it as your browser will probably misread the Korean glyphs, instead save the entire page. After you have the files in a viewer that can show utf-8 correctly, cut and paste from there in the font asset creator setting “character set” to “custom characters”

I had to include the characters themselves rather than their Unicode values, because the font asset creator can’t handle strings that long. If you want to find the raw data and do it yourself you can find it here.

For my game, I rendered the 2,780 common characters into one texture, and then the 8,392 rare characters into another (at a lower font size to save space), and then set the common font to fallback to the rare font if it doesn’t contain the character.

Japanese is a bit simpler in a way. It’s a lot like Korean in that there are a ton of characters (40,000 or more!), and the characters aren’t phonetic, so there’s not a good way to define a complete set. But about 99.5% of Japanese writing contains only 3,500 or so distinct characters(kanji) .

I won’t even try to get a complete font for Japanese. If you blindly download the best google font, it’s going to have 21,488 characters. It’s hard to justify filling video memory with something that will never be seen. The gold standard for reasonable Japanese fonts seems to be the fonts based on the m+ font which all contain around 5,000 characters.

As an aside, I should mention that TMP has a tool by which you can feed in all your text and it will spit out textures that contain only the characters you actually use. This works great if you don’t add or allow user generated content, but that’s not my plan.

You can tell how many characters a font has when you make the asset:

localironhamaru

See the 5350/21488 in yellow? That means I asked for 21,488 kanji characters, and the font only contained 5350.

One thing you have to watch out for are fonts that say they contain more characters than they actually do. like this one:

localJKG

It says it has 12,597 characters but if you count the rows and columns you’ll see it’s closer to 6,000. Whether some are duplicates or defined as zero size characters, I do not know. But I’d avoid these fonts as they might cause bugs later on.

Japanese kanji fonts tend to require a bit more detail than Korean (And Chinese requires more than Japanese!) So far, I’ve found that two 2,048 by 1,024 textures work for Korean (one for the high  frequency characters, and one for the low, a single 2,048 by 2,048 works Japanese,  and I haven’t tried Chinese yet.

Anyway, it’s so late it’s early, so until next time…

Comments