Libre Graphics magazine logo Libre Graphics magazine archives


Localizing type

Denis Jacquerye

In the digital type universe, there is a complex set of elements which make it a struggle for some to use typography in their own language. When viewing digital type, it's common for some characters to be shown, some not, because they don't fit within the particular font being used. A font can contain a capital letter but not the corresponding lowercase letter. Users don't really know how to deal with that. They try different fonts. If they're more courageous, they go online and look up how to complain about those fonts not supporting necessary characters.

Very often, they end up taking their complaints to font designers or software engineers. The designers and engineers try to solve problems as well as they can, but it can be difficult. Adding a missing character is easy, but there are additional complex language requirements. Like the ogonek—in Polish—which is like a little tail, showing that a vowel is nasalized. For some languages, the tail is centered. It's quite rare to see a font that has that. When font designers face the issue, they make a choice whether they want to go with one tradition or another. If they go one way, they cater to the people in that tradition.

Older encodings, like ASCII—the basic western Latin alphabet—were simple. Each character was represented by bytes. Those bytes represented the character and the character could be displayed with different fonts, with different styles that could meet the requirements of different people. But many requirements were difficult to fit into ASCII. One option was to start with ASCII and add specific requirements. That choice resulted in a collection of different standards, responding to different needs. One byte representation could have different meanings and two meanings could be displayed differently in different fonts, often resulting in rendering which looked like gibberish.

Enter Unicode

In the late '80s, people started thinking about compatibility problems in type. In the '90s, with Unicode, they started really working on it. Companies got together and worked on one single unifying standard that would be compatible with all of the previous standards. In Unicode, there's a universal code point to identify a character. That character can be displayed with different glyphs depending on the font or style selected.

Inclusion criteria have changed a bit in the past. Initially, in Unicode, there was basic Latin. And then they started adding all the special characters that were used in, for example, the International Phonetic Alphabet. Initially, they added the characters already used in other encodings. Then they added all the other accented characters they new about, even the ones which weren't already present in other encodings. Then Latin letters with marks, used in transcription. At some point, they realized that this list of accented characters would continue growing and considered that there must be a smarter way to do things. They figured they could use parts of characters, broken apart. A base letter, with marks added. Breaking things apart would would save them from having thousands of accented characters. They could have pretty much any possible accented character, using parts to represent it.

Most keyboards are based on the old encodings, with accented characters as single characters. For a sequence of several characters, like those in the new Unicode style, either more typing is necessary or a special keyboard layout allowing one key to be mapped to several characters is needed. That's technically feasible, but it's a slow process. Developers might add very common combinations to the keyboard layout or to applications, but other people have different needs that are less common. It takes the same effort again to make those sequences available.

Most of the necessary documentation is actually available in a book published by Unicode with every new version. That book has a few chapters that describe how Unicode works and how characters should work together, what properties they have, all the differences between scripts that are relevant. They also have special cases, trying to cater to those needs that weren't met with the proposals that were rejected.

Extending Unicode

Unfortunately, sometimes there's just no code point for the needed character. That could be because the character wasn't in any existing standard, no one has ever needed it before, or the people who needed it simply used old printers and metal type, never switching to digital. In cases where a character doesn't exist in Unicode, it comes down to dealing with the Unicode organization itself. They have a few ways to communicate. There's a public mailing list and also a forum. In those venues, it's possible to ask questions about characters. It might be that they do exist and are difficult to find. Most of the time, finding characters can be a problem because Unicode is organized with a very restrictive set of rules. In most applications, characters are just ordered in the way they're ordered within Unicode, meaning the code point order. So the capital “A” is 41. “B” is 42, etc. Because Unicode is expanding organically, work is done on one script and then another, and then they come back to the previous script and add things, which may not be in a logical or practical order.

If a character actually isn't available in Unicode and you want it added, there has to be a formal proposal. Unicode provides a template with questions. Proof needs to be provided that the proposed character is actually used. The final element in the proposal is an actual font with the character, so that they can use that in their documentation. Basically, the inclusion process is quite difficult to navigate.

Designing with Unicode

The documentation of Unicode itself is not prescriptive, meaning that the shapes of the glyphs are not set in stone. There's room for styling. Unicode just has one shape, and then it's the font designer's choice to have different ones. Unicode is also not about glyphs: It's really about how information is represented, not how it's displayed.

One of the ways to implement all of those features is with True Type/Open Type. It's very technical and can be slow to update, so if there's a mistake in the actual specification of Open Type, it takes a while before it's corrected and before that correction shows up in any applications. A further issue is that it has its own language code system. Some identified languages just can't be identified in Open Type. One of the features in Open Type is that it's possible to specify: “If I'm using Polish, I want this shape and if I'm using Navajo, I want this shape.” That's really cool, because it allows for a single font that's used by Polish speakers and Navajo speakers and they don't have to worry about changing fonts as long as they label what they're doing according to the language they're using. But it becomes a problem for languages that don't have language codes in Open Type. The option is closed to people using those languages.

Most font designers still work with the old encoding mindset where one character is equal to one letter. Some think that following the Unicode character charts is good enough. It's a hard change to make, because there are very few connections between the Unicode world and the people who work on Open Type libraries, how Open Type is handled, the desires of font designers and not least of all the actual needs of the users.