Term of the Week: Character Set – The Language of Localization

What is it?

A defined list of grouped symbols used for digital communication.

Why is it important?

All global text belongs to a particular character set. Digital programs and platforms expect a specific character set so that they correctly process, render, and visualize each character of the text.

Why does a technical communicator need to know this?

In its simplest form, a character set is a mapping (table) between text characters and the binary numbers that a computer or other digital device understands. For example, the 3 letters A, B, C are read as 01000001, 01000010, 01000011 by a computer using the ASCII character set (one of the early character sets).

As the need for global software arose in the 1980s and 1990s, computer scientists devised digital character sets that could manage character complexity and the thousands of characters in languages such as Chinese. Some character sets assigned a single byte to characters and others used double or multiple bytes for each character. Vendor- and platform-specific character sets also became common and created situations where similar character sets had different values for the same character, which meant that characters would be rendered incorrectly if processed using the mapping for the wrong character set[IANA].

If an application supports a specific character set, the user’s device needs to recognize and support the same character set, as part of the due diligence for publishing globally.

For this reason, software localization and development engineers must understand character sets[Zentgraf 2015]. Issues with character sets can be the bane of their lives, especially when character corruption occurs – for example, when translated software strings are moved across platforms that support different character sets or character encodings (e.g. from UNIX to Windows).

Today, more harmonization exists in this area with the proliferation of Unicode [Tero 2012] (which assigns a unique number to every character in nearly every language) and its various character encodings. A character set can have multiple character encodings, but each encoding can relate to only one character set[Open-Std].

References

[IANA] Official Character set names on the internet (IANA)
[Zentgraf 2015] Programmer information on Character sets and encoding: What every programmer absolutely needs to know: Zentgraf, David C.
[Tero 2012] Unicode, UTF8 & Character Sets: The Ultimate Guide: Tero, Paul.
[Open-Std] Universal Character Set Characters: Open-std.org. Open standard that lists the Universal Character Set characters.

About Dave Ruane

Dave Ruane is part of the enterprise business team at Xplanation Language Services. His current interests are Agile methodologies, global continuous delivery methods, and creating customer experience through global content. His background is in localization engineering, and he has been in the industry since the mid-1990s. He is a frequent speaker at industry events and is the creator of the Process Innovation Challenge, a platform for developing and showcasing innovation in the industry.

Term: Character Set

Email: dave.ruane@xplanation.com

Website: xplanation.com

Twitter: @DaveRuaneSpain

LinkedIn: linkedin.com/in/davidruane/