Character Sets

And The Issues Surrounding Them!

A character that appears on your computer screen, whether it be a number, letter or symbol, is the graphical interpretation of a number. For a computer to know what characters to display, it refers to a database that associates a single character with each number. This database is called a character set. Unfortunately, because not all computers agree with what number applies to what character, two users with incompatible character sets will have difficulty sharing information. For plain text messages written in English, this rarely happens, but for more complex documents such as those using extended or especially, non-European, characters, it is a major concern. If you would like to compare character sets, follow this link.

The most well-known character set is US ASCII, which has become the standard for use on the Internet. US ASCII is a 7-bit set, which means it has a capacity for 128 characters; this is usually sufficient for English writers, but is too narrow in scope to accommodate many other European languages. It is wholly inappropriate for languages that use different writing systems, like Chinese, Arabic and Russian. A number of 7-bit variants of US ASCII were created to address this deficiency but the 128 character limitation made this a poor solution.

8-bit character sets allow for up to 256 characters, expanding the range of languages that can be supported. ISO 8859 is series of twelve 8-bit sets developed to set a standard for Western languages. ISO 8859-1, also called Latin-1, is the most well known of these sets and was designed for Western European countries. Other ISO 8859 sets include Cyrillic, Hebrew, and Arabic, among others. All twelve sets are identical to US ASCII for the first 128 characters, but are different for the second 128. You can read more about ISO 8859 here.

Not all programs and operating systems make use of ISO 8859’s character sets, however. Mac OS and DOS (though not Windows) use their own 8-bit sets which are compatible with US ASCII, but not ISO 8859. There are also a number of competing sets, such as K018-R for Cyrillic, which is actually more popular than ISO-8859-5 (the Cyrillic 8859 set). The character sets of many Asian writing systems, which have thousands of characters, have even more complex sets. Adding to the confusion is that some programs, operating systems and Internet gateways are incapable of handling anything other than 7-bit characters. The result is that because most character sets only agree on the first 128 characters (i.e. US ASCII) text on the Internet is still largely US ASCII.

With the growing acceptance of MIME, that is changing. With MIME, it is possible to specify the character set used in an attached text document. If a MIME-compliant client receiving the document can translate the character set and has the font that can display it, it will be able to correctly display the text. Netscape Navigator for Mac OS, for example, can translate ISO 8859-1 into MacRoman, making it possibly for Mac users to read text written with ISO 8859-1.

Another potential solution is the development of the Unicode standard. As a 16-bit character set, Unicode can encode up to 65536 characters, giving it enough room to include even the writing systems of East Asia.

About Dewwa Socc