What is character encoding?

Character encoding quite simply put is a set of codes that represent specific characters in an encrypted or communication system. The encoding depending on the level of abstraction, its corresponding coding points, and space can be either natural numbers, bit patterns, electrical pulses, and octets. Today character encoding is used in data storage, computation and the transmission of text data aka ‘character set,’ ‘codeset,’ ‘character map,’ ‘code page’ etc. all of which are related terms but are not identical in nature.

Early character encoding

Originally character codes used to be associated with electrical telegraph and optical codes. These instruments were meant to only represent a certain subset of these characters. Generally, they were restricted to just numerals, letters and some punctuation. However, the low overhead associated with the representation of these characters in modern computers allow for far more elaborately designed codes.  Unicodes are some of the most commonly used and represent the majority of characters used in written languages. Character encoding uses internationally accepted standards to allow for the easy exchange of these texts in electronic format.

IBM was one of the first to introduce the Binary Coded Decimal which was a six-bit encoding system by IBM used through the 1960s in their 1401 and then their 1620 computers. Later it was introduced in the 7000 series of computers and various peripherals. BCD included four-bit encoding using special characters, alphabetic characters, etc. which were mapped on a punch card.

Character maps, sets and code pages

Historically terms like character map, character encoding, code page, etc. were used interchangeably in the world of computer science. Plus, they were synonymous with computers. However, now the terms bear a somewhat distinct meaning. Various efforts by standardizing bodies to make the terminology precise have now resulted in a more or less unified encoding system. The unification has been made possible across all current systems.

A ‘code page’ is generally a byte-oriented encoding system but with some style of encoding, it can share the same characters and codes in all of the code pages. Some well-known code page software include “Windows,” which is based on Windows-1252 and the IBM Dos operating system which is based on the 437 code page. Almost all encodings are referred to as code pages and occupy a single byte.

IBM’s special Character Data Representation Architecture or CDRA is designated with coded character set identification and every one of these is known as a charset, code page, etc. Interestingly, Linux which has evolved from Unix does not use a code page but a charmap.

Legacy Encoding Systems

The term Legacy Encoding is used to signify the older style of encoding characters but with a certain degree of ambiguity. The majority of legacy encoding is in the context of a system called Unicodification which is an encoding that does not cover all points of a Unicode or uses a somewhat different set of character repertoire. There are several code points that represent a single Unicode character.

In some material legacy refers to an encoding that precedes Unicode. So, that’s why all of Windows code pages are called legacy because they obvious antedate Unicode and also because they can’t represent  all of the possible 221 code points.