Skip to content

Compliance with UTF-8

What is character encoding?

Encoding it's a way to translate characters (i.e. letters, punctuation marks, symbols, white spaces, control sings) to numbers and finally to bits. Every character can be encoded to specific byte sequence.

For every character, that can be used during computer run, there's a number value assigned. Here are character groups:

  • lowercase letters from English language: a-z,
  • uppercase letters from English language: A-Z,
  • punctuation marks and symbols: i.e. $, !,
  • whitespace characters: space " ", new line "\n", carriage return "\r", tabulator "\t" etc.
  • Non-writable characters: backspace "\b"; basically characters, that cannot be displayed in way like i.e. some number or A letter.

There are several way to encode characters. Everyone in their own way, assigns numbers to individual characters. That's why it's important to use same coding as it was written, when we read data. Chain of 0 and 1 can be decrypted as sequence of totally different symbols, depends on way it's encoded.

ASCII

The easiest way of coding is ASCII (eng. American Standard Code for Information Interchange). Full list of ASCII codes contain 128 characters. If some symbol is missing, it means that ASCII don't have it representation and cannot be coded in it.

Below if full list of ASCII codes. It show for what character specific number is assigned (in decimal dec, hexadecimal hex and octal oct). A-Z letters are represented by numbers 65-90, while a-z letter have numbers from 97 to 122.

ASCII

Most important ASCII coding attributes:

  • Character encoding system.
  • 7 bits.
  • Assign number from 0 to 127 to letters of Latin alphabet, numbers punctuation marks and other symbols.
  • Example: letter "A" is assigned to number 65.
  • Released in 1963.
  • It cannot be used to store specific letters from Polish, Chinese etc.

Unicode

How you probably have noticed, ASCII issue is that code table is not enough large to accommodate all languages, dialects and symbols of the world (i.e. there's no Polish or Chinese letters). Unicode works same way as ASCI but it has more characters defined.

Unicode can be recognized as biggest, newer version of ASCII table - which contains not 128 characters but 1 114 112 possible codes. Basically ASCII is a subset of Unicode. First 128 characters in Unicode are the same as in ASCII table.

What is important from technology point of view, Unicode is not the way of coding. Unicode is implemented by different code systems. Best way to imagine it is a some kind of map or double column data base. In this area Unicode map characters (for example "a", ":", "3") to specific unique positive numbers.

Actually, in Unicode coding implementation, characters can be stored on few bytes (byte = 8 bits). Unicode is not character coding in the full sense of this concept, because will not tell anything, how to get proper bits number from text. It doesn't know how to convert text to binary data and vice versa. Unicode is abstract way of coding, not coding itself. Popular Unicode standard implementation is...

UTF-8

Most important UTF-8 coding attributes:

  • UTF-8 eng. 8-bit Unicode Transformation Format - Unicode coding system.
  • Default characters coding system.
  • Use 1 to 4 bits to code one character.
  • ASCII compatible.
  • Currently coding over million characters
  • Character code have "U+" prefix.

What about character coding in Python?

  • Python 3.X is coded in UTF-8 by default.
  • It means you can use accented characters in strings
  • It's possible to use accented letters i.e. in variables, functions or classes naming, but it's not recommended (there can be people from other countries that working on project).

This is good practice:

    text_en = "Hello, young programmers!"
    text_fr = "Voiÿ ambiguë d'un cœur qui, au zéphĀr, préfère les jattes de kiwis."
    text_cn = "你好,世界"
    print(text_en)
    print(text_fr)
    print(text_cn)

This is not recommended, however will work:

    整数 = 7
    turtle = "turtle"
    vérité = True
    print(整数)
    print(turtle)
    print(vérité)

Coding and decoding in Python 3

Data type str, present in Python, is a type for human readable text and can contain any character from Unicode table.

Data type bytes, represent binary data or byte sequence, which are not coded.

Encoding and Decoding are processes, that pass from one to another type:

encoding

Python provides two coding functions encode and decode. They work with UTF-8 coding by default, however in general it's safer and clearly is to precise it - like in example below:

    >>> "résumé".encode("utf-8")
    b'r\xc3\xa9sum\xc3\xa9'
    >>> b"r\xc3\xa9sum\xc3\xa9".decode("utf-8")
    'résumé'

Spring with lowercase b before apostrophe/quote means, that we don't use str but bytes.

Result of str.encode() function is bytes type object. bytes type representation allows only to use ASCII characters, others are represented by bytes. for example above it will be \xc3\xa9, so the character é, which is not in ASCII code.

Python facing programmer and makes character codes navigation easier:

  • ascii()

Function return ASCII text representation. It means, that all Polish characters will be replaced with byte format.

    >>> ascii("abcdefg")
    "'abcdefg'"

    >>> ascii("jalepeño")
    "'jalepe\\xf1o'"

There're a double symbols ' or " used.

  • bin() Return binary representation of integer number as string. First function add '0b' prefix, that tells it's a binary number.
    >>> bin(0)
    '0b0'

    >>> bin(400)
    '0b110010000'
  • bytes()

Convert input to bytes type.

    >>> bytes("🐍", "utf-8")
    b'\xf0\x9f\x90\x8d'
  • chr()

Convert character code to Unicode standard.

    >>> chr(97)
    'a'

    >>> chr(7048)
    'ᮈ'
  • ord()

Function inverse chr(). Converts character from Unicode standard to integer number.

    >>> ord("a")
    97

    >>> ord("ę")
    281
  • str()

Convert to str type, that represent text.

    >>> str("some string")
    'some string'

    >>> str(5)
    '5'