« Home | Differences between myisam and innodb » | Removing unnecessary .svn folders from » | Java collections and generics. » | Mysql cluster & replication. » | Finding prime numbers with in a given number n. » | Javascript performance. » | Sorting algorithms. » | Multi tenant challenges. » | Web application security. » | Another short but beautiful game I won. » 

Monday, April 02, 2012 

Unicode, things to know.


Unicode currently defines just under 100,000 characters, but has space for 1,114,112 code points. They are organized into 17 “planes” of 216 (65,536) characters, numbered 0 through 16. Plane 0 is called the “Basic Multilingual Plane” or BMP.

UTF-32: each character - 32 bits, for finding the byte order the first character is U+FEFF so if your format is  U-FFFE, then you basically order accordingly.

UTF-16: Here for characters > BMP / astral planes are defined using surrogate blocks. Basically 2 - 16 bit chars. when you look at a sixteen-bit quantity, you can tell right away whether it's an ordinary BMP character or half of an astral-plane character (surrogate block), and if so, which half. For byte ordering, UTF-16BE and UTF-16LE characters are used as the first characters.

UCS-2 - used by javascript. (It has no surrogate blocks -check utf16). You can define BMP with out any problem.

UTF8 - Characters whose value is less than 128 (i.e. ASCII) are encoded as themselves in one byte; the high-order bit will always be zero. (Which means that a pure ASCII text is actually UTF-8 as it sits.) The rest have their bits ripped apart and dealt out into several (from two to four) bytes as follows:
  • The first byte has a bunch of high-order one bits telling you how many bytes are used to encode the character, followed by a zero bit.
  • The rest of the bytes each begin with a single one bit followed by a zero bit.
  • The bits of the character are dealt out in the space left over after these signaling bits.
Suppose a character is encoded in two bytes. Then the first byte has two one bits and a zero bit, leaving five bits of payload. The second has a one, a zero, and six bits of payload. Thus there are eleven bits of payload, and the biggest character that can squeeze into two bytes in UTF-8 is U+07FF, which is 11 ones.

For UTF-8 as the unit of encoding is the byte, so there are no byte-ordering issues.