Unicode, things to know.
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
Unicode currently defines just under 100,000 characters, but has space for 1,114,112 code points. They are organized into 17 “planes” of 216 (65,536) characters, numbered 0 through 16. Plane 0 is called the “Basic Multilingual Plane” or BMP.
UTF-32: each character - 32 bits, for finding the byte order the first character is U+FEFF so if your format is U-FFFE, then you basically order accordingly.
Unicode currently defines just under 100,000 characters, but has space for 1,114,112 code points. They are organized into 17 “planes” of 216 (65,536) characters, numbered 0 through 16. Plane 0 is called the “Basic Multilingual Plane” or BMP.
UTF-32: each character - 32 bits, for finding the byte order the first character is U+FEFF so if your format is U-FFFE, then you basically order accordingly.
UTF-16: Here for characters > BMP / astral planes are defined using surrogate blocks. Basically 2 - 16 bit chars. when you look at a sixteen-bit quantity, you can tell right away whether it's an ordinary BMP character or half of an astral-plane character (surrogate block), and if so, which half. For byte ordering, UTF-16BE and UTF-16LE characters are used as the first characters.
UCS-2 - used by javascript. (It has no surrogate blocks -check utf16). You can define BMP with out any problem.
UTF8 - Characters whose value is less than 128 (i.e. ASCII) are encoded as themselves in one byte; the high-order bit will always be zero. (Which means that a pure ASCII text is actually UTF-8 as it sits.) The rest have their bits ripped apart and dealt out into several (from two to four) bytes as follows:
- The first byte has a bunch of high-order one bits telling you how many bytes are used to encode the character, followed by a zero bit.
- The rest of the bytes each begin with a single one bit followed by a zero bit.
- The bits of the character are dealt out in the space left over after these signaling bits.
Suppose a character is encoded in two bytes. Then the first byte has two one bits and a zero bit, leaving five bits of payload. The second has a one, a zero, and six bits of payload. Thus there are eleven bits of payload, and the biggest character that can squeeze into two bytes in UTF-8 is U+07FF, which is 11 ones.
For UTF-8 as the unit of encoding is the byte, so there are no byte-ordering issues.