Spec-Zone .ru
спецификации, руководства, описания, API
|
ucs2
Character Set (UCS-2 Unicode Encoding)utf16
Character Set (UTF-16 Unicode Encoding)utf16le
Character Set (UTF-16LE Unicode Encoding)utf32
Character Set (UTF-32 Unicode Encoding)utf8
Character Set (3-Byte UTF-8 Unicode Encoding)utf8mb3
"Character Set"
(Alias for utf8
)utf8mb4
Character Set (4-Byte UTF-8 UnicodeEncoding)The initial implementation of Unicode support (in MySQL 4.1) included two character sets for storing Unicode data:
ucs2
, the UCS-2 encoding of the Unicode character set
using 16 bits per character.
utf8
, a UTF-8 encoding of the Unicode character set
using one to three bytes per character.
These two character sets support the characters from the Basic Multilingual Plane (BMP) of Unicode Version 3.0. BMP characters have these characteristics:
Their code values are between 0 and 65535 (or U+0000
.. U+FFFF
).
They can be encoded with a fixed 16-bit word, as in ucs2
.
They can be encoded with 8, 16, or 24 bits, as in utf8
.
They are sufficient for almost all characters in major languages.
Characters not supported by the aforementioned character sets include supplementary characters that lie outside
the BMP. Characters outside the BMP compare as REPLACEMENT CHARACTER and convert to '?'
when converted to a Unicode character set.
In MySQL 5.6, Unicode support includes supplementary characters, which requires new character sets that have a broader range and therefore take more space. The following table shows a brief feature comparison of previous and current Unicode support.
Before MySQL 5.5 | MySQL 5.5 and up |
---|---|
All Unicode 3.0 characters | All Unicode 5.0 and 6.0 characters |
No supplementary characters | With supplementary characters |
ucs2 character set, BMP only |
No change |
utf8 character set for up to three bytes, BMP only |
No change |
New utf8mb4 character set for up to four bytes, BMP or supplemental
|
|
New utf16 character set, BMP or supplemental |
|
New utf16le character set, BMP or supplemental (5.6.1and up) |
|
New utf32 character set, BMP or supplemental |
These changes are upward compatible. If you want to use the new character sets, there are potential
incompatibility issues for your applications; see Section
10.1.11, "Upgrading from Previous to Current Unicode Support". That section also describes how to
convert tables from utf8
to the (4-byte) utf8mb4
character set, and what constraints may apply in doing so.
MySQL 5.6 supports these Unicode character sets:
ucs2
, the UCS-2 encoding of the Unicode character set
using 16 bits per character.
utf16
, the UTF-16 encoding for the Unicode character
set; like ucs2
but with an extension for supplementary characters.
utf16le
, the UTF-16LE encoding for the Unicode
character set; like utf16
but little-endian rather than big-endian.
utf32
, the UTF-32 encoding for the Unicode character
set using 32 bits per character.
utf8
, a UTF-8 encoding of the Unicode character set
using one to three bytes per character.
utf8mb4
, a UTF-8 encoding of the Unicode character set
using one to four bytes per character.
ucs2
and utf8
support BMP characters. utf8mb4
, utf16
, utf16le
,
and utf32
support BMP and supplementary characters.
A similar set of collations is available for most Unicode character sets. For example, each has a Danish
collation, the names of which are ucs2_danish_ci
, utf16_danish_ci
,
utf32_danish_ci
, utf8_danish_ci
, and utf8mb4_danish_ci
. The exception is utf16le
, which
has only two collations. All Unicode collations are listed at Section
10.1.14.1, "Unicode Character Sets", which also describes collation properties for supplementary
characters.
Note that although many of the supplementary characters come from East Asian languages, what MySQL 5.6 adds is support for more Japanese and Chinese characters in Unicode character sets, not support for new Japanese and Chinese character sets.
The MySQL implementation of UCS-2, UTF-16, and UTF-32 stores characters in big-endian byte order and does not use a byte order mark (BOM) at the beginning of values. Other database systems might use little-endian byte order or a BOM. In such cases, conversion of values will need to be performed when transferring data between those systems and MySQL. The implementation of UTF-16LE is little-endian.
MySQL uses no BOM for UTF-8 values.
Client applications that need to communicate with the server using Unicode should set the client character set
accordingly; for example, by issuing a SET NAMES 'utf8'
statement. ucs2
, utf16
, utf16le
,
and utf32
cannot be used as a client character set, which means that they do not
work for SET NAMES
or SET CHARACTER SET
. (See Section 10.1.4, "Connection Character Sets and
Collations".)
The following sections provide additional detail on the Unicode character sets in MySQL.