Spec-Zone .ru
спецификации, руководства, описания, API
|
MySQL 5.6 supports these Unicode character sets:
ucs2
, the UCS-2 encoding of the Unicode character set
using 16 bits per character.
utf16
, the UTF-16 encoding for the Unicode character
set; like ucs2
but with an extension for supplementary characters.
utf16le
, the UTF-16LE encoding for the Unicode
character set; like utf16
but little-endian rather than big-endian.
utf32
, the UTF-32 encoding for the Unicode character
set using 32 bits per character.
utf8
, a UTF-8 encoding of the Unicode character set
using one to three bytes per character.
utf8mb4
, a UTF-8 encoding of the Unicode character set
using one to four bytes per character.
ucs2
and utf8
support Basic Multilingual Plane (BMP)
characters. utf8mb4
, utf16
, utf16le
,
and utf32
support BMP and supplementary characters. utf16le
was added in MySQL 5.6.1.
You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set and describes their differentiating properties. For general information about the character sets, see Section 10.1.10, "Unicode Support".
A similar set of collations is available for most Unicode character sets. These are shown in the following list,
where xxx
represents the character set name. For example,
represents the Danish
collations, the specific names of which are xxx
_danish_ciucs2_danish_ci
, utf16_danish_ci
,
utf32_danish_ci
, utf8_danish_ci
, and utf8mb4_danish_ci
.
Collation support for utf16le
is more limited. The only collations available are
utf16le_general_ci
and utf16le_bin
. These are similar
to utf16_general_ci
and utf16_bin
.
Unicode collation names may also include a version number (for example,
) to indicate the Unicode Collation Algorithm
version on which the collation is based, as described later in this section. For such collations, there is no
xxx
_unicode_520_ciutf8mb3
alias to the corresponding utf8
collation. See
Section 10.1.10.6, "The utf8mb3
"Character Set" (Alias for utf8
)".
xxx
_bin
xxx
_croatian_ci
xxx
_czech_ci
xxx
_danish_ci
xxx
_esperanto_ci
xxx
_estonian_ci
(default) xxx
_general_ci
xxx
_german2_ci
xxx
_general_mysql500_ci
xxx
_hungarian_ci
xxx
_icelandic_ci
xxx
_latvian_ci
xxx
_lithuanian_ci
xxx
_persian_ci
xxx
_polish_ci
xxx
_roman_ci
xxx
_romanian_ci
xxx
_sinhala_ci
xxx
_slovak_ci
xxx
_slovenian_ci
xxx
_spanish_ci
xxx
_spanish2_ci
xxx
_swedish_ci
xxx
_turkish_ci
xxx
_unicode_ci
xxx
_vietnamese_ci
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA) described at xxx
_unicode_ci
collations
have only partial support for the Unicode Collation Algorithm. Some characters are not supported yet. Also,
combining marks are not fully supported. This affects primarily Vietnamese, Yoruba, and some smaller languages
such as Navajo. xxx
_unicode_ci
MySQL implements language-specific Unicode collations only if the ordering with
does not work well for a language.
Language-specific collations are UCA-based. They are derived from xxx
_unicode_ci
with additional language tailoring rules. xxx
_unicode_ci
Collations based on UCA versions later than 4.0.0 include the version in the collation name. Thus,
collations are based on
UCA 5.2.0 weight keys: xxx
_unicode_520_ci
LOWER()
and UPPER()
perform case folding according to the collation of their argument. A
character that has uppercase and lowercase versions only in a Unicode version more recent than 4.0.0 will be
converted by these functions only if the argument has a collation that uses a recent enough UCA version.
For any Unicode character set, operations performed using the
collation are faster than those for the xxx
_general_ci
collation. For example,
comparisons for the xxx
_unicode_ciutf8_general_ci
collation are faster, but slightly less
correct, than comparisons for utf8_unicode_ci
. The reason for this is that utf8_unicode_ci
supports mappings such as expansions; that is, when one character
compares as equal to combinations of other characters. For example, in German and some other languages "ß
" is equal to "ss
". utf8_unicode_ci
also supports contractions and ignorable characters. utf8_general_ci
is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both utf8_general_ci
and
utf8_unicode_ci
(for the effect this has in comparisons or when doing searches, see
Section 10.1.7.8, "Examples of the Effect of
Collation"):
Ä = AÖ = OÜ = U
A difference between the collations is that this is true for utf8_general_ci
:
ß = s
Whereas this is true for utf8_unicode_ci
, which supports the German DIN-1 ordering
(also known as dictionary order):
ß = ss
MySQL implements language-specific collations for the utf8
character set only if
the ordering with utf8_unicode_ci
does not work well for a language. For example,
utf8_unicode_ci
works fine for German dictionary order and French, so there is no
need to create special utf8
collations.
utf8_general_ci
also is satisfactory for both German and French, except that "ß
" is equal to "s
", and not to "ss
". If this is acceptable for
your application, you should use utf8_general_ci
because it is faster. If this is
not acceptable (for example, if you require German dictionary order), use utf8_unicode_ci
because it is more accurate.
If you require German DIN-2 (phone book) ordering, use the utf8_german2_ci
collation, which compares the following sets of characters equal:
Ä = Æ = AEÖ = Œ = OEÜ = UEß = ss
utf8_german2_ci
is similar to latin1_german2_ci
, but
the latter does not compare "Æ
" equal to "AE
" or "Œ
" equal to "OE
". There is no utf8_german_ci
corresponding to latin1_german_ci
for German dictionary order because utf8_general_ci
suffices.
includes Swedish rules. For
example, in Swedish, the following relationship holds, which is not something expected by a German or French
speaker: xxx
_swedish_ci
Ü = Y < Ö
The
and xxx
_spanish_ci
collations correspond to
modern Spanish and traditional Spanish, respectively. In both collations, "xxx
_spanish2_ciñ
" (n-tilde) is a separate letter between "n
" and "o
". In addition, for
traditional Spanish, "ch
" is
a separate letter between "c
"
and "d
", and "ll
" is a separate letter
between "l
" and "m
"
In the
collations, xxx
_roman_ciI
and J
compare as equal, and U
and V
compare as equal.
The
collations are
tailored for these Croatian letters: xxx
_croatian_ciČ
, Ć
, Dž
, Đ
, Lj
, Nj
, Š
, Ž
.
For all Unicode collations except the "binary" (
) collations, MySQL performs a table
lookup to find a character's collating weight. This weight can be displayed using the xxx
_binWEIGHT_STRING()
function. (See Section
12.5, "String Functions".) If a character is not in the table (for example, because it is a "new" character), collating weight determination becomes
more complex:
For BMP characters in general collations (
), weight = code point. xxx
_general_ci
For BMP characters in UCA collations (for example,
and language-specific collations), the
following algorithm applies: xxx
_unicode_ci
if (code >= 0x3400 && code <= 0x4DB5) base= 0xFB80; /* CJK Ideograph Extension */else if (code >= 0x4E00 && code <= 0x9FA5) base= 0xFB40; /* CJK Ideograph */else base= 0xFBC0; /* All other characters */aaaa= base + (code >> 15);bbbb= (code & 0x7FFF) | 0x8000;
The result is a sequence of two collating elements, aaaa
followed by
bbbb
. For example:
mysql> SELECT HEX(WEIGHT_STRING(_ucs2
0x04CF COLLATE ucs2_unicode_ci));
+----------------------------------------------------------+| HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci)) |+----------------------------------------------------------+| FBC084CF |+----------------------------------------------------------+
Thus, U+04cf CYRILLIC SMALL LETTER PALOCHKA
is, with all UCA 4.0.0
collations, greater than U+04c0 CYRILLIC LETTER PALOCHKA
. With UCA
5.2.0 collations, all palochkas sort together.
For supplementary characters in general collations, the weight is the weight for
0xfffd REPLACEMENT CHARACTER
. For supplementary characters in UCA 4.0.0
collations, their collating weight is 0xfffd
. That is, to MySQL, all
supplementary characters are equal to each other, and greater than almost all BMP characters.
An example with Deseret characters and COUNT(DISTINCT)
:
CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci);INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */SELECT COUNT(DISTINCT s1) FROM t;
The result is 2 because in the MySQL
collations, the replacement character has a weight of xxx
_unicode_ci0x0dc6
, whereas
Deseret Bee and Deseret Tee both have a weight of 0xfffd
. (Were the
utf32_general_ci
collation used instead, the result would be 1 because
all three characters have a weight of 0xfffd
in that collation.)
An example with cuneiform characters and WEIGHT_STRING()
:
/*The four characters in the INSERT string are00000041 # LATIN CAPITAL LETTER A0001218F # CUNEIFORM SIGN KAB000121A7 # CUNEIFORM SIGN KISH00000042 # LATIN CAPITAL LETTER B*/CREATE TABLE t (s1 CHAR(4) CHARACTER SET utf32 COLLATE utf32_unicode_ci);INSERT INTO t VALUES (0x000000410001218f000121a700000042);SELECT HEX(WEIGHT_STRING(s1)) FROM t;
The result is:
0E33 FFFD FFFD 0E4A
0E33
and 0E4A
are primary weights as in FFFD
is the weight for KAB and also for KISH.
The rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it will be very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by MySQL's rule and secondarily by code point value, it is easy:
ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin
For supplementary characters based on UCA versions later than 4.0.0 (for example,
),
supplementary characters do not necessarily all have the same collation weight. Some have explicit
weights from the UCA xxx
_unicode_520_ciallkeys.txt
file. Others have weights calculated from
this algorithm:
aaaa= base + (code >> 15);bbbb= (code & 0x7FFF) | 0x8000;
The utf16_bin
Collation
There is a difference between "ordering by the character's code value"
and "ordering by the character's binary representation,"
a difference that appears only with utf16_bin
, because of surrogates.
Suppose that utf16_bin
(the binary collation for utf16
)
was a binary comparison "byte by byte" rather than "character by character." If that were so, the order of
characters in utf16_bin
would differ from the order in utf8_bin
.
For example, the following chart shows two rare characters. The first character is in the range E000
-FFFF
, so it is greater than a surrogate but
less than a supplementary. The second character is a supplementary.
Code point Character utf8 utf16---------- --------- ---- -----0FF9D HALFWIDTH KATAKANA LETTER N EF BE 9D FF 9D10384 UGARITIC LETTER DELTA F0 90 8E 84 D8 00 DF 84
The two characters in the chart are in order by code point value because 0xff9d
< 0x10384
. And they are in order by utf8
value
because 0xef
< 0xf0
. But they are not in order by
utf16
value, if we use byte-by-byte comparison, because 0xff
> 0xd8
.
So MySQL's utf16_bin
collation is not "byte by byte."
It is "by code point." When MySQL sees a
supplementary-character encoding in utf16
, it converts to the character's
code-point value, and then compares. Therefore, utf8_bin
and utf16_bin
are the same ordering. This is consistent with the SQL:2008 standard requirement for a UCS_BASIC collation:
"UCS_BASIC is a collation in which the ordering is determined entirely by the Unicode scalar values of the characters in the strings being sorted. It is applicable to the UCS character repertoire. Since every character repertoire is a subset of the UCS repertoire, the UCS_BASIC collation is potentially applicable to every character set. NOTE 11: The Unicode scalar value of a character is its code point treated as an unsigned integer."
If the character set is ucs2
, comparison is byte-by-byte, but ucs2
strings should not contain surrogates, anyway.
The
collations
were added in MySQL 5.6.5. They preserve the pre-5.1.24 ordering of the original xxx
_general_mysql500_ci
collations and permit upgrades for tables
created before MySQL 5.1.24. For more information, see Section
2.11.3, "Checking Whether Tables or Indexes Must Be Rebuilt", and Section
2.11.4, "Rebuilding or Repairing Tables or Indexes". xxx
_general_ci
For additional information about Unicode collations in MySQL, see Collation-Charts.Org (