Spec-Zone .ru
спецификации, руководства, описания, API
|
This section describes the LDML syntax that MySQL recognizes. This is a subset of the syntax described in the
LDML specification available at <rules>
and </rules>
tags) into the MySQL Index.xml
file.
The rules described here are all supported except that character sorting occurs only at the primary level. Rules
that specify differences at secondary or higher sort levels are recognized (and thus can be included in
collation definitions) but are treated as equality at the primary level.
The MySQL server generates diagnostics when it finds problems while parsing the Index.xml
file. See Section
10.4.4.3, "Diagnostics During Index.xml
Parsing".
Character Representation
Characters named in LDML rules can be written literally or in \u
format, where nnnn
nnnn
is the hexadecimal Unicode code point value. For example, A
and á
can be written literally or as \u0041
and \u00E1
. Within hexadecimal values, the digits
A
through F
are not case sensitive; \u00E1
and \u00e1
are equivalent. For UCA 4.0.0
collations, hexadecimal notation can be used only for characters in the Basic Multilingual Plane, not for
characters outside the BMP range of 0000
to FFFF
. For
UCA 5.2.0 collations, hexadecimal notation can be used for any character.
The Index.xml
file itself should be written using UTF-8 encoding.
Syntax Rules
LDML has reset rules and shift rules to specify character ordering. Orderings are given as a set of rules that begin with a reset rule that establishes an anchor point, followed by shift rules that indicate how characters sort relative to the anchor point.
A <reset>
rule does not specify any ordering in
and of itself. Instead, it "resets" the ordering
for subsequent shift rules to cause them to be taken in relation to a given character. Either of the
following rules resets subsequent shift rules to be taken in relation to the letter 'A'
:
<reset>A</reset><reset>\u0041</reset>
The <p>
, <s>
,
and <t>
shift rules define primary, secondary, and tertiary
differences of a character from another character:
Use primary differences to distinguish separate letters.
Use secondary differences to distinguish accent variations.
Use tertiary differences to distinguish lettercase variations.
Either of these rules specifies a primary shift rule for the 'G'
character:
<p>G</p><p>\u0047</p>
The <i>
shift rule indicates that one character
sorts identically to another. The following rules cause 'b'
to sort the
same as 'a'
:
<reset>a</reset><i>b</i>
Abbreviated shift syntax specifies multiple shift rules using a single pair of tags. The following table shows the correspondence between abbreviated syntax rules and the equivalent nonabbreviated rules.
Table 10.2. Abbreviated Shift Syntax
Abbreviated Syntax | Nonabbreviated Syntax |
---|---|
<pc>xyz</pc> |
<p>x</p><p>y</p><p>z</p> |
<sc>xyz</sc> |
<s>x</s><s>y</s><s>z</s> |
<tc>xyz</tc> |
<t>x</t><t>y</t><t>z</t> |
<ic>xyz</ic> |
<i>x</i><i>y</i><i>z</i> |
An expansion is a reset rule that establishes an anchor point for a
multiple-character sequence. MySQL supports expansions 2 to 6 characters long. The following rules put
'z'
greater at the primary level than the sequence of three characters
'abc'
:
<reset>abc</reset><p>z</p>
A contraction is a shift rule that sorts a multiple-character sequence. MySQL
supports contractions 2 to 6 characters long. The following rules put the sequence of three characters
'xyz'
greater at the primary level than 'a'
:
<reset>a</reset><p>xyz</p>
Long expansions and long contractions can be used together. These rules put the
sequence of three characters 'xyz'
greater at the primary level than the
sequence of three characters 'abc'
:
<reset>abc</reset><p>xyz</p>
Normal expansion syntax uses <x>
plus <extend>
elements to specify an expansion. The following rules put the
character 'k'
greater at the secondary level than the sequence 'ch'
. That is, 'k'
behaves as if it expands
to a character after 'c'
followed by 'h'
:
<reset>c</reset><x><s>k</s><extend>h</extend></x>
This syntax permits long sequences. These rules sort the sequence 'ccs'
greater at the tertiary level than the sequence 'cscs'
:
<reset>cs</reset><x><t>ccs</t><extend>cs</extend></x>
The LDML specification describes normal expansion syntax as "tricky." See that specification for details.
Previous context syntax uses <x>
plus <context>
elements to specify that the context before a character
affects how it sorts. The following rules put '-'
greater at the secondary
level than 'a'
, but only when '-'
occurs after
'b'
:
<reset>a</reset><x><context>b</context><s>-</s></x>
Previous context syntax can include the <extend>
element. These rules put 'def'
greater at the primary level than 'aghi'
, but only when 'def'
comes after
'abc'
:
<reset>a</reset><x><context>abc</context><p>def</p><extend>ghi</extend></x>
Reset rules permit a before
attribute. Normally, shift
rules after a reset rule indicate characters that sort after the reset character. Shift rules after a
reset rule that has the before
attribute indicate characters that sort
before the reset character. The following rules put the character 'b'
immediately before 'a'
at the primary level:
<reset before="primary">a</reset><p>b</p>
Permissible before
attribute values specify the sort level by name or
the equivalent numeric value:
<reset before="primary"><reset before="1"><reset before="secondary"><reset before="2"><reset before="tertiary"><reset before="3">
A reset rule can name a logical reset position rather than a literal character:
<first_tertiary_ignorable/><last_tertiary_ignorable/><first_secondary_ignorable/><last_secondary_ignorable/><first_primary_ignorable/><last_primary_ignorable/><first_variable/><last_variable/><first_non_ignorable/><last_non_ignorable/><first_trailing/><last_trailing/>
These rules put 'z'
greater at the primary level than nonignorable
characters that have a Default Unicode Collation Element Table (DUCET) entry and that are not CJK:
<reset><last_non_ignorable/></reset><p>z</p>
Logical positions have the code points shown in the following table.
Table 10.3. Logical Reset Position Code Points
Logical Position | Unicode 4.0.0 Code Point | Unicode 5.2.0 Code Point |
---|---|---|
<first_non_ignorable/> |
U+02D0 | U+02D0 |
<last_non_ignorable/> |
U+A48C | U+1342E |
<first_primary_ignorable/> |
U+0332 | U+0332 |
<last_primary_ignorable/> |
U+20EA | U+101FD |
<first_secondary_ignorable/> |
U+0000 | U+0000 |
<last_secondary_ignorable/> |
U+FE73 | U+FE73 |
<first_tertiary_ignorable/> |
U+0000 | U+0000 |
<last_tertiary_ignorable/> |
U+FE73 | U+FE73 |
<first_trailing/> |
U+0000 | U+0000 |
<last_trailing/> |
U+0000 | U+0000 |
<first_variable/> |
U+0009 | U+0009 |
<last_variable/> |
U+2183 | U+1D371 |
The <collation>
element permits a shift-after-method
attribute that affects character weight calculation for
shift rules. The attribute has these permitted values:
simple
: Calculate character weights as for
reset rules that do not have a before
attribute. This is the
default if the attribute is not given.
expand
: Use expansions for shifts after
reset rules.
Suppose that '0'
and '1'
have weights of
0E29
and 0E2A
and we want to put all basic
Latin letters between '0'
and '1'
:
<reset>0</reset><pc>abcdefghijklmnopqrstuvwxyz</pc>
For simple shift mode, weights are calculated as follows:
'a' has weight 0E29+1'b' has weight 0E29+2'c' has weight 0E29+3...
However, there are not enough vacant positions to put 26 characters between '0'
and '1'
. The result is that digits and letters are intermixed.
To solve this, use shift-after-method="expand"
. Then weights are
calculated like this:
'a' has weight [0E29][233D+1]'b' has weight [0E29][233D+2]'c' has weight [0E29][233D+3]...
233D
is the UCA 4.0.0 weight for character 0xA48C
, which is the last nonignorable character (a sort of the greatest
character in the collation, excluding CJK). UCA 5.2.0 is similar but uses 3ACA
, for character 0x1342E
.
MySQL-Specific LDML Extensions
In MySQL 5.6, an extension to LDML rules permits the <collation>
element to
include an optional version
attribute in
<collation>
tags to indicate the UCA version on which the collation is based. If the version
attribute is omitted, its default value is 4.0.0
. For example, this specification indicates a collation that is based on UCA
5.2.0:
<collation id="nnn
" name="utf8_xxx
_ci" version="5.2.0">...</collation>