The Asian character sets that we support include Chinese, Japanese, Korean, and Thai. These can be complicated.
For example, the Chinese sets must allow for thousands of different characters. See Section
10.1.14.7.1, "The cp932 Character Set", for additional information about
the cp932 and sjis character sets.
In MySQL, the sjis character set corresponds to the Shift_JIS
character set defined by IANA, which supports JIS X0201 and JIS X0208 characters. (See http://www.iana.org/assignments/character-sets.)
However, the meaning of "SHIFT JIS" as a descriptive
term has become very vague and it often includes the extensions to Shift_JIS
that are defined by various vendors.
For example, "SHIFT JIS" used in Japanese Windows
environments is a Microsoft extension of Shift_JIS and its exact name is Microsoft Windows Codepage : 932 or cp932. In
addition to the characters supported by Shift_JIS, cp932 supports extension characters such as NEC special characters, NEC
selected—IBM extended characters, and IBM selected characters.
Many Japanese users have experienced problems using these extension characters. These problems stem from the
following factors:
MySQL automatically converts character sets.
Character sets are converted using Unicode (ucs2).
The sjis character set does not support the
conversion of these extension characters.
There are several conversion rules from so-called "SHIFT JIS" to Unicode, and some characters are converted to Unicode
differently depending on the conversion rule. MySQL supports only one of these rules (described
later).
The MySQL cp932 character set is designed to solve these problems.
Because MySQL supports character set conversion, it is important to separate IANA Shift_JIS
and cp932 into two different character sets because they provide different
conversion rules.
How does cp932 differ from sjis?
The cp932 character set differs from sjis in the
following ways:
cp932 supports NEC special characters, NEC
selected—IBM extended characters, and IBM selected characters.
Some cp932 characters have two different code
points, both of which convert to the same Unicode code point. When converting from Unicode back to
cp932, one of the code points must be selected. For this "round trip conversion," the rule
recommended by Microsoft is used. (See http://support.microsoft.com/kb/170559/EN-US/.)
The conversion rule works like this:
If the character is in both JIS X 0208 and NEC special characters,
use the code point of JIS X 0208.
If the character is in both NEC special characters and IBM selected
characters, use the code point of NEC special characters.
If the character is in both IBM selected characters and NEC
selected—IBM extended characters, use the code point of IBM extended characters.
The table shown at http://www.microsoft.com/globaldev/reference/dbcs/932.htm
provides information about the Unicode values of cp932 characters.
For cp932 table entries with characters under which a four-digit
number appears, the number represents the corresponding Unicode (ucs2) encoding. For table entries with an underlined two-digit
value appears, there is a range of cp932 character values that
begin with those two digits. Clicking such a table entry takes you to a page that displays the
Unicode value for each of the cp932 characters that begin with
those digits.
The following links are of special interest. They correspond to the encodings for the following
sets of characters:
cp932 supports conversion of user-defined
characters in combination with eucjpms, and solves the problems with
sjis/ujis conversion. For details, please
refer to http://www.opengroup.or.jp/jvc/cde/sjis-euc-e.html.
For some characters, conversion to and from ucs2 is different for sjis and cp932. The following tables illustrate
these differences.