MySQL 5.5 supports these Unicode character sets:
ucs2
, the UCS-2 encoding of the Unicode character set using 16 bits per characterutf16
, the UTF-16 encoding for the Unicode character set; likeucs2
but with an extension for supplementary charactersutf32
, the UTF-32 encoding for the Unicode character set using 32 bits per characterutf8
, a UTF-8 encoding of the Unicode character set using one to three bytes per characterutf8mb4
, a UTF-8 encoding of the Unicode character set using one to four bytes per character
ucs2
and utf8
support
Basic Multilingual Plane (BMP) characters.
utf8mb4
, utf16
, and
utf32
support BMP and supplementary
characters. The utf8mb4
,
utf16
, and utf32
character sets were added in MySQL 5.5.3.
You can store text in about 650 languages using these character sets. This section lists the collations available for each Unicode character set and describes their differentiating properties. For general information about the character sets, see Section 9.1.10, “Unicode Support”.
A similar set of collations is available for each Unicode
character set. These are shown in the following list, where
xxx
represents the character set
name. For example,
represents the Danish collations, the specific names of which
are xxx
_danish_ciucs2_danish_ci
,
utf16_danish_ci
,
utf32_danish_ci
,
utf8_danish_ci
, and
utf8mb4_danish_ci
.
xxx
_binxxx
_czech_cixxx
_danish_cixxx
_esperanto_cixxx
_estonian_ci
(default)xxx
_general_cixxx
_hungarian_cixxx
_icelandic_cixxx
_latvian_cixxx
_lithuanian_cixxx
_persian_cixxx
_polish_cixxx
_roman_cixxx
_romanian_cixxx
_sinhala_cixxx
_slovak_cixxx
_slovenian_cixxx
_spanish_cixxx
_spanish2_cixxx
_swedish_cixxx
_turkish_cixxx
_unicode_ci
MySQL implements the
collations according to the Unicode Collation Algorithm (UCA)
described at
http://www.unicode.org/reports/tr10/. The
collation uses the version-4.0.0 UCA weight keys:
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt.
Currently, the
xxx
_unicode_ci
collations have only partial support for the Unicode Collation
Algorithm. Some characters are not supported yet. Also,
combining marks are not fully supported. This affects
primarily Vietnamese, Yoruba, and some smaller languages such
as Navajo.
xxx
_unicode_ci
MySQL implements language-specific Unicode collations only if
the ordering with
does not work well for a language. Language-specific
collations are UCA-based. They are derived from
xxx
_unicode_ci
with additional language tailoring rules.
xxx
_unicode_ci
For any Unicode character set, operations performed using the
collation are faster than those for the
xxx
_general_ci
collation. For example, comparisons for the
xxx
_unicode_ciutf8_general_ci
collation are faster, but
slightly less correct, than comparisons for
utf8_unicode_ci
. The reason for this is
that utf8_unicode_ci
supports mappings such
as expansions; that is, when one character compares as equal
to combinations of other characters. For example, in German
and some other languages “ß
”
is equal to “ss
”.
utf8_unicode_ci
also supports contractions
and ignorable characters. utf8_general_ci
is a legacy collation that does not support expansions,
contractions, or ignorable characters. It can make only
one-to-one comparisons between characters.
To further illustrate, the following equalities hold in both
utf8_general_ci
and
utf8_unicode_ci
(for the effect this has in
comparisons or when doing searches, see
Section 9.1.7.8, “Examples of the Effect of Collation”):
Ä = A Ö = O Ü = U
A difference between the collations is that this is true for
utf8_general_ci
:
ß = s
Whereas this is true for utf8_unicode_ci
,
which supports the German DIN-1 ordering (also known as
dictionary order):
ß = ss
MySQL implements language-specific collations for the
utf8
character set only if the ordering
with utf8_unicode_ci
does not work well for
a language. For example, utf8_unicode_ci
works fine for German dictionary order and French, so there is
no need to create special utf8
collations.
utf8_general_ci
also is satisfactory for
both German and French, except that
“ß
” is equal to
“s
”, and not to
“ss
”. If this is acceptable
for your application, you should use
utf8_general_ci
because it is faster.
Otherwise, use utf8_unicode_ci
because it
is more accurate.
includes Swedish rules. For example, in Swedish, the following
relationship holds, which is not something expected by a
German or French speaker:
xxx
_swedish_ci
Ü = Y < Ö
The
and
xxx
_spanish_ci
collations correspond to modern Spanish and traditional
Spanish, respectively. In both collations,
“xxx
_spanish2_ciñ
” (n-tilde) is a separate
letter between “n
” and
“o
”. In addition, for
traditional Spanish, “ch
” is a
separate letter between “c
”
and “d
”, and
“ll
” is a separate letter
between “l
” and
“m
”
In the
collations, xxx
_roman_ciI
and J
compare as equal, and U
and
V
compare as equal.
For all Unicode collations except the “binary”
(
)
collations, MySQL performs a table lookup to find a
character's collating weight. If a character is not in the
table (for example, because it is a “new”
character), collating weight determination becomes more
complex:
xxx
_bin
For BMP characters in general collations (
), weight = code point.xxx
_general_ciFor BMP characters in UCA collations (for example,
and language-specific collations), the following algorithm applies:xxx
_unicode_ciif (code >= 0x3400 && code <= 0x4DB5) base= 0xFB80; /* CJK Ideograph Extension */ else if (code >= 0x4E00 && code <= 0x9FA5) base= 0xFB40; /* CJK Ideograph */ else base= 0xFBC0; /* All other characters */ aaaa= base + (code >> 15); bbbb= (code & 0x7FFF) | 0x8000;
The result is a sequence of two collating elements,
aaaa
followed bybbbb
.Thus,
U+04cf CYRILLIC SMALL LETTER PALOCHKA
currently is, with all UCA collations, greater thanU+04c0 CYRILLIC LETTER PALOCHKA
. Eventually, after further collation tuning, all palochkas will sort together.For supplementary characters in general collations, the weight is the weight for
0xfffd REPLACEMENT CHARACTER
. For supplementary characters in UCA collations, their collating weight is0xfffd
. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.An example with Deseret characters and
COUNT(DISTINCT)
:CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci); INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */ INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */ INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */ SELECT COUNT(DISTINCT s1) FROM t;
The result is 2 because in the MySQL
collations, the replacement character has a weight ofxxx
_unicode_ci0x0dc6
, whereas Deseret Bee and Deseret Tee both have a weight of0xfffd
. (Were theutf32_general_ci
collation used instead, the result would be 1 because all three characters have a weight of0xfffd
in that collation.)
The current rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it will be very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by MySQL's rule and secondarily by code point value, it is easy:
ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin
The utf16_bin
Collation
There is a difference between “ordering by the
character's code value” and “ordering by the
character's binary representation,” a difference that
appears only with utf16_bin
, because of
surrogates.
Suppose that utf16_bin
(the binary
collation for utf16
) was a binary
comparison “byte by byte” rather than
“character by character.” If that were so, the
order of characters in utf16_bin
would
differ from the order in utf8_bin
. For
example, the following chart shows two rare characters. The
first character is in the range
E000
-FFFF
, so it is
greater than a surrogate but less than a supplementary. The
second character is a supplementary.
Code point Character utf8 utf16 ---------- --------- ---- ----- 0FF9D HALFWIDTH KATAKANA LETTER N EF BE 9D FF 9D 10384 UGARITIC LETTER DELTA F0 90 8E 84 D8 00 DF 84
The two characters in the chart are in order by code point
value because 0xff9d
<
0x10384
. And they are in order by
utf8
value because 0xef
< 0xf0
. But they are not in order by
utf16
value, if we use byte-by-byte
comparison, because 0xff
>
0xd8
.
So MySQL's utf16_bin
collation is not
“byte by byte.” It is “by code
point.” When MySQL sees a supplementary-character
encoding in utf16
, it converts to the
character's code-point value, and then compares. Therefore,
utf8_bin
and utf16_bin
are the same ordering. This is consistent with the SQL:2008
standard requirement for a UCS_BASIC collation:
“UCS_BASIC is a collation in which the ordering is
determined entirely by the Unicode scalar values of the
characters in the strings being sorted. It is applicable to
the UCS character repertoire. Since every character repertoire
is a subset of the UCS repertoire, the UCS_BASIC collation is
potentially applicable to every character set. NOTE 11: The
Unicode scalar value of a character is its code point treated
as an unsigned integer.”
If the character set is ucs2
, comparison is
byte-by-byte, but ucs2
strings should not
contain surrogates, anyway.
For additional information about Unicode collations in MySQL, see Collation-Charts.Org (utf8).