396 lines
15 KiB
Plaintext
396 lines
15 KiB
Plaintext
|
||
|
||
|
||
|
||
|
||
|
||
Network Working Group M. Crispin
|
||
Request for Comments: 5051 University of Washington
|
||
Category: Standards Track October 2007
|
||
|
||
|
||
i;unicode-casemap - Simple Unicode Collation Algorithm
|
||
|
||
Status of This Memo
|
||
|
||
This document specifies an Internet standards track protocol for the
|
||
Internet community, and requests discussion and suggestions for
|
||
improvements. Please refer to the current edition of the "Internet
|
||
Official Protocol Standards" (STD 1) for the standardization state
|
||
and status of this protocol. Distribution of this memo is unlimited.
|
||
|
||
Abstract
|
||
|
||
This document describes "i;unicode-casemap", a simple case-
|
||
insensitive collation for Unicode strings. It provides equality,
|
||
substring, and ordering operations.
|
||
|
||
1. Introduction
|
||
|
||
The "i;ascii-casemap" collation described in [COMPARATOR] is quite
|
||
simple to implement and provides case-independent comparisons for the
|
||
26 Latin alphabetics. It is specified as the default and/or baseline
|
||
comparator in some application protocols, e.g., [IMAP-SORT].
|
||
|
||
However, the "i;ascii-casemap" collation does not produce
|
||
satisfactory results with non-ASCII characters. It is possible, with
|
||
a modest extension, to provide a more sophisticated collation with
|
||
greater multilingual applicability than "i;ascii-casemap". This
|
||
extension provides case-independent comparisons for a much greater
|
||
number of characters. It also collates characters with diacriticals
|
||
with the non-diacritical character forms.
|
||
|
||
This collation, "i;unicode-casemap", is intended to be an alternative
|
||
to, and preferred over, "i;ascii-casemap". It does not replace the
|
||
"i;basic" collation described in [BASIC].
|
||
|
||
2. Unicode Casemap Collation Description
|
||
|
||
The "i;unicode-casemap" collation is a simple collation which is
|
||
case-insensitive in its treatment of characters. It provides
|
||
equality, substring, and ordering operations. The validity test
|
||
operation returns "valid" for any input.
|
||
|
||
|
||
|
||
|
||
|
||
Crispin Standards Track [Page 1]
|
||
|
||
RFC 5051 i;unicode-casemap October 2007
|
||
|
||
|
||
This collation allows strings in arbitrary (and mixed) character
|
||
sets, as long as the character set for each string is identified and
|
||
it is possible to convert the string to Unicode. Strings which have
|
||
an unidentified character set and/or cannot be converted to Unicode
|
||
are not rejected, but are treated as binary.
|
||
|
||
Each input string is prepared by converting it to a "titlecased
|
||
canonicalized UTF-8" string according to the following steps, using
|
||
UnicodeData.txt ([UNICODE-DATA]):
|
||
|
||
(1) A Unicode codepoint is obtained from the input string.
|
||
|
||
(a) If the input string is in a known charset that can be
|
||
converted to Unicode, a sequence in the string's charset
|
||
is read and checked for validity according to the rules of
|
||
that charset. If the sequence is valid, it is converted
|
||
to a Unicode codepoint. Note that for input strings in
|
||
UTF-8, the UTF-8 sequence must be valid according to the
|
||
rules of [UTF-8]; e.g., overlong UTF-8 sequences are
|
||
invalid.
|
||
|
||
(b) If the input string is in an unknown charset, or an
|
||
invalid sequence occurs in step (1)(a), conversion ceases.
|
||
No further preparation is performed, and any partial
|
||
preparation results are discarded. The original string is
|
||
used unchanged with the i;octet comparator.
|
||
|
||
(2) The following steps, using UnicodeData.txt ([UNICODE-DATA]),
|
||
are performed on the resulting codepoint from step (1)(a).
|
||
|
||
(a) If the codepoint has a titlecase property in
|
||
UnicodeData.txt (this is normally the same as the
|
||
uppercase property), the codepoint is converted to the
|
||
codepoints in the titlecase property.
|
||
|
||
(b) If the resulting codepoint from (2)(a) has a decomposition
|
||
property of any type in UnicodeData.txt, the codepoint is
|
||
converted to the codepoints in the decomposition property.
|
||
This step is recursively applied to each of the resulting
|
||
codepoints until no more decomposition is possible
|
||
(effectively Normalization Form KD).
|
||
|
||
Example: codepoint U+01C4 (LATIN CAPITAL LETTER DZ WITH CARON)
|
||
has a titlecase property of U+01C5 (LATIN CAPITAL LETTER D
|
||
WITH SMALL LETTER Z WITH CARON). Codepoint U+01C5 has a
|
||
decomposition property of U+0044 (LATIN CAPITAL LETTER D)
|
||
U+017E (LATIN SMALL LETTER Z WITH CARON). U+017E has a
|
||
decomposition property of U+007A (LATIN SMALL LETTER Z) U+030c
|
||
|
||
|
||
|
||
Crispin Standards Track [Page 2]
|
||
|
||
RFC 5051 i;unicode-casemap October 2007
|
||
|
||
|
||
(COMBINING CARON). Neither U+0044, U+007A, nor U+030C have
|
||
any decomposition properties. Therefore, U+01C4 is converted
|
||
to U+0044 U+007A U+030C by this step.
|
||
|
||
(3) The resulting codepoint(s) from step (2) is/are appended, in
|
||
UTF-8 format, to the "titlecased canonicalized UTF-8" string.
|
||
|
||
(4) Repeat from step (1) until there is no more data in the input
|
||
string.
|
||
|
||
Following the above preparation process on each string, the equality,
|
||
ordering, and substring operations are as for i;octet.
|
||
|
||
It is permitted to use an alternative implementation of the above
|
||
preparation process if it produces the same results. For example, it
|
||
may be more convenient for an implementation to convert all input
|
||
strings to a sequence of UTF-16 or UTF-32 values prior to performing
|
||
any of the step (2) actions. Similarly, if all input strings are (or
|
||
are convertible to) Unicode, it may be possible to use UTF-32 as an
|
||
alternative to UTF-8 in step (3).
|
||
|
||
Note: UTF-16 is unsuitable as an alternative to UTF-8 in step (3),
|
||
because UTF-16 surrogates will cause i;octet to collate codepoints
|
||
U+E0000 through U+FFFF after non-BMP codepoints.
|
||
|
||
This collation is not locale sensitive. Consequently, care should be
|
||
taken when using OS-supplied functions to implement this collation.
|
||
Functions such as strcasecmp and toupper are sometimes locale
|
||
sensitive and may inconsistently casemap letters.
|
||
|
||
The i;unicode-casemap collation is well suited to use with many
|
||
Internet protocols and computer languages. Use with natural language
|
||
is often inappropriate; even though the collation apparently supports
|
||
languages such as Swahili and English, in real-world use it tends to
|
||
mis-sort a number of types of string:
|
||
|
||
o people and place names containing scripts that are not collated
|
||
according to "alphabetical order".
|
||
o words with characters that have diacriticals. However,
|
||
i;unicode-casemap generally does a better job than i;ascii-casemap
|
||
for most (but not all) languages. For example, German umlaut
|
||
letters will sort correctly, but some Scandinavian letters will
|
||
not.
|
||
o names such as "Lloyd" (which in Welsh sorts after "Lyon", unlike
|
||
in English),
|
||
o strings containing other non-letter symbols; e.g., euro and pound
|
||
sterling symbols, quotation marks other than '"', dashes/hyphens,
|
||
etc.
|
||
|
||
|
||
|
||
Crispin Standards Track [Page 3]
|
||
|
||
RFC 5051 i;unicode-casemap October 2007
|
||
|
||
|
||
3. Unicode Casemap Collation Registration
|
||
|
||
<?xml version='1.0'?>
|
||
<!DOCTYPE collation SYSTEM 'collationreg.dtd'>
|
||
<collation rfc="5051" scope="global" intendedUse="common">
|
||
<identifier>i;unicode-casemap</identifier>
|
||
<title>Unicode Casemap</title>
|
||
<operations>equality order substring</operations>
|
||
<specification>RFC 5051</specification>
|
||
<owner>IETF</owner>
|
||
<submitter>mrc@cac.washington.edu</submitter>
|
||
</collation>
|
||
|
||
4. Security Considerations
|
||
|
||
The security considerations for [UTF-8], [STRINGPREP], and [UNICODE-
|
||
SECURITY] apply and are normative to this specification.
|
||
|
||
The results from this comparator will vary depending upon the
|
||
implementation for several reasons. Implementations MUST consider
|
||
whether these possibilities are a problem for their use case:
|
||
|
||
1) New characters added in Unicode may have decomposition or
|
||
titlecase properties that will not be known to an implementation
|
||
based upon an older revision of Unicode. This impacts step (2).
|
||
|
||
2) Step (2)(b) defines a subset of Normalization Form KD (NFKD) that
|
||
does not require normalization of out-of-order diacriticals.
|
||
However, an implementation MAY use an NFKD library routine that
|
||
does such normalization. This impacts step (2)(b) and possibly
|
||
also step (1)(a), and is an issue only with ill-formed UTF-8
|
||
input.
|
||
|
||
3) The set of charsets handled in step (1)(a) is open-ended. UTF-8
|
||
(and, by extension, US-ASCII) are the only mandatory-to-implement
|
||
charsets. This impacts step (1)(a).
|
||
|
||
Implementations SHOULD, as far as feasible, support all the
|
||
charsets they are likely to encounter in the input data, in order
|
||
to avoid poor collation caused by the fall through to the (1)(b)
|
||
rule.
|
||
|
||
4) Other charsets may have revisions which add new characters that
|
||
are not known to an implementation based upon an older revision.
|
||
This impacts step (1)(a) and possibly also step (1)(b).
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Crispin Standards Track [Page 4]
|
||
|
||
RFC 5051 i;unicode-casemap October 2007
|
||
|
||
|
||
An attacker may create input that is ill-formed or in an unknown
|
||
charset, with the intention of impacting the results of this
|
||
comparator or exploiting other parts of the system which process this
|
||
input in different ways. Note, however, that even well-formed data
|
||
in a known charset can impact the result of this comparator in
|
||
unexpected ways. For example, an attacker can substitute U+0041
|
||
(LATIN CAPITAL LETTER A) with U+0391 (GREEK CAPITAL LETTER ALPHA) or
|
||
U+0410 (CYRILLIC CAPITAL LETTER A) in the intention of causing a
|
||
non-match of strings which visually appear the same and/or causing
|
||
the string to appear elsewhere in a sort.
|
||
|
||
5. IANA Considerations
|
||
|
||
The i;unicode-casemap collation defined in section 2 has been added
|
||
to the registry of collations defined in [COMPARATOR].
|
||
|
||
6. Normative References
|
||
|
||
[COMPARATOR] Newman, C., Duerst, M., and A. Gulbrandsen,
|
||
"Internet Application Protocol Collation
|
||
Registry", RFC 4790, February 2007.
|
||
|
||
[STRINGPREP] Hoffman, P. and M. Blanchet, "Preparation of
|
||
Internationalized Strings ("stringprep")", RFC
|
||
3454, December 2002.
|
||
|
||
[UTF-8] Yergeau, F., "UTF-8, a transformation format of
|
||
ISO 10646", STD 63, RFC 3629, November 2003.
|
||
|
||
[UNICODE-DATA] <http://www.unicode.org/Public/UNIDATA/
|
||
UnicodeData.txt>
|
||
|
||
Although the UnicodeData.txt file referenced
|
||
here is part of the Unicode standard, it is
|
||
subject to change as new characters are added
|
||
to Unicode and errors are corrected in Unicode
|
||
revisions. As a result, it may be less stable
|
||
than might otherwise be implied by the
|
||
standards status of this specification.
|
||
|
||
[UNICODE-SECURITY] Davis, M. and M. Suignard, "Unicode Security
|
||
Considerations", February 2006,
|
||
<http://www.unicode.org/reports/tr36/>.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Crispin Standards Track [Page 5]
|
||
|
||
RFC 5051 i;unicode-casemap October 2007
|
||
|
||
|
||
7. Informative References
|
||
|
||
[BASIC] Newman, C., Duerst, M., and A. Gulbrandsen,
|
||
"i;basic - the Unicode Collation Algorithm",
|
||
Work in Progress, March 2007.
|
||
|
||
[IMAP-SORT] Crispin, M. and K. Murchison, "Internet Message
|
||
Access Protocol - SORT and THREAD Extensions",
|
||
Work in Progress, September 2007.
|
||
|
||
Author's Address
|
||
|
||
Mark R. Crispin
|
||
Networks and Distributed Computing
|
||
University of Washington
|
||
4545 15th Avenue NE
|
||
Seattle, WA 98105-4527
|
||
|
||
Phone: +1 (206) 543-5762
|
||
EMail: MRC@CAC.Washington.EDU
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Crispin Standards Track [Page 6]
|
||
|
||
RFC 5051 i;unicode-casemap October 2007
|
||
|
||
|
||
Full Copyright Statement
|
||
|
||
Copyright (C) The IETF Trust (2007).
|
||
|
||
This document is subject to the rights, licenses and restrictions
|
||
contained in BCP 78, and except as set forth therein, the authors
|
||
retain all their rights.
|
||
|
||
This document and the information contained herein are provided on an
|
||
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
|
||
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
|
||
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
|
||
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
|
||
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
|
||
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
|
||
|
||
Intellectual Property
|
||
|
||
The IETF takes no position regarding the validity or scope of any
|
||
Intellectual Property Rights or other rights that might be claimed to
|
||
pertain to the implementation or use of the technology described in
|
||
this document or the extent to which any license under such rights
|
||
might or might not be available; nor does it represent that it has
|
||
made any independent effort to identify any such rights. Information
|
||
on the procedures with respect to rights in RFC documents can be
|
||
found in BCP 78 and BCP 79.
|
||
|
||
Copies of IPR disclosures made to the IETF Secretariat and any
|
||
assurances of licenses to be made available, or the result of an
|
||
attempt made to obtain a general license or permission for the use of
|
||
such proprietary rights by implementers or users of this
|
||
specification can be obtained from the IETF on-line IPR repository at
|
||
http://www.ietf.org/ipr.
|
||
|
||
The IETF invites any interested party to bring to its attention any
|
||
copyrights, patents or patent applications, or other proprietary
|
||
rights that may cover technology that may be required to implement
|
||
this standard. Please address the information to the IETF at
|
||
ietf-ipr@ietf.org.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Crispin Standards Track [Page 7]
|
||
|