User Tools

Site Tools


sorting_unicode

Sorting Unicode

To have sort and search in unicode ignore diacritical marks you need to use normalized compatibility decomposition NFKD and then take just the first (or only the ASCII) characters of each sequence.

The ICU library has C and Java bindings for normalization and lots of other stuff.

In Python this looks like

unicodedata.normalize('NFKD',s).encode('ASCII','ignore')
sorting_unicode.txt · Last modified: 2007/07/27 13:39 (external edit)