python - How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

Question

Welcome To Ask or Share your Answers For Others

python - How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

I'm using Python and Django, but I'm having a problem caused by a limitation of MySQL. According to the MySQL 5.1 documentation, their utf8 implementation does not support 4-byte characters. MySQL 5.5 will support 4-byte characters using utf8mb4; and, someday in future, utf8 might support it as well.

But my server is not ready to upgrade to MySQL 5.5, and thus I'm limited to UTF-8 characters that take 3 bytes or less.

My question is: How to filter (or replace) unicode characters that would take more than 3 bytes?

I want to replace all 4-byte characters with the official ufffd (U+FFFD REPLACEMENT CHARACTER), or with ?.

In other words, I want a behavior quite similar to Python's own str.encode() method (when passing 'replace' parameter). Edit: I want a behavior similar to encode(), but I don't want to actually encode the string. I want to still have an unicode string after filtering.

I DON'T want to escape the character before storing at the MySQL, because that would mean I would need to unescape all strings I get from the database, which is very annoying and unfeasible.

[EDIT] Added tests about the proposed solutions

So I got good answers so far. Thanks, people! Now, in order to choose one of them, I did a quick testing to find the simplest and fastest one.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# vi:ts=4 sw=4 et

import cProfile
import random
import re

# How many times to repeat each filtering
repeat_count = 256

# Percentage of "normal" chars, when compared to "large" unicode chars
normal_chars = 90

# Total number of characters in this string
string_size = 8 * 1024

# Generating a random testing string
test_string = u''.join(
        unichr(random.randrange(32,
            0x10ffff if random.randrange(100) > normal_chars else 0x0fff
        )) for i in xrange(string_size) )

# RegEx to find invalid characters
re_pattern = re.compile(u'[^u0000-uD7FFuE000-uFFFF]', re.UNICODE)

def filter_using_re(unicode_string):
    return re_pattern.sub(u'uFFFD', unicode_string)

def filter_using_python(unicode_string):
    return u''.join(
        uc if uc < u'ud800' or u'ue000' <= uc <= u'uffff' else u'ufffd'
        for uc in unicode_string
    )

def repeat_test(func, unicode_string):
    for i in xrange(repeat_count):
        tmp = func(unicode_string)

print '='*10 + ' filter_using_re() ' + '='*10
cProfile.run('repeat_test(filter_using_re, test_string)')
print '='*10 + ' filter_using_python() ' + '='*10
cProfile.run('repeat_test(filter_using_python, test_string)')

#print test_string.encode('utf8')
#print filter_using_re(test_string).encode('utf8')
#print filter_using_python(test_string).encode('utf8')

The results:

filter_using_re() did 515 function calls in 0.139 CPU seconds (0.138 CPU seconds at the sub() built-in)
filter_using_python() did 2097923 function calls in 3.413 CPU seconds (1.511 CPU seconds at the join() call and 1.900 CPU seconds evaluating the generator expression)
I did no test using itertools because... well... that solution, although interesting, was quite big and complex.

Conclusion

The RegEx solution was, by far, the fastest one.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-16T23:50:15+0000

Unicode characters in the ranges u0000-uD7FF and uE000-uFFFF will have 3 byte (or less) encodings in UTF8. The uD800-uDFFF range is for multibyte UTF16. I do not know python, but you should be able to set up a regular expression to match outside those ranges.

pattern = re.compile("[uD800-uDFFF].", re.UNICODE)
pattern = re.compile("[^u0000-uFFFF]", re.UNICODE)

Edit adding Python from Denilson Sá's script in the question body:

re_pattern = re.compile(u'[^u0000-uD7FFuE000-uFFFF]', re.UNICODE)
filtered_string = re_pattern.sub(u'uFFFD', unicode_string)

Categories

python - How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

python - How to filter (or replace) unicode characters that would take more than 3 bytes in UTF-8?

[EDIT] Added tests about the proposed solutions

Conclusion

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags