Python: Find equivalent surrogate pair from non-BMP unicode char

Question

Welcome To Ask or Share your Answers For Others

Python: Find equivalent surrogate pair from non-BMP unicode char

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

Python: Find equivalent surrogate pair from non-BMP unicode char

The answer presented here: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair, such as 'ud83dude4f' into a single non-BMP unicode character (the answer being "ud83dude4f".encode('utf-16', 'surrogatepass').decode('utf-16')). I would like to know how to do this in reverse. How can I, using Python, find the equivalent surrogate pair from a non-BMP character, converting 'U0001f64f' (??) back to 'ud83dude4f'. I couldn't find a clear answer to that.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:11:43+0000

You'll have to manually replace each non-BMP point with the surrogate pair. You could do this with a regular expression:

import re

_nonbmp = re.compile(r'[U00010000-U0010FFFF]')

def _surrogatepair(match):
    char = match.group()
    assert ord(char) > 0xffff
    encoded = char.encode('utf-16-le')
    return (
        chr(int.from_bytes(encoded[:2], 'little')) + 
        chr(int.from_bytes(encoded[2:], 'little')))

def with_surrogates(text):
    return _nonbmp.sub(_surrogatepair, text)

Demo:

>>> with_surrogates('U0001f64f')
'ud83dude4f'

Categories

Python: Find equivalent surrogate pair from non-BMP unicode char

Python: Find equivalent surrogate pair from non-BMP unicode char

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags