I have text data that I want to clean (i.e. keep only alphanumeric characters) with Python. However, most of the text data I encounter contain emoji(s). I want to clean the text from non-alphanumerics, but still keep the emoji.
First, I used the emoji
library in Python to convert each emoji in a text to a certain string pattern to make it distinguishable. An example of an emoji that has been "demojized" (a literal function in the library) is shown below:
':smiley_face:' # a "demojized" emoji.
After scrolling through the data, I find that these emojis (once "demojized") exhibit the same pattern, which in regex terms seems to be
':[a-z_]+:' # regex for matching emojis.
Ok, so I know the pattern for emojis and I can extract every emoji from the text data I have. The problem is, I want to clean the text data from non-alphanumerics without altering the emoji pattern simultaneously. My initial attempt to clean the data:
>>> text = 'Wow.. :smiley_face: this is delicious!' # A string containing emoji
>>> cleaned_text = re.sub('[^a-zA-Z0-9]+',' ',text) # regex to keep only alphanumerics
>>> print(cleaned_text)
Wow smiley face this is delicious
Clearly this isn't my desired output. I want to keep the emoji text intact, as shown below:
'Wow :smiley_face: this is delicious' # Desired output
So far I have looked into things like lookahead assertion, but to no avail. Is it possible with regex to remove non-alphanumerics whilst excluding the ':[a-z_]+:'
pattern from the match? Apologies if question is unclear.
question from:
https://stackoverflow.com/questions/65906945/regex-match-certain-patterns-while-excluding-others 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…