Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
234 views
in Technique[技术] by (71.8m points)

python - Regex - Match certain patterns while excluding others?

I have text data that I want to clean (i.e. keep only alphanumeric characters) with Python. However, most of the text data I encounter contain emoji(s). I want to clean the text from non-alphanumerics, but still keep the emoji.

First, I used the emoji library in Python to convert each emoji in a text to a certain string pattern to make it distinguishable. An example of an emoji that has been "demojized" (a literal function in the library) is shown below:

':smiley_face:' # a "demojized" emoji.

After scrolling through the data, I find that these emojis (once "demojized") exhibit the same pattern, which in regex terms seems to be

':[a-z_]+:' # regex for matching emojis.

Ok, so I know the pattern for emojis and I can extract every emoji from the text data I have. The problem is, I want to clean the text data from non-alphanumerics without altering the emoji pattern simultaneously. My initial attempt to clean the data:

>>> text = 'Wow.. :smiley_face: this is delicious!' # A string containing emoji
>>> cleaned_text = re.sub('[^a-zA-Z0-9]+',' ',text) # regex to keep only alphanumerics
>>> print(cleaned_text)
Wow smiley face this is delicious

Clearly this isn't my desired output. I want to keep the emoji text intact, as shown below:

'Wow :smiley_face: this is delicious' # Desired output

So far I have looked into things like lookahead assertion, but to no avail. Is it possible with regex to remove non-alphanumerics whilst excluding the ':[a-z_]+:' pattern from the match? Apologies if question is unclear.

question from:https://stackoverflow.com/questions/65906945/regex-match-certain-patterns-while-excluding-others

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

If you just want to remove all special chars except the colons and underscores inside colon-word(s)-colon contexts, you can use

re.sub(r'(:[a-z_]+:)|[^ws]|_', r'1', text)

See the regex demo. Details:

  • (:[a-z_]+:) - Capturing group 1 (1): :, one or more lowercase ASCII letters or _, and a :
  • | - or
  • [^ws]|_ - any char other than a word and whitespace char or a _ (it is a word char, hence it needs to be added as an alternative).

See the Python demo:

import re
text = 'Wow.. :smiley_face: this is delicious!' # A string containing emoji
print( re.sub(r'(:[a-z_]+:)|[^ws]|_', r'1', text) )
# => Wow :smiley_face: this is delicious

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...