Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
446 views
in Technique[技术] by (71.8m points)

python - PySpark - Extract Regex that matches dataframe value

I have a list of regex {"WeLove*", "Arizona.*hot", "Mahi*"} and a dataFrame with certain values that might match with one of the Regex expressions from the regex list.

_c0 _c1 _c2 _c3
Arizona is hot 2020 1 Y
Arizona happens to be hot 2020 1 Y
MahiWalia 2020 1 Y
MahiSingh 2020 1 Y
MahiRandhawa 2020 1 Y
WeLovechocolate 2020 1 Y
question from:https://stackoverflow.com/questions/65895988/pyspark-extract-regex-that-matches-dataframe-value

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can join the regex dataframe to the df dataframe using an rlike condition, and then get the count for each regex:

import pyspark.sql.functions as F

regex = spark.read.csv("regex.csv", header=False)

result = df.alias('df').join(
    regex.alias('regex'),
    F.expr('df._c0 rlike regex._c0')
).groupBy('regex._c0').count()

result.show()
+------------+-----+
|         _c0|count|
+------------+-----+
|Arizona.*hot|    2|
|       Mahi*|    3|
|     WeLove*|    1|
+------------+-----+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...