python - PySpark - Extract Regex that matches dataframe value

Question

Welcome To Ask or Share your Answers For Others

python - PySpark - Extract Regex that matches dataframe value

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - PySpark - Extract Regex that matches dataframe value

I have a list of regex {"WeLove*", "Arizona.*hot", "Mahi*"} and a dataFrame with certain values that might match with one of the Regex expressions from the regex list.

_c0	_c1	_c2	_c3
Arizona is hot	2020	1	Y
Arizona happens to be hot	2020	1	Y
MahiWalia	2020	1	Y
MahiSingh	2020	1	Y
MahiRandhawa	2020	1	Y
WeLovechocolate	2020	1	Y

question from:https://stackoverflow.com/questions/65895988/pyspark-extract-regex-that-matches-dataframe-value

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:16:45+0000

You can join the regex dataframe to the df dataframe using an rlike condition, and then get the count for each regex:

import pyspark.sql.functions as F

regex = spark.read.csv("regex.csv", header=False)

result = df.alias('df').join(
    regex.alias('regex'),
    F.expr('df._c0 rlike regex._c0')
).groupBy('regex._c0').count()

result.show()
+------------+-----+
|         _c0|count|
+------------+-----+
|Arizona.*hot|    2|
|       Mahi*|    3|
|     WeLove*|    1|
+------------+-----+

Categories

python - PySpark - Extract Regex that matches dataframe value

python - PySpark - Extract Regex that matches dataframe value

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags