Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
104 views
in Technique[技术] by (71.8m points)

python - Creating a relationship between 2 dataframes in pyspark

I am trying to create a mapping as follows:

df_input_1: this is a grouped data

+------+------+
|Hier_1|Hier_2|
+------+------+
|   Jim|   Pan|
|   Tak|   Can|
|   Pac|   Dan|
|   Foe|   Man|
|   Yat|   Van|
+------+------+

df_output_1: Created after applying logic

+---+---------+--------------------+---+
| Sr|Parent_Sr|                Name| ID|
+---+---------+--------------------+---+
|123|       NA|Jim is father of Pan|Jim|
|456|       NA|Tak is father of Can|Tak|
|789|       NA|Pac is father of Dan|Pac|
|143|       NA|Foe is father of Man|Foe|
|457|       NA|Yat is father of Van|Yat|
+---+---------+--------------------+---+

df_output_2: Second dataframe after using another input.

+---+---------+--------------------+---+
| Sr|Parent_Sr|                Name| ID|
+---+---------+--------------------+---+
|998|       NA|Pan is father of Fen|Pan|
|887|       NA|Can is father of Den|Can|
|776|       NA|Dan is father of Qen|Dan|
|665|       NA|Man is father of Men|Man|
|554|       NA|Van is father of Ren|Van|
+---+---------+--------------------+---+

Expected df_output_2:

Sr Parent_Sr Name ID
998 123 Pan is father of Fen Pan
887 456 Can is father of Den Can
776 789 Dan is father of Qen Dan
665 143 Man is father of Men Man
554 457 Van is father of Ren Van
question from:https://stackoverflow.com/questions/65856725/creating-a-relationship-between-2-dataframes-in-pyspark

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Your question is not very clear. But as per the shown inputs and the desired output, I believe what you want is simply join df_output_2 and df_input_1 on ID = Hier_2 to get the relation between IDs then join with df_output_1 on ID = Hier_1 to get Sr values :

df_output_2 = df_output_2.alias("out2").join(
    df_input_1.alias("in1"),
    col("out2.ID") == col("in1.Hier_2"), "left"
) 
    .join(
    df_output_1.alias("out1"),
    col("out1.ID") == col("in1.Hier_1"), "left"
) 
    .selectExpr("out2.Sr", "coalesce(out2.Parent_Sr, out1.Sr) as Parent_Sr", "out2.name", "out2.ID")

df_output_2.show()

#+---+---------+--------------------+---+
#| Sr|Parent_Sr|                name| ID|
#+---+---------+--------------------+---+
#|998|      123|Pan is father of Fen|Pan|
#|665|      143|Man is father of Men|Man|
#|887|      456|Can is father of Den|Can|
#|554|      457|Van is father of Ren|Van|
#|776|      789|Dan is father of Qen|Dan|
#+---+---------+--------------------+---+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...