Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
180 views
in Technique[技术] by (71.8m points)

scala - How to avoid duplicate columns after join?

I have two dataframes with the following columns:

df1.columns
//  Array(ts, id, X1, X2)

and

df2.columns
//  Array(ts, id, Y1, Y2)

After I do

val df_combined = df1.join(df2, Seq(ts,id))

I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.

Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.

Here is the left dataframe:

val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))

val left = llist.toDF("firstname","lastname","date","duration")

left.show()

/*
+---------+--------+----------+--------+
|firstname|lastname|      date|duration|
+---------+--------+----------+--------+
|      bob|       b|2015-01-13|       4|
|    alice|       a|2015-04-23|      10|
+---------+--------+----------+--------+
*/

Here is the right dataframe:

val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")

right.show()

/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
|    alice|       a|   100|
|      bob|       b|    23|
+---------+--------+------+
*/

Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").

The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:

left.join(right, left("firstname")===right("firstname") &&
                 left("lastname")===right("lastname")).show

/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname|      date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
|      bob|       b|2015-01-13|       4|      bob|       b|    23|
|    alice|       a|2015-04-23|      10|    alice|       a|   100|
+---------+--------+----------+--------+---------+--------+------+
*/

The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:

left.join(right, Seq("firstname", "lastname")).show

/*
+---------+--------+----------+--------+------+
|firstname|lastname|      date|duration|upload|
+---------+--------+----------+--------+------+
|      bob|       b|2015-01-13|       4|    23|
|    alice|       a|2015-04-23|      10|   100|
+---------+--------+----------+--------+------+
*/

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...