I want to filter dataframe according to the following conditions firstly (d<5) and secondly (value of col2 not equal its counterpart in col4 if value in col1 equal its counterpart in col3).
If the original dataframe DF
is as follows:
+----+----+----+----+---+
|col1|col2|col3|col4| d|
+----+----+----+----+---+
| A| xx| D| vv| 4|
| C| xxx| D| vv| 10|
| A| x| A| xx| 3|
| E| xxx| B| vv| 3|
| E| xxx| F| vvv| 6|
| F|xxxx| F| vvv| 4|
| G| xxx| G| xxx| 4|
| G| xxx| G| xx| 4|
| G| xxx| G| xxx| 12|
| B|xxxx| B| xx| 13|
+----+----+----+----+---+
The desired Dataframe is:
+----+----+----+----+---+
|col1|col2|col3|col4| d|
+----+----+----+----+---+
| A| xx| D| vv| 4|
| A| x| A| xx| 3|
| E| xxx| B| vv| 3|
| F|xxxx| F| vvv| 4|
| G| xxx| G| xx| 4|
+----+----+----+----+---+
Code I have tried that did not work as expected:
cols=[('A','xx','D','vv',4),('C','xxx','D','vv',10),('A','x','A','xx',3),('E','xxx','B','vv',3),('E','xxx','F','vvv',6),('F','xxxx','F','vvv',4),('G','xxx','G','xxx',4),('G','xxx','G','xx',4),('G','xxx','G','xxx',12),('B','xxxx','B','xx',13)]
df=spark.createDataFrame(cols,['col1','col2','col3','col4','d'])
df.filter((df.d<5)& (df.col2!=df.col4) & (df.col1==df.col3)).show()
+----+----+----+----+---+
|col1|col2|col3|col4| d|
+----+----+----+----+---+
| A| x| A| xx| 3|
| F|xxxx| F| vvv| 4|
| G| xxx| G| xx| 4|
+----+----+----+----+---+
What should I do to achieve the desired result?
See Question&Answers more detail:
os