With below approach and Glue configuration, Job completed in 121 min:
Glue Details=>
Workers=>G2.X
Number of Workers=> 50 . You could try with 149 also, this should complete job in 35-45 Min.
I have created two files:-
df1=> 7 columns rows: 1700000, size 140 MB (Based on column size, file size may be different for you)
df2=> 7 columns rows: 25000, size 2 MB
Now I have partitioned first dataframe with 42500.
How did I get the 42500-> First I have created DF1 with 1 records, DF2 with 25000 and saved, cross join output.
It was 3.5 MB file, For best performance, Optimum partition should be around 128 MB.
Lets assume you want to make one partition size as 150 MB.
Now output generated from 1 record was 3.5 MB, to make 150 MB partition size
we need approx. 42 records per partitions.
We have 1700000 records, which makes it approx. 40500 partitions.
For you, size for 1 record could differ. Use same approach to calculate partition size.
After the reparation, just use cross join with broadcast.
df1.reparition(40500)
df.crossJoin(broadcast(df2))
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…