Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
231 views
in Technique[技术] by (71.8m points)

amazon web services - Cross join optimizations on AWS Glue/Spark

I have 2 dataframes:

df1 - 7 columns (IDs and VARCHARs), rows: 1,700,000

df2 - 7 columns (IDs and VARCHARs), rows: 25,000

Need to find all possible similarities, no way to skip cartesian-product.

AWS Glue: Cluster with 10 (or 20) G.1X Workers

Already tested for 178 partitions (Spark calculated on fly when filtered df1 from bigger df) Running time: 10 hours... I stopped the job! But on S3, more than 999 part-XXX-YYYYY files were found.

Question: How to optimize this cross join on Glue/Spark, if no way to skip cross-join?

question from:https://stackoverflow.com/questions/66064608/cross-join-optimizations-on-aws-glue-spark

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

With below approach and Glue configuration, Job completed in 121 min:

Glue Details=>

Workers=>G2.X

Number of Workers=> 50 . You could try with 149 also, this should complete job in 35-45 Min.

I have created two files:-

df1=> 7 columns rows: 1700000, size 140 MB (Based on column size, file size may be different for you)

df2=> 7 columns rows: 25000, size 2 MB

Now I have partitioned first dataframe with 42500.

How did I get the 42500-> First I have created DF1 with 1 records, DF2 with 25000 and saved, cross join output.

It was 3.5 MB file, For best performance, Optimum partition should be around 128 MB. Lets assume you want to make one partition size as 150 MB.

Now output generated from 1 record was 3.5 MB, to make 150 MB partition size we need approx. 42 records per partitions. We have 1700000 records, which makes it approx. 40500 partitions.

For you, size for 1 record could differ. Use same approach to calculate partition size. After the reparation, just use cross join with broadcast.

df1.reparition(40500)

df.crossJoin(broadcast(df2))

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...