pyspark - Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

Question

Welcome To Ask or Share your Answers For Others

pyspark - Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

pyspark - Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

Good day.

I am running a development code for parsing some log files. My code will run smoothly if I tried to parse less files. But as I increase the number of log files I need to parse, it will return different errors such as too many open files and Total size of serialized results of tasks is bigger than spark.driver.maxResultSize.

I tried to increase the spark.driver.maxResultSize but the error still persists.

Can you give me any ideas on how to resolve this issue?

Thanks.

Sample Error

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:16:30+0000

Total size of serialized results of tasks is bigger than spark.driver.maxResultSize means when a executor is trying to send its result to driver, it exceeds spark.driver.maxResultSize. Possible solution is as mentioned above by @mayank agrawal to keep on increasing it till you get it to work (not a recommended solution if an executor is trying to send too much data ).

I would suggest looking into your code and see if the data is skewed that is making one of the executor to do most of the work resulting in a lot of data in/out. If data is skewed you could try repartitioning it.

for too many open files issues , possible cause is Spark might be creating a number of intermediate files before shuffle. could happen if too many cores being used in executor/high parallelism or unique keys (possible cause in your case - huge number of input files). One solution to look into is consolidating the huge number of intermediate files through this flag : --conf spark.shuffle.consolidateFiles=true (when you do spark-submit)

One more thing to check is this thread (if that something similar to your use case): https://issues.apache.org/jira/browse/SPARK-12837

Categories

pyspark - Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

pyspark - Total size of serialized results of tasks is bigger than spark.driver.maxResultSize

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags