Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
642 views
in Technique[技术] by (71.8m points)

apache-spark - 为什么读取Parquet文件比在SPARK中写入要占用更多的内存(Why reading Parquet file take much more memory than writing in SPARK)

I have a CSV file 40G, and use a SPARK job to convert it into parquet file with snappy compression and get a 1.8G parquet file.

(我有一个40G的CSV文件,并使用SPARK作业将其转换为具有活泼压缩的实木复合地板文件,并得到一个1.8G的实木复合地板文件。)

Then I have another SPARK job to read the parquet file and process it.

(然后,我还有另一个SPARK作业来读取镶木地板文件并进行处理。)

And I found that, even I just simply read the file without any process, I need to assign 75G memory to the reading job in order to let it run smoothly, but the writing job just need assign 14G!

(我发现,即使我只是简单地读取文件而没有任何处理,我也需要为读取作业分配75G内存,以使其顺利运行,但是写入作业只需分配14G!)

I use SPARK 2.8 in a single machine, with 8core CPU 128G ram.

(我在一台机器上使用SPARK 2.8,配备8核CPU 128G内存。)

All the setting for both read and write job are same.

(读写作业的所有设置都相同。)

I think it is really too weird that why reading take over 5 times memory as writing (75G vs 14G)?

(我认为为什么阅读占用的内存是写的5倍(75G vs 14G)真的太奇怪了吗?)

Who have idea on that?

(谁对此有想法?)

Thanks!

(谢谢!)

  ask by Danny translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...