This needs to be broken down
- Does the Parquet code get the predicates from spark (yes)
- Does parquet then attempt to selectively read only those columns, using the Hadoop
FileSystem
seek()
+ read()
or readFully(position, buffer, length)
calls? Yes
- Does the S3 connector translate these File Operations into efficient HTTP GET requests? In Amazon EMR: Yes. In Apache Hadoop, you need hadoop 2.8 on the classpath and set the properly
spark.hadoop.fs.s3a.experimental.fadvise=random
to trigger random access.
Hadoop 2.7 and earlier handle the aggressive seek() round the file badly, because they always initiate a GET offset-end-of-file, get surprised by the next seek, have to abort that connection, reopen a new TCP/HTTPS 1.1 connection (slow, CPU heavy), do it again, repeatedly. The random IO operation hurts on bulk loading of things like .csv.gz, but is critical to getting ORC/Parquet perf.
You don't get the speedup on Hadoop 2.7's hadoop-aws JAR. If you need it you need to update hadoop*.jar and dependencies, or build Spark up from scratch against Hadoop 2.8
Note that Hadoop 2.8+ also has a nice little feature: if you call toString()
on an S3A filesystem client in a log statement, it prints out all the filesystem IO stats, including how much data was discarded in seeks, aborted TCP connections &c. Helps you work out what's going on.
2018-04-13 warning:: Do not try to drop the Hadoop 2.8+ hadoop-aws
JAR on the classpath along with the rest of the hadoop-2.7 JAR set and expect to see any speedup. All you will see are stack traces. You need to update all the hadoop JARs and their transitive dependencies.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…