Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
426 views
in Technique[技术] by (71.8m points)

python - How to read compressed avro files (.gz) in spark?

I am trying to read a gzip (.gz extension) avro file using spark but I am getting below error. I see from the documentation that spark should be able to read .gz files without any additional conversions (might be for csv/text files).

I tried running below command but it gives error:

df= spark.read.format("com.databricks.spark.avro").load("/user/data/test1.avro.gz")

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/hdp/2.6.1.0-129/spark2/python/pyspark/sql/readwriter.py", line 149, in load
    return self._df(self._jreader.load(path))
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/hdp/2.6.1.0-129/spark2/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/hdp/2.6.1.0-129/spark2/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.io.IOException: Not an Avro data file
        at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:63)
        at com.databricks.spark.avro.DefaultSource$$anonfun$5.apply(DefaultSource.scala:80)
        at com.databricks.spark.avro.DefaultSource$$anonfun$5.apply(DefaultSource.scala:77)
        at scala.Option.getOrElse(Option.scala:121)
        at com.databricks.spark.avro.DefaultSource.inferSchema(DefaultSource.scala:77)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:184)
        at scala.Option.orElse(Option.scala:289)
        at org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:183)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)
question from:https://stackoverflow.com/questions/65896452/how-to-read-compressed-avro-files-gz-in-spark

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Compression in an avro file works by separately compressing the individual data blocks, the avro file itself is not compressed (docs). ORC and Parquet compression works in a similar way, this is how these formats can be splittable.

In other words, you can't run gzip on an uncompressed .avro file and read it directly, the way you can with plain text files.

Compression happens when you write the avro file, in spark this is controlled by either the spark.sql.avro.compression.codec SparkConf setting, or the compression option on the writer (docs).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...