I need read parquet data from aws s3. If I use aws sdk for this I can get inputstream like this:
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, bucketKey));
InputStream inputStream = object.getObjectContent();
But the apache parquet reader uses only local file like this:
ParquetReader<Group> reader =
ParquetReader.builder(new GroupReadSupport(), new Path(file.getAbsolutePath()))
.withConf(conf)
.build();
reader.read()
So I don't know how parse input stream for parquet file.
For example for csv files there is CSVParser which uses inputstream.
I know solution to use spark for this goal.
Like this:
SparkSession spark = SparkSession
.builder()
.getOrCreate();
Dataset<Row> ds = spark.read().parquet("s3a://bucketName/file.parquet");
But I cannot use spark.
Could anyone tell me any solutions for read parquet data from s3?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…