Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

amazon web services - Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3.

This is the EMR step used in EMR Activity

s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://my-s3-bucket/hive/my_hive_script.q,-d,DYNAMODB_INPUT_TABLE1=MyTable,-d,S3_OUTPUT_BUCKET=#{output.directoryPath}

where

out.direcoryPath is :

s3://my-s3-bucket/output/#{format(@scheduledStartTime,"YYYY-MM-dd")}

So this creates one folder and one file in S3. (technically speaking it creates two keys 2017-03-18/<some_random_number> and 2017-03-18_$folder$)

2017-03-18
2017-03-18_$folder$

How to avoid creation of these extra empty _$folder$ files.

EDIT: I found a solution listed at https://issues.apache.org/jira/browse/HADOOP-10400 but I don't know how to implement it in AWS data pipeline.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

EMR doesn't seem to provide a way to avoid this.

Because S3 uses a key-value pair storage system, the Hadoop file system implements directory support in S3 by creating empty files with the "_$folder$" suffix.

You can safely delete any empty files with the <directoryname>_$folder$ suffix that appear in your S3 buckets. These empty files are created by the Hadoop framework at runtime, but Hadoop is designed to process data even if these empty files are removed.

https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/

It's in the Hadoop source code, so it could be fixed, but apparently it's not fixed in EMR.

If you are feeling clever, you could create an S3 event notification that matches the _$folder$ suffix, and have it fire off a Lambda function to delete the objects after they're created.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...