amazon web services - Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

Question

Welcome To Ask or Share your Answers For Others

amazon web services - Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

amazon web services - Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3.

This is the EMR step used in EMR Activity

s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://my-s3-bucket/hive/my_hive_script.q,-d,DYNAMODB_INPUT_TABLE1=MyTable,-d,S3_OUTPUT_BUCKET=#{output.directoryPath}

where

out.direcoryPath is :

s3://my-s3-bucket/output/#{format(@scheduledStartTime,"YYYY-MM-dd")}

So this creates one folder and one file in S3. (technically speaking it creates two keys 2017-03-18/<some_random_number> and 2017-03-18_$folder$)

2017-03-18
2017-03-18_$folder$

How to avoid creation of these extra empty _$folder$ files.

EDIT: I found a solution listed at https://issues.apache.org/jira/browse/HADOOP-10400 but I don't know how to implement it in AWS data pipeline.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T17:52:30+0000

EMR doesn't seem to provide a way to avoid this.

Because S3 uses a key-value pair storage system, the Hadoop file system implements directory support in S3 by creating empty files with the "_$folder$" suffix.

You can safely delete any empty files with the <directoryname>_$folder$ suffix that appear in your S3 buckets. These empty files are created by the Hadoop framework at runtime, but Hadoop is designed to process data even if these empty files are removed.

https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/

It's in the Hadoop source code, so it could be fixed, but apparently it's not fixed in EMR.

If you are feeling clever, you could create an S3 event notification that matches the _$folder$ suffix, and have it fire off a Lambda function to delete the objects after they're created.

Categories

amazon web services - Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

amazon web services - Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags