I am using an EMR Activity in AWS data pipeline. This EMR Activity is running a hive script in EMR Cluster. It takes dynamo DB as input and stores data in S3.
This is the EMR step used in EMR Activity
s3://elasticmapreduce/libs/script-runner/script-runner.jar,s3://elasticmapreduce/libs/hive/hive-script,--run-hive-script,--hive-versions,latest,--args,-f,s3://my-s3-bucket/hive/my_hive_script.q,-d,DYNAMODB_INPUT_TABLE1=MyTable,-d,S3_OUTPUT_BUCKET=#{output.directoryPath}
where
out.direcoryPath is :
s3://my-s3-bucket/output/#{format(@scheduledStartTime,"YYYY-MM-dd")}
So this creates one folder and one file in S3. (technically speaking it creates two keys 2017-03-18/<some_random_number>
and 2017-03-18_$folder$
)
2017-03-18
2017-03-18_$folder$
How to avoid creation of these extra empty _$folder$
files.
EDIT:
I found a solution listed at https://issues.apache.org/jira/browse/HADOOP-10400 but I don't know how to implement it in AWS data pipeline.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…