My MapReduce
job processes data by dates and needs to write output to a certain folder structure. Current expectation is to generate out put in following structure:
2013
01
02
..
2012
01
02
..
etc.
At any time, I get only upto 12 months of data, So, I am using MultipleOutputs
class to create 12 outputs using the following function in the driver:
public void createOutputs(){
Calendar c = Calendar.getInstance();
String monthStr, pathStr;
// Create multiple outputs for last 12 months
// TODO make 12 configurable
for(int i = 0; i < 12; ++i ){
//Get month and add 1 as month is 0 based index
int month = c.get(Calendar.MONTH)+1;
//Add leading 0
monthStr = month > 10 ? "" + month : "0" + month ;
// Generate path string in the format 2013/03/etl
pathStr = c.get(Calendar.YEAR) + "" + monthStr + "etl";
// Add the named output
MultipleOutputs.addNamedOutput(config, pathStr );
// Move to previous month
c.add(Calendar.MONTH, -1);
}
}
In the reducer, I added a cleanup function to move the generated output to appropriate directories.
protected void cleanup(Context context) throws IOException, InterruptedException {
// Custom function to recursively process data
moveFiles (FileSystem.get(new Configuration()), new Path("/MyOutputPath"));
}
Problem: cleanup function of the reducer is getting executed before the output is moved from _temporary directory to the output directory. And due to this, the above function doesn't see any output at the time of execution since all the data is still in _temporary directory.
What is the best way for me to achieve the desired functionality?
Appreciate any insights.
Thinking of the following:
- Is there a way to use custom outputcommitter?
- Is it better to chain another job or is it an overkill for this?
- Is there a simpler alternative that I am just not aware of..
Here is the sample log of file structure from cleanup
function:
MyMapReduce: filepath:hdfs://localhost:8020/dev/test
MyMapReduce: filepath:hdfs://localhost:8020/dev/test/_logs
MyMapReduce: filepath:hdfs://localhost:8020/dev/test/_logs/history/job_201310301015_0224_1383763613843_371979_HtmlEtl
MyMapReduce: filepath:hdfs://localhost:8020/dev/test/_temporary
MyMapReduce: filepath:hdfs://localhost:8020/dev/test/_temporary/_attempt_201310301015_0224_r_000000_0
MyMapReduce: filepath:hdfs://localhost:8020/dev/test/_temporary/_attempt_201310301015_0224_r_000000_0/201307etl-r-00000
MyMapReduce: filepath:hdfs://localhost:8020/dev/test/_temporary/_attempt_201310301015_0224_r_000000_0/part-r-00000
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…