java - Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job

Question

Welcome To Ask or Share your Answers For Others

java - Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

java - Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job

I went through the question How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job? here. Though it explains the concept, I am unable to successfully transform it to code.

Basically, I want the file name as key and the file data as value. For that I wrote a custom RecordReader as recommended in the aforementioned question. But I couldn't understand how to get the file name as the key in this class. Also, while writing the custom FileInputFormat class, I couldn't understand how to return the custom RecordReader I wrote previously.

The RecordReader code is:

import java.io.IOException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;

public class CustomRecordReader extends RecordReader<Text, Text> {

    private static final String LINE_SEPARATOR = System.getProperty("line.separator");

    private StringBuffer valueBuffer = new StringBuffer("");
    private Text key = new Text();
    private Text value = new Text();
    private RecordReader<Text, Text> recordReader;

    public SPDRecordReader(RecordReader<Text, Text> recordReader) {
        this.recordReader = recordReader;
    }

    @Override
    public void close() throws IOException {
        recordReader.close();
    }

    @Override
    public Text getCurrentKey() throws IOException, InterruptedException {
        return key;
    }

    @Override
    public Text getCurrentValue() throws IOException, InterruptedException {
        return value;
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
        return recordReader.getProgress();
    }

    @Override
    public void initialize(InputSplit arg0, TaskAttemptContext arg1)
            throws IOException, InterruptedException {
        recordReader.initialize(arg0, arg1);
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {

        if (valueBuffer.equals("")) {
            while (recordReader.nextKeyValue()) {
                valueBuffer.append(recordReader.getCurrentValue());
                valueBuffer.append(LINE_SEPARATOR);
            }
            value.set(valueBuffer.toString());
            return true;
        }
        return false;
    }

}

And the incomplete FileInputFormat class is:

import java.io.IOException;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;

public class CustomFileInputFormat extends FileInputFormat<Text, Text> {

    @Override
    protected boolean isSplitable(FileSystem fs, Path filename) {
        return false;
    }

    @Override
    public RecordReader<Text, Text> getRecordReader(InputSplit arg0, JobConf arg1,
            Reporter arg2) throws IOException {
        return null;
    }
}

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:35:31+0000

Have this code in your CustomRecordReader class.

private LineRecordReader lineReader;

private String fileName;

public CustomRecordReader(JobConf job, FileSplit split) throws IOException {
    lineReader = new LineRecordReader(job, split);
    fileName = split.getPath().getName();
}

public boolean next(Text key, Text value) throws IOException {
    // get the next line
    if (!lineReader.next(key, value)) {
        return false;
    }    

    key.set(fileName);
    value.set(value);

    return true;
}

public Text createKey() {
    return new Text("");
}

public Text createValue() {
    return new Text("");
}

Remove SPDRecordReader constructor (It is an error).

And have this code in your CustomFileInputFormat class

public RecordReader<Text, Text> getRecordReader(
  InputSplit input, JobConf job, Reporter reporter)
  throws IOException {

    reporter.setStatus(input.toString());
    return new CustomRecordReader(job, (FileSplit)input);
}

Categories

java - Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job

java - Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags