本文整理汇总了Java中org.apache.crunch.Target类的典型用法代码示例。如果您正苦于以下问题:Java Target类的具体用法?Java Target怎么用?Java Target使用的例子?那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。
Target类属于org.apache.crunch包,在下文中一共展示了Target类的13个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于我们的系统推荐出更棒的Java代码示例。
示例1: testGeneric
import org.apache.crunch.Target; //导入依赖的package包/类
@Test
public void testGeneric() throws IOException {
Dataset<Record> inputDataset = repo.create("in", new DatasetDescriptor.Builder()
.schema(USER_SCHEMA).build());
Dataset<Record> outputDataset = repo.create("out", new DatasetDescriptor.Builder()
.schema(USER_SCHEMA).build());
// write two files, each of 5 records
writeTestUsers(inputDataset, 5, 0);
writeTestUsers(inputDataset, 5, 5);
Pipeline pipeline = new MRPipeline(TestCrunchDatasets.class);
PCollection<GenericData.Record> data = pipeline.read(
CrunchDatasets.asSource(inputDataset, GenericData.Record.class));
pipeline.write(data, CrunchDatasets.asTarget(outputDataset), Target.WriteMode.APPEND);
pipeline.run();
checkTestUsers(outputDataset, 10);
}
开发者ID:cloudera,项目名称:cdk,代码行数:20,代码来源:TestCrunchDatasets.java
示例2: testGenericParquet
import org.apache.crunch.Target; //导入依赖的package包/类
@Test
public void testGenericParquet() throws IOException {
Dataset<Record> inputDataset = repo.create("in", new DatasetDescriptor.Builder()
.schema(USER_SCHEMA).format(Formats.PARQUET).build());
Dataset<Record> outputDataset = repo.create("out", new DatasetDescriptor.Builder()
.schema(USER_SCHEMA).format(Formats.PARQUET).build());
// write two files, each of 5 records
writeTestUsers(inputDataset, 5, 0);
writeTestUsers(inputDataset, 5, 5);
Pipeline pipeline = new MRPipeline(TestCrunchDatasets.class);
PCollection<GenericData.Record> data = pipeline.read(
CrunchDatasets.asSource(inputDataset, GenericData.Record.class));
pipeline.write(data, CrunchDatasets.asTarget(outputDataset), Target.WriteMode.APPEND);
pipeline.run();
checkTestUsers(outputDataset, 10);
}
开发者ID:cloudera,项目名称:cdk,代码行数:20,代码来源:TestCrunchDatasets.java
示例3: asTarget
import org.apache.crunch.Target; //导入依赖的package包/类
/**
* Expose the given {@link Dataset} as a Crunch {@link Target}.
*
* Only the FileSystem {@code Dataset} implementation is supported and the
* file format must be {@code Formats.PARQUET} or {@code Formats.AVRO}. In
* addition, the given {@code Dataset} must not be partitioned,
* <strong>or</strong> must be a leaf partition in the partition hierarchy.
*
* <strong>The {@code Target} returned by this method will not write to
* sub-partitions.</strong>
*
* @param dataset the dataset to write to
* @return the {@link Target}, or <code>null</code> if the dataset is not
* filesystem-based.
*/
public static Target asTarget(Dataset dataset) {
Path directory = Accessor.getDefault().getDirectory(dataset);
if (directory != null) {
final Format format = dataset.getDescriptor().getFormat();
if (Formats.PARQUET.equals(format)) {
return new AvroParquetFileTarget(directory);
} else if (Formats.AVRO.equals(format)) {
return new AvroFileTarget(directory);
} else {
throw new UnsupportedOperationException(
"Not a supported format: " + format);
}
}
return null;
}
开发者ID:cloudera,项目名称:cdk,代码行数:31,代码来源:CrunchDatasets.java
示例4: testPartitionedSourceAndTarget
import org.apache.crunch.Target; //导入依赖的package包/类
@Test
@SuppressWarnings("deprecation")
public void testPartitionedSourceAndTarget() throws IOException {
PartitionStrategy partitionStrategy = new PartitionStrategy.Builder().hash(
"username", 2).build();
Dataset<Record> inputDataset = repo.create("in", new DatasetDescriptor.Builder()
.schema(USER_SCHEMA).partitionStrategy(partitionStrategy).build());
Dataset<Record> outputDataset = repo.create("out", new DatasetDescriptor.Builder()
.schema(USER_SCHEMA).partitionStrategy(partitionStrategy).build());
writeTestUsers(inputDataset, 10);
PartitionKey key = partitionStrategy.partitionKey(0);
Dataset<Record> inputPart0 = inputDataset.getPartition(key, false);
Dataset<Record> outputPart0 = outputDataset.getPartition(key, true);
Pipeline pipeline = new MRPipeline(TestCrunchDatasets.class);
PCollection<GenericData.Record> data = pipeline.read(
CrunchDatasets.asSource(inputPart0, GenericData.Record.class));
pipeline.write(data, CrunchDatasets.asTarget(outputPart0), Target.WriteMode.APPEND);
pipeline.run();
Assert.assertEquals(5, datasetSize(outputPart0));
}
开发者ID:cloudera,项目名称:cdk,代码行数:26,代码来源:TestCrunchDatasets.java
示例5: run
import org.apache.crunch.Target; //导入依赖的package包/类
@Override
public int run(String[] args) throws Exception {
// Construct a local filesystem dataset repository rooted at /tmp/data
DatasetRepository fsRepo = DatasetRepositories.open("repo:hdfs:/tmp/data");
// Construct an HCatalog dataset repository using external Hive tables
DatasetRepository hcatRepo = DatasetRepositories.open("repo:hive:/tmp/data");
// Turn debug on while in development.
getPipeline().enableDebug();
getPipeline().getConfiguration().set("crunch.log.job.progress", "true");
// Load the events dataset and get the correct partition to sessionize
Dataset<StandardEvent> eventsDataset = fsRepo.load("events");
Dataset<StandardEvent> partition;
if (args.length == 0 || (args.length == 1 && args[0].equals("LATEST"))) {
partition = getLatestPartition(eventsDataset);
} else {
partition = getPartitionForURI(eventsDataset, args[0]);
}
// Create a parallel collection from the working partition
PCollection<StandardEvent> events = read(
CrunchDatasets.asSource(partition, StandardEvent.class));
// Group events by user and cookie id, then create a session for each group
PCollection<Session> sessions = events
.by(new GetSessionKey(), Avros.strings())
.groupByKey()
.parallelDo(new MakeSession(), Avros.specifics(Session.class));
// Write the sessions to the "sessions" Dataset
getPipeline().write(sessions, CrunchDatasets.asTarget(hcatRepo.load("sessions")),
Target.WriteMode.APPEND);
return run().succeeded() ? 0 : 1;
}
开发者ID:cloudera,项目名称:cdk-examples,代码行数:39,代码来源:CreateSessions.java
示例6: run
import org.apache.crunch.Target; //导入依赖的package包/类
@Override
public int run(String... args) throws Exception {
if (args.length != 3) {
System.err.println("Usage: " + CombinedLogFormatConverter.class.getSimpleName() +
" <input> <dataset_uri> <dataset name>");
return 1;
}
String input = args[0];
String datasetUri = args[1];
String datasetName = args[2];
Schema schema = new Schema.Parser().parse(
Resources.getResource("combined_log_format.avsc").openStream());
// Create the dataset
DatasetRepository repo = DatasetRepositories.open(datasetUri);
DatasetDescriptor datasetDescriptor = new DatasetDescriptor.Builder()
.schema(schema).build();
Dataset<Object> dataset = repo.create(datasetName, datasetDescriptor);
// Run the job
final String schemaString = schema.toString();
AvroType<GenericData.Record> outputType = Avros.generics(schema);
PCollection<String> lines = readTextFile(input);
PCollection<GenericData.Record> records = lines.parallelDo(
new ConvertFn(schemaString), outputType);
getPipeline().write(records, CrunchDatasets.asTarget(dataset),
Target.WriteMode.APPEND);
run();
return 0;
}
开发者ID:cloudera,项目名称:cdk,代码行数:32,代码来源:CombinedLogFormatConverter.java
示例7: run
import org.apache.crunch.Target; //导入依赖的package包/类
@Override
public int run(String[] args) throws Exception {
final long startOfToday = startOfDay();
// the destination dataset
Dataset<Record> persistent = Datasets.load(
"dataset:file:/tmp/data/logs", Record.class);
// the source: anything before today in the staging area
Dataset<Record> staging = Datasets.load(
"dataset:file:/tmp/data/logs_staging", Record.class);
View<Record> ready = staging.toBefore("timestamp", startOfToday);
ReadableSource<Record> source = CrunchDatasets.asSource(ready);
PCollection<Record> stagedLogs = read(source);
getPipeline().write(stagedLogs,
CrunchDatasets.asTarget(persistent), Target.WriteMode.APPEND);
PipelineResult result = run();
if (result.succeeded()) {
// remove the source data partition from staging
ready.deleteAll();
return 0;
} else {
return 1;
}
}
开发者ID:kite-sdk,项目名称:kite-examples,代码行数:31,代码来源:StagingToPersistent.java
示例8: compressedTextOutput
import org.apache.crunch.Target; //导入依赖的package包/类
protected final Target compressedTextOutput(Configuration conf, String outputPathKey) {
// The way this is used, it doesn't seem like we can just set the object in getConf(). Need
// to set the copy in the MRPipeline directly?
conf.setClass(FileOutputFormat.COMPRESS_CODEC, GzipCodec.class, CompressionCodec.class);
conf.setClass(MRJobConfig.MAP_OUTPUT_COMPRESS_CODEC, SnappyCodec.class, CompressionCodec.class);
return To.textFile(Namespaces.toPath(outputPathKey));
}
开发者ID:apsaltis,项目名称:oryx,代码行数:8,代码来源:JobStep.java
示例9: run
import org.apache.crunch.Target; //导入依赖的package包/类
@Override
public int run(String[] args) throws Exception {
JCommander jc = new JCommander(this);
try {
jc.parse(args);
} catch (ParameterException e) {
jc.usage();
return 1;
}
if (paths == null || paths.size() != 2) {
jc.usage();
return 1;
}
String inputPathString = paths.get(0);
String outputPathString = paths.get(1);
Configuration conf = getConf();
Path inputPath = new Path(inputPathString);
Path outputPath = new Path(outputPathString);
outputPath = outputPath.getFileSystem(conf).makeQualified(outputPath);
Pipeline pipeline = new MRPipeline(getClass(), conf);
VariantsLoader variantsLoader;
if (dataModel.equals("GA4GH")) {
variantsLoader = new GA4GHVariantsLoader();
} else if (dataModel.equals("ADAM")) {
variantsLoader = new ADAMVariantsLoader();
} else {
jc.usage();
return 1;
}
Set<String> sampleSet = samples == null ? null :
Sets.newLinkedHashSet(Splitter.on(',').split(samples));
PTable<String, SpecificRecord> partitionKeyedRecords =
variantsLoader.loadPartitionedVariants(inputFormat, inputPath, conf, pipeline,
variantsOnly, flatten, sampleGroup, sampleSet, redistribute, segmentSize,
numReducers);
if (FileUtils.sampleGroupExists(outputPath, conf, sampleGroup)) {
if (overwrite) {
FileUtils.deleteSampleGroup(outputPath, conf, sampleGroup);
} else {
LOG.error("Sample group already exists: " + sampleGroup);
return 1;
}
}
pipeline.write(partitionKeyedRecords, new AvroParquetPathPerKeyTarget(outputPath),
Target.WriteMode.APPEND);
PipelineResult result = pipeline.done();
return result.succeeded() ? 0 : 1;
}
开发者ID:cloudera,项目名称:quince,代码行数:59,代码来源:LoadVariantsTool.java
示例10: run
import org.apache.crunch.Target; //导入依赖的package包/类
@Override
public int run(String[] args) throws Exception {
// Turn debug on while in development.
getPipeline().enableDebug();
getPipeline().getConfiguration().set("crunch.log.job.progress", "true");
Dataset<StandardEvent> eventsDataset = Datasets.load(
"dataset:hdfs:/tmp/data/default/events", StandardEvent.class);
View<StandardEvent> eventsToProcess;
if (args.length == 0 || (args.length == 1 && args[0].equals("LATEST"))) {
// get the current minute
Calendar cal = Calendar.getInstance(TimeZone.getTimeZone("UTC"));
cal.set(Calendar.SECOND, 0);
cal.set(Calendar.MILLISECOND, 0);
long currentMinute = cal.getTimeInMillis();
// restrict events to before the current minute
// in the workflow, this also has a lower bound for the timestamp
eventsToProcess = eventsDataset.toBefore("timestamp", currentMinute);
} else if (isView(args[0])) {
eventsToProcess = Datasets.load(args[0], StandardEvent.class);
} else {
eventsToProcess = FileSystemDatasets.viewForPath(eventsDataset, new Path(args[0]));
}
if (eventsToProcess.isEmpty()) {
LOG.info("No records to process.");
return 0;
}
// Create a parallel collection from the working partition
PCollection<StandardEvent> events = read(
CrunchDatasets.asSource(eventsToProcess));
// Group events by user and cookie id, then create a session for each group
PCollection<Session> sessions = events
.by(new GetSessionKey(), Avros.strings())
.groupByKey()
.parallelDo(new MakeSession(), Avros.specifics(Session.class));
// Write the sessions to the "sessions" Dataset
getPipeline().write(sessions,
CrunchDatasets.asTarget("dataset:hive:/tmp/data/default/sessions"),
Target.WriteMode.APPEND);
return run().succeeded() ? 0 : 1;
}
开发者ID:kite-sdk,项目名称:kite-examples,代码行数:49,代码来源:CreateSessions.java
示例11: avroOutput
import org.apache.crunch.Target; //导入依赖的package包/类
protected final Target avroOutput(String outputPathKey) {
return To.avroFile(Namespaces.toPath(outputPathKey));
}
开发者ID:apsaltis,项目名称:oryx,代码行数:4,代码来源:JobStep.java
示例12: output
import org.apache.crunch.Target; //导入依赖的package包/类
protected final Target output(String outputPathKey) {
return avroOutput(outputPathKey);
}
开发者ID:apsaltis,项目名称:oryx,代码行数:4,代码来源:JobStep.java
示例13: outputConf
import org.apache.crunch.Target; //导入依赖的package包/类
@Override
public Target outputConf(final String key, final String value) {
extraConf.put(key, value);
return this;
}
开发者ID:spotify,项目名称:hdfs2cass,代码行数:6,代码来源:CQLTarget.java
注:本文中的org.apache.crunch.Target类示例整理自Github/MSDocs等源码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。 |
请发表评论