本文整理汇总了Java中org.apache.spark.ml.feature.VectorAssembler类的典型用法代码示例。如果您正苦于以下问题:Java VectorAssembler类的具体用法?Java VectorAssembler怎么用?Java VectorAssembler使用的例子?那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。
VectorAssembler类属于org.apache.spark.ml.feature包,在下文中一共展示了VectorAssembler类的15个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于我们的系统推荐出更棒的Java代码示例。
示例1: main
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.master("local[8]")
.appName("PCAExpt")
.getOrCreate();
// Load and parse data
String filePath = "/home/kchoppella/book/Chapter09/data/covtypeNorm.csv";
// Loads data.
Dataset<Row> inDataset = spark.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", true)
.load(filePath);
ArrayList<String> inputColsList = new ArrayList<String>(Arrays.asList(inDataset.columns()));
//Make single features column for feature vectors
inputColsList.remove("class");
String[] inputCols = inputColsList.parallelStream().toArray(String[]::new);
//Prepare dataset for training with all features in "features" column
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Dataset<Row> dataset = assembler.transform(inDataset);
PCAModel pca = new PCA()
.setK(16)
.setInputCol("features")
.setOutputCol("pcaFeatures")
.fit(dataset);
Dataset<Row> result = pca.transform(dataset).select("pcaFeatures");
System.out.println("Explained variance:");
System.out.println(pca.explainedVariance());
result.show(false);
// $example off$
spark.stop();
}
开发者ID:PacktPublishing,项目名称:Machine-Learning-End-to-Endguide-for-Java-developers,代码行数:39,代码来源:PCAExpt.java
示例2: main
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
public static void main (String[] args) throws IOException {
SparkSession spark = SparkSession.builder().master("local").appName("DataProcess").getOrCreate();
String filename = "prices-split-adjusted.csv";
String symbol = "GOOG";
// load data from csv file
Dataset<Row> data = spark.read().format("csv").option("header", true)
.load(new ClassPathResource(filename).getFile().getAbsolutePath())
//.filter(functions.col("symbol").equalTo(symbol))
//.drop("date").drop("symbol")
.withColumn("openPrice", functions.col("open").cast("double")).drop("open")
.withColumn("closePrice", functions.col("close").cast("double")).drop("close")
.withColumn("lowPrice", functions.col("low").cast("double")).drop("low")
.withColumn("highPrice", functions.col("high").cast("double")).drop("high")
.withColumn("volumeTmp", functions.col("volume").cast("double")).drop("volume")
.toDF("date", "symbol", "open", "close", "low", "high", "volume");
data.show();
Dataset<Row> symbols = data.select("date", "symbol").groupBy("symbol").agg(functions.count("date").as("count"));
System.out.println("Number of Symbols: " + symbols.count());
symbols.show();
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[] {"open", "low", "high", "volume", "close"})
.setOutputCol("features");
data = assembler.transform(data).drop("open", "low", "high", "volume", "close");
data = new MinMaxScaler().setMin(0).setMax(1)
.setInputCol("features").setOutputCol("normalizedFeatures")
.fit(data).transform(data)
.drop("features").toDF("features");
}
开发者ID:IsaacChanghau,项目名称:StockPrediction,代码行数:34,代码来源:DataPreview.java
示例3: encodeFeatures
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
@Override
public List<Feature> encodeFeatures(SparkMLEncoder encoder){
VectorAssembler transformer = getTransformer();
List<Feature> result = new ArrayList<>();
String[] inputCols = transformer.getInputCols();
for(String inputCol : inputCols){
List<Feature> features = encoder.getFeatures(inputCol);
result.addAll(features);
}
return result;
}
开发者ID:jpmml,项目名称:jpmml-sparkml,代码行数:16,代码来源:VectorAssemblerConverter.java
示例4: getModelInfo
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
@Override
VectorAssemblerModelInfo getModelInfo(VectorAssembler from) {
VectorAssemblerModelInfo vectorAssemblerModelInfo = new VectorAssemblerModelInfo();
vectorAssemblerModelInfo.setInputKeys(new LinkedHashSet<>(Arrays.asList(from.getInputCols())));
Set<String> outputKeys = new LinkedHashSet<String>();
outputKeys.add(from.getOutputCol());
vectorAssemblerModelInfo.setOutputKeys(outputKeys);
return vectorAssemblerModelInfo;
}
开发者ID:flipkart-incubator,项目名称:spark-transformers,代码行数:13,代码来源:VectorAssemblerModelAdapter.java
示例5: getModelInfo
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
@Override
VectorAssemblerModelInfo getModelInfo(VectorAssembler from, DataFrame df) {
VectorAssemblerModelInfo vectorAssemblerModelInfo = new VectorAssemblerModelInfo();
vectorAssemblerModelInfo.setInputKeys(new LinkedHashSet<>(Arrays.asList(from.getInputCols())));
Set<String> outputKeys = new LinkedHashSet<String>();
outputKeys.add(from.getOutputCol());
vectorAssemblerModelInfo.setOutputKeys(outputKeys);
return vectorAssemblerModelInfo;
}
开发者ID:flipkart-incubator,项目名称:spark-transformers,代码行数:13,代码来源:VectorAssemblerModelAdapter.java
示例6: main
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.master("local[8]")
.appName("KMeansExpt")
.getOrCreate();
// Load and parse data
String filePath = "/home/kchoppella/book/Chapter09/data/covtypeNorm.csv";
// Loads data.
Dataset<Row> inDataset = spark.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", true)
.load(filePath);
ArrayList<String> inputColsList = new ArrayList<String>(Arrays.asList(inDataset.columns()));
//Make single features column for feature vectors
inputColsList.remove("class");
String[] inputCols = inputColsList.parallelStream().toArray(String[]::new);
//Prepare dataset for training with all features in "features" column
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Dataset<Row> dataset = assembler.transform(inDataset);
KMeans kmeans = new KMeans().setK(27).setSeed(1L);
KMeansModel model = kmeans.fit(dataset);
// Evaluate clustering by computing Within Set Sum of Squared Errors.
double WSSSE = model.computeCost(dataset);
System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
// Shows the result.
Vector[] centers = model.clusterCenters();
System.out.println("Cluster Centers: ");
for (Vector center: centers) {
System.out.println(center);
}
spark.stop();
}
开发者ID:PacktPublishing,项目名称:Machine-Learning-End-to-Endguide-for-Java-developers,代码行数:43,代码来源:KMeansExpt.java
示例7: main
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.master("local[8]")
.appName("BisectingKMeansExpt")
.getOrCreate();
// Load and parse data
String filePath = "/home/kchoppella/book/Chapter09/data/covtypeNorm.csv";
// Loads data.
Dataset<Row> inDataset = spark.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", true)
.load(filePath);
//Make single features column for feature vectors
ArrayList<String> inputColsList = new ArrayList<String>(Arrays.asList(inDataset.columns()));
inputColsList.remove("class");
String[] inputCols = inputColsList.parallelStream().toArray(String[]::new);
//Prepare dataset for training with all features in "features" column
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Dataset<Row> dataset = assembler.transform(inDataset);
// Trains a bisecting k-means model.
BisectingKMeans bkm = new BisectingKMeans().setK(27).setSeed(1);
BisectingKMeansModel model = bkm.fit(dataset);
// Evaluate clustering by computing Within Set Sum of Squared Errors.
double WSSSE = model.computeCost(dataset);
System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
// Shows the result.
Vector[] centers = model.clusterCenters();
System.out.println("Cluster Centers: ");
for (Vector center: centers) {
System.out.println(center);
}
spark.stop();
}
开发者ID:PacktPublishing,项目名称:Machine-Learning-End-to-Endguide-for-Java-developers,代码行数:44,代码来源:BisectingKMeansExpt.java
示例8: main
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.master("local[8]")
.appName("KMeansWithPCAExpt")
.getOrCreate();
// Load and parse data
String filePath = "/home/kchoppella/book/Chapter09/data/covtypeNorm.csv";
// Loads data.
Dataset<Row> inDataset = spark.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", true)
.load(filePath);
ArrayList<String> inputColsList = new ArrayList<String>(Arrays.asList(inDataset.columns()));
//Make single features column for feature vectors
inputColsList.remove("class");
String[] inputCols = inputColsList.parallelStream().toArray(String[]::new);
//Prepare dataset for training with all features in "features" column
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Dataset<Row> dataset = assembler.transform(inDataset);
PCAModel pca = new PCA()
.setK(16)
.setInputCol("features")
.setOutputCol("pcaFeatures")
.fit(dataset);
Dataset<Row> result = pca.transform(dataset).select("pcaFeatures");
System.out.println("Explained variance:");
System.out.println(pca.explainedVariance());
result.show(false);
KMeans kmeans = new KMeans().setK(27).setSeed(1L);
KMeansModel model = kmeans.fit(dataset);
// Evaluate clustering by computing Within Set Sum of Squared Errors.
double WSSSE = model.computeCost(dataset);
System.out.println("Within Set Sum of Squared Errors = " + WSSSE);
// $example off$
spark.stop();
}
开发者ID:PacktPublishing,项目名称:Machine-Learning-End-to-Endguide-for-Java-developers,代码行数:47,代码来源:KMeansWithPCAExpt.java
示例9: main
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.master("local[8]")
.appName("GaussianMixtureModelExpt")
.getOrCreate();
// Load and parse data
String filePath = "/home/kchoppella/book/Chapter09/data/covtypeNorm.csv";
// Loads data.
Dataset<Row> inDataset = spark.read()
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", true)
.load(filePath);
ArrayList<String> inputColsList = new ArrayList<String>(Arrays.asList(inDataset.columns()));
//Make single features column for feature vectors
inputColsList.remove("class");
String[] inputCols = inputColsList.parallelStream().toArray(String[]::new);
//Prepare dataset for training with all features in "features" column
VectorAssembler assembler = new VectorAssembler().setInputCols(inputCols).setOutputCol("features");
Dataset<Row> dataset = assembler.transform(inDataset);
PCAModel pca = new PCA()
.setK(16)
.setInputCol("features")
.setOutputCol("pcaFeatures")
.fit(dataset);
Dataset<Row> result = pca.transform(dataset).select("pcaFeatures").withColumnRenamed("pcaFeatures", "features");
String outPath = "/home/kchoppella/book/Chapter09/data/gmm_params.csv";
try {
BufferedWriter writer = Files.newBufferedWriter(Paths.get(outPath));
// Cluster the data into multiple classes using KMeans
int numClusters = 27;
GaussianMixtureModel gmm = new GaussianMixture()
.setK(numClusters).
fit(result);
int numIterations = gmm.getK();
// Output the parameters of the mixture model
for (int i = 0; i < numIterations; i++) {
String msg = String.format("Gaussian %d:\nweight=%f\nmu=%s\nsigma=\n%s\n\n",
i,
gmm.weights()[i],
gmm.gaussians()[i].mean(),
gmm.gaussians()[i].cov());
System.out.printf(msg);
writer.write(msg + "\n");
writer.flush();
}
}
catch (IOException iox) {
System.out.println("Write Exception: \n");
iox.printStackTrace();
}
finally {
}
spark.stop();
}
开发者ID:PacktPublishing,项目名称:Machine-Learning-End-to-Endguide-for-Java-developers,代码行数:68,代码来源:GaussianMixtureModelExpt.java
示例10: VectorAssemblerConverter
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
public VectorAssemblerConverter(VectorAssembler transformer){
super(transformer);
}
开发者ID:jpmml,项目名称:jpmml-sparkml,代码行数:4,代码来源:VectorAssemblerConverter.java
示例11: getSource
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
@Override
public Class<VectorAssembler> getSource() {
return VectorAssembler.class;
}
开发者ID:flipkart-incubator,项目名称:spark-transformers,代码行数:5,代码来源:VectorAssemblerModelAdapter.java
示例12: testVectorAssembler
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
@Test
public void testVectorAssembler() {
// prepare data
JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
RowFactory.create(0d, 1d, new DenseVector(new double[]{2d, 3d})),
RowFactory.create(1d, 2d, new DenseVector(new double[]{3d, 4d})),
RowFactory.create(2d, 3d, new DenseVector(new double[]{4d, 5d})),
RowFactory.create(3d, 4d, new DenseVector(new double[]{5d, 6d})),
RowFactory.create(4d, 5d, new DenseVector(new double[]{6d, 7d}))
));
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("value1", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("vector1", new VectorUDT(), false, Metadata.empty())
});
Dataset<Row> df = spark.createDataFrame(jrdd, schema);
VectorAssembler vectorAssembler = new VectorAssembler()
.setInputCols(new String[]{"value1", "vector1"})
.setOutputCol("feature");
//Export this model
byte[] exportedModel = ModelExporter.export(vectorAssembler);
String exportedModelJson = new String(exportedModel);
//Import and get Transformer
Transformer transformer = ModelImporter.importAndGetTransformer(exportedModel);
//compare predictions
List<Row> sparkOutput = vectorAssembler.transform(df).orderBy("id").select("id", "value1", "vector1", "feature").collectAsList();
for (Row row : sparkOutput) {
Map<String, Object> data = new HashMap<>();
data.put(vectorAssembler.getInputCols()[0], row.get(1));
data.put(vectorAssembler.getInputCols()[1], ((DenseVector) row.get(2)).toArray());
transformer.transform(data);
double[] output = (double[]) data.get(vectorAssembler.getOutputCol());
assertArrayEquals(output, ((DenseVector) row.get(3)).toArray(), 0d);
}
}
开发者ID:flipkart-incubator,项目名称:spark-transformers,代码行数:43,代码来源:VectorAssemblerBridgeTest.java
示例13: testVectorAssembler
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
@Test
public void testVectorAssembler() {
// prepare data
JavaRDD<Row> jrdd = sc.parallelize(Arrays.asList(
RowFactory.create(0d, 1d, new DenseVector(new double[]{2d, 3d})),
RowFactory.create(1d, 2d, new DenseVector(new double[]{3d, 4d})),
RowFactory.create(2d, 3d, new DenseVector(new double[]{4d, 5d})),
RowFactory.create(3d, 4d, new DenseVector(new double[]{5d, 6d})),
RowFactory.create(4d, 5d, new DenseVector(new double[]{6d, 7d}))
));
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("value1", DataTypes.DoubleType, false, Metadata.empty()),
new StructField("vector1", new VectorUDT(), false, Metadata.empty())
});
DataFrame df = sqlContext.createDataFrame(jrdd, schema);
VectorAssembler vectorAssembler = new VectorAssembler()
.setInputCols(new String[]{"value1", "vector1"})
.setOutputCol("feature");
//Export this model
byte[] exportedModel = ModelExporter.export(vectorAssembler, null);
String exportedModelJson = new String(exportedModel);
//Import and get Transformer
Transformer transformer = ModelImporter.importAndGetTransformer(exportedModel);
//compare predictions
Row[] sparkOutput = vectorAssembler.transform(df).orderBy("id").select("id", "value1", "vector1", "feature").collect();
for (Row row : sparkOutput) {
Map<String, Object> data = new HashMap<>();
data.put(vectorAssembler.getInputCols()[0], row.get(1));
data.put(vectorAssembler.getInputCols()[1], ((DenseVector) row.get(2)).toArray());
transformer.transform(data);
double[] output = (double[]) data.get(vectorAssembler.getOutputCol());
assertArrayEquals(output, ((DenseVector) row.get(3)).toArray(), 0d);
}
}
开发者ID:flipkart-incubator,项目名称:spark-transformers,代码行数:43,代码来源:VectorAssemblerBridgeTest.java
示例14: trainModel
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
private static Transformer trainModel(SQLContext sqlContxt, DataFrame train, String tokenizerOutputCol, boolean useCV) {
train = getCommonFeatures(sqlContxt, train, TOKENIZER_OUTPUT);
VectorAssembler featuresForNorm = new VectorAssembler()
.setInputCols(new String[] {"commonfeatures"})
.setOutputCol("commonfeatures_norm");
Normalizer norm = new Normalizer()
.setInputCol(featuresForNorm.getOutputCol())
.setOutputCol("norm_features");
HashingTF hashingTF = new HashingTF()
.setInputCol("ngrams")
.setOutputCol("tf");
IDF idf = new IDF()
.setInputCol(hashingTF.getOutputCol())
.setOutputCol("idf");
// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec()
.setInputCol(tokenizerOutputCol)
.setOutputCol("w2v");
List<String> assmeblerInput = new ArrayList<>();
assmeblerInput.add("commonfeatures");
// assmeblerInput.add(norm.getOutputCol());
// assmeblerInput.add(idf.getOutputCol());
// assmeblerInput.add(word2Vec.getOutputCol());
assmeblerInput.add(W2V_DB);
VectorAssembler assembler = new VectorAssembler()
.setInputCols(assmeblerInput.toArray(new String[assmeblerInput.size()]))
.setOutputCol("features");
LogisticRegression lr = new LogisticRegression();
// int[] layers = new int[] {108, 10, 10, 2};
// // create the trainer and set its parameters
// MultilayerPerceptronClassifier perceptron = new MultilayerPerceptronClassifier()
// .setLayers(layers)
// .setBlockSize(128)
// .setSeed(1234L)
// .setMaxIter(100);
// .setRegParam(0.03);
// .setElasticNetParam(0.3);
// ngramTransformer, hashingTF, idf,
PipelineStage[] pipelineStages = new PipelineStage[] { /*hashingTF, idf, word2Vec,*/ w2vModel, /*featuresForNorm, norm, */assembler, lr};
Pipeline pipeline = new Pipeline()
.setStages(pipelineStages);
stagesToString = ("commonfeatures_suff1x\t" + StringUtils.join(pipelineStages, "\t")).replaceAll("([A-Za-z]+)_[0-9A-Za-z]+", "$1");
// We use a ParamGridBuilder to construct a grid of parameters to search over.
// With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
// this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
ParamMap[] paramGrid = new ParamGridBuilder()
// .addGrid(word2Vec.vectorSize(), new int[] {100, 500})
// .addGrid(word2Vec.minCount(), new int[] {2, 3, 4})
// .addGrid(ngramTransformer.n(), new int[] {2, 3})
// .addGrid(hashingTF.numFeatures(), new int[] {1000, 2000})
.addGrid(lr.maxIter(), new int[] {10})
// .addGrid(lr.regParam(), new double[] {0.0, 0.1, 0.4, 0.8, 1, 3, 5, 10})
// .addGrid(lr.fitIntercept())
// .addGrid(lr.elasticNetParam(), new double[] {0.0, 0.2, 0.5, 0.8, 1.0} )
// .addGrid(idf.minDocFreq(), new int[]{2, 4})
.build();
Transformer model;
if (!useCV) {
model = trainWithValidationSplit(train, pipeline, paramGrid);
} else {
model = trainWithCrossValidation(train, pipeline, paramGrid);
}
return model;
}
开发者ID:mhardalov,项目名称:news-credibility,代码行数:80,代码来源:NewsCredibilityMain.java
示例15: getAssembler
import org.apache.spark.ml.feature.VectorAssembler; //导入依赖的package包/类
private static VectorAssembler getAssembler(String[] input, String output) {
return new VectorAssembler().setInputCols(input).setOutputCol(output);
}
开发者ID:deeplearning4j,项目名称:deeplearning4j,代码行数:4,代码来源:SparkDl4jNetworkTest.java
注:本文中的org.apache.spark.ml.feature.VectorAssembler类示例整理自Github/MSDocs等源码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。 |
请发表评论