Imagine you have a file named numbers.txt
like the following:
10
5
8
7
3
6
9
11
3
1
you can achieve your goal like this:
int desiredSum = 15;
SparkSession spark = SparkSession
.builder()
.appName("My App")
.master("local[*]")
.getOrCreate();
Dataset<Row> rdd = spark
.read()
.text("numbers")
.withColumnRenamed("value", "number")
.withColumn("number", col("number").cast(DataTypes.LongType));
rdd.createOrReplaceTempView("myTable");
spark.sql("select first.number, second.number as number_2 from myTable first inner join myTable second on first.number + second.number =" + desiredSum + " where first.number <= second.number").show();
+------+--------+
|number|number_2|
+------+--------+
| 5| 10|
| 7| 8|
| 6| 9|
+------+--------+
Or if data is small you can achieve your goal using Cartesian product in Spark like this:
int desiredSum = 15;
SparkSession spark = SparkSession
.builder()
.appName("My App")
.master("local[*]")
.getOrCreate();
Dataset<Row> rdd = spark
.read()
.text("numbers.txt")
.withColumnRenamed("value", "number")
.withColumn("number", col("number").cast(DataTypes.LongType));
Dataset<Row> joinedRdd = rdd.crossJoin(rdd.withColumnRenamed("number", "number_2")).filter("number <= number_2");
UserDefinedFunction mode = udf((UDF2<Long, Long, Object>) Long::sum, DataTypes.LongType);
joinedRdd = joinedRdd.withColumn("sum", mode.apply(col("number"), col( "number_2"))).filter("sum = " + desiredSum);
joinedRdd.show();
in which result to this:
+------+--------+---+
|number|number_2|sum|
+------+--------+---+
| 5| 10| 15|
| 7| 8| 15|
| 6| 9| 15|
+------+--------+---+
**take into account the Order of time and space complexity when you use Cross join**
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…