Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
399 views
in Technique[技术] by (71.8m points)

hadoop - How can we solve a two-sum algorithm as a big data problem leveraging mapreduce or spark?

Assuming the list/array of numbers are present in a very huge data file and we need to find the pair of sums that match a particular number 'k'. I know how to solve it typically using data structures but I am unable to think of an approach to solve it leveraging Hadoop MR or spark particularly.

Assuming a file having 1,2,3,6,7,7,8,9 My thought process: -Considering the data into a dataframe and then adding one more column to it which identifies the difference i.e.if i<=k/2 then k-i else i. then now my dataframe for above data is like:


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Imagine you have a file named numbers.txt like the following:

10
5
8
7
3
6
9
11
3
1

you can achieve your goal like this:

int desiredSum = 15;
SparkSession spark = SparkSession
        .builder()
        .appName("My App")
        .master("local[*]")
        .getOrCreate();
Dataset<Row> rdd = spark
        .read()
        .text("numbers")
        .withColumnRenamed("value", "number")
        .withColumn("number", col("number").cast(DataTypes.LongType));
rdd.createOrReplaceTempView("myTable");
spark.sql("select first.number, second.number as number_2 from myTable  first inner join myTable second on first.number + second.number =" + desiredSum + " where first.number <= second.number").show();



+------+--------+
|number|number_2|
+------+--------+
|     5|      10|
|     7|       8|
|     6|       9|
+------+--------+

Or if data is small you can achieve your goal using Cartesian product in Spark like this:

int desiredSum = 15;
SparkSession spark = SparkSession
        .builder()
        .appName("My App")
        .master("local[*]")
        .getOrCreate();
Dataset<Row> rdd = spark
        .read()
        .text("numbers.txt")
        .withColumnRenamed("value", "number")
        .withColumn("number", col("number").cast(DataTypes.LongType));
Dataset<Row> joinedRdd = rdd.crossJoin(rdd.withColumnRenamed("number", "number_2")).filter("number <= number_2");
UserDefinedFunction mode = udf((UDF2<Long, Long, Object>) Long::sum, DataTypes.LongType);
joinedRdd = joinedRdd.withColumn("sum", mode.apply(col("number"), col( "number_2"))).filter("sum = " + desiredSum);
joinedRdd.show();

in which result to this:

+------+--------+---+
|number|number_2|sum|
+------+--------+---+
|     5|      10| 15|
|     7|       8| 15|
|     6|       9| 15|
+------+--------+---+


**take into account the Order of time and space complexity when you use Cross join**

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...