Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
403 views
in Technique[技术] by (71.8m points)

python - Get the max value for each key in a Spark RDD

What is the best way to return the max row (value) associated with each unique key in a spark RDD?

I'm using python and I've tried Math max, mapping and reducing by keys and aggregates. Is there an efficient way to do this? Possibly an UDF?

I have in RDD format:

[(v, 3),
 (v, 1),
 (v, 1),
 (w, 7),
 (w, 1),
 (x, 3),
 (y, 1),
 (y, 1),
 (y, 2),
 (y, 3)]

And I need to return:

[(v, 3),
 (w, 7),
 (x, 3),
 (y, 3)]

Ties can return the first value or random.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Actually you have a PairRDD. One of the best ways to do it is with reduceByKey:

(Scala)

val grouped = rdd.reduceByKey(math.max(_, _))

(Python)

grouped = rdd.reduceByKey(max)

(Java 7)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    new Function2<Integer, Integer, Integer>() {
        public Integer call(Integer v1, Integer v2) {
            return Math.max(v1, v2);
    }
});

(Java 8)

JavaPairRDD<String, Integer> grouped = new JavaPairRDD(rdd).reduceByKey(
    (v1, v2) -> Math.max(v1, v2)
);

API doc for reduceByKey:


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...