Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
501 views
in Technique[技术] by (71.8m points)

r - Sparklyr: Use group_by and then concatenate strings from rows in a group

I am trying to use the group_by() and mutate() functions in sparklyr to concatenate rows in a group.

Here is a simple example that I think should work but doesn't:

library(sparkylr)
d <- data.frame(id=c("1", "1", "2", "2", "1", "2"), 
             x=c("200", "200", "200", "201", "201", "201"), 
             y=c("This", "That", "The", "Other", "End", "End"))
d_sdf <- copy_to(sc, d, "d")
d_sdf %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))

What I'd like it to produce is:

Source: local data frame [6 x 3]
Groups: id, x [4]

# A tibble: 6 x 3
      id      x         y
  <fctr> <fctr>     <chr>
1      1    200 This That
2      1    200 This That
3      2    200       The
4      2    201 Other End
5      1    201       End
6      2    201 Other End

I get the following error:

Error: org.apache.spark.sql.AnalysisException: missing ) at 'AS' near '' '' in selection target; line 1 pos 42

Note that the using the same code on a data.frame works fine:

d %>% group_by(id, x) %>% mutate( y = paste(y, collapse = " "))
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Spark sql doesn't like it if you use aggregate functions without aggregating, hence the reason why this works in dplyr with an ordinary dataframe but not in a SparkDataFrame- sparklyr translates your commands to an sql statement. You can observe this going wrong if you look at the second bit in the error message:

== SQL ==
SELECT `id`, `x`, CONCAT_WS(' ', `y`, ' ' AS "collapse") AS `y`

paste gets translated to CONCAT_WS. concat however would paste columns together.

A better equivalent would be collect_list and collect_set, but they produce list outputs.

But you can build on that:

If you do not want to have the same row replicated in your result you can use summarise, collect_list, and paste:

res <- d_sdf %>% 
      group_by(id, x) %>% 
      summarise( yconcat =paste(collect_list(y)))

result:

Source:     lazy query [?? x 3]
Database:   spark connection master=local[8] app=sparklyr local=TRUE
Grouped by: id

     id     x         y
  <chr> <chr>     <chr>
1     1   201       End
2     2   201 Other End
3     1   200 This That
4     2   200       The

you can join this back onto your original data if you do want to have your rows replicated:

d_sdf %>% left_join(res)

result:

Source:     lazy query [?? x 4]
Database:   spark connection master=local[8] app=sparklyr local=TRUE

     id     x     y   yconcat
  <chr> <chr> <chr>     <chr>
1     1   200  This This That
2     1   200  That This That
3     2   200   The       The
4     2   201 Other Other End
5     1   201   End       End
6     2   201   End Other End

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...