Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
258 views
in Technique[技术] by (71.8m points)

mysql - mutate new column based on sequence of time stamp in dbplyr

I have a dataset from a PostgreSQL DB, which I have two columns:


id = the ID of a participant

time_stamp = the time stamp of the recorded measurement


I need to work with dbplyr in order to mutate a new column based on the sequence of time_stamp. In other words, if the time_stamp is in sequence (meaning that it is in one minute intervals), this is recognized as one event.

For example, this is my dataset:

library(dplyr)
library(dbplyr)
library(lubridate)

mf <- memdb_frame(
  id = "id001", 
  time_stamp = c(
    seq(from = as_datetime("2021-01-01 08:00:00"), to = as_datetime("2021-01-01 08:03:00"), by = "1 min"),
    seq(from = as_datetime("2021-01-01 08:05:00"), to = as_datetime("2021-01-01 08:08:00"), by = "1 min"),
    seq(from = as_datetime("2021-01-01 08:12:00"), to = as_datetime("2021-01-01 08:18:00"), by = "1 min")
  )
)

mf %>% 
  collect() %>% 
  mutate(time_stamp = as_datetime(time_stamp))

#> # A tibble: 15 x 2
#>    id    time_stamp         
#>    <chr> <dttm>             
#>  1 id001 2021-01-01 08:00:00
#>  2 id001 2021-01-01 08:01:00
#>  3 id001 2021-01-01 08:02:00
#>  4 id001 2021-01-01 08:03:00
#>  5 id001 2021-01-01 08:05:00
#>  6 id001 2021-01-01 08:06:00
#>  7 id001 2021-01-01 08:07:00
#>  8 id001 2021-01-01 08:08:00
#>  9 id001 2021-01-01 08:12:00
#> 10 id001 2021-01-01 08:13:00
#> 11 id001 2021-01-01 08:14:00
#> 12 id001 2021-01-01 08:15:00
#> 13 id001 2021-01-01 08:16:00
#> 14 id001 2021-01-01 08:17:00
#> 15 id001 2021-01-01 08:18:00

Now, I need to recognize the events. That means, find time_stamps that happened in sequence (sequence = 1 minute interval). As an example, this would be my expected output:

#> # A tibble: 15 x 3
#>    id    time_stamp          events 
#>    <chr> <dttm>              <chr>  
#>  1 id001 2021-01-01 08:00:00 event_1
#>  2 id001 2021-01-01 08:01:00 event_1
#>  3 id001 2021-01-01 08:02:00 event_1
#>  4 id001 2021-01-01 08:03:00 event_1
#>  5 id001 2021-01-01 08:05:00 event_2
#>  6 id001 2021-01-01 08:06:00 event_2
#>  7 id001 2021-01-01 08:07:00 event_2
#>  8 id001 2021-01-01 08:08:00 event_2
#>  9 id001 2021-01-01 08:12:00 event_3
#> 10 id001 2021-01-01 08:13:00 event_3
#> 11 id001 2021-01-01 08:14:00 event_3
#> 12 id001 2021-01-01 08:15:00 event_3
#> 13 id001 2021-01-01 08:16:00 event_3
#> 14 id001 2021-01-01 08:17:00 event_3
#> 15 id001 2021-01-01 08:18:00 event_3

Note that from row 4 to 5 there was an interval of 2 minutes – causing the next event to start. Same thing from row 8 to 9: there was an interval of 4 minutes and then the next event started.

PS: I need it to work fully in dbplyr, that is: without using collect()

Any ideas would be very appreciated!

Thank you!

question from:https://stackoverflow.com/questions/65835364/mutate-new-column-based-on-sequence-of-time-stamp-in-dbplyr

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Working without collect means we are limited to mostly dplyr functions as these are the functions for which SQL translations are defined.

Very similar to @Sinh_Nguyen, I would propose the following:

output = mf %>%
  group_by(id) %>%
  arrange(time_stamp) %>%
  mutate(prev_time_stamp = lag(time_stamp, 1)) %>%
  mutate(hours_diff = DATEPART('hour', time_stamp - prev_time_stamp),
         min_part_diff = DATEPART('minute', time_stamp - prev_time_stamp)) %>%
  mutate(gap = hours_diff * 60 + min_part_diff) %>%
  mutate(is_gap = ifelse(is.na(prev_time_stamp) | gap == 1, 0, 1)) %>%
  mutate(event_index = cumsum(is_gap))

Notes:

  • The group_by and arrange happen once at the start, but are used implicitly by the lag and cumsum functions.
  • If dbplyr does not have a translation defined then it passes the command as-is. Writing DATEPART in capitals ensures it is not translated, so we get the PostgreSQL DATEPART function. I usually use SQL server, so I am following these examples for how to calculate the difference in PostgreSQL.
  • The idea of the last two lines is to create a binary indicator when a record is not a continuation of the previous event. Summing these indicators increments the event counter for each new event.
  • If there are errors, you can use show_query(output) to view the SQL translation. Viewing / sharing the SQL translation often helps for trouble shooting.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...