Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
165 views
in Technique[技术] by (71.8m points)

R: Cutting a year of dates into 2 month bins yields 7 bins instead of 6?

I am trying to use the cut() function in R to divide a year of dates into 6 two month bins. When I do, it makes 7 bins instead of 6, with the last bin being empty. I am using the following code:

dates <- seq(as.Date("2021-1-1"),as.Date("2021-12-31"),by="day")
months <- cut(dates,"month",labels=1:12)
table(months)
# months
#  1  2  3  4  5  6  7  8  9 10 11 12 
# 31 28 31 30 31 30 31 31 30 31 30 31 
sextiles <- cut(dates,"2 months",labels=1:6)
# Error in cut.default(unclass(x), unclass(breaks), labels = labels, right = right,  : 
#   lengths of 'breaks' and 'labels' differ
sextiles <- cut(dates,"2 months",labels=1:7)
table(sextiles)
# sextiles
#  1  2  3  4  5  6  7 
# 59 61 61 62 61 61  0 

The code works fine when I divide the year into single month bins, but produces an error when I divide into 2 month bins, unless I account for 7 bins instead of 6 in the labels argument. If I start removing dates from the end of the year, the code eventually works with 6 bins after removing the last 3 days of the year:

dates_364 <- dates[-length(dates)]
sextiles <- cut(dates_364,"2 months",labels=1:6)
# Error in cut.default(unclass(x), unclass(breaks), labels = labels, right = right,  : 
#   lengths of 'breaks' and 'labels' differ
dates_363 <- dates_364[-length((dates_364))]
sextiles <- cut(dates_363,"2 months",labels=1:6)
# Error in cut.default(unclass(x), unclass(breaks), labels = labels, right = right,  : 
#   lengths of 'breaks' and 'labels' differ
dates_362 <- dates_363[-length((dates_363))]
sextiles <- cut(dates_362,"2 months",labels=1:6)
table(sextiles)
# sextiles
#  1  2  3  4  5  6 
# 59 61 61 62 61 58 

This seems like a bug in the function. Can anyone shed any light on something I'm missing? Thanks!

question from:https://stackoverflow.com/questions/66057389/r-cutting-a-year-of-dates-into-2-month-bins-yields-7-bins-instead-of-6

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

There are two ways to define "bins" for a number range such that all provided numbers are within one of the bins:

  • find the minimum, find the maximum, and since Date-bins are generally right=FALSE meaning right-open, bump the maximum out a little; or
  • find the minimum, and don't find the maximum, instead use Inf so that it always contains the maximum values.

cut.Date chose the first of the two. Further, instead of "bump out from the maximum by 1 day", it chose to "bump out by 'step'". This means that when you say "2 months", it will ensure that the next bin "edge" is 2 months from the penultimate boundary.

Namely, if you look at the source for cut.Date:

        start <- as.POSIXlt(min(x, na.rm = TRUE))
# ...
            end <- as.POSIXlt(max(x, na.rm = TRUE))
# and then if 'months', then
            end <- as.POSIXlt(end + (31 * step * 86400))
# and eventually
            breaks <- as.Date(seq(start, end, breaks))

So I'll debug(cut.Date) and take a look at cut(dates, "2 months"):

start
# [1] "2021-01-01 UTC"
# debug: end <- as.POSIXlt(max(x, na.rm = TRUE))
# debug: step <- if (length(by2) == 2L) as.integer(by2[1L]) else 1L
end
# [1] "2021-12-31 UTC"
step
# [1] 2

# debug: as.integer(by2[1L])
# debug: end <- as.POSIXlt(end + (31 * step * 86400))
end
# [1] "2022-03-03 UTC"

# debug: end$mday <- 1L
# debug: end$isdst <- -1L
# debug: breaks <- as.Date(seq(start, end, breaks))
breaks
# [1] "2021-01-01" "2021-03-01" "2021-05-01" "2021-07-01" "2021-09-01" "2021-11-01" "2022-01-01"
# [8] "2022-03-01"

It then eventually does breaks[-length(breaks)], which explains why we don't see eight. My guess is that there are corner cases (leap years, perhaps?) where the 31 * step * 86400 (or other by-units) do not always align perfectly, so they buffered it a little.

Long story short (too late), I suggest you use labels=FALSE instead.

sextiles <- cut(dates, "2 months", labels = FALSE)
table(sextiles)
# sextiles
#  1  2  3  4  5  6 
# 59 61 61 62 61 61 

If you want them to be integer-looking factors (which are string levels with true integers underneath), then perhaps

sextiles <- factor(sextiles)
head(sextiles)
# [1] 1 1 1 1 1 1
# Levels: 1 2 3 4 5 6

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...