Isn't factor just integer which should be easier to do counting sort
than character?
Yes, if you're given a factor already. But the time to create that factor can be significant and that's what setkey
(and ad hoc by
) aim to beat. Try timing factor()
on a randomly ordered character vector, say 1e6 long with 1e4 levels. Then compare to setkey
or ad hoc by
on the original randomly ordered character vector.
agstudy's comment is correct too; i.e., character vectors (being pointers to R cached strings) are quite similar to factors anyway. On 32bit systems character vectors are the same size as the factor's integer vector but the factor has the levels attribute to store (and sometimes copy) too. On 64bit systems the pointers are twice as big. But on the other hand R's string cache can be looked up directly from character vector pointers, whereas the factor has an extra hop via levels. (The levels attribute is a character vector of R string cache pointers too.)
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…