I have a large data.table (about 24000 rows and growing). I want to subset that datatable based on a couple of criteria and from that subset (ends up being about 3000 rows) I want to randomly sample just 4 rows. I do not want to create a named 3000 or so row data.table, count its rows and then sample based on row number. How can I do it on the fly? Or should I just suck it up by creating the table and then working on it, sampling it and then using rm()
to get rid of it?
Lets simulate my issue
require(data.table)
random.length <- sample(x = 15:30, size = 1)
data.table(city=sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"), size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))
That makes a random length table, which simulates the fact that depending on my criteria and depending on my starting table, I do not know what the length of the subsetted table with be
Now, if I just wanted the first three rows I could do as so
data.table(city=sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"), size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))[1:3]
But let us say I did not want the first three rows but rather a random 3 rows, then I would want to do something such as this...
data.table(city=sample(c("Cape Town", "New York", "Pittsburgh", "Tel Aviv", "Amsterdam"), size=random.length, replace = TRUE), score = sample(x=1:10, size = random.length, replace=TRUE))[sample(x= 1:number of rows of that previous data.table,size = 3 ]
That will not work. How do I compute, on the fly, what the length of the initial data.frame was?
See Question&Answers more detail:
os