I am trying to do fixed effects linear regression with R. My data looks like
dte yr id v1 v2
. . . . .
. . . . .
. . . . .
I then decided to simply do this by making yr
a factor and use lm
:
lm(v1 ~ factor(yr) + v2 - 1, data = df)
However, this seems to run out of memory. I have 20 levels in my factor and df
is 14 million rows which takes about 2GB to store, I am running this on a machine with 22 GB dedicated to this process.
I then decided to try things the old fashioned way: create dummy variables for each of my years t1
to t20
by doing:
df$t1 <- 1*(df$yr==1)
df$t2 <- 1*(df$yr==2)
df$t3 <- 1*(df$yr==3)
...
and simply compute:
solve(crossprod(x), crossprod(x,y))
This runs without a problem and produces the answer almost right away.
I am specifically curious what is it about lm that makes it run out of memory when I can compute the coefficients just fine? Thanks.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…