r - predict.lm() with an unknown factor level in test data

Question

Welcome To Ask or Share your Answers For Others

r - predict.lm() with an unknown factor level in test data

asked Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - predict.lm() with an unknown factor level in test data

I am fitting a model to factor data and predicting. If the newdata in predict.lm() contains a single factor level that is unknown to the model, all of predict.lm() fails and returns an error.

Is there a good way to have predict.lm() return a prediction for those factor levels the model knows and NA for unknown factor levels, instead of only an error?

Example code:

foo <- data.frame(response=rnorm(3),predictor=as.factor(c("A","B","C")))
model <- lm(response~predictor,foo)
foo.new <- data.frame(predictor=as.factor(c("A","B","C","D")))
predict(model,newdata=foo.new)

I would like the very last command to return three "real" predictions corresponding to factor levels "A", "B" and "C" and an NA corresponding to the unknown level "D".

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-17T00:06:55+0000

You have to remove the extra levels before any calculation, like:

> id <- which(!(foo.new$predictor %in% levels(foo$predictor)))
> foo.new$predictor[id] <- NA
> predict(model,newdata=foo.new)
         1          2          3          4 
-0.1676941 -0.6454521  0.4524391         NA

This is a more general way of doing it, it will set all levels that do not occur in the original data to NA. As Hadley mentioned in the comments, they could have chosen to include this in the predict() function, but they didn't

Why you have to do that becomes obvious if you look at the calculation itself. Internally, the predictions are calculated as :

model.matrix(~predictor,data=foo) %*% coef(model)
        [,1]
1 -0.1676941
2 -0.6454521
3  0.4524391

At the bottom you have both model matrices. You see that the one for foo.new has an extra column, so you can't use the matrix calculation any more. If you would use the new dataset to model, you would also get a different model, being one with an extra dummy variable for the extra level.

> model.matrix(~predictor,data=foo)
  (Intercept) predictorB predictorC
1           1          0          0
2           1          1          0
3           1          0          1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$predictor
[1] "contr.treatment"

> model.matrix(~predictor,data=foo.new)
  (Intercept) predictorB predictorC predictorD
1           1          0          0          0
2           1          1          0          0
3           1          0          1          0
4           1          0          0          1
attr(,"assign")
[1] 0 1 1 1
attr(,"contrasts")
attr(,"contrasts")$predictor
[1] "contr.treatment"

You can't just delete the last column from the model matrix either, because even if you do that, both other levels are still influenced. The code for level A would be (0,0). For B this is (1,0), for C this (0,1) ... and for D it is again (0,0)! So your model would assume that A and D are the same level if it would naively drop the last dummy variable.

On a more theoretical part: It is possible to build a model without having all the levels. Now, as I tried to explain before, that model is only valid for the levels you used when building the model. If you come across new levels, you have to build a new model to include the extra information. If you don't do that, the only thing you can do is delete the extra levels from the dataset. But then you basically lose all information that was contained in it, so it's generally not considered good practice.

Categories

r - predict.lm() with an unknown factor level in test data

r - predict.lm() with an unknown factor level in test data

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags