Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
912 views
in Technique[技术] by (71.8m points)

r - How to update `lm` or `glm` model on same subset of data?

I am trying to fit two nested models and then test those against each other using anova function. The commands used are:

probit <- glm(grad ~ afqt1 + fhgc + mhgc + hisp + black + male, data=dt, 
    family=binomial(link = "probit"))
nprobit <- update(probit, . ~ . - afqt1)
anova(nprobit, probit, test="Rao")

However, the variable afqt1 apparently contains NAs and because the update call does not take the same subset of data, anova() returns error

Error in anova.glmlist(c(list(object), dotargs), dispersion = dispersion, : models were not all fitted to the same size of dataset

Is there a simple way how to achieve refitting the model on the same dataset as the original model?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

As suggested in the comments, a straightforward approach to this is to use the model data from the first fit (e.g. probit) and update's ability to overwrite arguments from the original call.

Here's a reproducible example:

data(mtcars)
mtcars[1,2] <- NA
nobs( xa <- lm(mpg~cyl+disp, mtcars) ) 
## [1] 31
nobs( update(xa, .~.-cyl) )  ##not nested
## [1] 32
nobs( xb <- update(xa, .~.-cyl, data=xa$model) )  ##nested
## [1] 31

It is easy enough to define a convenience wrapper around this:

update_nested <- function(object, formula., ..., evaluate = TRUE){
    update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}

This forces the data argument of the updated call to re-use the data from the first model fit.

nobs( xc <- update_nested(xa, .~.-cyl) )
## [1] 31
all.equal(xb, xc)  ##only the `call` component will be different
## [1] "Component “call”: target, current do not match when deparsed"
identical(xb[-10], xc[-10])
## [1] TRUE

So now you can easily do anova:

anova(xa, xc)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp
## Model 2: mpg ~ disp
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     28 269.97                              
## 2     29 312.96 -1   -42.988 4.4584 0.04378 *
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The other approach suggested is na.omit on the data frame prior to the lm() call. At first I thought this would be impractical when dealing with a big data frame (e.g. 1000 cols) and with a large number of vars in the various specifications (e.g ~15 vars), but not because of speed. This approach requires manual bookkeeping of which vars should be sanitized of NAs and which shouldn't, and is precisely what the OP seems intent to avoid. The biggest drawback would be that you must always keep in sync the formula with the subsetted data frame.

This however can be overcome rather easily, as it turns out:

data(mtcars)
for(i in 1:ncol(mtcars)) mtcars[i,i] <- NA
nobs( xa <- lm(mpg~cyl + disp + hp + drat + wt + qsec + vs + am + gear + 
                    carb, mtcars) ) 
## [1] 21
nobs( xb <- update(xa, .~.-cyl) )  ##not nested
## [1] 22
nobs( xb <- update_nested(xa, .~.-cyl) )  ##nested
## [1] 21
nobs( xc <- update(xa, .~.-cyl, data=na.omit(mtcars[ , all.vars(formula(xa))])) )  ##nested
## [1] 21
all.equal(xb, xc)
## [1] "Component “call”: target, current do not match when deparsed"
identical(xb[-10], xc[-10])
## [1] TRUE

anova(xa, xc)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     10 104.08                           
## 2     11 104.42 -1  -0.34511 0.0332 0.8591

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...