I definitely agree that this isn't intuitive (I made that point before on SO). In defense of R, I think that knowing when you have a missing value is useful (i.e. this is not a bug). The ==
operator is explicitly designed to notify the user of NA or NaN values. See ?"==" for more information. It states:
Missing values ('NA') and 'NaN' values are regarded as
non-comparable even to themselves, so comparisons involving them
will always result in 'NA'.
In other words, a missing value isn't comparable using a binary operator (because it's unknown).
Beyond is.na(), you could also do:
which(a$col2==2) # tests explicitly for TRUE
Or
a$col2 %in% 2 # only checks for 2
%in% is defined as using the match()
function:
'"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0'
This is also covered in "The R Inferno".
Checking for NA values in your data is crucial in R, because many important operators don't handle it the way you expect. Beyond ==, this is also true for things like &, |, <, sum(), and so on. I am always thinking "what would happen if there was an NA here" when I'm writing R code. Requiring an R user to be careful with missing values is "by design".
Update: How is NA handled when there are multiple logical conditions?
NA
is a logical constant and you might get unexpected subsetting if you don't think about what might be returned (e.g. NA | TRUE == TRUE
). These truth tables from ?Logic
may provide a useful illustration:
outer(x, x, "&") ## AND table
# <NA> FALSE TRUE
#<NA> NA FALSE NA
#FALSE FALSE FALSE FALSE
#TRUE NA FALSE TRUE
outer(x, x, "|") ## OR table
# <NA> FALSE TRUE
#<NA> NA NA TRUE
#FALSE NA FALSE TRUE
#TRUE TRUE TRUE TRUE
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…