I want to create a new variable that is equal to the value of one of two other variables, conditional on the values of still other variables. Here's a toy example with fake data.
Each row of the data frame represents a student. Each student can be studying up to two subjects (subj1
and subj2
), and can be pursuing a degree ("BA") or a minor ("MN") in each subject. My real data includes thousands of students, several types of degree, about 50 subjects, and students can have up to five majors/minors.
ID subj1 degree1 subj2 degree2
1 1 BUS BA <NA> <NA>
2 2 SCI BA ENG BA
3 3 BUS MN ENG BA
4 4 SCI MN BUS BA
5 5 ENG BA BUS MN
6 6 SCI MN <NA> <NA>
7 7 ENG MN SCI BA
8 8 BUS BA ENG MN
...
Now I want to create a sixth variable, df$major
, that equals the value of subj1
if subj1
is the student's primary major, or the value of subj2
if subj2
is the primary major. The primary major is the first subject with degree equal to "BA". I tried the following code:
df$major[df$degree1 == "BA"] = df$subj1
df$major[df$degree1 != "BA" & df$degree2 == "BA"] = df$subj2
Unfortunately, I got an error message:
> df$major[df$degree1 == "BA"] = df$subj1
Error in df$major[df$degree1 == "BA"] = df$subj1 :
NAs are not allowed in subscripted assignments
I assume this means that a vectorized assignment can't be used if the assignment evaluates to NA for at least one row.
I feel like I must be missing something basic here, but the code above seemed like the obvious thing to do and I haven't been able to come up with an alternative.
In case it would be helpful in writing an answer, here's sample data, created using dput()
, in the same format as the fake data listed above:
structure(list(ID = 1:20, subj1 = structure(c(3L, NA, 1L, 2L,
2L, 3L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 3L, 3L, 1L, 2L, 1L
), .Label = c("BUS", "ENG", "SCI"), class = "factor"), degree1 = structure(c(2L,
NA, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), .Label = c("BA", "MN"), class = "factor"), subj2 = structure(c(1L,
2L, NA, NA, 1L, NA, 3L, 2L, NA, 2L, 2L, 1L, 3L, NA, 2L, 1L, 1L,
NA, 2L, 2L), .Label = c("BUS", "ENG", "SCI"), class = "factor"),
degree2 = structure(c(2L, 2L, NA, NA, 2L, NA, 1L, 2L, NA,
2L, 1L, 1L, 2L, NA, 1L, 2L, 2L, NA, 1L, 2L), .Label = c("BA",
"MN"), class = "factor")), .Names = c("ID", "subj1", "degree1",
"subj2", "degree2"), row.names = c(NA, -20L), class = "data.frame")
See Question&Answers more detail:
os