How to recode some dataframe values to NA
if they don't appear in a separate vector?
More specifically, how to approach such task when:
- each data column to clean has its specific set of "valid" values to keep, independent of other columns
- column-specific values are given in a separate table (as vectors nested in a list-column in a
tibble
)
Example
- My data to clean up is
my_mtcars
- I want to clean up certain columns (
cars
, gear
, and carb
)
- In each of those columns, I want to keep only certain values as they are specified in a separate table
table_valid_values
under valid_values
. Otherwise, values not specified as "valid" should turn to NA
.
- For any column of
my_mtcars
that does not appear in table_valid_values
, no cleanup is needed.
library(tibble)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
my_mtcars <- rownames_to_column(mtcars, "cars")
as_tibble(my_mtcars)
#> # A tibble: 32 x 12
#> cars mpg cyl disp hp drat wt qsec vs am gear carb
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 Mazda RX4 ~ 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 Hornet 4 D~ 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 Hornet Spo~ 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
table_valid_values <-
structure(
list(
var_name = c("cars", "gear", "carb"),
valid_values = list(
c("Valiant", "AMC Javelin", "Ferrari Dino"),
c(3, 5),
c(1, 4, 6)
)
),
row.names = c(NA, -3L),
class = c("tbl_df", "tbl", "data.frame")
)
table_valid_values
#> # A tibble: 3 x 2
#> var_name valid_values
#> <chr> <list>
#> 1 cars <chr [3]>
#> 2 gear <dbl [2]>
#> 3 carb <dbl [3]>
table_valid_values %>%
pull(valid_values)
#> [[1]]
#> [1] "Valiant" "AMC Javelin" "Ferrari Dino"
#>
#> [[2]]
#> [1] 3 5
#>
#> [[3]]
#> [1] 1 4 6
Created on 2021-01-27 by the reprex package (v0.3.0)
Desired Output
Provided with only table_valid_values
, how can I clean up my_mtcars
to get the following:
## cars mpg cyl disp hp drat wt qsec vs am gear carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA 21 6 160 110 3.9 2.62 16.5 0 1 NA 4
## 2 NA 21 6 160 110 3.9 2.88 17.0 0 1 NA 4
## 3 NA 22.8 4 108 93 3.85 2.32 18.6 1 1 NA 1
## 4 NA 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 NA 18.7 8 360 175 3.15 3.44 17.0 0 0 3 NA
## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 NA 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 NA 24.4 4 147. 62 3.69 3.19 20 1 0 NA NA
## 9 NA 22.8 4 141. 95 3.92 3.15 22.9 1 0 NA NA
## 10 NA 19.2 6 168. 123 3.92 3.44 18.3 1 0 NA 4
## 11 NA 17.8 6 168. 123 3.92 3.44 18.9 1 0 NA 4
## 12 NA 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 NA
## 13 NA 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 NA
## 14 NA 15.2 8 276. 180 3.07 3.78 18 0 0 3 NA
## 15 NA 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
## 16 NA 10.4 8 460 215 3 5.42 17.8 0 0 3 4
## 17 NA 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
## 18 NA 32.4 4 78.7 66 4.08 2.2 19.5 1 1 NA 1
## 19 NA 30.4 4 75.7 52 4.93 1.62 18.5 1 1 NA NA
## 20 NA 33.9 4 71.1 65 4.22 1.84 19.9 1 1 NA 1
## 21 NA 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
## 22 NA 15.5 8 318 150 2.76 3.52 16.9 0 0 3 NA
## 23 AMC Javelin 15.2 8 304 150 3.15 3.44 17.3 0 0 3 NA
## 24 NA 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
## 25 NA 19.2 8 400 175 3.08 3.84 17.0 0 0 3 NA
## 26 NA 27.3 4 79 66 4.08 1.94 18.9 1 1 NA 1
## 27 NA 26 4 120. 91 4.43 2.14 16.7 0 1 5 NA
## 28 NA 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 NA
## 29 NA 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
## 30 Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
## 31 NA 15 8 301 335 3.54 3.57 14.6 0 1 5 NA
## 32 NA 21.4 4 121 109 4.11 2.78 18.6 1 1 NA NA
I also wonder, what if we wanted to replace invalid values with a string of choice (say, invalid
) rather than NA
?
question from:
https://stackoverflow.com/questions/65916731/recode-dataframe-values-to-na-per-column 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…