r - tidyr separate column values into character and numeric using regex

Question

Welcome To Ask or Share your Answers For Others

r - tidyr separate column values into character and numeric using regex

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

r - tidyr separate column values into character and numeric using regex

I'd like to separate column values using tidyr::separate and a regex expression but am new to regex expressions

df <- data.frame(A=c("enc0","enc10","enc25","enc100","harab0","harab25","harab100","requi0","requi25","requi100"), stringsAsFactors=F)

This is what I've tried

library(tidyr)
df %>%
   separate(A, c("name","value"), sep="[a-z]+")

Bad Output

   name value
1           0
2          10
3          25
4         100
5           0
# etc

How do I save the name column as well?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:39:51+0000

You may use a (?<=[a-z])(?=[0-9]) lookaround based regex with tidyr::separate:

> tidyr::separate(df, A, into = c("name", "value"), "(?<=[a-z])(?=[0-9])")
    name value
1    enc     0
2    enc    10
3    enc    25
4    enc   100
5  harab     0
6  harab    25
7  harab   100
8  requi     0
9  requi    25
10 requi   100

The (?<=[a-z])(?=[0-9]) pattern matches a location in the string right in between a lowercase ASCII letter ((?<=[a-z])) and a digit ((?=[0-9])). The (?<=...) is a positive lookahead that requires the presence of some pattern immediately to the left of the current location, and (?=...) is a positive lookahead that requires the presence of its pattern immediately to the right of the current location. Thus, the letters and digits are kept intact when splitting.

Alternatively, you may use extract:

extract(df, A, into = c("name", "value"), "^([a-z]+)(\d+)$")

Output:

    name value
1    enc     0
2    enc    10
3    enc    25
4    enc   100
5  harab     0
6  harab    25
7  harab   100
8  requi     0
9  requi    25
10 requi   100

The ^([a-z]+)(\d+)$ pattern matches:

^ - start of input
([a-z]+) - Capturing group 1 (column name): one or more lowercase ASCII letters
(\d+) - Capturing group 2 (column value): one or more digits
$ - end of string.

Categories

r - tidyr separate column values into character and numeric using regex

r - tidyr separate column values into character and numeric using regex

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags