A common technique to extract a number before or after a word is to match all the string up to the word or number or number and word while capturing the number and then matching the rest of the string and replacing with the captured substring using sub
:
# Extract the first number after a word:
as.integer(sub(".*?<WORD_OR_PATTERN_HERE>.*?(\d+).*", "\1", x))
# Extract the first number after a word:
as.integer(sub(".*?(\d+)\s*<WORD_OR_PATTERN_HERE>.*", "\1", x))
NOTE: Replace \d+
with \d+(?:\.\d+)?
to match int or float numbers (to keep consistency with the code above, remember change as.integer
to as.numeric
). \s*
matches 0 or more whitespace in the second sub
.
For the current scenario, a possible solution will look like
v <- c("A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift","Newbuild flat. Floor Area: 30 sq.m","6 bed house with floor area 50 sqm, lot area 25 sqm")
as.integer(sub("(?i).*?\bfloor area:?\s*(\d+).*", "\1", v))
# [1] 50 30 50
See the regex demo.
You may also leverage a capturing mechanism with str_match
from stringr and get the second column value ([,2]
):
> library(stringr)
> v <- c("A beautiful flat on the 3rd floor with floor area: 50 sqm and a lift","Newbuild flat. Floor Area: 30 sq.m","6 bed house with floor area 50 sqm, lot area 25 sqm")
> as.integer(str_match(v, "(?i)\bfloor area:?\s*(\d+)")[,2])
[1] 50 30 50
See the regex demo.
The regex matches:
(?i)
- in a case-insensitive way
\bfloor area:?
- a whole word (
is a word boundary) floor area
followed by an optional :
(one or zero occurrence, ?
)
\s*
- zero or more whitespace
(\d+)
- Group 1 (will be in [,2]
) capturing one or more digits
See R demo online
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…