I'm working with transcriptions of speech where colloquial contracted forms such as "wanna" or "dunno" are rendered as whitespace-separated word forms, such as "wan na" (2 words) or "du n no" (3 words). The transcriptions also contain Part-of-Speech tags. I want to contract the tags for the to-be-contracted forms but run into problems with that.
Data:
Tha data is arranged in element pairs: the first-pair part contains the form I wish to contract (e.g., wan
and na
, the second-pair part contains a similar form that I do not wish to contract (e.g., need
and to
):
df_reduced <- data.frame(
Text = c("they wan na run around .", "You need to get your b's off the top .", # wan na
"What you gon na put ?", "Well I was trying to", # gon na
"Just got ta .", "She never even said to ask me", # got ta
"It 's true though in n it ?", "Is n't it obvious ?", # in n it
"Must of been clean mud , I du n no !", "And you do n't expect it do you ?" # du n no
),
Tag = c("PNP VVB TO0 VVI AVP", "PNP VVB TO0 VVI DPS ZZ0 PRP AT0 NN1", # "VVB TO0"
"DTQ PNP VVG TO0 VVI", "AV0 PNP VBD VVG TO0", # "VVG TO0"
"AV0 VVD TO0", "PNP AV0 AV0 VVD TO0 VVI PNP", # "VVD TO0"
"PNP VBZ AJ0 AV0 VBZ XX0 PNP", "VBZ XX0 PNP AJ0", # "VBZ XX0 PNP"
"VM0 VHI VBN AJ0 NN1 PNP VDB XX0 VVI", "CJC PNP VDB XX0 VVI PNP VDB PNP" # "VDB XX0 VVI"
)
)
Note that "wan na" (which I want to contract) and "need to" (which I do not want to contract), have exactly the same Part-of-Speech tags, namely "VVB TO0". The same is true of all other element pairs in Tag
.
I want to replace the relevant tags only if the Text
contains the to-be-contracted form. For example, I want to contract VVB TO0
only if Text
contains the substring wan na
, and contract VVG TO0
only if Text
contains the substring gon na
, and so forth. I therefore define replacements_tag
and a forms_pattern
for all the to-be-contracted forms:
# define replacements:
replacements_tag <- setNames(c("VVB_TO0", "VVG_TO0", "VVD_TO0", "VBZ_XX0_PNP", "VDB_XX0_VVI"), # new forms
c("VVB TO0", "VVG TO0", "VVD TO0", "VBZ XX0 PNP", "VDB XX0 VVI")) # old forms
# define pattern:
forms <- c("wan na", "gon na", "got ta", "in n it", "du n no")
forms_pattern <- paste0("\b(", paste0(forms, collapse = "|"), ")\b")
# create new column:
df_reduced$Tag_new <- ifelse(grepl(forms_pattern, df_reduced$Text),
str_replace_all(df_reduced$Tag[grepl(forms_pattern, df_reduced$Text)], replacements_tag),
df_reduced$Tag)
The replacements work fine, except that the Tag
values to which no changes have been made carry numbers instead of the expected 'old' unchanged Tag
values:
df_reduced$Tag_new
[1] "PNP VVB_TO0 VVI AVP" "8" "AV0 VVD_TO0"
[4] "1" "VM0 VHI VBN AJ0 NN1 PNP VDB_XX0_VVI" "5"
[7] "DTQ PNP VVG_TO0 VVI" "9" "PNP VBZ AJ0 AV0 VBZ_XX0_PNP"
[10] "3"
How can the 'old' Tag
values be displayed in Tag_new
?