Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
506 views
in Technique[技术] by (71.8m points)

r - How to contract Part-of-Speech tags for contracted word forms

I'm working with transcriptions of speech where colloquial contracted forms such as "wanna" or "dunno" are rendered as whitespace-separated word forms, such as "wan na" (2 words) or "du n no" (3 words). The transcriptions also contain Part-of-Speech tags. I want to contract the tags for the to-be-contracted forms but run into problems with that.

Data:

Tha data is arranged in element pairs: the first-pair part contains the form I wish to contract (e.g., wanand na, the second-pair part contains a similar form that I do not wish to contract (e.g., needand to):

df_reduced <- data.frame(
  Text = c("they wan na run around .", "You need to get your b's off the top .",         # wan na
           "What you gon na put ?", "Well I was trying to",                              # gon na
           "Just got ta .", "She never even said to ask me",                             # got ta
           "It 's true though in n it ?", "Is n't it obvious ?",                         # in n it
           "Must of been clean mud , I du n no !", "And you do n't expect it do you ?"   # du n no
           ),
  Tag = c("PNP VVB TO0 VVI AVP", "PNP VVB TO0 VVI DPS ZZ0 PRP AT0 NN1",                  # "VVB TO0"            
          "DTQ PNP VVG TO0 VVI", "AV0 PNP VBD VVG TO0",                                  # "VVG TO0"
          "AV0 VVD TO0", "PNP AV0 AV0 VVD TO0 VVI PNP",                                  # "VVD TO0"
          "PNP VBZ AJ0 AV0 VBZ XX0 PNP", "VBZ XX0 PNP AJ0",                              # "VBZ XX0 PNP"
          "VM0 VHI VBN AJ0 NN1 PNP VDB XX0 VVI", "CJC PNP VDB XX0 VVI PNP VDB PNP"       # "VDB XX0 VVI"
          )
)

Note that "wan na" (which I want to contract) and "need to" (which I do not want to contract), have exactly the same Part-of-Speech tags, namely "VVB TO0". The same is true of all other element pairs in Tag.

I want to replace the relevant tags only if the Text contains the to-be-contracted form. For example, I want to contract VVB TO0 only if Text contains the substring wan na, and contract VVG TO0 only if Text contains the substring gon na, and so forth. I therefore define replacements_tag and a forms_pattern for all the to-be-contracted forms:

# define replacements:
replacements_tag <- setNames(c("VVB_TO0", "VVG_TO0", "VVD_TO0", "VBZ_XX0_PNP", "VDB_XX0_VVI"),   # new forms
                         c("VVB TO0", "VVG TO0", "VVD TO0", "VBZ XX0 PNP", "VDB XX0 VVI"))       # old forms

# define pattern:
forms <- c("wan na", "gon na", "got ta", "in n it", "du n no")
forms_pattern <- paste0("\b(", paste0(forms, collapse = "|"), ")\b")

# create new column:
df_reduced$Tag_new <- ifelse(grepl(forms_pattern, df_reduced$Text),
                             str_replace_all(df_reduced$Tag[grepl(forms_pattern, df_reduced$Text)], replacements_tag),
                             df_reduced$Tag)

The replacements work fine, except that the Tag values to which no changes have been made carry numbers instead of the expected 'old' unchanged Tag values:

 df_reduced$Tag_new
 [1] "PNP VVB_TO0 VVI AVP"                 "8"                                   "AV0 VVD_TO0"                        
 [4] "1"                                   "VM0 VHI VBN AJ0 NN1 PNP VDB_XX0_VVI" "5"                                  
 [7] "DTQ PNP VVG_TO0 VVI"                 "9"                                   "PNP VBZ AJ0 AV0 VBZ_XX0_PNP"        
[10] "3"

How can the 'old' Tagvalues be displayed in Tag_new?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...