I want to compare two texts to similarity, therefore i need a simple function to list clearly and chronologically the words and phrases occurring in both texts. these words/sentences should be highlighted or underlined for better visualization)
on the base of @joris Meys ideas, i added an array to divide text into sentences and subordinate sentences.
this is how it looks like:
textparts <- function (text){
textparts <- c("\,", "\.")
i <- 1
while(i<=length(textparts)){
text <- unlist(strsplit(text, textparts[i]))
i <- i+1
}
return (text)
}
textparts1 <- textparts("This is a complete sentence, whereas this is a dependent clause. This thing works.")
textparts2 <- textparts("This could be a sentence, whereas this is a dependent clause. Plagiarism is not cool. This thing works.")
commonWords <- intersect(textparts1, textparts2)
commonWords <- paste("\<(",commonWords,")\>",sep="")
for(x in commonWords){
textparts1 <- gsub(x, "\1*", textparts1,ignore.case=TRUE)
textparts2 <- gsub(x, "\1*", textparts2,ignore.case=TRUE)
}
return(list(textparts1,textparts2))
However, sometimes it works, sometimes it doesn't.
I WOULD like to have results like these:
> return(list(textparts1,textparts2))
[[1]]
[1] "This is a complete sentence" " whereas this is a dependent clause*" " This thing works*"
[[2]]
[1] "This could be a sentence" " whereas this is a dependent clause*" " Plagiarism is not cool" " This thing works*"
whereas i get none results.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…