Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
242 views
in Technique[技术] by (71.8m points)

java - divide sentence into words and punctuations

I need to parse class Sentence into word and punctuation (whitespace is considered as a punctuation mark), then add all of it into general ArrayList<Sentence>.

An example sentence:

A man, a plan, a canal — Panama!
A => word
whitespase => punctuation
man => word
, + space => punctuation
a => word
[...]

I tried to read this whole sentence one character at a time and collect the same and create new word or new Punctuation from this collection.

Here's my code:

public class Sentence {

    private String sentence;
    private LinkedList<SentenceElement> elements;

    /**
     * Constructs a sentence.
     * @param aText a string containing all characters of the sentence
     */
    public Sentence(String aText) {
        sentence = aText.trim();
        splitSentence();
    }

    public String getSentence() {
        return sentence;
    }

    public LinkedList<SentenceElement> getElements() {
        return elements;
    }

    /**
     * Split sentance into words and punctuations
     */
    private void splitSentence() {
        if (sentence == "" || sentence == null || sentence == "
") {
            return;
        }

        StringBuilder builder = new StringBuilder();

        int j = 0;
        boolean mark = false;
        while (j < sentence.length()) {
            //char current = sentence.charAt(j);

            while (Character.isLetter(sentence.charAt(j))) {
                if (mark) {
                    elements.add(new Punctuation(builder.toString()));
                    builder.setLength(0);
                    mark = false;
                }
                builder.append(sentence.charAt(j));
                j++;
            } 
            mark = true;

            while (!Character.isLetter(sentence.charAt(j))) {
                if (mark) {
                    elements.add(new Word(builder.toString()));
                    builder.setLength(0);
                    mark = false;
                }
                builder.append(sentence.charAt(j));
                j++;
            }
            mark = true;
        }
    }

But logic of splitSentence() isn't work correctly. And I can't to find right solution for it.

I want to implement this as we read first character => add to builder => till next element are the same type (letter or punctuation) keep adding to builder => when next element are different than content of builder => create new word or punctuation and set builder to start.

Do the same logic again.

How to implement this checking logic at right way?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Split the string on word boundaries (except the first):

String[] parts = sentence.split("(?<!^)\b");

The array will contain alternating word/punctuation/word/punctuation/word etc.


Here's some test code:

String sentence = "A man, a plan, a canal — Panama!";
String[] parts = sentence.split("(?<!^)\b");
for (String part : parts)
    System.out.println('"' + part + "" (" + (part.matches("\w+") ? "word" : "punctuation") + ")");

Output:

"A" (word)
" " (punctuation)
"man" (word)
", " (punctuation)
"a" (word)
" " (punctuation)
"plan" (word)
", " (punctuation)
"a" (word)
" " (punctuation)
"canal" (word)
" — " (punctuation)
"Panama" (word)
"!" (punctuation)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...