Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.4k views
in Technique[技术] by (71.8m points)

java - Splitting a nested string keeping quotation marks

I am working on a project in Java that requires having nested strings.

For an input string that in plain text looks like this:

This is "a string" and this is "a "nested" string"

The result must be the following:

[0] This
[1] is
[2] "a string"
[3] and
[4] this
[5] is
[6] "a "nested" string"

Note that I want the " sequences to be kept.
I have the following method:

public static String[] splitKeepingQuotationMarks(String s);

and I need to create an array of strings out of the given s parameter by the given rules, without using the Java Collection Framework or its derivatives.

I am unsure about how to solve this problem.
Can a regex expression be made that would get this solved?

UPDATE based on questions from comments:

  • each unescaped " has its closing unescaped " (they are balanced)
  • each escaping character also must be escaped if we want to create literal representing it (to create text representing we need to write it as \).
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can use the following regex:

"[^"\]*(?:\.[^"\]*)*"|S+

See the regex demo

Java demo:

String str = "This is "a string" and this is "a "nested" string""; 
Pattern ptrn = Pattern.compile(""[^"\\]*(?:\\.[^"\\]*)*"|\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.group(0));
}

Explanation:

  • "[^"\]*(?:\.[^"\]*)*" - a double quote that is followed with any 0+ characters other than a " and ([^"\]) followed with 0+ sequences of any escaped sequence (\.) followed with any 0+ characters other than a " and
  • | - or...
  • S+ - 1 or more non-whitespace characters

NOTE

@Pshemo's suggestion - ""(?:\\.|[^"])*"|\S+" (or ""(?:\\.|[^"\\])*"|\S+" would be more correct) - is the same expression, but much less efficient since it is using an alternation group quantified with *. This construct involves much more backtracking as the regex engine has to test each position, and there are 2 probabilities for each position. My unroll-the-loop based version will match chunks of text at once, and is thus much faster and reliable.

UPDATE

Since String[] type is required as output, you need to do it in 2 steps: count the matches, create the array, and then re-run the matcher again:

int cnt = 0;
String str = "This is "a string" and this is "a "nested" string""; 
Pattern ptrn = Pattern.compile(""[^"\\]*(?:\\.[^"\\]*)*"|\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    cnt++;
}
System.out.println(cnt);
String[] result = new String[cnt];
matcher.reset();
int idx = 0;
while (matcher.find()) {
    result[idx] = matcher.group(0);
    idx++;
}
System.out.println(Arrays.toString(result));

See another IDEONE demo


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...