Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
165 views
in Technique[技术] by (71.8m points)

java - How to properly split on a non escaped delimiter?

I have the following example string:

A|B|C\|D\|E\\F

with | being the delimiter and being the escape character. A proper split should look as follows:

A
B|C\
D\|E\\
F

Also I need this logic to be generally applicable in case the delimiter or the escape consists out of multiple characters.

I already have a regex which splits at the correct position, but it does not produce the desired output:

Regex:

(?<!Q\E)(?:(Q\E)*)Q|E

Output:

A
B|C
D\|E
F

I am usually testing here: https://regex101.com/, but am working in java so I have a little more capabilities.

Also tried the following with no positive result as well (doesn't work on the webpage, but in java just doesn't produce the desired result):

(?=(Q\E){0,5})(?<!Q\E)Q|E
question from:https://stackoverflow.com/questions/65901486/how-to-properly-split-on-a-non-escaped-delimiter

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Extracting approach

You can use a matching approach as it is the most stable and allows arbitrary amount of escaping chars. You can use

(?s)(?:\.|[^\|])+

See the regex demo. Details:

  • (?s) - Pattern.DOTALL embedded flag option
  • (?:\.|[^\|])+ - one or more repetitions of and then any one char, or any char but and |.

See the Java demo:

String s = "A|B\|C\\|D\\\|E\\\\|F";
Pattern pattern = Pattern.compile("(?:\\.|[^\\|])+", Pattern.DOTALL);
Matcher matcher = pattern.matcher(s);
List<String> results = new ArrayList<>();
while (matcher.find()){
    results.add(matcher.group());
} 
System.out.println(results); 
// => [A, B|C\, D\|E\\, F]

Splitting approach (workaround for split)

You may (ab)use the constrained-width lookbehind pattern support in Java regex and use limiting quantifier like {0,1000} instead of * quantifier. A work-around would look like

String s = "A|B\|C\\|D\\\|E\\\\|F";
String[] results = s.split("(?<=(?<!\\)(?:\\{2}){0,1000})\|"); System.out.println(Arrays.toString(results));

See this Java demo.

Note (?:\{2}){0,1000} part will only allow up to 1000 escaping backslashes that should suffice in most cases, I believe, but you might want to test this first. I'd still recommend the first solution.

Details:

  • (?<= - start of a positive lookbehind:
    • (?<!\) - a location not immediately preceded with a
    • (?:\{2}){0,1000} - zero to one thousand occurrences of double backslash
  • ) - end of the positive lookbehind
  • | - a | char.

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...