Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
670 views
in Technique[技术] by (71.8m points)

regex - Python re.finditer match.groups() does not contain all groups from match

I am trying to use regex in Python to find and print all matching lines from a multiline search. The text that I am searching through may have the below example structure:

AAA
ABC1
ABC2
ABC3
AAA
ABC1
ABC2
ABC3
ABC4
ABC
AAA
ABC1
AAA

From which I want to retrieve the ABC*s that occur at least once and are preceeded by an AAA.

The problem is, that despite the group catching what I want:

match = <_sre.SRE_Match object; span=(19, 38), match='AAA
ABC2
ABC3
ABC4
'>

... I can access only the last match of the group:

match groups = ('AAA
', 'ABC4
')

Below is the example code that I use for this problem.

#! python
import sys
import re
import os

string = "AAA
ABC1
ABC2
ABC3
AAA
ABC1
ABC2
ABC3
ABC4
ABC
AAA
ABC1
AAA
"
print(string)

p_MATCHES = []
p_MATCHES.append( (re.compile('(AAA
)(ABC[0-9]
){1,}')) ) #   
matches = re.finditer(p_MATCHES[0],string)

for match in matches:
    strout = ''
    gr_iter=0
    print("match = "+str(match))
    print("match groups = "+str(match.groups()))
    for group in match.groups():
    gr_iter+=1
    sys.stdout.write("TEST GROUP:"+str(gr_iter)+""+group) # test output
    if group is not None:
        if group != '':
            strout+= '"'+group.replace("
","",1)+'"'+'
'
sys.stdout.write("
COMPLETE RESULT:
"+strout+"====
")
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Here is your regular expression:

(AAA
)(ABC[0-9]
){1,}

Regular expression visualization

Debuggex Demo

Your goal is to capture all ABC#s that immediately follow AAA. As you can see in this Debuggex demo, all ABC#s are indeed being matched (they're highlighted in yellow). However, since only the "what is being repeated" part

ABC[0-9]

is being captured (is inside the parentheses), and its quantifier,

{1,}

is not being captured, this therefore causes all matches except the final one to be discarded. To get them, you must also capture the quantifier:

AAA
((?:ABC[0-9]
){1,})

Regular expression visualization

Debuggex Demo

I've placed the "what is being repeated" part (ABC[0-9] ) into a non-capturing group. (I've also stopped capturing AAA, as you don't seem to need it.)

The captured text can be split on the newline, and will give you all the pieces as you wish.

(Note that by itself doesn't work in Debuggex. It requires .)


This is a workaround. Not many regular expression flavors offer the capability of iterating through repeating captures (which ones...?). A more normal approach is to loop through and process each match as they are found. Here's an example from Java:

   import java.util.regex.*;

public class RepeatingCaptureGroupsDemo {
   public static void main(String[] args) {
      String input = "I have a cat, but I like my dog better.";

      Pattern p = Pattern.compile("(mouse|cat|dog|wolf|bear|human)");
      Matcher m = p.matcher(input);

      while (m.find()) {
         System.out.println(m.group());
      }
   }
}

Output:

cat
dog

(From http://ocpsoft.org/opensource/guide-to-regular-expressions-in-java-part-1/, about a 1/4 down)


Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. The links in this answer come from it.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...