Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
173 views
in Technique[技术] by (71.8m points)

python - How to add information found with regular expressions in a CSV file

I am trying to "append" new information to a CSV file. The problem is that that information is not in a dataframe structure, but is information extracted from a text using regular expressions. The sample text would be the next one:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Etiam id diam posuere, eleifend diam at, condimentum justo. Pellentesque mollis a diam id consequat.

TITLE-SDFSD-DFDS-SFDS-01-01: This is the title 1 that

is split into two lines with a blank line in the middle

Conditions Pellentesque blandit scelerisque pellentesque. Sed nec quam purus. Quisque nec tellus sed neque accumsan lacinia sit amet sit amet tellus. Etiam venenatis nibh vel pellentesque elementum. Nullam eget tortor quam. Morbi sed leo et arcu aliquet luctus.

Opening date 15 Apr 2021

Deadline 26 Aug 2021

Indicative budget: The total indicative budget for the topic is EUR 20.00 million.

TITLE-SDFSD-DFDS-SFDS-01-02; This is the title2 in one single line

Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.

Opening date 15 March 2021

Deadline 17 Aug 2021

Indicative budget: The total indicative budget for the topic is EUR 15.00 million.

TITLE-SDFSD-DFDS-SFDS-01-03: This is the title3 that is too long and takes two lines

Conditions Cras egestas consectetur sapien at dignissim. Maecenas commodo purus nibh, a tempus augue vestibulum feugiat. Vestibulum dolor neque, sagittis ut tortor et, lobortis faucibus quam.

Opening date 15 May 2021

Deadline 26 Sep 2021

Indicative budget: The total indicative budget for the topic is EUR 5.00 million.

To extract all the information, I have to make several interactions to extract the information I need. I know it is possible to make just one iteration subdividing into several groups what I need, but it is very hard for me to find just a regular expression that works. Instead, I am using several of them:

import re
import csv
    
with open('doubt2.txt','r', encoding="utf-8") as f:
    f_contents = f.read()

regexHOR =r'
(TITLE-S+-d{2}-d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening dates+(d{1,2} w+ d{4})s*?'
regexDL =r'^Deadlines+(d+ w+ d+)'

patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)

matchesHOR = patternHOR.finditer(f_contents)
matchesOD = patternOD.finditer(f_contents)
matchesDL = patternDL.finditer(f_contents)

marchesHOR finds two groups, whereas the other matches are just one group. Once I have the matches I have to export it in a CSV file executing the next code:

with open("result.csv", "w",newline='') as outfile:
    csvfile = csv.writer(outfile)
    csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
    for match in matchesHOR:
        csvfile.writerow([match.group(1), match.group(2).replace('
', ' '),'',''])
    for match in matchesOD:
        csvfile.writerow(['','',match.group(1),''])
    for match in matchesDL:
        csvfile.writerow(['','','',match.group(1)])

The problem is that when I write the new nows after the matchesHOR it put me below, as you can see in this table:

CODE TITLE Opening Deadline
CODE 1 TITLE 1
CODE 2 TITLE 2
CODE 3 TITLE 3
OPENING 1
OPENING 2
OPENING 3
DEADLINE 1
DEADLINE 2
DEADLINE 3
question from:https://stackoverflow.com/questions/66048123/how-to-add-information-found-with-regular-expressions-in-a-csv-file

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You need to rearrange things a bit so that all items are written for a row at the same time. The approach here is to use the match_hor to find each title start, and to then use this as a starting point for the match_od, which is in turn used as a starting point for the match_dl.

import re
import csv
    
with open('doubt2.txt','r', encoding="utf-8") as f:
    f_contents = f.read()

regexHOR = r'
(TITLE-S+-d{2}-d{2})[:|;](.*?)^Conditions'
regexOD = r'^Opening dates+(d{1,2} w+ d{4})s*?'
regexDL =r'^Deadlines+(d+ w+ d+)'

patternHOR = re.compile(regexHOR, re.MULTILINE | re.DOTALL)
patternOD = re.compile(regexOD, re.MULTILINE | re.DOTALL)
patternDL = re.compile(regexDL, re.MULTILINE | re.DOTALL)

with open("result.csv", "w",newline='') as outfile:
    csvfile = csv.writer(outfile)
    csvfile.writerow(['Topic ID', 'Title', 'Opening date', 'Deadline'])
    
    for match_hor in patternHOR.finditer(f_contents):
        code, title = [match_hor.group(1), match_hor.group(2).replace('
', ' ')]
        offset = match_hor.end()
        
        match_od = patternOD.search(f_contents[offset:])
        offset += match_od.end()
        opening = match_od.group(1)
        
        match_dl = patternDL.search(f_contents[offset:]) 
        offset += match_dl.end()
        deadline = match_dl.group(1)
        
        csvfile.writerow([code, title.strip(), opening, deadline])

This would give you result.csv containing:

Topic ID,Title,Opening date,Deadline
TITLE-SDFSD-DFDS-SFDS-01-01,This is the title 1 that  is split into two lines with a blank line in the middle,15 Apr 2021,26 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-02,This is the title2 in one single line,15 March 2021,17 Aug 2021
TITLE-SDFSD-DFDS-SFDS-01-03,This is the title3 that is too long and takes two lines,15 May 2021,26 Sep 2021

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...