Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
334 views
in Technique[技术] by (71.8m points)

python - whats missing this regex to match the lines of apache logs?

I have these lines

5.10.80.69 - - [21/Jun/2019:15:46:20 -0700] "PATCH /niches/back-end HTTP/2.0" 406 15834
11.57.203.39 - carroll8889 [21/Jun/2019:15:46:21 -0700] "HEAD /visionary/cultivate HTTP/1.1" 404 15391
124.137.187.175 - - [21/Jun/2019:15:46:22 -0700] "DELETE /expedite/exploit/cultivate/web-enabled HTTP/1.0" 403 2606
203.36.55.39 - collins6322 [21/Jun/2019:15:46:23 -0700] "PATCH /efficient/productize/disintermediate HTTP/1.1" 504 13377
175.5.52.40 - - [21/Jun/2019:15:46:24 -0700] "POST /real-time HTTP/1.1" 200 2660
232.220.131.214 - - [21/Jun/2019:15:46:25 -0700] "GET /wireless/matrix/synergistic/expedite HTTP/1.1" 205 15081
87.234.209.125 - labadie6990 [21/Jun/2019:15:46:26 -0700] "GET /unleash/aggregate HTTP/2

and I need to put them in an array like this:

example_dict = {"host":"146.204.224.152", 
                "user_name":"feest6811", 
                "time":"21/Jun/2019:15:45:24 -0700",
                "request":"POST /incentivize HTTP/1.1"}

This is what I have done:

import re
def logs():
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
        return logdata
    
partes = [
    r'(?P<host>S+)',                   # host %h
    r'S+',                             # indent %l (unused)
    r'(?P<user>S+)',                   # user %u
    r'[(?P<time>.+)]',                # time %t
    r'"(?P<request>.*)"',               # request "%r"
    r'(?P<status>[0-9]+)',              # status %>s
    r'(?P<size>S+)',                   # size %b (careful, can be '-')
    r'"(?P<referrer>.*)"',              # referrer "%{Referer}i"
    r'"(?P<agent>.*)"',                 # user agent "%{User-agent}i"
]

pattern = re.compile(r's+'.join(partes)+r's*')

log_data = []

for line in logs():
  log_data.append(pattern.match(line).groupdict())
    
print (log_data)

But I have this errror:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-029948b6e367> in <module>
     23 # Get components from each line of the log file into a structured dict
     24 for line in logs():
---> 25   log_data.append(pattern.match(line).groupdict())
     26 
     27 

AttributeError: 'NoneType' object has no attribute 'groupdict'

This error is obviusly because the regex is wrong, but not sure why, the code is taken from here:

https://gist.github.com/sumeetpareek/9644255

Update:

    import re
    def logs():
        with open("assets/logdata.txt", "r") as file:
            logdata = file.read()
            return logdata

regex="^(S+) (S+) (S+) [([w:/]+s[+-]d{4})] "(S+)s?(S+)?s?(S+)?" (d{3}|-) (d+|-)s?"?([^"]*)"?s?"?([^"]*)?"?$"

log_data = []

for line in logs():
    m = pattern.match(line)
    log_data.append(re.findall(regex, line).groupdict())
    
print (log_data)

But I get this error:unexpected character after line continuation character

Update 2:

when adding the items to a dictionary, the items must arrive in this format:

assert len(logs()) == 979

one_item={'host': '146.204.224.152',
  'user_name': 'feest6811',
  'time': '21/Jun/2019:15:45:24 -0700',
  'request': 'POST /incentivize HTTP/1.1'}
assert one_item in logs(), "Sorry, this item should be in the log results, check your formating"
question from:https://stackoverflow.com/questions/65882119/whats-missing-this-regex-to-match-the-lines-of-apache-logs

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Since there are a lot of issues with the solution you have, please consider revamping it completely.

The regex that should work for you is

^(?P<host>S+) +S+ +(?P<user>S+) +[(?P<time>[w:/]+ +[+-]d{4})] +"(?P<request>S+) +(?P<status>S+) +(?P<size>S+)" +(?P<someid>d{3}|-) +(?P<someid2>d+|-)(?: +"(?P<referrer>[^"]*)"(?: +"(?P<agent>[^"]*)")?)?$

See the regex demo. Note the last (?: +"([^"]*)"(?: +"([^"]*)")?)? part matches two optional sequences of patterns and the last one is only matched if the first is matched.

The code you can leverage may look like

import re

pattern = re.compile(r'''^(?P<host>S+) +S+ +(?P<user>S+) +[(?P<time>[w:/]+ +[+-]d{4})] +"(?P<request>S+) +(?P<status>S+) +(?P<size>S+)" +(?P<someid>d{3}|-) +(?P<someid2>d+|-)(?: +"(?P<referrer>[^"]*)"(?: +"(?P<agent>[^"]*)")?)?$''')

log_data = []

with open("assets/logdata.txt", "r") as file:
  for line in file:
    m = pattern.search(line.strip())
    if m:
      log_data.append(m.groupdict())

print(log_data)

See the Python demo


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...