My current Python Project will require a lot of string splitting to process incoming packages. Since I will be running it on a pretty slow system, I was wondering what the most efficient way to go about this would be. The strings would be formatted something like this:
Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5
Explanation: This particular example would come from a list where the first two Items are a title and a date, while Item 3 to Item 5 would be invited people (The number of those can be anything from zero to n, where n is the number of registered users on the server).
From what I see, I have the following options:
- repeatedly use
split()
- Use a regular expression and Regex functions
- Some other Python functions I have not thought about yet (There are probably some)
Solution 1 would include splitting at |
and then splitting the last element of the resulting list at <>
for this example, while solution 2 would probably result in a regular expression like:
((.+)|)+((.+)(<>)?)+
Okay, this RegEx is horrible, I can see that myself. It is also untested. But you get the idea.
Now, I am looking for the way that a) takes the least amount of time and b) ideally uses the least amount of memory. If only one of the two is possible, I would prefer less time. The ideal solution would also work for Strings that have more Items seperated with |
and strings that completely lack the <>
. At least the Regular Expression-based Solution would do that
My understanding would be that split()
would use more memory (since you basically get two resulting lists, one that splits at |
and the second one that splits at <>
), but I don't know enough about Pythons implementation of regular Expressions to judge how the RegEx would perform. split()
is also less dynamic than a regular expression if it somes to different numbers of Items and the absence of the second seperator. Still, I can't shake the impression that python can do this better without regular expressions, that's why I am asking
Some notes:
- Yes, I could just benchmark both solutions, but I'm trying to learn something about python in general and how it works here, and if I just benchmark these two, I still don't know what python functions I have missed.
- Yes, optimizing at this level is only really required for high-performance stuff, but as I said, I am trying to learn things about python.
- Addition: in the original question, I completely forgot to mention that I need to be able to distinguish the parts that were seperated by
|
from the parts with the seperator <>
, so a simple flat list as generated by re.split(||<>,input)
(as proposed by @obmarg) will not work too well. Solutions fitting this criterium are much appreciated.
To sum the question up: Which solution would be the most efficient one, for what reasons.
Due to multiple requests, I have run some timeit on the split()
-solution and the first proposed regular expression by @obmarg, as well as the solutions by @mgibsonbr and @duncan:
import timeit
import re
def splitit(input):
res0 = input.split("|")
res = []
for element in res0:
t = element.split("<>")
if t != [element]:
res0.remove(element)
res.append(t)
return (res0, res)
def regexit(input):
return re.split( "||<>", input )
def mgibsonbr(input): # Solution by @mgibsonbr
items = re.split(r'||<>', input) # Split input in items
offset = 0
result = [] # The result: strings for regular itens, lists for <> separated ones
acc = None
for i in items:
delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>'
offset += len(i) + len(delimiter)
if delimiter == '<>': # Will always put the item in a list
if acc is None:
acc = [i] # Create one if doesn't exist
result.append(acc)
else:
acc.append(i)
else:
if acc is not None: # If there was a list, put the last item in it
acc.append(i)
else:
result.append(i) # Add the regular items
acc = None # Clear the list, since what will come next is a regular item or a new list
return result
def split2(input): # Solution by @duncan
res0 = input.split("|")
res1, res2 = [], []
for r in res0:
if "<>" in r:
res2.append(r.split("<>"))
else:
res1.append(r)
return res1, res2
print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
The results:
mgibs: 14.7349407408
split: 6.403942732
split2: 3.68306812233
regex: 5.28414318792
mgibs: 107.046683735
split: 46.0844590775
split2: 26.5595985591
regex: 28.6513302646
At the moment, it looks like split2 by @duncan beats all other algorithms, regardless of length (with this limited dataset at least), and it also looks like @mgibsonbr's solution has some performance issues (Sorry 'bout that, but thanks for the solution regardless).
Thanks for the input, everyone.
See Question&Answers more detail:
os