In your current code, you're reading the whole file into memory at once. Since they're 500Mb files, that means 500Mb strings. And then you do repeated replacements of them, which means Python has to create a new 500Mb string with the first replacement, then destroy the first string, then create a second 500Mb string for the second replacement, then destroy the second string, et cetera, for each replacement. That turns out to be quite a lot of copying of data back and forth, not to mention using a lot of memory.
If you know the replacements will always be contained in a line, you can read the file line by line by iterating over it. Python will buffer the read, which means it will be fairly optimized. You should open a new file, under a new name, for writing the new file simultaneously. Perform the replacement on each line in turn, and write it out immediately. Doing this will greatly reduce the amount of memory used and the amount of memory copied back and forth as you do the replacements:
for file in files:
fname = os.path.join(dir, file)
inFile = codecs.open(fname, "r", "utf-8")
outFile = codecs.open(fname + ".new", "w", "utf-8")
for line in inFile:
newline = do_replacements_on(line)
outFile.write(newline)
inFile.close()
outFile.close()
os.rename(fname + ".new", fname)
If you can't be certain if they'll always be on one line, things get a little harder; you'd have to read in blocks manually, using inFile.read(blocksize)
, and keep careful track of whether there might be a partial match at the end of the block. Not as easy to do, but usually still worth it to avoid the 500Mb strings.
Another big improvement would be if you could do the replacements in one go, rather than trying a whole bunch of replacements in order. There are several ways of doing that, but which fits best depends entirely on what you're replacing and with what. For translating single characters into something else, the translate
method of unicode objects may be convenient. You pass it a dict mapping unicode codepoints (as integers) to unicode strings:
>>> u"xff and ubd23".translate({0xff: u"255", 0xbd23: u"something else"})
u'255 and something else'
For replacing substrings (and not just single characters), you could use the re
module. The re.sub
function (and the sub
method of compiled regexps) can take a callable (a function) as the first argument, which will then be called for each match:
>>> import re
>>> d = {u'spam': u'spam, ham, spam and eggs', u'eggs': u'saussages'}
>>> p = re.compile("|".join(re.escape(k) for k in d))
>>> def repl(m):
... return d[m.group(0)]
...
>>> p.sub(repl, u"spam, vikings, eggs and vikings")
u'spam, ham, spam and eggs, vikings, saussages and vikings'