If this is really something line based (where a true XML parser isn't necessary the best solution), mmap
can help here.
mmap
the file, then call .rfind('
')
on the resulting object (possibly with adjustments to handle the file ending with a newline when you really want the non-empty line before it, not the empty "line" following it). You can then slice out the final line alone. If you need to modify the file in place, you can resize the file to shave off (or add) a number of bytes corresponding to the difference between the line you sliced and the new line, then write back the new line. Avoids reading or writing any more of the file than you need.
Example code (please comment if I made a mistake):
import mmap
# In Python 3.1 and earlier, you'd wrap mmap in contextlib.closing; mmap
# didn't support the context manager protocol natively until 3.2; see example below
with open("large.XML", 'r+b') as myfile, mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
# len(mm) - 1 handles files ending w/newline by getting the prior line
# + 1 to avoid catching prior newline (and handle one line file seamlessly)
startofline = mm.rfind(b'
', 0, len(mm) - 1) + 1
# Get the line (with any newline stripped)
line = mm[startofline:].rstrip(b'
')
# Do whatever calculates the new line, decoding/encoding to use str
# in do_something to simplify; this is an XML file, so I'm assuming UTF-8
new_line = do_something(line.decode('utf-8')).encode('utf-8')
# Resize to accommodate the new line (or to strip data beyond the new line)
mm.resize(startofline + len(new_line)) # + 1 if you need to add a trailing newline
mm[startofline:] = new_line # Replace contents; add a b"
" if needed
Apparently on some systems (e.g. OSX) without mremap
, mm.resize
won't work, so to support those systems, you'd probably split the with
(so the mmap
closes before the file object), and use file object based seeks, writes and truncates to fix up the file. The following example includes my previously mentioned Python 3.1 and earlier specific adjustment to use contextlib.closing
for completeness:
import mmap
from contextlib import closing
with open("large.XML", 'r+b') as myfile:
with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
startofline = mm.rfind(b'
', 0, len(mm) - 1) + 1
line = mm[startofline:].rstrip(b'
')
new_line = do_something(line.decode('utf-8')).encode('utf-8')
myfile.seek(startofline) # Move to where old line began
myfile.write(new_line) # Overwrite existing line with new line
myfile.truncate() # If existing line longer than new line, get rid of the excess
The advantages to mmap
over any other approach are:
- No need to read any more of the file beyond the line itself (meaning 1-2 pages of the file, the rest never gets read or written)
- Using
rfind
means you can let Python do the work of finding the newline quickly at the C layer (in CPython); explicit seek
s and read
s of a file object could match the "only read a page or so", but you'd have to hand-implement the search for the newline
Caveat: This approach will not work (at least, not without modification to avoid mapping more than 2 GB, and to handle resizing when the whole file might not be mapped) if you're on a 32 bit system and the file is too large to map into memory. On most 32 bit systems, even in a newly spawned process, you only have 1-2 GB of contiguous address space available; in certain special cases, you might have as much as 3-3.5 GB of user virtual addresses (though you'll lose some of the contiguous space to the heap, stack, executable mapping, etc.). mmap
doesn't require much physical RAM, but it needs contiguous address space; one of the huge benefits of a 64 bit OS is that you stop worrying about virtual address space in all but the most ridiculous cases, so mmap
can solve problems in the general case that it couldn't handle without added complexity on a 32 bit OS. Most modern computers are 64 bit at this point, but it's definitely something to keep in mind if you're targeting 32 bit systems (and on Windows, even if the OS is 64 bit, they may have installed a 32 bit version of Python by mistake, so the same problems apply). Here's yet one more example that works (assuming the last line isn't 100+ MB long) on 32 bit Python (omitting closing
and imports for brevity) even for huge files:
with open("large.XML", 'r+b') as myfile:
filesize = myfile.seek(0, 2)
# Get an offset that only grabs the last 100 MB or so of the file aligned properly
offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
startofline = mm.rfind(b'
', 0, len(mm) - 1) + 1
# If line might be > 100 MB long, probably want to check if startofline
# follows a newline here
line = mm[startofline:].rstrip(b'
')
new_line = do_something(line.decode('utf-8')).encode('utf-8')
myfile.seek(startofline + offset) # Move to where old line began, adjusted for offset
myfile.write(new_line) # Overwrite existing line with new line
myfile.truncate() # If existing line longer than new line, get rid of the excess