I am ordering a huge pile landsat scenes from the USGS, which come as tar.gz archives. I am writing a simple python script to unpack them. Each archive contains 15 tiff images from 60-120 mb in size, totalling just over 2 gb. I can easily extract an entire archive with the following code:
import tarfile
fileName = "LT50250232011160-SC20140922132408.tar.gz"
tfile = tarfile.open(fileName, 'r:gz')
tfile.extractall("newfolder/")
I only actually need 6 of those 15 tiffs, identified as "bands" in the title. These are some of the larger files, so together they account for about half the data. So, I thought I could speed this process up by modifying the code as follows:
fileName = "LT50250232011160-SC20140922132408.tar.gz"
tfile = tarfile.open(fileName, 'r:gz')
membersList = tfile.getmembers()
namesList = tfile.getnames()
bandsList = [x for x, y in zip(membersList, namesList) if "band" in y]
print("extracting...")
tfile.extractall("newfolder/",members=bandsList)
However, adding a timer to both scripts reveals no significant efficiency gain of the second script (on my system, both run in about a minute on a single scene). While the extraction is somewhat faster, it seems like that gain is offset by the time it takes to figure out which files need to be extracted int he first place.
The question is, is this tradeoff inherant in what I am doing, or just the result of my code being inefficient? I'm relatively new to python and only discovered tarfile today, so it wouldn't surprise me if the latter were true, but I haven't been able to find any recommendations for efficient extraction of only part of an archive.
Thanks!
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…