Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
344 views
in Technique[技术] by (71.8m points)

compression - How do I list contents of a gz file without extracting it in python?

I have a .gz file and I need to get the name of files inside it using python.

This question is the same as this one

The only difference is that my file is .gz not .tar.gz so the tarfile library did not help me here

I am using requests library to request a URL. The response is a compressed file.

Here is the code I am using to download the file

response = requests.get(line.rstrip(), stream=True)
        if response.status_code == 200:
            with open(str(base_output_dir)+"/"+str(current_dir)+"/"+str(count)+".gz", 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
            del response

This code downloads the file with name 1.gz for example. Now if I opened the file with an archive manger the file will contain something like my_latest_data.json

I need to extract the file and the output be my_latest_data.json.

Here is the code I am using to extract the file

inF = gzip.open(f, 'rb')
outfilename = f.split(".")[0]
outF = open(outfilename, 'wb')
outF.write(inF.read())
inF.close()
outF.close()

The outputfilename variable is a string I provide in the script but I need the real file name (my_latest_data.json)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can't, because Gzip is not an archive format.

That's a bit of a crap explanation on its own, so let me break this down a bit more than I did in the comment...

Its just compression

Being "just a compression system" means that Gzip operates on input bytes (usually from a file) and outputs compressed bytes. You cannot know whether or not the bytes inside represent multiple files or just a single file -- it is just a stream of bytes that has been compressed. That is why you can accept gzipped data over a network, for example. Its bytes_in -> bytes_out.

What's a manifest?

A manifest is a header within an archive that acts as a table of contents for the archive. Note that now I am using the term "archive" and not "compressed stream of bytes". An archive implies that it is a collection of files or segments that are referred to by a manifest -- a compressed stream of bytes is just a stream of bytes.

What's inside a Gzip, anyway?

A somewhat simplified description of a .gz file's contents is:

  1. A header with a special number to indicate its a gzip, a version and a timestamp (10 bytes)
  2. Optional headers; usually including the original filename (if the compression target was a file)
  3. The body -- some compressed payload
  4. A CRC-32 checksum at the end (8 bytes)

That's it. No manifest.

Archive formats, on the other hand, will have a manifest inside. That's where the tar library would come in. Tar is just a way to shove a bunch of bits together into a single file, and places a manifest at the front that lets you know the names of the original files and what sizes they were before being concatenated into the archive. Hence, .tar.gz being so common.

There are utilities that allow you to decompress parts of a gzipped file at a time, or decompress it only in memory to then let you examine a manifest or whatever that may be inside. But the details of any manifest are specific to the archive format contained inside.

Note that this is different from a zip archive. Zip is an archive format, and as such contains a manifest. Gzip is a compression library, like bzip2 and friends.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...