How to save storage in python crawler (common strings)

Question

Welcome To Ask or Share your Answers For Others

How to save storage in python crawler (common strings)

asked Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

How to save storage in python crawler (common strings)

I have a python3 crawler that connect to target sites and saves all html and resources. Although I compress with gzip before saving it consumes too much space and I usually reach my configured space limit before less than half of website pages are crawled.

The point is that all pages of the same website have a lot of common strings (there are even websites that include resources like css in all html pages instead linking then). Then my idea is saving the common strings for the same website. I thought this kind of optimization would be documented, but I didn't found anything about this.

Although I have this idea, I don't know how to implement this kind of algorithm. Any help would be appreciated.

question from:https://stackoverflow.com/questions/65879548/how-to-save-storage-in-python-crawler-common-strings

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T19:21:31+0000

Common compression algorithms can already save a lot of space on common strings, and this would be a very good case if you have pages from the same website. Instead of saving and compressing the pages individually, try to put similar pages in the same file and compress that. It should make the compression rate much better.

You should try to make the window size of your compression larger than the average page in order to make it utilize the data of previous pages.

If you are saving them to disk or using a database, you can use the domain name or the first X characters of the URL as your index to find the compressed file, so pages under the same domain/directory will naturally get compressed together.

Another approach you can take is to create "dictionaries" for your compression. Some compression algorithms allow you to give a few example files to train a dictionary that can then be used to compress similar files better. An example of an algorithm that does this is zstd. You can read how to use the dictionary feature here.

Categories

How to save storage in python crawler (common strings)

How to save storage in python crawler (common strings)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags