Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
582 views
in Technique[技术] by (71.8m points)

multithreading - Threading in python doesn't happen parallel

I'm doing data scraping calls with an urllib2, yet they each take around 1 seconds to complete. I was trying to test if I could multi-thread the URL-call loop into threading with different offsets.

I'm doing this now with my update_items() method, where first and second parameter are the offset and limit to do loops:

import threading
t1 = threading.Thread(target=trade.update_items(1, 100))
t2 = threading.Thread(target=trade.update_items(101, 200))
t3 = threading.Thread(target=trade.update_items(201, 300))

t1.start()
t2.start()
t3.start()

#t1.join()
#t2.join()
#t3.join()

Like the code, I tried to commment out the join() to prevent waiting of the threads, but it seems I get the idea of this library wrong. I inserted print() functions into the update_items() method, funny tho it shows that it's still looping just in serial routine and not all 3 threads in parallel, like I wanted to achieve.

My normal scraping protocol takes about 5 hours to complete and it's only very small pieces of data, but the HTTP call always takes some time. I want to multi-thread this task at least a few times to shorten the fetching at least to around 30-45minutes.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

To get multiple urls in parallel limiting to 20 connections at a time:

import urllib2
from multiprocessing.dummy import Pool

def generate_urls(): # generate some dummy urls
    for i in range(100):
        yield 'http://example.com?param=%d' % i

def get_url(url):
    try: return url, urllib2.urlopen(url).read(), None
    except EnvironmentError as e:
         return url, None, e

pool = Pool(20) # limit number of concurrent connections
for url, result, error in pool.imap_unordered(get_url, generate_urls()):
    if error is None:
       print result,

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...