Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
978 views
in Technique[技术] by (71.8m points)

django - ReactorNotRestartable - Twisted and scrapy

Before you link me to other answers related to this, note that I've read them and am still a bit confused. Alrighty, here we go.

So I am creating a webapp in Django. I am importing the newest scrapy library to crawl a website. I am not using celery (I know very little about it, but saw it in other topics related to this).

One of the url's of our website, /crawl/, is meant to start the crawler running. It's the only url in our site that requires scrapy to be used. Here is the function which is called when the url is visited:

def crawl(request):
  configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
  runner = CrawlerRunner()

  d = runner.crawl(ReviewSpider)
  d.addBoth(lambda _: reactor.stop())
  reactor.run() # the script will block here until the crawling is finished

  return render(request, 'index.html')

You'll notice that this is an adaptation of the scrapy tutorial on their website. The first time this url is visited when the server starts running, everything works as intended. The second time and further, a ReactorNotRestartable exception is thrown. I understand that this exception happens when a reactor which has already been stopped is issued a command to start again, which is not possible.

Looking at the sample code, I would assume the line "runner = CrawlerRunner()" would return a ~new~ reactor for use each time this url is visited. But I believe perhaps my understanding of twisted reactors is not completely clear.

How would I go about getting and running a NEW reactor each time this url is visited?

Thank you so much

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Generally speaking, you can't have a new reactor. There's one global one. This is clearly a mistake and maybe it will be corrected in the future but that's the current state of affairs.

You might be able to use Crochet to manage a single reactor running (for the lifetime of your whole process - not repeatedly starting and stopping) in a separate thread.

Consider the example from the Crochet docs:

#!/usr/bin/python
"""
Do a DNS lookup using Twisted's APIs.
"""
from __future__ import print_function

# The Twisted code we'll be using:
from twisted.names import client

from crochet import setup, wait_for
setup()


# Crochet layer, wrapping Twisted's DNS library in a blocking call.
@wait_for(timeout=5.0)
def gethostbyname(name):
    """Lookup the IP of a given hostname.

    Unlike socket.gethostbyname() which can take an arbitrary amount of time
    to finish, this function will raise crochet.TimeoutError if more than 5
    seconds elapse without an answer being received.
    """
    d = client.lookupAddress(name)
    d.addCallback(lambda result: result[0][0].payload.dottedQuad())
    return d


if __name__ == '__main__':
    # Application code using the public API - notice it works in a normal
    # blocking manner, with no event loop visible:
    import sys
    name = sys.argv[1]
    ip = gethostbyname(name)
    print(name, "->", ip)

This gives you a blocking gethostbyname function that's implemented using Twisted APIs. The implementation uses twisted.names.client which just relies on being able to import the global reactor.

Note there is no reactor.run or reactor.stop call - just the Crochet setup call.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...