Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
287 views
in Technique[技术] by (71.8m points)

php - Scraping website with Goutte hangs until timeout on specific site

I'm playing around with Goutte and can't get it to connect to a certain website. All other URLs seem to be working perfectly, and I'm struggling to understand what's preventing it from connecting. It just hangs until it times out after 30 seconds. If I remove the timeout, the same happens after 150 seconds.

Key points to note:

  • This timeout / hang only happens on tesco.com that I've found so far. asda.com, google.com, etc work fine and return a result.
  • The site loads instantly in a web browser (Chrome) (not IP or ISP related).
  • I get a result returned fine if I make a GET request in Postman to the same URL.
  • Doesn't appear to be user agent related.
<?php

namespace AppHttpControllers;

use GoutteClient;
use GuzzleHttpClient as GuzzleClient;

class ScraperController extends Controller
{
    public function scrape()
    {
        $goutteClient = new Client();

        $goutteClient->setHeader('user-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36');

        $guzzleClient = new GuzzleClient(array(
            'timeout' => 30,
            'verify' => true,
            'debug' => true,
        ));
        $goutteClient->setClient($guzzleClient);
        $crawler = $goutteClient->request('GET', 'https://www.tesco.com/');

        dump($crawler);

        /*$crawler->filter('.result__title .result__a')->each(function ($node) {
            dump($node->text());
        });*/

    }
}

This is the "debug" output, including the error:

* Trying 104.123.91.150:443... * TCP_NODELAY set * Connected to www.tesco.com (104.123.91.150) port 443 (#0) * ALPN, offering http/1.1 * successfully set certificate verify locations: * CAfile: /etc/ssl/certs/ca-certificates.crt CApath: /etc/ssl/certs * SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 * ALPN, server accepted to use http/1.1 * Server certificate: * subject: C=GB; L=Welwyn Garden City; jurisdictionC=GB; O=Tesco PLC; businessCategory=Private Organization; serialNumber=00445790; CN=www.tesco.com * start date: Feb 4 11:09:23 2020 GMT * expire date: Feb 3 11:39:21 2022 GMT * subjectAltName: host "www.tesco.com" matched cert's "www.tesco.com" * issuer: C=US; O=Entrust, Inc.; OU=See www.entrust.net/legal-terms; OU=(c) 2014 Entrust, Inc. - for authorized use only; CN=Entrust Certification Authority - L1M * SSL certificate verify ok. > GET / HTTP/1.1 Host: www.tesco.com user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36 * old SSL session ID is stale, removing * Operation timed out after 30001 milliseconds with 0 bytes received * Closing connection 0
GuzzleHttpExceptionConnectException
cURL error 28: Operation timed out after 30001 milliseconds with 0 bytes received (see https://curl.haxx.se/libcurl/c/libcurl-errors.html)
http://localhost/scrape

Can anyone see why I'm getting no response at all?

question from:https://stackoverflow.com/questions/65865442/scraping-website-with-goutte-hangs-until-timeout-on-specific-site

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Managed to resolve this by adding some more headers:

<?php

namespace AppHttpControllers;

use GoutteClient;
use GuzzleHttpClient as GuzzleClient;

class ScraperController extends Controller
{
    public function scrape()
    {
        $goutteClient = new Client();

        $goutteClient->setHeader('accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9');
        $goutteClient->setHeader('accept-encoding', 'gzip, deflate, br');
        $goutteClient->setHeader('accept-language', 'en-GB,en-US;q=0.9,en;q=0.8');
        $goutteClient->setHeader('upgrade-insecure-requests', '1');
        $goutteClient->setHeader('user-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.96 Safari/537.36');
        $goutteClient->setHeader('connection', 'keep-alive');

        $guzzleClient = new GuzzleClient(array(
            'timeout' => 5,
            'verify' => true,
            'debug' => true,
            'cookies' => true,
        ));
        $goutteClient->setClient($guzzleClient);
        $crawler = $goutteClient->request('GET', 'https://www.tesco.com/');

        dump($crawler);

        /*$crawler->filter('.result__title .result__a')->each(function ($node) {
            dump($node->text());
        });*/
    }
}

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...