I'm using puppeteer for scraping some pages, but I'm curious about how to manage this in production for a node app. I'll be scraping up to 500,000 pages in a day, but these scrape jobs will happen at random intervals, so it's not a single queue that I can plow through.
What I'm wondering is, is it better to open a browser, go to the page, then close the browser between each job? Which I would assume would be a lot slower, but maybe handle memory better?
Or do I open one global browser when the app boots, and then just go to the page, and have some way to dump that page when I'm done with it (e.g. closing all tabs in chrome, but not closing chrome) then just re-open a new page when I need it? This way seems like it would be faster, but could potentially eat up lots of memory.
I've never worked with this library especially in a production environment, so I'm not sure if there's things I should watch out for.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…