TransWikia.com

Dynamic 'wait' arg in Scrapy-splash

Stack Overflow Asked by Winters on November 16, 2021

I am scraping multiple pages using Scrapy-Splash.

class Spider(scrapy.Spider):
    name = "scrape"

    def start_requests(self):
        urls = get_urls()
        for url in urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 8 }
                }
            })

The code works fine, I get the desired result from the pages.

The problem is, I have to set a larger wait time (>4) or Splash is sometimes terminated by the next request before returning a result. This seems terribly unreliable.

Is there a way to set the wait time to something more dynamic? I found a partial solution here using a LUA script:

Adding a wait-for-element while performing a SplashRequest in python Scrapy

function main(splash)
  splash:set_user_agent(splash.args.ua)
  assert(splash:go(splash.args.url))

  -- requires Splash 2.3  
  while not splash:select('.my-element') do
    splash:wait(0.1)
  end
  return {html=splash:html()}
end

But it appears to require a hard-coded element to terminate Splash (".my-element"), and I am scraping many different websites with different elements to be collected.

How can I dynamically code the ‘wait’ arg or customise the LUA script to terminate Splash when it has collected the desired element? Surely this is a common problem?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP