TransWikia.com

Scrapy Last Page is not null and after page 146 last page is showing again

Stack Overflow Asked by Slacoff on November 15, 2021

The website has 146 pages with words but after page 146 the last page is showing again.
`

     if next_page is not None:

         yield response.follow(next_page, callback = self.parse)`

With this method sprider is not stoping at page 146 and it continues because page 147,148,149..is same as page 146. I tried to use for loop but that not worked. Also, I tried to take the value in next page button and break the function with next_extract. By the way output of next_extract is [‘kelimeler.php?s=1’]and the number increases with the page number like [‘kelimeler.php?s=2’]. Also, this way is not worked.

         next_page = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a::attr(href)').get()
     next_extract = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a').xpath("@href").extract()

     print(next_page)
     print(next_extract)




     
     if next_extract is 'kelimeler.php?s=147':
         break
     if next_page is not None:
         yield response.follow(next_page, callback = self.parse)

What should I do to stop the scrapying at page 146?

That’s the whole parse function

     def parse(self,response):

     items = TidtutorialItem()

     all_div_kelimeler = response.css('a.collapsed')

     for tid in all_div_kelimeler:

         kelime = tid.css('a.collapsed::text').extract()
         link= tid.css('a.collapsed::text').xpath("@href").extract()


         items['Kelime'] = kelime
         items['Link'] = link

         yield items

     next_page = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a::attr(href)').get()
     next_extract = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a').xpath("@href").extract()

     print(next_page)
     print(next_extract)


     if next_page is not None:
     #if next_extract is not 'kelimeler.php?s=2':
     #for i in range (10):
         yield response.follow(next_page, callback = self.parse)

One Answer

I can't be very precise about the best approach without seeing the page, but I can giv you some suggestions.

     next_page = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a::attr(href)').get()
     next_extract = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a').xpath("@href").extract()

I'm not sure what you are trying to accomplish here, as both the selectors are essentially the same, except that the second one you are using the .extract() method, which returns a LIST. And since it returns a list this following line will ALWAYS fail:

    if next_extract is 'kelimeler.php?s=147':
        break

Another important point is that break is meant to be used inside a loop, so if the if statement ever resolved into True, this would cause an exception. Read more here.

Again, without seeing the page I can't say this for sure, but I believe this would acomplish what you are trying to do:

    if next_page == 'kelimeler.php?s=147':
         return

Notice next_page instead of next_extract. If you want to use the latter, remember it is a list, not a string.

Answered by renatodvc on November 15, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP