AnswerBun.com

Python Scrapy how to save data in different files

I want to save each quote from http://quotes.toscrape.com/ saved into a csv file (2 field : author, quote). One other necessity is to save these quotes in different files seperated by the page they reside. ie : (page1.csv, page2.csv …). I have tried to achieve this by declaring feed exports in custom_settings attribute in my spider as shown below. This, however, doesn’t even produce a file called page-1.csv. I am a total beginner using scrapy, please try to explain assuming I know little to nothing.

import scrapy
import urllib

class spidey(scrapy.Spider):
    name = "idk"
    start_urls = [
        "http://quotes.toscrape.com/"
    ]

    custom_settings = {
        'FEEDS' : {
            'file://page-1.csv' : { #edit: uri needs to be absolute path
                'format' : 'csv',
                'store_empty' : True
            }
        },
        'FEED_EXPORT_ENCODING' : 'utf-8',
        'FEED_EXPORT_FIELDS' : ['author', 'quote']
    }
    

    def parse(self, response):
        for qts in response.xpath("//*[@class="quote"]"):
            author = qts.xpath("./span[2]/small/text()").get()
            quote = qts.xpath("./*[@class="text"]/text()").get()
            yield {
                'author' : author,
                'quote' : quote
                }

        next_pg = response.xpath('//li[@class="next"]/a/@href').get()      
        if next_pg is not None:
            next_pg = urllib.parse.urljoin(self.start_urls[0], next_pg)
            yield scrapy.Request(next_pg, self.parse)

How I ran the crawler: scrapy crawl idk
As an added question, I need my files to be overwritten as opposed to being appended like when specifying -o flag. Is it possible to do it without having to manually check/delete preexisting files from spider?

Stack Overflow Asked by Silver Flash on December 27, 2020

1 Answers

One Answer

Saving your items into a file named after the page you found them in is (afaik) not supported in settings. If you wanted to achieve this, you could create your own functionality for that with python's open function and csv.writer in your parse method. An alternate option would be to write an item pipeline which manages different item exporters for different files.

What you can do with settings however is limit the number of items in a file with the FEED_EXPORT_BATCH_ITEM_COUNT setting, which is supported since of Scrapy version 2.3.
Overwriting instead of appending to a file can also be done since of Scrapy 2.4. In FEEDS you can set overwrite to True as demonstrated shortly.

If you were to replace your custom_settings with the following, it would produce files with 10 items each named page- followed by the batch_id, which starts with one. So your first 3 files would be named page-1.csv, page-2.csv and page-3.csv.

    custom_settings = {
        'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
        'FEEDS' : {
            'page-%(batch_id)d.csv' : {
                'format' : 'csv',
                'store_empty' : True,
                'overwrite': True
            }
        }
    }

Implementing as pipeline

If you wanted to implement this using an item pipeline, you could save the page number you are on in the dictionary you return, which then gets processed and removed by the item pipeline.

The pipeline in your pipelines.py (based on this example) could then look like this:

from scrapy.exporters import CsvItemExporter


class PerFilenameExportPipeline:
    """Distribute items across multiple CSV files according to their 'page' field"""

    def open_spider(self, spider):
        self.filename_to_exporter = {}

    def close_spider(self, spider):
        for exporter in self.filename_to_exporter.values():
            exporter.finish_exporting()

    def _exporter_for_item(self, item):
        filename = 'page-' + str(item['page_no'])
        del item['page_no']
        if filename not in self.filename_to_exporter:
            f = open(f'{filename}.csv', 'wb')
            exporter = CsvItemExporter(f)
            exporter.start_exporting()
            self.filename_to_exporter[filename] = exporter
        return self.filename_to_exporter[filename]

    def process_item(self, item, spider):
        exporter = self._exporter_for_item(item)
        exporter.export_item(item)
        return item

To your spider you would then need to add a routine to get the page you are on as well as setting the pipeline in your custom_settings, which you could do like the following:

import scrapy
from ..pipelines import PerFilenameExportPipeline


class spidey(scrapy.Spider):
    name = "idk"
    custom_settings = {
        'ITEM_PIPELINES': {
            PerFilenameExportPipeline: 100
        }
    }
    
    def start_requests(self):
        yield scrapy.Request("http://quotes.toscrape.com/", cb_kwargs={'page_no': 1})

    def parse(self, response, page_no):
        for qts in response.xpath("//*[@class="quote"]"):
            yield {
                'page_no': page_no,
                'author' : qts.xpath("./span[2]/small/text()").get(),
                'quote' : qts.xpath("./*[@class="text"]/text()").get()
            }

        next_pg = response.xpath('//li[@class="next"]/a/@href').get()      
        if next_pg is not None:
            yield response.follow(next_pg, cb_kwargs={'page_no': page_no + 1})

However, there is one issue with this. The last file (page-10.csv) stays empty for reasons beyond my comprehension. I have asked why that could be here.

Answered by Patrick Klein on December 27, 2020

Add your own answers!

Related Questions

How can I get different margins when appending divs in CSS?

4  Asked on February 4, 2021 by michaelstackquestion

     

Getting error while running merged jtl files

2  Asked on February 4, 2021 by ajij-shaikh

   

Why don’t need to use free() function in this case?

2  Asked on February 4, 2021 by akrilmokus

     

how to show data of databse in navigation bar on laravel 7

2  Asked on February 4, 2021 by fahad-munir

     

String formatting: optional section

1  Asked on February 3, 2021 by cerno

     

Multiple table to encode json and display

1  Asked on February 3, 2021 by j-wujeck

   

i need to remove duplicated comments from every post

1  Asked on February 3, 2021 by rabie_alkholi

         

Why is C is much slower as compared to Java?

1  Asked on February 3, 2021 by jaysmito-mukherjee

       

Server returned HTTP response code: 429 for URL JAVA Reddit JSON

2  Asked on February 3, 2021 by luke-prior

     

How to measure sequential memory read speed in C/C++

1  Asked on February 3, 2021 by sz-ppeter

 

Ask a Question

Get help from others!

© 2022 AnswerBun.com. All rights reserved.