TransWikia.com

How to insert several csv files into Elasticsearch?

Database Administrators Asked by Revolucion for Monica on December 19, 2020

I have several csv files on university courses that all seem linked by an ID, that you can find here, and I wondered how to put them on Elasticsearch. I know, thanks to this video and Logstash, how to insert one sole file csv file to Elasticsearch. But do you know how to insert several such as those in the provided link ?

At the moment I started with a first .config file for a first csv file : ACCREDITATION.csv. But it would be painful to write them all…

The .config file is :

input{
    file{
        path =>"Users/mike/Data/ACCREDITATION.csv"
        start_position => "begining"
        sincedb_path => "/dev/null"
    }
}

filter{
    csv{
        separator => ","
        columns => ['PUBUKPRN', 'UKPRN', 'KISCOURSEID', 'KISMODE', 'ACCTYPE', 'ACCDEPEND', 'ACCDEPENDURL', 'ACCDEPENDURLW']

    }
    mutate{convert => ["PUBUKPRN","integer"]}
    mutate{convert => ["UKPRN","integer"]}
    mutate{convert => ["KISMODE","integer"]}
    mutate{convert => ["ACCTYPE","integer"]}
    mutate{convert => ["ACCDEPEND","integer"]}
}

output{
    elasticsearch{
        hosts =>"localhost"
        index =>"accreditation"
        document_type =>"accreditaiton keys"
    }
    stdout{}
}

Update May, 3rd

Without knowing how to use a .config file to implement csv files to Elasticsearch, I fell back to Elastic blog and tried to do a shell script importSVFiles for a first .csv file before trying to generalize the approach :

importCSVFiles :

#!/bin/bash
while read f1
do        
   curl -XPOST 'https://XXX.us-east-1.aws.found.io:9243/courses/accreditation' -H "Content-Type: application/json" -u elastic:XXX -d "{ "accreditation": "$f1" }"
done < AccreditationByHep.csv

Yet I received a mapper_parsing_exception on the terminal :

mike@mike-thinks:~/Data/on_2018_04_25_16_43_17$ ./importCSVFiles
{"error":{"root_cause":
            [{"type":"mapper_parsing_exception","reason":"failed to parse"}],
          "type":"mapper_parsing_exception",
          "reason":"failed to parse",
          "caused_by":{"type":"i_o_exception","reason":"Illegal unquoted character ((CTRL-CHAR, code 13)): 
              has to be escaped using backslash to be included in string valuen at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@e18584; line: 1, column: 88]"}
         },"status":400
}

One Answer

I just had a look at the data in the Higher Education Statistics Agency (HESA) zipped file and the files are all different.

This means you will either have to create an individual .config file for each import or create a single .config file using conditions as described in the following article:

Reference: How to use multiple csv files in logstash (Elastic Discuss Forum)

Expanding on your first .config by one level:

input{
    file{
        path =>"Users/mike/Data/ACCREDITATION.csv"
        start_position => "begining"
        sincedb_path => "/dev/null"
    }
    file{
        path =>"Users/mike/Data/ACCREDITATION.csv"
        start_position => "begining"
        sincedb_path => "/dev/null"
    }

}

filter{
    # added condition for first file
    if [path] == "Users/mike/Data/ACCREDITATION.csv"{
        csv{
            separator => ","
            columns => ['PUBUKPRN', 'UKPRN', 'KISCOURSEID', 'KISMODE', 'ACCTYPE', 'ACCDEPEND', 'ACCDEPENDURL', 'ACCDEPENDURLW']

        }
        mutate{convert => ["PUBUKPRN","integer"]}
        mutate{convert => ["UKPRN","integer"]}
        mutate{convert => ["KISMODE","integer"]}
        mutate{convert => ["ACCTYPE","integer"]}
        mutate{convert => ["ACCDEPEND","integer"]}
    }
    # added condition for second file
    else if [path] == "Users/mike/Data/AccreditationByHep.csv"{
        csv{
            separator => ","
            columns => ['AccreditingBodyName', 'AccreditionType', 'HEP', 'KisCourseTitle', 'KiscourseID']
        }
    # ommitted mutations for second file
    }

}

output{
    # added condition for first file
    if [path] == "Users/mike/Data/ACCREDITATION.csv"{ 
        elasticsearch{
            hosts =>"localhost"
            index =>"accreditation"
            document_type =>"accreditaiton keys"
        }
    }
    # added condition for second file
    else if [path] == "Users/mike/Data/AccreditationByHep.csv"{
        elasticsearch{
            hosts =>"localhost"
            index =>"accreditationByHep"
            document_type =>"accreditaitonbyhep keys"
        }
    }
    stdout{}
}

document_type is a deprecated configuration option

You should be able to expand on this example on your own.

Answered by John aka hot2use on December 19, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP