TransWikia.com

What is the most used format to save data with type information

Data Science Asked by Pieter on February 14, 2021

I am exporting data from an SQL database and importing it into R. This is a two step process since I first (automatically) download the data to a hard drive and then import the file with R.

Currently, I am using csv files to save the data. Everybody supports csv. But csv does not support type information. This makes it sometimes cumbersome to load a csv file because I must check all the column types. This seems unnecessary because the SQL database already specifies the types of the columns.

I want to know if there is a broadly accepted file format to save data that also specifies the type of the columns.

Currently I am working with SQL databases, FME ETL’ing and R but I think this is an issue for every data tranfer.

2 Answers

I think it depends on your requirements. Read/Write, Sparse/Nonsparse,...? There are many alternatives.

Really common is SQLite, the "most widely deployed and used database engine", a small relational database, these days used behind-the-scenes by many open source and commercial software packages with data storage needs (e.g., Adobe Lightroom, Mozilla Firefox).

From the top of my head:

If you work with R and python:

The feather software was designed for fast data-frame serialization. It is currently available for R and python. Two R and Python authorities designed it in a collaboration. It's built on top of "Apache Arrow" and/or "protocol buffers", it's fast for reading, but it's in alpha state.

There are some serialization formats available from the XML community. You can store complex webs of objects in these formats.

There is JSON and JSON-schema.

If your tables are sparse, there is, for instance, "sparse ARFF" format (in little use, though). There must be others (I have to look this up myself)

Correct answer by knb on February 14, 2021

Parquet and Avro both support data types (strings, integers, floats, etc). These are the primary file types used for "big data" projects, although your data doesn't have to be big. Apache Spark is able to read both with ease. Having said that, I'm a big fan of keeping data in a SQL database (e.g., MySQL or Postgres) because that is what they are built for. If you can't re-use the database you're pulling from, could you make your own database locally or on a separate server? I would try using a relational database until your data exceeds 50 GB (an arbitrarily "somewhat large" size), and then I would use Avro or Parquet.

Answered by Ryan Zotti on February 14, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP