TransWikia.com

Building Financial Data Time Series Database from scratch

Quantitative Finance Asked by mountainclimber11 on October 27, 2021

My company is starting a new initiative aimed at building a financial database from scratch.

We would be using it in these ways:

  1. Time series analysis of: a company’s financial data (ex: IBM’s total fixed assets over time), aggregations (ex: total fixed assets for the materials sector over time), etc.
  2. Single company snapshot: various data points of a single company
  3. Analysis of multiple companies across multiple data fields for a single time frame, usually the current day.
  4. Backtesting, rank analysis, data analysis, etc. of ideas and custom factors.

Approximate breadth of data:

  1. 3000 companies
  2. 3500 data fields (ex: total fixed assets, earnings, etc.)
  3. 500 aggregation levels
  4. Periodicity: daily, monthly, quarterly, annual
  5. 20 year look-back that would grow over time

Questions:

  1. What database should we choose? We are currently limited to free options and we prefer open source (on principle). Currently we use PostgreSQL.
  2. How should I structure this schema-wise? I am thinking of breaking up the field types into categories (balance sheet, descriptive, income statement, custom calculations, etc.) so each company would have a table for balance sheet, descriptive, income statement, custom calculations, etc. with each row representing one day and appropriate fields for the category of table for columns/fields. That will be my fully normalized database. Using the fully normalized database, I will then build a data warehouse, temp tables, views, etc. that are not fully normalized to make queries fast for the various use cases described previously. One issue with this approach is the number of tables. If I have, say, 5 categories of company data and 3000 companies I will have 15,000 tables in my fully normalized database for just storing the company data. But still, from my perspective, it seems like the best way to do it.
  3. What is the best strategy for indexing and structuring the time series portion of this? I’ve talked to a few people and I did some research on time series database indexing/structure, but help/references/tips/etc. in this area, even if they duplicate what I have found, would be helpful. I realize this depends on the answer to #1 above, so maybe assume I am staying with PostgreSQL and I will be building out the “time series” functionality specific bells and whistles myself.

Notes:

  • In-depth technical answers and references/links much preferred.
  • This is for a small buy side financial investment firm.
  • If you have been down this road before, suggestions outside of the scope of my initial question are welcome.
  • We cannot compromise on the amount of data, so reducing the amount of data isn’t an option for us; however, the numbers I supplied are estimates only.
  • If there is a better place to ask this question, please let me know.
  • There is much more to what we want to do, but this represents the core of what we want to do from a data structure perspective.

10 Answers

QuestDB is also another option. Billed as "the fastest open source time series database".

Answered by MrR on October 27, 2021

Axibase Time Series Database FINANCE Edition is built exactly for storing and retrieving Last sale, Level 1, and reference data.

It's faster than files and comes with custom SQL functions for fast filtering and aggregation.

SELECT datetime, symbol, open(), close(), vwap()
    FROM atsd_trade
  WHERE datetime BETWEEN '2020-12-01' AND '2021-01-01' EXCL
    AND IS_INSTRUMENT_IN_INDEX('<index_name>')
    -- AND exchange = 'LSE' AND class = 'IOB' AND symbol = 'RDS.A'
  GROUP BY exchange, class, symbol, period(1 HOUR)

Notice functions such as vwap() which are not part of standard SQL, or IS_INSTRUMENT_IN_INDEX function which filters trades by components that belonged to the index on the trade date (not only current components).

More SQL examples are listed here. Also, how to handle option EOD data.

The database is free for production on a single node.

(Disclaimer: I work for Axibase)

Answered by Sergei Rodionov on October 27, 2021

DolphinDB is designed for your use-case. It is really powerful and performant though it is not open-sourced and requires a commercial license.

Answered by Wei Qiu on October 27, 2021

Another solution is QAdirect. The data is from various sources, and you can add in your own custom tables. The schema is well documented, and extraction is via the usual SQL technologies. As far as cost goes, it costs, but not as much as Bloomberg.

They seem to be committed to getting information from new sources, which will help in this heterogeneous environment.

I know of several shops that combined QAdirect, Capital IQ/alpha-factor library, Datastream, along with Facset and Bloomberg for point solutions.

The real problems with all of this cost. I have seen suppliers start to ramp up prices and are trying to build strangleholds on the data sources. The control of the data is the real battle for the right to charge rent from investors these days.

Answered by Paul Brennan on October 27, 2021

I will prefix my answer with the following:

  • I've not read the holy wars stuff in the other answers in detail. I think its missing the point.
  • I've implemented exactly this project more than once in small buy side firms.

Your main issue is getting the data in the first place. I have a recommendation - S&P Capital IQ. You can get a deployment where you either:

  • connect to their database server in the cloud
  • deploy their loaders and database on your hardware (there's a choice of target OS and database, so if your org is postgres / sql server / oracle, its not an issue - just pick what your guys prefer.

The loaders take care of sync'ing the data, and the schema.

I've used similar competing products from other vendors and S&P was the only one that had what I would call a 'sane' data structure. SQL queries are pretty straightforward and your average quant analyst will have no issue at all getting what they want. The documentation is good, and they have support teams that actually have a clue.

If you want to extract that data into something that suits your analytic workload better, then its much easier to do that from a properly organised dataset. I'm specifically talking about extract to another storage for faster processing. For most quant workloads its going to be either:

  • snap shot the DB to my temp copy
  • read into python etc, do work, dump snapshot to my local disk
  • some other file format / database for highly concurrent analytic workload (tableau, Apache Spark, custom calc, etc)

The only issue with S&P is cost - but you were going to need to buy something anyway, as collecting it yourself is totally infeasible. They are far cheaper than Bloomberg, and I cannot recommend Thomson Reuters.

Don't worry to much about scalable tech storage etc, the real issues are cleaning the data, identifiers and dealing with things like corporate actions. That's waaaaay harder.

Regarding the tech:

Postgres is fine, and it wont annoy with with issues like float precision etc. There is now TimescaleDB that gives you some pretty powerful time series performance for zero $, without forcing you to silo the time series data from the reference data.

For the kind of data you are talking about, you won't need a clustered database solution if your data is reasonably normalised. S&P's product is and for that dataset size, it was a couple of TB iirc.

Answered by ThatDataGuy on October 27, 2021

Interesting debate and Not to wake sleeping dogs, the world has moved quite a bit in the 1.5 years, and the data space has exploded.

I would like to recommend some new technologies and at the same time share a few of my experiences in this space.

As @madilyn is trying to explain: It all depends on your use case. In my experience it's easy to know what you want to do today, but really hard to foresee all the future use cases. Therefore I also take the agility into account when making a choice of the stack to use for a given system.

Flat files is just really powerful especially if they are stored in a binary format (e.g. tea, HDF5, Feather, Apache Parquet), using JSON and serializing / de-serializing the data is not clever at all.

When dealing with huge amounts of rows of structured data a modern Column-oriented database is hard to beat, especially in combination with interesting technologies like snappy (compression that... well compress AND gives faster i/o - What!?!) and distributed file systems (e.g. GlusterFS, GridFS, CEPH), that allows you to build a relative inexpensive and scalable database cluster, check out MariaDB Columnstore and the super performant (but with some drawbacks) Clickhouse.

Most of the data in finance is having the time dimension, so it might be a good idea of thinking this into the? KDB+ has been around for decades with a super strong database, unfortunately mostly unavailable for smaller companies due to cost. Now a whole sub-industry is emerging, fueled by the IoT buzz, offering time series databases (e.g. InfluxDB, RiakTS, OpenTSDB), but in my opinion the latest and still relative unknown contender TimescaleDB offers some truly unique features. TimescaleDB is an extension to PostgreSQL and offers time series capabilities inside the same database of where your non-time-dimensional data resides, making it easy to JOIN together and in general leveraging PostgreSQL's large feature set.

The query language is really important, and in my opinion nothing beats SQL (and NewSQL) in terms of compatibility. By forcing other people to learn and use CQL or MongoDB Query Language, you might very well end up building a data graveyard.

When NoSQL was at the peak of its hype, I was one of the cool kids onboard with a MongoDB database, quickly I realized that a. I was the only user b. I wanted to use BI tools for the early data exploration, MongoDB have properly had a lot of customers with the same request, so luckily they have made a 'Connector for BI'. I tried both the official one and a bunch of the SQL connectors made by 3rd party vendors, and let me just say this: You don't have to do the same experiment, unless you wish to waste a day or two of your life.


Conclusion I will (partly) side with @madilyn PostgreSQL with TimescaleDB extension might be the way to go for the OP, but if you’re not already married to PostgreSQL then also checkout MariaDB Columnstore, once you have a nice solution, then build a feature/script to extract data into a binary file for one-off / event research.

Answered by chjortlund on October 27, 2021

All of the answers above (unfortunately highly upvoted at this point) are missing the point. You shouldn't pick a DBMS or storage solution by general performance benchmarks, you should pick it by use case. If someone says they get a "x ms read", "y inserts per second", "k times speedup", "store n TB data" or "have m years of experience" and use that to justify a proposal to you, don't trust that person.

I can describe a common breaking point for every single one of the proposed solutions above:

  1. Flat files: This is a bad idea when you start to have many client applications, you have a small team, and/or you need to access this data in realtime. Even a small team can have hundreds of processes on heterogenous servers hammering the data simultaneously, so you start making this a hardware or file system problem if you store your data in flat files and rely on your own applications to manage concurrent access. Even 10-20 year old RDBMSes do a fairly decent job of managing this that would take nontrivial amount of time for your own developers to replicate. Writing low level concurrent software to deal with network I/O bounds or file system limitations is almost always more expensive than figuring out how to implement a cluster database or sharding with open source tools, and if you have a small team, your developer time is infinitely more valuable than single file query speed. You said that you plan to backtest against the data store - yeah sure, you can backtest at 500 MB/s to 2 GB/s (millions of entries per second, cool!) on a single server with a plain file on a fast SSD, but it's nontrivial to scale this with plain files.

  2. Column-oriented time series database: Most people mistake modern database optimizations with the column-oriented advantages. More modern DBMSes have clever parallel radix hash joins, SIMD-based aggregation operations and the likes which explain their speedup. Say, in theory, a row-oriented DBMS should always outperform a column-oriented DBMS in write speed, but you'd see the reverse in many benchmarks because many column-oriented DBMSes have more modern ways to defer metadata generation or maintain indices. In the end, column-oriented is

    i. A bad idea when you almost always fully materialize your records early. For example, suppose you are just storing {time, best_bid, best_ask} and you are just selecting all the columns in the time interval $[a,b]$ because you are doing exploratory analysis and don't yet know what function $f(best bid, best ask)$ you want to work with. First, the cache advantage of sequentially going through the times (stored continguously) in $mathbb{O}(n)$ loses to the algorithmic advantage of chasing pointers through the index B-tree in $mathbb{O}(log n)$. Most column-oriented DBMS architects are aware of this and implement their query optimizers to fall back on the index if the query follows this pattern, so both row-oriented and column-oriented DBMS are evenly matched here. But at the materialization step, the column-oriented DBMS still has to deserialize the separate columns back into row-oriented records, whereas the row-oriented DBMS just writes out the data in storage order and should be faster in theory.

    ii. Negligible if your query qualifying sets are usually small. The biggest bottleneck is in loading a disk sector into memory. If your qualifying set is small, all of it lies on the same (few) sector(s) regardless of column-oriented or row-oriented layout, so there is no first order speed advantage in column-oriented layout.

    iii. Expensive if you need third party support. There's only a few production strength column-oriented DBMSes, and getting a contractor to help you here is more expensive than getting free quality assistance on MySQL. If you want realtime applications subscribing to your database, chances are you need an expensive commercial solution because all the open-source options have weak support for such functionality.

  3. NoSQL: Let's say you have 2 companies that decide to merge, which happens very often since you are looking at daily granularity. There's different conventions for handling this, but now how do you update the earnings associated with either company? There's no cascade-style data model in MongoDB, so now you offload the work from database-level to application level. This can be bad for a few reasons, (1) chances are, you hand this task to an analyst to do it for you, who is more prone to mess it up than letting the schema ensure integrity for you, (2) you need to write use-case-specific code to update specific fields in your JSON documents, which is difficult to maintain, (3) MongoDB, proposed above, has a rather inefficient serialization format (BSON) and almost all your applications downstream are going to get bottlenecked by the BSON library at some point.

Here's what I suggest you do: Stick with PostgreSQL because you are already familiar with it, design your schema in a futureproof way that is easy for you to migrate your data to any future solutions, determine where the performance bottlenecks of your use cases are before asking for a more specific solution.

You'll have to ask your end users what these bottlenecks are. Is it:

  1. Backtesting
  2. Running time range queries ("Get me all the data columns in this time range.")
  3. Running aggregation operations ("I want to find max trade price, calculate total volume etc.")
  4. Concurrent access ("I want to write my backtest results on server A while server B is broadcasting the results to server B and C.")
  5. Maintaining complex relationships ("I need to know all the dividend date revisions and I have to update them frequently.")
  6. Maintaining unstructured relationships ("This asset class has strike prices but this other asset class doesn't.")

The ideal solution differs depending on your use case.

Answered by madilyn on October 27, 2021

Regarding storage, I stream real-time updates for exchange listed contracts (outright + exchange listed calendar spreads) to InfluxDB. Its a time-series database, mostly geared towards IT Ops for storing log data, but it works fine with homogeneous finance data.

For options strips, due to the sheer amount of data generated per day, I use TeaFiles. Pros for using teafiles are that the file sizes are small, and load times are fast. For each .tea file I assign the following strategy: [ContractSymbol]_[Strike]_[CallOrPut]_[Date].tea. Files are then stored in Azure BLOB storage. Read benchmarks are good; ~25ms to read a file with 300k entries from a C# loader.

Answered by omencat on October 27, 2021

I am going to recommend something that I have no doubt will get people completely up in arms and probably get people to attack me. It happened in the past and I lost many points on StackOverflow as people downvoted my answer. I certainly hope people are more open minded in the quant forum.

Note - It seems that this suggestion has created some strong disagreement again. Before you read this I would like to point out that this suggestion is for a "Small buy side firm" and not a massive multiuser system.

I spent 7 years managing a high-frequency trading operation and our primary focus was building systems just like this. We spent a huge amount of time trying to figure out the most efficient way to store, retrieve and analyze order level data from both the NYSE, NASDAQ and a wide variety of ECNs. What I am giving you is the result of that work.

Our answer was Don't Use a Database. A basic structured file system of serialized data chunks works far better. Market time series data is unique in many ways, both in how it is used and how it is stored. Databases were developed for wildly different needs and actually hurt the performance of what you are trying to do.

This is in the context of a small to mid-sized trading operation that is focused on data analysis related to trading strategies or risk analytics. If you are creating a solution for a large brokerage, bank or have to meet the needs of a large number of simultaneous clients then I imagine that your solution would differ from mine.

I happen to love databases. I am using MongoDB right now for part of a new project allowing us to analyze options trades, but my market timeseries data, including 16 years of options data, is all built into a structured file store.

Let me explain the reasoning behind this and why it is more performant.

First, let's look at storing the data. Databases are designed to allow a system to do a wide variety of things with data. The basic CRUD functions; Create, Read, Update and Delete. To do these things effectively and safely, many checks and safety mechanisms must be implemented. Before you read data the database needs to be sure the data isn't being modified, it is checking for colisions, etc.. When you do read the data in a database the server puts a lot of effort into caching that data and determining if it can be served up faster later. There are indexing operations and replicating data to prepare it to be viewed in different ways. Database designers have put huge amounts of effort into designing these functions to be fast, but they all take processing time and if they are not used they are just an impediment.

Market time series data is stored in a completely different way. In fact, I would say it is prepared rather than stored. Each data item only needs to be written once and after that never needs to be modified or changed. Data items can be written sequentially, there is no need to insert anything in the middle. It needs no ACID functionality at all. They have little to no references out to any other data. The time series is effectively its own thing.

As a database does all the magic that makes databases wonderful it also packs on the bytes. The minimum space data can take up is its own original size. They may be able to play some tricks with normalizing data and compression, but those only go so far and slow things down. The indexing, caching and referencing the data ends up packing on the bytes and chewing up storage.

Reading is also very simplified. Finding data is as simple as time & symbol. Complex indexing does it no good. Since time series data is typically read in a linear fashion and a sequential chunk at once, Caching strategies actually slow the access down instead of help. It takes the processor cycles to cache the data you aren't going to read again anytime soon.

This is the basic structures that worked for us. We created basic data structures for serializing the data. If your major concern is speed and data size you can go with a simple custom binary storage. In another answer, omencat suggested using TeaFiles and that looks like it has some promise also. Our recent need is for more flexibility so we chose to use a fairly dense, but flexible JSON format.

We broke the data up into fairly obvious chunks. The EOD stock data is a very easy example, but the concept works for our larger datasets also.

We use the data for analysis in fairly traditional time series scenarios. It could be referenced as one quote or out to a series containing years of data at a time. It was important to break the data down to bite-sized chunks for storage so we chose to make one "Block" of our data equal one year of EOD stock time series data. Each block is one file that contains a year of OHLC EOD data serialized as JSON. The name of the file is the Stock symbol prefixed by an underscore. Note - the underscore prevents issues when the stock symbol conflicts with DOS commands such as COM or PRN.

Note, make sure you understand the limitations of your file system. We got in trouble when we put too many files in one place. This led to a directory structure that is effectively its own index. It is broken down by the year of data and then also sorted by the first letter of the stock symbol. This gives us roughly 20 to a few hundred symbol files per directory. It looks roughly like this;

StockEOD{YYYY}{Initial}_symbol.json

AAPL data for 2015 would be

StockEOD2015A_AAPL.json

A small piece of its data file looks like this;

[{"dt":"2007-01-03T00:00:00","o":86.28,"h":86.58,"l":81.9,"c":83.8,"v":43674760},
{"dt":"2007-01-04T00:00:00","o":84.17,"h":85.95,"l":83.82,"c":85.66,"v":29854074},
{"dt":"2007-01-05T00:00:00","o":85.84,"h":86.2,"l":84.4,"c":85.05,"v":29631186},
{"dt":"2007-01-08T00:00:00","o":85.98,"h":86.53,"l":85.28,"c":85.47,"v":28269652}

We have a router object that can give us a list of filenames for any data request in just a handful of lines. Each file is read with an Async filestream and deserialized. Each quote is turned into an Object and added to a sorted list in the system. At that point, we can do a very quick query to trim off the unneeded data. The data is now in memory and can be used in almost any way needed.

If the query size gets too big for the computer to handle it isn't difficult chunking the process. It takes a massive request to get there.

I have had programmers who I described this to almost go into a rage telling me how I was doing it wrong. That this was "Rolling my own database" and a complete waste of time. In fact, we switched from a fairly sophisticated database. When we did our codebase to handle this dropped to a small handful of classes and less than 1/4 of the code we used to manage the database solution. We also got nearly a 100x jump in speed. I can retrieve 7 years of stock end of day data for 20 symbols in a couple of milliseconds.

Our old HF trading system used similar concepts but in a highly optimized Linux environment and operated in the nanosecond range.

Answered by drobertson on October 27, 2021

The standard answer is going to be that for time series, you want a column store database. These are optimized for range queries (ie: give me everything between two timestamps) because crucially, they store data along one of the dimensions (which you must choose, usually time) contiguously on disk, and thus reads are extremely fast. The alternative, when fully normalizing on a relational database, is that time will be indexed, but not stored contiguously. Thus if you ask for, say 3000 data points, you're hitting the index 3000 times, and even if said index is held entirely in memory (unlikely if you have 3000 companies x3500 fields), this takes a lot of time. Anecdotally, even using MongoDB, which is very vast, I obtained a 100x speed up using Cassandra. Hbase will equally serve you well, while lesser known entrants such as RiakTS, InfluxDB et al, are more specialized towards time series, and often don't have tool support (for example, Flink or Spark). I have found Cassandra to suit my needs very well, but then, I don't have as many fields as you (though it will handle them with ease). However I do store vast amounts of timeseries data (intraday) and Cassandra also has very high ingest rates.

However. You mention normalization. Cassandra et al will not give you normalization. They will not give you anything like the sophisticated integrity tools which relational databases provide, nor the complex schema options. In particular, if you want column-store speed along more than one of the axes in your data hybercube, you will need to repeat the data transposed form. Otherwise it will revert to relational performance (ie: you can still index across dimensions, of course). If you find that most of your queries will be along the time axis, and that you don't have much need for complex schemas, I would heartily recommend Cassandra. It is heavily used in the financial industry for this reason.

An option which might be equivalent for you however, is Postgres with its column store capabilities. You can specify to Postgres that it should store data along one axis contiguously, thereby getting the benefit of the column stores. Note however that Postgres doesn't scale across multiple machines nearly as easily as Cassandra or Hbase. You'll have to scale it vertically, which is much much more expensive when you start pushing the envelop on ingest or query performance. With Cassandra, you just add cheap boxes. We all know that 10 cheap boxes are much cheaper than the single monster box that will keep up with their (parallelized) performance. Indeed at some performance levels, relational will just choke, whereas Cassandra will scale linearly to as many boxes as you like (Apple reportedly runs 75000 nodes). That's why Netflix uses Cassandra, as does Apple. You'll need some serious z-enterprise style big iron if you want to try to get close to the performance of even a medium sized Cassandra cluster, with a relational database.

So, Postgres or Cassandra. It will probably come down to whether or not you are going to put the data up for public (ie large number of users) consumption, or if your data is truly big (> 5 terabytes ish is what I call "big"). In that case Cassandra. Or if you need computer-science rigorous normalization and all the bells and whistles that Postgres will give you, and indeed it is an excellent choice (with column store) if you are not too worried about truly big data growth.

Answered by Thomas Browne on October 27, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP