Time-Series Data and Choice of an Appropriate DB to handle it

As of late, we can see the proliferation of time-series DBs like financial data analysis, microservice, data center, container monitoring, IoT applications etc. So, time-series DB to handle it are evolving like never before, and there are plenty of them like MongoDB, InfluxDB, Couchbase, Cassandra, Prometheus, Graphite, DalmatinerDB, OpenTSDB, KairosDB, ClickHouse, RiakTS and so on.

The major reason behind the adoption of time-series NoSQL DBs is the flexibility and non-relational nature of these. The relational databases still have many unique features that are highly useful to enterprises of all types, but these are tough to scale when needed. As the time-series data tend to pile up so quickly, the developers also increasingly think that relational databases are now ill-suited to handle data structures’ proliferation.

However, in reality, the relational DB to handle it can also be used effectively to handle time-series data while enjoying all the other allied benefits. The single problem to be solved is the need for scaling. This is what the databases like TimescaleDB addresses. When it comes to scaling, there is two option to think about it as scaling up a single machine for higher storage and scaling out where data is stored in multiple machines.

The need for scaling up

One major problem while considering database scaling up on a single machine is the huge cost of expanding disk and memory. While memory acts faster than the disk, it is comparatively more expensive. This is one of the common and oldest problems of database administrators managing relational DBs. In the case of many conventional relational DB to handle it, a table is stored as data of fixed-size pages, on which the data structured are built. With a set index, a query can easily find the row holding a unique ID without the need to scan the entire table.

Here, if the working data set and indexes are limited, we can keep these in the memory itself. However, if the data is large, it cannot be kept in the memory to modify the in-memory B-tree. Relational databases keep a B-tree for every table index, making it much easier to find the values. However, as the databases access the disks in page-size boundaries only, even the smallest updates may cause these swaps, i.e., even to change one cell, DBs need to swap out the existing pages and then write it back to the disk and again read it form the new page for modifying the same.

Solid-state drives

The solutions like NAND Flash drives may help reduce this seek time as they can also read from or write from the page-level granularity. To update a byte of data, SSD firmware may need to read from the 8KB page from the disk to buffer cache and modify the page further to write the updated bytes to the new disk. The cost involved in swapping out and in of the memory can also be seen in the performance graph from PostgreSQL.

The difference in time-series data

On looking at the actual problems relational DB to handle it, we can see that from the seminal System R by IBM in the late ’70s, relational DBs were meant for online transaction processing or OLTP. In this, the database operations are mostly transactional with many rows in the DBs. Say, for example, you may think of a typical bank transfer in which a user debit money from an account and then credits it into another. This further corresponds to the updates to two different rows on a table. A typical bank transfer may also occur between two distinct accounts, then the two rows involved can be modified and can be randomly distributed over different tables.

Time series data can found in various settings as transportation, logistics, industrial machines, server monitoring, data centre, DevOps, server monitoring etc. apart from typical financial applications. Let us explore a few standard time-series workloads.

DevOps monitoring: Such a system will collect the metrics about various containers and servers, including the CPU usage, network tx/rx, memory free vs. used memory, disk IOPS, etc. Each of these metrics will be associated with the timestamp having a unique server ID or name and a distinct set of tags which describe an attribute of what is collected.

IoT sensor data: Every IoT device will deliver many sensor readings for each time frame. For example, an air quality and environmental monitoring system could deliver the parameters like humidity, temperature, pressure, sound levels, carbon monoxide volume, nitrogen dioxide and so on. Each of these readings will be associated with a particular timestamp when the reading is taken and the sensor ID with some metadata.

Financial data: This may include different timestamps and streams. For example, stock market data may include the name of the stock, its current price, change in price overtime etc. It can also be related to a typical payment transaction on an e-com website, including an account ID, transaction amount, timestamp, and the metadata. In the case of financial data, it is different compared to the OLTP examples we discussed above. In this case, we are recording each transaction while the typical OLTP systems were recording the current state of the given system at a particular timestamp.

Fleet management or asset management systems: In this case, the data may be of a moving vehicle with GPS readings, asset ID, timestamp, and the associated metadata. This can be more or less the same as the OLTP systems we mentioned.

In all these cases, the datasets are considered a stream of various measurements that insert a new set of data into the DB at different time intervals. It is also possible that the data may arrive much later than it is timestamped due to connectivity delays or because of the need for data correction and updating. All these can be covered effectively with applications like TimeScaleDB.