What You Possibly Don’t Know About Columnar Storage
Columnar storage is a commonly used storage technique. Often, it implies high performance and has become a standard configuration for today’s analytical databases.
Join the DZone community and get the full member experience.
Join For FreeColumnar storage is a commonly used storage technique. Often, it implies high performance and has basically become a standard configuration for today’s analytical databases.
The basic principle of columnar storage is reducing the amount of data retrieved from the hard disk. A data table can have a lot of columns, but the computation may use only a very small number of them. With columnar storage, useless columns do not need to be retrieved, while with row-wise storage, all columns need to be scanned. When the retrieved columns only take up a very small part of the total, columnar storage has a big advantage in terms of IO time, and computation seems to get much faster.
But the columnar storage also has another side – it isn’t the fastest for any scenario.
Implementing columnar storage is much more complex than implementing row-wise storage because, for a data table, the number of columns can be determined in advance, but the number of rows will not stop growing. With row-wise storage, we write and append data to the table according to the order of records. It is easy to store the data table as a single file. But this does not work for data stored in columnar format. As there will be data appending, we cannot know the number of rows beforehand, and it is thus impossible to finish writing a column and then the next. Generally, we divide the storage space into a number of blocks, write a fixed number of rows (represented by N) to one block, and then move to the next when finish the writing. Later, data will be retrieved block by block. In each block, data is stored in the column-wise format, while between blocks, data can be regarded as stored row-wise. A special management module, where a table of contents is used to record information on the continuously growing data blocks and every column they store, is needed, causing a lot of inconveniences. So, it is difficult to implement columnar storage in a single data file. The storage schema is usually adopted by special data warehouse products.
However, the block storage mechanism is unfriendly to the implementation of parallel processing when the data amount is not large. Parallel processing requires that data be divided into multiple segments. To be able to do this, there are two conditions: an almost equal amount of data in each segment (equal processing load for each thread) and the ability for flexible segmentation (the number of parallel tasks cannot be determined beforehand). Row-wise data can be segmented according to the number of rows, and parallel processing becomes feasible even for a very small amount of data. Column-wise data can only be divided into blocks, where data cannot be further divided. The number of records (the above-mentioned N) should not be too small; otherwise, too many resources will be wasted due to the existence of the smallest disk retrieval unit. In the extreme case of N=1, the storage schema is equal to row-wise storage. When N is too small, and the total amount of data involved is huge, the table of contents becomes very large and overburdens the content management. So, N is usually specified as one million or above. In order to segment data flexibly, there need to be at least hundreds of data blocks. That is to say, the parallel computation on column-wise data becomes smooth only when the total amount reaches at least hundreds of millions of data rows.
esProc SPL offers the double increment segmentation strategy to make N grow as the data amount increases while maintaining the same number of data blocks. This way, the size of the table of contents can also be fixed, the columnar storage can be conveniently implemented in a single file, and flexible segmentation can be implemented for performing parallel computation on a small amount of data.
According to the principle of columnar storage, the storage schema brings an obvious advantage only when the computation involves a relatively small number of columns. Many performance test cases (such as TPCH used as the international standard) choose such computing scenarios so they are convenient for bringing out the advantages of columnar databases. Those are only a part of the real-life business scenarios. In the finance industry, it is not rare that a computation involves most of the columns in a table having over one hundred columns. In that case, columnar storage only gives half-play to its advantage. Even if columnar storage has a higher compression ratio and a smaller amount of data retrieved than row-wise storage, its advantage is not that noticeable when many columns participate in the computation. After all, the process of retrieving data stored column-wise is much more complex than that of retrieving row-wised stored data.
Therefore, when a real-world computation does not have as good performance as the test case gets, it is normal, and this does not mean that the test result is fake.
Columnar storage also leads to random disk accesses. Data in each column is stored continuously, but data in different columns isn’t. The more columns that are retrieved, the more serious the degree of randomness resulting from the retrieval, even with a single-thread task. For SSDs, it isn’t a very serious problem because when data in each column is continuous, and the above-mentioned N is big enough, the retrieval cost takes up a very small proportion, and the SSD does not have the seek time.
But for HDDs that have the seek time, the problem becomes disastrous. When a lot of columns are accessed, it is probable that the performance is not even as good as that of the row-wise storage. Both concurrency and parallel processing will worsen the problem. On the other hand, increasing the size of the cache to alleviate the problem will occupy too much memory space.
Be cautious when you try to use the columnar storage on HDDs.
Another big problem with columnar storage is that it has a much lower indexing performance than row-wise storage. As we said, the index table stores ordered key values and positions of their corresponding records in the original table. For row-wise storage, the position of a record can be represented by one number, but for columnar storage, each column in a record has a different position, and, in principle, these positions should all be recorded. This creates an index table almost as big as the original table, which leads to heavy storage utilization and high retrieval costs. There isn’t much difference between this and the method of copying the original table and sorting it.
Of course, no one will do that in real-world practices. The general approach is still the previously mentioned block storage mechanism. The index only stores ordinal numbers of the records. The search reads an ordinal number from the index table, locates the corresponding block, “counts” from the first record to the one with the corresponding ordinal number in the block, and retrieves the column value. The “count” action is performed on each column. In the best-case scenario, a number of disk units equal to the number of columns will be read; if you are not lucky, the whole block will be scanned. By contrast, an index for row-wise storage generally only needs to read one or two disk units (determined by the space the records occupy). The amount of data retrieved under columnar storage is dozens of, even one hundred, times more than that under row-wise storage. With HDDs, there is also the unbearable seek time. Therefore, the columnar storage basically cannot handle the high-concurrency query requirements.
Use the columnar storage for traversal and the row-wise storage for search. For data on which both traversal and search require high performance, it is even necessary to store two copies of data redundantly. The data platform should permit programmers to adopt the most suitable storage schema for each computing scenario rather than making the same decision for all scenarios.
Well, esProc SPL allows users to choose a more suitable one between the row-wise storage and the columnar storage. It also offers the value-attached indexing strategy to store a row-wise data copy of the traversal-oriented columnar data for the search.
Published at DZone with permission of Judy Liu. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments