Data from a table are stored on a set of buckets. A bucket is a directory containing a set of files.
Option with one file per column
With option « bucket.oneFilePerColumn = true », column data are stored on a dedicated file per column. This option is faster on cloud storages and allows better flexibility.
Option with all columns in one file
With option « bucket.oneFilePerColumn = false », all columns are stored on a single file. This option requires less files to be opened simultaneously and can be a better choice on HDFS.
Types supported by the K-Store file format are intended to be as minimal as possible:
• TINYINT: 8 bits signed integer
• SMALLINT: 16 bits signed integer
• INT: 32 bits signed integer
• FLOAT: IEEE 32-bit floating point value
• BIGINT: 64 bits signed integer
• DOUBLE: FLOAT: IEEE 64-bit floating point value
• STRING: unicode string
• TIMESTAMP: timestamp value as unicode string
• DATE: date value as unicode string
Compression / Encoding
Integers are compressed using delta encoding and other types are compressed with Snappy.
Strings are stored as a 16 bits length followed by UTF-8 characters.
In column files, data are stored as a sequence of pages. A page contains a fixed number of rows defined by « bucket.pagesize ».