This section describes some internal implementation details about compression in InnoDB. The information presented here may be helpful in tuning for performance, but is not necessary to know for basic use of compression.
Some operating systems implement compression at the file system level. Files are typically divided into fixed-size blocks that are compressed into variable-size blocks, which easily leads into fragmentation. Every time something inside a block is modified, the whole block is recompressed before it is written to disk. These properties make this compression technique unsuitable for use in an update-intensive database system.
The InnoDB storage engine implements a novel type of compression with the help of the well-known zlib library, which implements the LZ77 compression algorithm. This compression algorithm is mature, robust, and efficient in both CPU utilization and in reduction of data size. The algorithm is “lossless”, so that the original uncompressed data can always be reconstructed from the compressed form. LZ77 compression works by finding sequences of data that are repeated within the data to be compressed. The patterns of values in your data determine how well it compresses, but typical user data often compresses by 50% or more.
Unlike compression performed by an application, or compression features of some other database management systems, InnoDB compression applies both to user data and to indexes. In many cases, indexes can constitute 40-50% or more of the total database size, so this difference is significant. When compression is working well for a data set, the size of the InnoDB data files (the .idb files) is 25% to 50% of the uncompressed size or possibly smaller. Depending on the workload, this smaller database can in turn lead to a reduction in I/O, and an increase in throughput, at a modest cost in terms of increased CPU utilization.
All user data in InnoDB is stored in pages comprising a B-tree index (the so-called clustered index). In some other database systems, this type of index is called an “index-organized table”. Each row in the index node contains the values of the (user-specified or system-generated) primary key and all the other columns of the table.
Secondary indexes in InnoDB are also B-trees, containing pairs of values: the index key and a pointer to a row in the clustered index. The pointer is in fact the value of the primary key of the table, which is used to access the clustered index if columns other than the index key and primary key are required. Secondary index records must always fit on a single B-tree page.
The compression of B-tree nodes (of both clustered and secondary
indexes) is handled differently from compression of overflow
pages used to store long VARCHAR
,
BLOB
, or TEXT
columns, as
explained in the following sections.
Because they are frequently updated, B-tree pages require special treatment. It is important to minimize the number of times B-tree nodes are split, as well as to minimize the need to uncompress and recompress their content.
One technique InnoDB uses is to maintain some system information in the B-tree node in uncompressed form, thus facilitating certain in-place updates. For example, this allows rows to be delete-marked and deleted without any compression operation.
In addition, InnoDB attempts to avoid unnecessary uncompression and recompression of index pages when they are changed. Within each B-tree page, the system keeps an uncompressed “modification log” to record changes made to the page. Updates and inserts of small records may be written to this modification log without requiring the entire page to be completely reconstructed.
When the space for the modification log runs out, InnoDB uncompresses the page, applies the changes and recompresses the page. If recompression fails, the B-tree nodes are split and the process is repeated until the update or insert succeeds.
Generally, InnoDB requires that each B-tree page can
accommodate at least two records. For compressed tables, this
requirement has been relaxed. Leaf pages of B-tree nodes
(whether of the primary key or secondary indexes) only need to
accommodate one record, but that record must fit in
uncompressed form, in the per-page modification log. Starting
with InnoDB storage engine version 1.0.2, and if InnoDB strict
mode is ON
, the InnoDB storage engine checks the
maximum row size during CREATE
TABLE
or CREATE INDEX
. If the row does not fit,
the following error message is issued: ERROR HY000:
Too big row
.
If you create a table when InnoDB strict mode is OFF, and a
subsequent INSERT
or
UPDATE
statement attempts to create an
index entry that does not fit in the size of the compressed
page, the operation fails with ERROR 42000: Row size
too large
. Unfortunately, this error message does
not name the index for which the record is too large, or
mention the length of the index record or the maximum record
size on that particular index page. The only remedy is to
rebuild the table with ALTER
TABLE
, to select a larger compressed page size
(KEY_BLOCK_SIZE
), to shorten any column prefix indexes, or
to disable compression entirely with ROW_FORMAT=DYNAMIC
or
ROW_FORMAT=COMPACT
.
In a clustered index, BLOB
,
VARCHAR
and TEXT
columns
that are not part of the primary key may be stored on
separately allocated (“overflow”) pages. We call
these “off-page columns” whose values are stored
on singly-linked lists of overflow pages.
For tables created in ROW_FORMAT=DYNAMIC
or
ROW_FORMAT=COMPRESSED
, the values of
BLOB
, TEXT
or
VARCHAR
columns may be stored fully
off-page, depending on their length and the length of the
entire row. For columns that are stored off-page, the
clustered index record only contains 20-byte pointers to the
overflow pages, one per column. Whether any columns are stored
off-page depends on the page size and the total size of the
row. When the row is too long to fit entirely within the page
of the clustered index, InnoDB chooses the longest columns
for off-page storage until the row fits on the clustered index
page. As noted above, if a row does not fit by itself on a
compressed page, an error occurs.
Tables created in previous versions of InnoDB use the
“Antelope” file format, which supports only
ROW_FORMAT=REDUNDANT
and ROW_FORMAT=COMPACT
. In these
formats, InnoDB stores the first 768 bytes of
BLOB
, VARCHAR
and
TEXT
columns in the clustered index record
along with the primary key. The 768-byte prefix is followed by
a 20-byte pointer to the overflow pages that contain the rest
of the column value.
When a table is in COMPRESSED
format, all data written to
overflow pages is compressed “as is”; that is,
InnoDB applies the zlib compression algorithm to the entire
data item. Other than the data, compressed overflow pages
contain an uncompressed header and trailer comprising a page
checksum and a link to the next overflow page, among other
things. Therefore, very significant storage savings can be
obtained for longer BLOB
,
TEXT
or VARCHAR
columns
if the data is highly compressible, as is often the case with
text data (but not previously compressed images).
The overflow pages are of the same size as other pages. A row containing ten columns stored off-page occupies ten overflow pages, even if the total length of the columns is only 8K bytes. In an uncompressed table, ten uncompressed overflow pages occupy 160K bytes. In a compressed table with an 8K page size, they occupy only 80K bytes. Thus, it is often more efficient to use compressed table format for tables with long column values.
Using a 16K compressed page size can reduce storage and I/O
costs for BLOB
, VARCHAR
or TEXT
columns, because such data often
compress well, and might therefore require fewer
“overflow” pages, even though the B-tree nodes
themselves take as many pages as in the uncompressed form.
In a compressed InnoDB table, every compressed page (whether 1K, 2K, 4K or 8K) corresponds to an uncompressed page of 16K bytes. To access the data in a page, InnoDB reads the compressed page from disk if it is not already in the buffer pool, then uncompresses the page to its original 16K byte form. This section describes how InnoDB manages the buffer pool with respect to pages of compressed tables.
To minimize I/O and to reduce the need to uncompress a page, at times the buffer pool contains both the compressed and uncompressed form of a database page. To make room for other required database pages, InnoDB may “evict” from the buffer pool an uncompressed page, while leaving the compressed page in memory. Or, if a page has not been accessed in a while, the compressed form of the page may be written to disk, to free space for other data. Thus, at any given time, the buffer pool may contain both the compressed and uncompressed forms of the page, or only the compressed form of the page, or neither.
InnoDB keeps track of which pages to keep in memory and which to evict using a least-recently-used (LRU) list, so that “hot” or frequently accessed data tends to stay in memory. When compressed tables are accessed, InnoDB uses an adaptive LRU algorithm to achieve an appropriate balance of compressed and uncompressed pages in memory. This adaptive algorithm is sensitive to whether the system is running in an I/O-bound or CPU-bound manner. The goal is to avoid spending too much processing time uncompressing pages when the CPU is busy, and to avoid doing excess I/O when the CPU has spare cycles that can be used for uncompressing compressed pages (that may already be in memory). When the system is I/O-bound, the algorithm prefers to evict the uncompressed copy of a page rather than both copies, to make more room for other disk pages to become memory resident. When the system is CPU-bound, InnoDB prefers to evict both the compressed and uncompressed page, so that more memory can be used for “hot” pages and reducing the need to uncompress data in memory only in compressed form.
Before a compressed page is written to a database file, InnoDB writes a copy of the page to the redo log (if it has been recompressed since the last time it was written to the database). This is done to ensure that redo logs will always be usable, even if a future version of InnoDB uses a slightly different compression algorithm. Therefore, some increase in the size of log files, or a need for more frequent checkpoints, can be expected when using compression. The amount of increase in the log file size or checkpoint frequency depends on the number of times compressed pages are modified in a way that requires reorganization and recompression.
Note that the redo log file format (and the database file format) are different from previous releases when using compression. The MySQL Enterprise Backup product does support this latest “Barracuda” file format for compressed InnoDB tables. The older InnoDB Hot Backup product can only back up tables using the file format “Antelope”, and thus does not support InnoDB tables that use compression.