Most often, the internal optimizations in InnoDB described in section Section 13.7.3.4.2, “InnoDB Data Storage and Compression” ensure that the system runs well with compressed data. However, because the efficiency of compression depends on the nature of your data, there are some factors you should consider to get best performance. You need to choose which tables to compress, and what compressed page size to use. You might also adjust the size of the buffer pool based on run-time performance characteristics, such as the amount of time the system spends compressing and uncompressing data.
In general, compression works best on tables that include a reasonable number of character string columns and where the data is read far more often than it is written. Because there are no guaranteed ways to predict whether or not compression benefits a particular situation, always test with a specific workload and data set running on a representative configuration. Consider the following factors when deciding which tables to compress.
A key determinant of the efficiency of compression in reducing
the size of data files is the nature of the data itself.
Recall that compression works by identifying repeated strings
of bytes in a block of data. Completely randomized data is the
worst case. Typical data often has repeated values, and so
compresses effectively. Character strings often compress well,
whether defined in CHAR
,
VARCHAR
, TEXT
or
BLOB
columns. On the other hand, tables
containing mostly binary data (integers or floating point
numbers) or data that is previously compressed (for example
JPEG
or
PNG
images) may not generally compress well, significantly or at
all.
Compression is chosen on a table by table basis with the
InnoDB storage engine, and a table and all of its indexes use the
same (compressed) page size. It might be that the primary key
(clustered) index, which contains the data for all columns of
a table, compresses more effectively than the secondary
indexes. For those cases where there are long rows, the use of
compression might result in long column values being stored
“off-page”, as discussed in
Section 13.7.5.3, “DYNAMIC
Row Format”. Those overflow
pages may compress well. Given these considerations, for many
applications, some tables compress more effectively than
others, and you might find that your workload performs best
only with a subset of tables compressed.
Experimenting is the only way to determine whether or not to
compress a particular table. InnoDB compresses data in 16K
chunks corresponding to the uncompressed page size, and in
addition to user data, the page format includes some internal
system data that is not compressed. Compression utilities
compress an entire stream of data, and so may find more
repeated strings across the entire input stream than InnoDB
would find in a table compressed in 16K chunks. But you can
get a sense of how compression efficiency by using a utility
that implements LZ77 compression (such as
gzip
or WinZip) on your data file.
Another way to test compression on a specific table is to copy
some data from your uncompressed table to a similar,
compressed table (having all the same indexes) and look at the
size of the resulting file. When you do so (if nothing else
using compression is running), you can examine the ratio of
successful compression operations to overall compression
operations. (In the INNODB_CMP
table,
compare COMPRESS_OPS
to
COMPRESS_OPS_OK
. See
INNODB_CMP
for more information.) If a high percentage of compression
operations complete successfully, the table might be a good
candidate for compression.
Decide whether to compress data in your application or in the InnoDB table. It is usually not sensible to store data that is compressed by an application in an InnoDB compressed table. Further compression is extremely unlikely, and the attempt to compress just wastes CPU cycles.
The InnoDB table compression is automatic and applies to
all columns and index values. The columns can still be
tested with operators such as LIKE
, and
sort operations can still use indexes even when the index
values are compressed. Because indexes are often a
significant fraction of the total size of a database,
compression could result in significant savings in storage,
I/O or processor time. The compression and decompression
operations happen on the database server, which likely is a
powerful system that is sized to handle the expected load.
If you compress data such as text in your application, before it is inserted into the database, You might save overhead for data that does not compress well by compressing some columns and not others. This approach uses CPU cycles for compression and uncompression on the client machine rather than the database server, which might be appropriate for a distributed application with many clients, or where the client machine has spare CPU cycles.
Of course, it is possible to combine these approaches. For some applications, it may be appropriate to use some compressed tables and some uncompressed tables. It may be best to externally compress some data (and store it in uncompressed InnoDB tables) and allow InnoDB to compress (some of) the other tables in the application. As always, up-front design and real-life testing are valuable in reaching the right decision.
In addition to choosing which tables to compress (and the page
size), the workload is another key determinant of performance.
If the application is dominated by reads, rather than updates,
fewer pages need to be reorganized and recompressed after the
index page runs out of room for the per-page
“modification log” that InnoDB maintains for
compressed data. If the updates predominantly change
non-indexed columns or those containing
BLOB
s or large strings that happen to be
stored “off-page”, the overhead of compression
may be acceptable. If the only changes to a table are
INSERT
s that use a monotonically increasing
primary key, and there are few secondary indexes, there is
little need to reorganize and recompress index pages. Since
InnoDB can “delete-mark” and delete rows on
compressed pages “in place” by modifying
uncompressed data, DELETE
operations on a
table are relatively efficient.
For some environments, the time it takes to load data can be as important as run-time retrieval. Especially in data warehouse environments, many tables may be read-only or read-mostly. In those cases, it might or might not be acceptable to pay the price of compression in terms of increased load time, unless the resulting savings in fewer disk reads or in storage cost is significant.
Fundamentally, compression works best when the CPU time is available for compressing and uncompressing data. Thus, if your workload is I/O bound, rather than CPU-bound, you might find that compression can improve overall performance. Therefore when you test your application performance with different compression configurations, it is important to test on a platform similar to the planned configuration of the production system.
Reading and writing database pages from and to disk is the slowest aspect of system performance. Therefore, compression attempts to reduce I/O by using CPU time to compress and uncompress data, and thus is most effective when I/O is a relatively scarce resource compared to processor cycles.
This is often especially the case when running in a multi-user environment with fast, multi-core CPUs. When a page of a compressed table is in memory, InnoDB often uses an additional 16K in the buffer pool for an uncompressed copy of the page. The adaptive LRU algorithm in the InnoDB storage engine attempts to balance the use of memory between compressed and uncompressed pages to take into account whether the workload is running in an I/O-bound or CPU-bound manner. Nevertheless, a configuration with more memory dedicated to the InnoDB buffer pool tends to run better when using compressed tables than a configuration where memory is highly constrained.
The optimal setting of the compressed page size depends on the type and distribution of data that the table and its indexes contain. The compressed page size should always be bigger than the maximum record size, or operations may fail as noted in Section 13.7.3.4.2.1, “Compression of B-Tree Pages”.
Setting the compressed page size too large wastes some space, but the pages do not have to be compressed as often. If the compressed page size is set too small, inserts or updates may require time-consuming recompression, and the B-tree nodes may have to be split more frequently, leading to bigger data files and less efficient indexing.
Typically, one would set the compressed page size to 8K or 4K
bytes. Given that the maximum InnoDB record size is around
8K, KEY_BLOCK_SIZE=8
is usually a safe
choice.
The current version of the InnoDB storage engine provides only a limited means to monitor the performance of compression at runtime. Overall application performance, CPU and I/O utilization and the size of disk files are the best indicators of how effective compression is for your application.
The InnoDB storage engine does include some Information Schema tables
(see
Example 13.1, “Using the Compression Information Schema Tables”)
that reflect the internal use of memory and the rates of
compression used overall. The INNODB_CMP
tables report information about compression activity for each
compressed page size (KEY_BLOCK_SIZE
) in use. The information
in these tables is system-wide, and includes summary data across
all compressed tables in your database. You can use this data to
help decide whether or not to compress a table by examining
these tables when no other compressed tables are being accessed.
The key statistics to consider are the number of, and amount of
time spent performing, compression and uncompression operations.
Since InnoDB must split B-tree nodes when they are too full to
contain the compressed data following a modification, you should
also compare the number of “successful” compression
operations with the number of such operations overall. Based on
the information in the INNODB_CMP
tables and
overall application performance and hardware resource
utilization, you might make changes in your hardware
configuration, adjust the size of the InnoDB buffer pool,
choose a different page size, or select a different set of
tables to compress.
If the amount of CPU time required for compressing and uncompressing is high, changing to faster CPUs, or those with more cores, can help improve performance with the same data, application workload and set of compressed tables. Increasing the size of the InnoDB buffer pool might also help performance, so that more uncompressed pages can stay in memory, reducing the need to uncompress pages that exist in memory only in compressed form.
A large number of compression operations overall (compared to
the number of INSERT
,
UPDATE
and DELETE
operations in your application and the size of the database)
could indicate that some of your compressed tables are being
updated too heavily for effective compression. If so, choose a
larger page size, or be more selective about which tables you
compress.
If the number of “successful” compression
operations (COMPRESS_OPS_OK
) is a high
percentage of the total number of compression operations
(COMPRESS_OPS
), then the system is likely
performing well. However, if the ratio is low, then InnoDB is
being caused to reorganize, recompress and split B-tree nodes
more often than is desirable. In this case, avoid compressing
some tables, or choose a larger KEY_BLOCK_SIZE
for some of the
compressed tables. You might want to turn off compression for
tables that cause the number of “compression
failures” in your application to be more than 1% or 2% of
the total (although such a failure ratio might be acceptable
during a temporary operation such as a data load).