How to calculate cardinality

‘When deciding whether to create a new index, or to split a table, it”s useful to work out the cardinality of the data to see whether it”s worth doing. It”s possible to calculate the cardinality of the data in a column before creating an index.

Calculating the cardinality of a column is easy. First, find out how many rows you have in the table:

SELECT COUNT(*) FROM {table}

This will be used to calculate the index cardinality ”score”.

Cardinality of full column indexes

Calculating the cardinality of single, full column indexes is just a case of running SQL SELECTs to count the distinct values:

SELECT COUNT(DISTINCT {field}) FROM {table}

Let”s use some proper data to demonstrate this. If you haven”t already, download and install the ”world” database from the MySQL website: http://dev.mysql.com/doc/#sampledb. This database doesn”t contain much in the way of indexes, so is ideal for experimentation.

One table in this database that could use some indexing is the ”City” table. First, we find the number of rows:

mysql> SELECT COUNT(*) FROM City; +----------+ | COUNT(*) | +----------+ | 4079 | ----------+ 1 row in set (0.02 sec)

This table contains a field called ”CountryCode”, which is likely to be used to find all the cities in a given country and so will contain duplicates – this makes it a potential candidate for an index. So, let”s see what the cardinality is of the ”CountryCode” field:

mysql> SELECT COUNT(DISTINCT CountryCode) FROM City; +-----------------------------+ | COUNT(DISTINCT CountryCode) | +-----------------------------+ | 232 | +-----------------------------+ 1 row in set (0.00 sec)

If we created an index on that field it would have a cardinality of 232, and with 4079 rows in the table this gives a ”score” of 0.06. Although this would classify as a ”low” cardinality according to my own scoring, the index is still useful – we have already identified that the users are likely to do SELECT statements using this column (and hence this index), so there”s a good chance the query optimiser will use it.

Calculating cardinality for indexes using prefixes

The MySQL database server allows you to create indexes based on part of a column”s value. This is not the same as the ”partial indexes” of other RDBMSs (which allow you to create an index on a subset of the table data) – all rows are added to the index. Prefix indexes are extremely useful, but require investigation and maintenance to ensure that they”re running at their best, as the cardinality can vary wildly with data changes.

4079 records is quite a lot – let”s assume that a system using this table would display an A-Z index, allowing the user to pick the first letter of the city they”re looking for. As it stands, there”s no index on the ”Name” field, so the MySQL database server would have to scan all 4079 rows to get all of the records for a given letter (the server cannot guarantee that new rows matching the search condition haven”t been added either to the end of the table or in the space left by a deleted row).

Common sense tells us we can do better than that – if we had an index based on the first character of the name, the lookup would be much faster. So, we calculate the cardinality of a single character prefix index:

mysql> SELECT COUNT(DISTINCT(LEFT(Name,1))) FROM City; +-------------------------------+ | COUNT(DISTINCT(LEFT(Name,1))) | +-------------------------------+ | 30 | +-------------------------------+ 1 row in set (0.01 sec)

A cardinality ”score” of 0.007 – that”s very low! It”s probably the most efficient index for our case of letting the users select a city based on it”s starting character, but it”s unlikely to be much use for anything else. Let”s try increasing the length of the prefix, and see what kind of cardinality we get:

mysql> SELECT COUNT(DISTINCT(LEFT(Name,2))) FROM City; +-------------------------------+ | COUNT(DISTINCT(LEFT(Name,2))) | +-------------------------------+ | 315 | +-------------------------------+ 1 row in set (0.00 sec)   mysql> SELECT COUNT(DISTINCT(LEFT(Name,3))) FROM City; +-------------------------------+ | COUNT(DISTINCT(LEFT(Name,3))) | +-------------------------------+ | 1623 | +-------------------------------+ 1 row in set (0.00 sec)   mysql> SELECT COUNT(DISTINCT(LEFT(Name,4))) FROM City; +-------------------------------+ | COUNT(DISTINCT(LEFT(Name,4))) | +-------------------------------+ | 2965 | +-------------------------------+ 1 row in set (0.00 sec)

”Scores” of 0.08, 0.40 and 0.73, respectively. The 3 character prefix index looks the most interesting – a middle-of-the-road cardinality means it”s likely to be considered for many more queries. Although each query would have to go through more index entries, fewer indexes are likely to be needed and there is more chance that the index blocks will be available in the key cache anyway.

Mixing it up – multiple column indexes

The MySQL database server allows you to create indexes based on multiple columns, which can be extremely useful. However, the cardinality of these indexes is greater than that of the single column indexes, which could reduce their chances of being used by the query optimiser. Having said that, the MySQL database server also supports ”left-most indexes” from multiple column indexes (that is, only using some of the columns in a multiple column index) which can help consolidate the indexes on a table and reduce the total number without impacting performance too much.

You can get a good idea of the cardinality of multiple column indexes, although it”s a bit more tricky – every single combination of the columns must be considered, and duplicates removed. This, however, can still be done with a single query – let”s take a look at the estimated cardinality of a multiple column index on the City table:

mysql> SELECT COUNT(DISTINCT(CONCAT(t1.CountryCode, t1.Name))) FROM City t1 LEFT JOIN City t2 USING (CountryCode); +--------------------------------------------------+ | COUNT(DISTINCT(CONCAT(t1.CountryCode, t1.Name))) | +--------------------------------------------------+ | 4056 | +--------------------------------------------------+ 1 row in set (0.09 sec)

Although this does place the index in the (very!) high cardinality category, it”s still a very useful index, especially as some queries may run entirely from the index.

Real ultimate power – multiple column indexes using prefixes

Combining multiple column indexes with indexes using prefixes is a powerful technique to reduce the number of indexes and keep cardinality reasonable. Calculating the estimated cardinality uses a mixture of the previous two methods:

mysql> SELECT COUNT(DISTINCT(CONCAT(t1.CountryCode, LEFT(t1.Name,3)))) FROM City t1 LEFT JOIN City t2 USING (CountryCode); +----------------------------------------------------------+ | COUNT(DISTINCT(CONCAT(t1.CountryCode, LEFT(t1.Name,3)))) | +----------------------------------------------------------+ | 3163 | +----------------------------------------------------------+ 1 row in set (1.68 sec)

This has lowered the cardinality against a full multiple column index, making it much more use for general purpose queries.’