Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
792 views
in Technique[技术] by (71.8m points)

indexing - MySQL varchar index length

I have a table like this:

CREATE TABLE `products` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(512) NOT NULL,
  `description` text,
  PRIMARY KEY (`id`),
) ENGINE=InnoDB AUTO_INCREMENT=38 DEFAULT CHARSET=utf8;

and one like this:

CREATE TABLE `product_variants` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `product_id` int(11) unsigned NOT NULL,
  `product_code` varchar(255) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `product_code` (`product_code`),
  KEY `product_variant_product_fk` (`product_id`),
  CONSTRAINT `product_variant_product_fk` FOREIGN KEY (`product_id`) REFERENCES `products` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1037 DEFAULT CHARSET=utf8;

and an SQL statement like this

SELECT p.id AS id, p.name AS name, p.description AS description, pv.id AS product_variant_id, pv.product_code AS product_code
FROM products p
INNER JOIN product_variants pv ON pv.product_id = p.id
ORDER BY p.name ASC
LIMIT 300 OFFSET 0;

which if I explain gives me this:

+----+-------------+-------+------+----------------------------+----------------------------+---------+---------+--------+----------------+
| id | select_type | table | type | possible_keys              | key                        | key_len | ref     | rows   | Extra          |
+----+-------------+-------+------+----------------------------+----------------------------+---------+---------+--------+----------------+
|  1 | SIMPLE      | p     | ALL  | PRIMARY                    | NULL                       | NULL    | NULL    | 993658 | Using filesort |
|  1 | SIMPLE      | pv    | ref  | product_variant_product_fk | product_variant_product_fk | 4       | db.p.id |      1 |                |
+----+-------------+-------+------+----------------------------+----------------------------+---------+---------+--------+----------------+
2 rows in set (0.00 sec)

For a million rows, this is pretty slow. I've tried adding an index on products.name with:

ALTER TABLE products ADD INDEX `product_name_idx` (name(512));

which gives this:

mysql> show indexes from products;
+----------+------------+------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table    | Non_unique | Key_name         | Seq_in_index | Column_name     | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+----------+------------+------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| products |          0 | PRIMARY          |            1 | id              | A         |      993658 |     NULL | NULL   |      | BTREE      |         |               |
| products |          1 | product_manf_fk  |            1 | manufacturer_id | A         |          18 |     NULL | NULL   | YES  | BTREE      |         |               |
| products |          1 | product_name_idx |            1 | name            | A         |         201 |      255 | NULL   |      | BTREE      |         |               |
+----------+------------+------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
3 rows in set (0.00 sec)

I think that the Sub_part column shows the prefix that has been in indexed (in bytes), as described on this page.

When I re-explain the query, I get:

+----+-------------+-------+------+----------------------------+----------------------------+---------+---------+--------+----------------+
| id | select_type | table | type | possible_keys              | key                        | key_len | ref     | rows   | Extra          |
+----+-------------+-------+------+----------------------------+----------------------------+---------+---------+--------+----------------+
|  1 | SIMPLE      | p     | ALL  | PRIMARY                    | NULL                       | NULL    | NULL    | 993658 | Using filesort |
|  1 | SIMPLE      | pv    | ref  | product_variant_product_fk | product_variant_product_fk | 4       | db.p.id |      1 |                |
+----+-------------+-------+------+----------------------------+----------------------------+---------+---------+--------+----------------+
2 rows in set (0.00 sec)

which looks like the new index is not being used. As described on this page, indexes will not be used for sorting if they are prefix indexes. In fact if I truncate the data with:

alter table products modify `name`  varchar(255) not null;

The explain gives:

+----+-------------+-------+-------+----------------------------+----------------------------+---------+----------------------------------------------+------+-------+
| id | select_type | table | type  | possible_keys              | key                        | key_len | ref                                          | rows | Extra |
+----+-------------+-------+-------+----------------------------+----------------------------+---------+----------------------------------------------+------+-------+
|  1 | SIMPLE      | p     | index | PRIMARY                    | product_name_idx           | 767     | NULL                                         |  300 |       |
|  1 | SIMPLE      | pv    | ref   | product_variant_product_fk | product_variant_product_fk | 4       | oh_2c98c233_69fe_4f06_ad0d_fe6f85a5beac.p.id |    1 |       |
+----+-------------+-------+-------+----------------------------+----------------------------+---------+----------------------------------------------+------+-------+

which I think backs that up. However, it says on this page that InnoDB tables can have up to 767 bytes of index. If the length is in bytes, why does it refuse to have more than 255? If it's in characters, how is it deciding the length of each UTF-8 character? Is it just assuming 3?

Also, am using this version of MySQL:

mysql> select version();
+------------+
| version()  |
+------------+
| 5.5.27-log |
+------------+
1 row in set (0.00 sec)
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Sept 2021 edit: I've been using MySQL 8.0 for a couple years now, so here's some updated info.

The MySQL manual now has a very informative page regarding conversion between utf8mb3 (currently also known as utf8) and utf8mb4. utf8mb3 is deprecated and will be removed eventually; and when it is removed, its current alias, utf8, will refer to utf8mb4 instead.

With deprecated utf8mb3, you can store up to 255 characters in an index, while with utf8mb4, up to 191, when using COMPACT or REDUNDANT row format.

With COMPRESSED or DYNAMIC row format, index key prefixes can be up to 3072 bytes. With them, you can index up to 1024 characters for utf8mb3, and 768 characters for utf8mb4.

Below is my previous answer, which explains some of the logic behind the number of characters you can index versus the number of bytes.


I must revise my answer due to my research. I originally posted this (quoting myself):

I believe the answer is that you cannot know how many characters will be in the index because you cannot know how many bytes your characters will be (unless you do something to exclude multi-byte characters).

And I'm not sure, but it might still be correct, but not in quite the way I was thinking.

Here is the correct answer:

MySQL assumes 3 bytes per utf8 character. 255 characters is the maximum index size you can specify per column, because 256x3=768, which breaks the 767 byte limit.

If you don't specify index size, MySQL chooses the maximum size (i.e. 255 per column). A UNIQUE constraint cannot be put on a utf8 column whose length is greater than 255, because a unique index must contain the entire cell value. But a regular index can be used - it will just index the first 255 characters (or first 767 bytes?). And that is where there is still some mystery for me.

The MySTERY: I can see why MySQL assumes 3 bytes per character, for safety, because otherwise the UNIQUE constraint could be broken. But the docs seem to suggest that the index is actually sized in bytes, not characters. So, suppose you put a 255 char (765 byte) index on a varchar(256) column. If the characters you store are all ASCII, 1-byte characters, like A-Z, a-z, 0-9, then then you can fit the entire column into the 767 byte index. And it seems like that is what would actually happen.

Below is some more information from my original answer about characters, bytes, etc.


According to wikipedia, UTF-8 character can be 1,2, 3, or 4 bytes long. But, according to this mysql documentation, the maximium character size is 3 bytes, and so any column index index over 255 characters might hit that byte limit. But as I understand it, it might not. If most of your characters are in the ASCII range, then your average character size will be closer to 1 byte. If your average character size is, for example, 1.3 bytes (mostly 1 byte, but a significant number of 2-3 byte characters), then you could specify an index of 767/1.3

So, if you are storing mostly 1-byte characters, your actual character limit would be more like: 767 / 1.3 = 590. But it turns out that is not the way it works. 255 characters is the limit.

As mentioned in this MySQL documentation,

Prefix limits are measured in bytes, whereas the prefix length in CREATE INDEX statements is interpreted as number of characters for nonbinary data types (CHAR, VARCHAR, TEXT). Take this into account when specifying a prefix length for a column that uses a multi-byte character set.

It seems that MySQL is advising people to do a calculation/guestimation like I just did in order to determine your key size for a varchar column. But in fact you cannot specify an index larger than 255 for utf8 columns.

Finally, if you refer back to my second link again, there is also this:

When the innodb_large_prefix configuration option is enabled, this length limit is raised to 3072 bytes, for InnoDB tables that use the DYNAMIC and COMPRESSED row formats.

So it seems like you can get much larger indexes if you want to, with a bit of tweaking. Just make sure the row formats are DYNAMIC or COMPRESSED. You can probably specify an index of 1023 or 1024 characters in that case.


By the way, it turns out that you can store 4-byte characters using [the utf8mb4 character set][4]. The utf8 character set apparently stores only ["plane 0" characters][5].

EDIT:

I just tried to create a composite index on a varchar(511) column with a tinyint(1) column and got the error message saying the max index size was 767 bytes. This makes me believe that MySQL assumes utf8 character set columns will contain 3 bytes per character (the maximum), and allows you to use 255 chars max. But perhaps that is only with composite indexes. I will update my answer as I find out more. But for now I'm leaving this as an edit.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...