Quantcast
Channel: Percona Database Performance Blog
Viewing all articles
Browse latest Browse all 1785

Revamp MySQL Query Optimization and Overcome Slowness of ORDER BY with LIMIT Queries

$
0
0
Slowness of ORDER BY with LIMIT Queries

The efficiency of database queries in MySQL can make all the difference in the performance and responsiveness of applications. In this blog post, I’ll dig into MySQL query optimization and show how MySQL uses indexes in cases of queries using sorting and limiting. While sorting may seem simple, it’s important to understand how to do it efficiently and effectively to ensure your queries are optimized and use better indexing.

Since sorting your rows is not free, it can take a significant amount of resources and time to sort large data sets; thus, it’s important to do it cautiously. If you don’t need your rows in a certain order, don’t order them.

However, if you need to order your rows, doing it efficiently and effectively is essential to optimize your queries. You must understand how to use indexes to make sorting cheaper. 

Looking at this, can you say which is faster: LIMIT 1 or LIMIT 10? Presumably, fetching fewer rows is faster than fetching more records. However, for 16 years since 2007, the MySQL query optimizer has had a “bug” that not only makes LIMIT 1 slower than LIMIT 10 but can also make the former a table scan, which tends to cause problems. I went through the case details for one of our clients last week, leading me to pen down this article. I’m writing this blog post for developers/DBAs to more clearly illustrate and explain how the MySQL query optimizer works in case of queries using GROUP BY, ORDER BY with LIMIT, and how we can now control the optimization of these queries using optimizer_switch using prefer_ordering_index which is simply covered under Switchable Optimizations in MySQL.

Before looking at the problematic query, I will walk you through a little detail about the optimizer.  The Query Optimizer is the part of query execution that chooses the query plan.  A Query Execution Plan is the way a database chooses to run a specific query.  It includes index choices, join types, table query order, temporary table usage, sorting type, etc. The execution plan for a specific query can be obtained using the EXPLAIN command.

There is a concept called Switchable Optimizations, where MySQL lets you control the query optimizer, which is managed by the optimizer_switch variable. This system variable enables control over optimizer behavior. The value of this variable is a set of flags, each of which has a value of on or off to indicate whether the corresponding optimizer behavior is enabled or disabled. This variable has global and session values and can be changed at runtime. The global default can be set at server startup.

To see the current set of optimizer flags, select the variable value:

mysql> SELECT @@optimizer_switchG
*************************** 1. row ***************************
@@optimizer_switch: index_merge=on,index_merge_union=on,
                    index_merge_sort_union=on,index_merge_intersection=on,
                    engine_condition_pushdown=on,index_condition_pushdown=on,
                    mrr=on,mrr_cost_based=on,block_nested_loop=on,
                    batched_key_access=off,materialization=on,semijoin=on,
                    loosescan=on,firstmatch=on,duplicateweedout=on,
                    subquery_materialization_cost_based=on,
                    use_index_extensions=on,condition_fanout_filter=on,
                    derived_merge=on,use_invisible_indexes=off,skip_scan=on,
                    hash_join=on,subquery_to_derived=off,
                    prefer_ordering_index=on,hypergraph_optimizer=off,
                    derived_condition_pushdown=on

Optimizer flag

Let’s take a deeper look into one of the opt_name flags, i.e., prefer_ordering_index. This flag controls whether, in the case of a query having an ORDER BY or GROUP BY with a LIMIT clause, the optimizer tries to use an ordered index instead of an unordered index, a filesort, or some other optimization. This optimization is performed by default whenever the optimizer determines that using it would allow for faster query execution. Because the algorithm that makes this determination cannot handle every conceivable case (due in part to the assumption that the distribution of data is always more or less uniform), there are cases in which this optimization may not be desirable. Prior to MySQL 8.0.21, it was not possible to disable this optimization, but in MySQL 8.0.21 and later, while it remains the default behavior, it can be disabled by setting the prefer_ordering_index flag to off.

Here, we will look through a case study with an example. Let’s first understand the problem.

Problem statement:

The below query was taking too long to execute. The table being queried had a size of around 850G. But what went wrong? The query had the worst performance and took around three hours for a single row. Isn’t it crazy? And why is it taking so much time? Let’s get into more details to find out the answer.

mysql> select `tokenId` from `test_db`.`Tokens` order by `tokenId` desc limit 1;

Here is the structure of the table:

show create table `test_db`.`Tokens` G
*************************** 1. row ***************************
      Table: Tokens
Create Table: CREATE TABLE `Tokens` (
  `tokenId` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `tokenTypeId` int(10) unsigned NOT NULL DEFAULT '1',
  `accountId` bigint(20) unsigned DEFAULT NULL,
  `token` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `validFrom` datetime NOT NULL,
  `validTo` datetime NOT NULL,
  `dateCreated` datetime NOT NULL,
  `lastUpdated` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `status` enum('INACTIVE','ACTIVE','REDEEMED','CANCELLED','EXPIRED','INVALIDATED','USED') COLLATE utf8_unicode_ci NOT NULL,
  `passphrase` varchar(255) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
  PRIMARY KEY (`tokenId`),                         
  KEY `idx_dateCreated` (`dateCreated`),
  KEY `idx_status_validTo` (`status`,`validTo`),
  KEY `idx_token` (`token`),
  KEY `idx_tokenTypeId` (`tokenTypeId`),  <<---This index is being used by the query and scans 4065011580 rows.
  KEY `idx_accountId_status` (`accountId`,`status`)
) ENGINE=InnoDB AUTO_INCREMENT=5984739122 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci ROW_FORMAT=COMPRESSED
1 row in set (0.00 sec)

mysql> select table_schema, table_name, table_rows, round(data_length / 1024 / 1024 / 1024) DATA_GB, round(index_length / 1024 / 1024 / 1024) INDEX_GB, round(data_free / 1024 / 1024 / 1024) FREE_GB, round(((data_length / 1024 / 1024)+round(index_length / 1024 / 1024)+round(data_free / 1024 / 1024))/1024) TOTAL_GB from information_schema.tables where table_name='Tokens';
+----------------+------------+------------+---------+----------+---------+----------+
| table_schema   | table_name | table_rows | DATA_GB | INDEX_GB | FREE_GB | TOTAL_GB |
+----------------+------------+------------+---------+----------+---------+----------+
| test_db        | Tokens     | 4069894019 |     360 |      438 |      52 |      850 |
+----------------+------------+------------+---------+----------+---------+----------+
1 row in set (0.01 sec)

prefer_ordering_index=OFF

mysql> show variables like '%optimizer_switch%' G
*************************** 1. row ***************************
Variable_name: optimizer_switch   Value:index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on,prefer_ordering_index=off,favor_range_scan=off
1 row in set (0.01 sec)

This is the execution plan with prefer_ordering_index=OFF:

mysql> explain select `tokenId` from `test_db`.`Tokens` where order by `tokenId` desc limit 1G
*************************** 1. row ***************************
          id: 1
  select_type: SIMPLE
        table: Tokens
  partitions:
        type: index
possible_keys: NULL
          key: idx_tokenTypeId
      key_len: 4
          ref: NULL
        rows: 3187489428
    filtered: 100.00
        Extra: Using index; Using filesort
1 row in set, 1 warning (0.00 sec)

  • Here, the query uses a secondary index idx_tokenTypeId, and filesort is used for sorting the result set, where you can see it is causing a full scan.

prefer_ordering_index=ON 

This is how it behaves when switching prefer_ordering_index to ON, where you can see the index is being used.

mysql> set optimizer_switch='prefer_ordering_index=on';
Query OK, 0 rows affected (0.00 sec)

mysql> explain select `tokenId` from `test_db`.`Tokens` order by `tokenId` desc limit 1G
*************************** 1. row ***************************
          id: 1
  select_type: SIMPLE
        table: Tokens
  partitions:
        type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 8
          ref: NULL
        rows: 1
    filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.00 sec)

And this is how it is executed within less than a second! Great!

mysql> select `tokenId` from `test_db`.`Tokens` order by `tokenId` desc limit 1 ;
+------------+
| tokenId    |
+------------+
| 5984755269 |
+------------+
1 row in set (0.00 sec)

The prefer_ordering_index should, by default, have a value of ON.

Let me take you into more detail now. The usage of the index in the above case is based on the following aspects: 

  1. Index cardinality

An estimate of the number of unique values in the index. CARDINALITY is counted based on statistics stored as integers, so the value is not necessarily exact, even for small tables. The higher the cardinality, the greater the chance MySQL uses the index.

  1. File sort

Filesort is the catch-all algorithm for producing sorted results for ORDER-BY or GROUP-BY queries. This is how file sort works:

  • Read the rows that match the WHERE clause.
  • For each row, record a tuple of values consisting of the sort key value and the additional fields referenced by the query.
  • When the sort buffer becomes full, sort the tuples by sort key value in memory and write it to a temporary file.
  • After merge-sorting the temporary file, retrieve the rows in sorted order, read the required columns directly from the sorted tuples
  1. How does the query optimizer prefer ordering index?

In the case of a query having an ORDER BY or GROUP BY and a LIMIT clause, the optimizer tries to choose an ordered index by default when it appears doing so would speed up the query execution. It uses an ordered index instead of an unordered index. An unordered index may create a filesort, which apparently will increase the query execution time. This optimization is performed by default whenever the optimizer determines that using it would allow for faster execution of the query.

Prior to MySQL 8.0.21, there was no way to override this behavior, even in cases where using some other optimization might be faster. Beginning with MySQL 8.0.21, this optimization can be turned off by setting the optimizer_switch system variable’s prefer_ordering_index flag to off.

Caution:

Disabling prefer_ordering_index causes another bug: MySQL does not scan the primary key for SELECT … FROM t ORDER BY pk_col LIMIT n. Instead, it does a full table scan plus sort, which is unnecessary and likely to cause problems. This was the case with the client where they disabled prefer_ordering_index, the reason being was satisfying the where clause is their preference. However, this has caused a real issue, and we would call it a bug.

Also, please note that when you want to control optimizer strategies by setting the optimizer_switch system variable, the changes to this variable affect the execution of all subsequent queries if you set it on the global level or in my.cnf file. This may affect the performance of the other queries in your application. To affect one query differently from another, it is necessary to change optimizer_switch before each one. This can be done by setting it on session level. The other way is to use FORCE INDEX in your queries.  Currently, the optimizer hint for the flag ‘prefer_ordering_index’ is not yet available till version 8.2.0. Hopefully, MySQL will be adding this in the upcoming releases.

Another example

Table: employee_details

id employee_id department_id hire_date
1 101 1001 2017-11-09
2 102 1003 2020-02-06
3 103 1006 2021-05-15
4 104 1002 2022-07-10
5 105 1006 2022-02-06
6 106 1004 2023-06-14

Primary Key    : id
Secondary index: <department_id,employee_id>

mysql> SELECT * FROM employee_details WHERE department_id = '1006' ORDER BY id LIMIT 1;

How should MySQL execute that query? Developers tend to say, “Use the secondary index for the WHERE condition department_id = ‘1006’.” That’s reasonable; it makes sense. 

The secondary index has two matching records: <‘1006’,103> and <‘1006’, 105>. That will cause four lookups total: two secondary index reads + two corresponding primary key reads. Furthermore, the query needs to be ordered by id, which is not the order of the secondary index, so MySQL will also sort the results after those four lookups. That means EXPLAIN will say, “Using filesort”.

Let’s walk through the secondary index access step by step:

  1. Match Secondary Index <‘1006’, 105>
  2. Read corresponding PK row 5 into sort buffer
  3. Match Secondary Index <‘1006’,103>
  4. Read corresponding PK row 3 into sort buffer
  5. Sort the buffer for ORDER BY: [5, 3] → [3, 6]
  6. Apply LIMIT 1 to return PK row <3, ‘1006’,103>

That’s not a bad execution plan, but the query optimizer can choose a completely different plan: an index scan on the ORDER BY column, which happens to be the primary key: id. (Remember: an index scan on the primary is the same as a table scan because the primary key is the table.) Why? In the source code, a code comment explains:

/*Switch to index that gives order if its scan time is smaller than read_time of current chosen access method*/.

Reading rows in order might be faster than unordered secondary index lookups plus sorting. With this optimization, the new query execution plan would be:

  1. Read PK row 1 and discard (department_id value doesn’t match)
  2. Read PK row 2 and discard (department_id value doesn’t match)
  3. Read PK row 3 (department_id value matches)

Looks like MySQL is correct: by scanning the primary key in order, it reads one less row and avoids the filesort. For now, the point is that this query optimization works this way and might be faster.

— BEFORE: Secondary index lookup

mysql> explain SELECT * FROM employee_details WHERE department_id = '1006' ORDER BY id LIMIT 1G
*************************** 1. row ***************************
          id: 1
  select_type: SIMPLE
        table: employee_details
  partitions:
        type: ref
possible_keys: idx_dept_empid
          key: idx_dept_empid
      key_len: 16
          ref: const
        rows: 1000      
    filtered: 100.00
        Extra: Using filesort
1 row in set, 1 warning (0.00 sec)

But after the change, you would see an EXPLAIN plan like:

mysql> explain SELECT * FROM employee_details WHERE department_id = '1006' ORDER BY id LIMIT 1G
*************************** 1. row ***************************
          id: 1
  select_type: SIMPLE
        table: employee_details
  partitions:
        type: ref
possible_keys: idx_dept_empid
          key: PRIMARY
      key_len: 8
          ref: NULL
        rows: 2
    filtered: 100.00
        Extra: Using index
1 row in set, 1 warning (0.00 sec)

Notice that fields type, key, ref, and Extra all change. Also, PRIMARY is not listed for possible_keys before, but after (when MySQL changes the execution plan) it appears as the chosen key.

Conclusion

The optimization to switch from a non-ordering index to an ordering index for “group by” and “order by” when there is a limit clause goes very wrong for certain queries. The MySQL team has introduced a new optimizer switch to disable limit optimization, i.e., prefer_ordering_index. The MySQL team tested and analyzed most of the bugs reported in this area and identified the above problems with the algorithm. The problem is not yet solved with this change, but giving users an option to use the optimization correctly. With this, you can disable or enable the optimizer flag prefer_ordering_index or work around it by increasing the LIMIT value to change the cost calculations, or use FORCE INDEX to force the optimal secondary index. Whatever solution you may use, you just need to be aware of the performance penalties that may come and have a clear understanding of the optimization techniques you choose.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!


Viewing all articles
Browse latest Browse all 1785

Trending Articles