
During the development of the myloader –innodb-optimize-keys option, which was released in version 0.10.7, we found several issues and opportunities to improve the process. We had to change the approach, reimplement some of the core functionality and add a couple of data structures. That allowed us to implement, at a really low cost, a feature that executes the files that contain INSERT statements, sorted by Primary Key. This is desirable to reduce page splits, which cause on-disk tablespace fragmentation.
In this blog post, I will present the differences in data fragmentation for each version.
Test Details
These are local vm tests as there is no intention to show performance gain.
The table that I used is:
CREATE TABLE `perf_test` ( `id` int(11) NOT NULL AUTO_INCREMENT, `val` varchar(108) DEFAULT NULL, PRIMARY KEY (`id`), KEY `val` (`val`(2)), KEY `val_2` (`val`(4)), KEY `val_3` (`val`(8)) ) ENGINE=InnoDB
And I inserted the data with:
INSERT INTO perf_test(val) SELECT concat(uuid(),uuid(),uuid()) FROM perf_test;
The graphs below were made with innodb_ruby (more info about it in this blog post) and based on a table of 131K rows with –rows 100. The intention of this test was to create a lot of files that will cause better spread in the Primary Key. The timings are over the same table structure but the table has 32M rows. Finally, I performed the test with 1 and 4 threads and with –innodb-optimize-keys when possible.
Tests Performed
In myloader v0.10.5 there was no file sorting, which is why we can see that lower Primary Key values were updated recently:
It doesn’t matter the number of threads, we can see how pages, across the whole file, are being updated at any time.
This is happening because mydumper exported the files in order with these min_id and max_id values:
File | min_id | max_id |
test.perf_test.00000.sql | 1 | 21261 |
test.perf_test.00001.sql | 21262 | 42522 |
test.perf_test.00002.sql | 42523 | 49137 |
test.perf_test.00003.sql | 65521 | 85044 |
test.perf_test.00004.sql | 85045 | 98288 |
test.perf_test.00006.sql | 131056 | 148827 |
test.perf_test.00007.sql | 148828 | 170088 |
test.perf_test.00008.sql | 170089 | 191349 |
test.perf_test.00009.sql | 191350 | 196591 |
test.perf_test.00012.sql | 262126 | 276393 |
But, during import, there was no order, let’s see the log:
** Message: 12:55:12.267: Thread 3 restoring `test`.`perf_test` part 1476. Progress 1 of 1589 . ** Message: 12:55:12.269: Thread 1 restoring `test`.`perf_test` part 87. Progress 2 of 1589 . ** Message: 12:55:12.269: Thread 2 restoring `test`.`perf_test` part 1484. Progress 3 of 1589 . ** Message: 12:55:12.269: Thread 4 restoring `test`.`perf_test` part 1067. Progress 4 of 1589 . ** Message: 12:55:13.127: Thread 1 restoring `test`.`perf_test` part 186. Progress 5 of 1589 . ** Message: 12:55:13.128: Thread 4 restoring `test`.`perf_test` part 1032. Progress 6 of 1589 .
With these max_id and max_id per file:
File | min_id | max_id |
test.perf_test.01476.sql | 31381237 | 31402497 |
test.perf_test.00087.sql | 1849708 | 1870968 |
test.perf_test.01484.sql | 31551325 | 31572585 |
test.perf_test.01067.sql | 22685488 | 22706748 |
test.perf_test.00186.sql | 3954547 | 3975807 |
test.perf_test.01032.sql | 21941353 | 21962613 |
With this kind of insert order, you can only imagine the amount of page splits that cause the fragmentation in the InnoDB datafile.
Timings were:
0.10.5/mydumper/myloader -t 1 6:52 0.10.5/mydumper/myloader -t 4 4:55
In v0.10.7-2 we have the same behavior:
But we have a small performance increase:
0.10.7-2/mydumper/myloader -t 1 6:49 0.10.7-2/mydumper/myloader -t 4 4:47
We see the same pattern, even if we use the –innodb-optimize-keys:
The main difference is the index creation stage.
0.10.7-2/mydumper/myloader --innodb-optimize-keys -t 1 6:07 0.10.7-2/mydumper/myloader --innodb-optimize-keys -t 4 5:53
Now, in v0.10.9, where we have table and file sorting, the graphs have a significant change:
It is also a bit shocking the difference between the 2 graphs, not about color trending, but about the number of pages used which indicates a high fragmentation when multiple threads are used.
master/mydumper/myloader -t 1 5:50 master/mydumper/myloader -t 4 4:29
Let’s check now with –innodb-optimize-keys:
This is what we are looking for! As you can see with 1 thread is perfect, but with 4 threads there is some odd distribution, but for sure, much better than the other options.
However, the timings are not the best:
master/mydumper/myloader --innodb-optimize-keys -t 1 5:33 master/mydumper/myloader --innodb-optimize-keys -t 4 5:10
Let’s compare them:
Data | Index | Total | Table 0 00:05:50 | 0 00:00:00 | 0 00:05:50 | `test`.`perf_test` -t 1 0 00:04:29 | 0 00:00:00 | 0 00:04:29 | `test`.`perf_test` -t 4 0 00:02:33 | 0 00:02:59 | 0 00:05:33 | `test`.`perf_test` -t 1 --innodb-optimize-keys 0 00:02:01 | 0 00:03:09 | 0 00:05:10 | `test`.`perf_test` -t 4 --innodb-optimize-keys
But that makes sense if you read this blog post. Actually, it would be really nice to have a feature that determines when –innodb-optimize-keys needs to be used.
Conclusions
Version 0.10.9 of MyDumper will allow myloader to insert better than previous versions. Multithreaded inserts sorted by Primary Key are now possible and faster than ever!