
For a long time, MyDumper has been the fastest tool to take Logical Backups. We have been adding several features to expand the use cases. Masquerade was one of these features, but it was only for integer and UUID values. In this blog post, I’m going to present a new functionality that is available in MyDumper and will be available in the next release: we added the possibility to build random data based on a format that the user defines.
How does it work?
During export, mydumper sends SELECT statements to the database. Each row is written one by one as an INSERT statement. Something important that you might not know, is that each column of a row can be transformed by a function. When you execute a backup, the default function is the identity function, as nothing needs to be changed. The function, which can be configured inside the defaults file, will change the content of the column before writing the row into disk.
How can we select the column to masquerade?
I think that the most valuable element of this feature is the simplicity to define which column will be modified and how you want to mask it. The format is:
[`schema_name`.`table_name`] `column1`=random_int `column2`=random_string
In the section name, you add the schema and table name surrounded by backticks and separated by a dot. Then, each key-value entry will keep in the key the column name surrounded by backticks, and the value will be the masking function definition.
New random format function
Having string, integer, and UUID is nice to have, but what about build dynamic data with a specific format? As we want more realistic data, we want to build dynamically world wide addresses, phone numbers, emails, etc. The new function has this syntax:
random_format { <{file|string n|number n}> | DELIMITER | 'CONSTANT' }*
This are some examples:
`phone`=random_format '+1 ('<number 3>') '<number 3>'-'<number 4> `emails`=random_format <file names.txt>'.'<file surnames.txt>'@'<file domains.txt> `addresses`=random_format <number 3>' '<file streets.txt>', '<file cities.txt>', '<file states_and_zip.txt>', USA'
Performance considerations
You should expect performance degradation if you compare masquerade backups and regular backups. It is impossible to measure the impact as it will depend on the amount of data that needs to be masked. However, I tried to give you an idea through an example over a sysbench table of 10M rows.
Baseline backup
We are going to split by rows and compress with ZSTD:
# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper.cnf -r 100000 -c real 0m19.964s user 0m48.396s sys 0m7.885s
It took near 19.9 seconds to complete, and here is an example of the output:
# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","67847967377-48000963322-62604785301-91415491898-96926520291") ,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","23183251411-36241541236-31706421314-92007079971-60663066966")
One integer column
We are going to use random_int over the k column, which in the configuration will be:
[`test`.`sbtest1`] `k`=random_int
The backup took 20.7 seconds, an increase of 4%:
# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-k.cnf -r 100000 -c real 0m20.709s user 0m46.056s sys 0m11.247s
And as you can see, the data in the second column has changed:
# zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head INSERT INTO `sbtest1` VALUES(1,1527173,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","67847967377-48000963322-62604785301-91415491898-96926520291") ,(2,3875126,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","23183251411-36241541236-31706421314-92007079971-60663066966")
random_format with <number 11>
Now, we are going to use the last column (pad) and the number tag with 11 digits to simulate the values:
`pad`=random_format <number 11>-<number 11>-<number 11>-<number 11>-<number 11>
We can see that it took 36.6 seconds to complete, and the values in the latest column have changed:
# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-pad-long.cnf -r 100000 -c real 0m36.667s user 1m3.785s sys 0m32.757s # zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","32720009027-12540600353-41008809903-18811191622-46944507919") ,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","14761241271-79422723442-42242331639-12424460062-25625932261")
Take into consideration that 11 digits forced us to execute two times g_random_int, this means that if we have:
`pad`=random_format <number 9>-<number 9>-<number 9>-<number 9>-<number 9>
It will take 29 seconds.
random_format with <file> with 100 lines file
In this case, the configuration will be:
`pad`=random_format <file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>-<file words_alpha.txt.100>
And it will take 34 seconds:
# rm -rf data/; time ./mydumper -o data -B test --defaults-file=mydumper-simple-pad.cnf -r 100000 -c real 0m34.224s user 0m56.702s sys 0m29.474s # zstdcat data/test.sbtest1.00000.sql.zst | grep INSERT -A10 | head INSERT INTO `sbtest1` VALUES(1,4992833,"83868641912-28773972837-60736120486-75162659906-27563526494-20381887404-41576422241-93426793964-56405065102-33518432330","aam-abacot-abalienated-abandonedly-ab") ,(2,5019684,"38014276128-25250245652-62722561801-27818678124-24890218270-18312424692-92565570600-36243745486-21199862476-38576014630","aardwolves-abaised-abandoners-aaronitic-abacterial")
Warning
This is not a fully tested feature in MyDumper; you should consider it as Beta. However, I found it relevant to show the potential that it might have for the community.
Conclusion
Never has it been as easy to build a new masquerade environment as we can do now with MyDumper.
Percona Distribution for MySQL is the most complete, stable, scalable, and secure open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!