Quantcast
Channel: Percona Database Performance Blog
Viewing all articles
Browse latest Browse all 1786

The Power of utf8mb4 in MySQL 8.0: Unleashing the Full Potential of Multilingual Data

$
0
0
utf8mb4 in MySQL 8.0

In the world of modern web applications, it is increasingly important to support a diverse range of languages and character sets. With the rise of globalization, the need to store and process multilingual data has become essential. MySQL, one of the most popular relational database management systems, recognizes this need and has introduced utf8mb4 in its 8.0 version as a game-changer. In this blog post, we will explore utf8mb4 and its advantages in MySQL 8.0, backed by practical examples.

Understanding utf8mb4

Before diving into the benefits, let’s clarify what utf8mb4 represents. In MySQL, “utf8” refers to a character encoding that supports the Unicode character set using a maximum of three bytes per character. However, the original utf8 implementation in MySQL does not cover all Unicode characters. utf8mb4, on the other hand, is a modified version of utf8 that supports the complete Unicode character set, including emojis and other supplementary characters, by using a maximum of four bytes per character.

The original utf8 implementation in MySQL only supports characters from the Basic Multilingual Plane (BMP), which is about 90% of all Unicode characters. utf8mb4, on the other hand, supports the entire Unicode character set, including emojis and other supplementary characters. It does this by using a maximum of four bytes per character instead of the three bytes used by utf8.

Here is a table showing the difference between utf8 and utf8mb4:

Feature utf8 utf8mb3 utf8mb4
Maximum number of bytes per character 3 3 4
Characters supported Basic Multilingual Plane (BMP) BMP BMP + Supplementary Plane
Default in MySQL Yes Yes Yes (since MySQL 8.0)
Deprecation status Deprecated Deprecated Not deprecated

Note: Historically, MySQL used the character set utf8 as an alias for utf8mb3. However, starting with MySQL 8.0.28, utf8mb3 is only used in the output of SHOW statements and in Information Schema tables when referring to that character set. In the future, utf8 is expected to become a reference to utf8mb4. To avoid any ambiguity, it is recommended to explicitly specify utf8mb4 when referring to that character set.

As you can see, the main difference between utf8, utf8mb3, and utf8mb4 is the maximum number of bytes per character. utf8 and utf8mb3 can only store characters in the Basic Multilingual Plane (BMP), while utf8mb4 can also store characters in the Supplementary Plane. This means that utf8mb4 can support a wider range of characters, including emojis, mathematical symbols, and other special characters.

Another difference between the three character sets is their default status in MySQL. utf8 is the default character set in MySQL 5.7 and earlier, while utf8mb3 is the default character set in MySQL 8.0. However, utf8mb4 is the default character set in MySQL 8.0.28 and later.

Finally, utf8 and utf8mb3 are deprecated in MySQL 8.0. This means that they will eventually be removed from MySQL, so it is recommended to use utf8mb4 instead.

So, if you need to store all Unicode characters, including emojis and other supplementary characters, then you should use utf8mb4. However, if you only need to store characters from the BMP, then utf8 may be sufficient.

Here is an example comparison of utf8 and utf8mb4 using MySQL tables and queries:

MySQL 5.7

mysql> select version();
+-----------+
| version() |
+-----------+
| 5.7.42-46 |
+-----------+

Table:

mysql> CREATE TABLE users (
  id INT AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(255) CHARACTER SET utf8,
  email VARCHAR(255) CHARACTER SET utf8
);
Query OK, 0 rows affected (0.03 sec)

mysql> show create table usersG
*************************** 1. row ***************************
       Table: users
Create Table: CREATE TABLE `users` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
  `email` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.01 sec)

Inserts three rows into the users table, including the emoji.

mysql> INSERT INTO users (name, email) VALUES
('Arun Jith', 'arunjith@example.com'),
('Jane Doe', 'janedoe@example.com'),
('𝌆', 'emoji@example.com');
ERROR 1366 (HY000): Incorrect string value: 'xF0x9Dx8Cx86' for column 'name' at row 3
mysql>

The error message encountered, “ERROR 1366 (HY000): Incorrect string value: ‘xF0x9Dx8Cx86’ for column ‘name’ at row 3,” suggests that there is an issue with the character encoding being used for the ‘name’ column in the ‘users’ table. The error occurred while trying to insert the Unicode character ‘𝌆’ into the ‘name’ column.

mysql> INSERT INTO users (name, email) VALUES
('Arun Jith', 'arunjith@example.com'),
('Jane Doe', 'janedoe@example.com')
;
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

MySQL 8.0

mysql> select version();
+-------------------------+
| version()               |
+-------------------------+
| 8.0.33-0ubuntu0.22.04.2 |
+-------------------------+

Table:

CREATE TABLE users (
  id INT AUTO_INCREMENT PRIMARY KEY,
  name VARCHAR(255) CHARACTER SET utf8,
  email VARCHAR(255) CHARACTER SET utf8
);

mysql> show create table usersG
*************************** 1. row ***************************
       Table: users
Create Table: CREATE TABLE `users` (
  `id` int NOT NULL AUTO_INCREMENT,
  `name` varchar(255) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  `email` varchar(255) CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

This table uses the utf8mb3 character set for both the name and email columns. This means that the table can store all characters from the BMP, but it cannot store emojis or other supplementary characters.

Query:

INSERT INTO users (name, email) VALUES
('Arun Jith', 'arunjith@example.com'),
('Jane Doe', 'janedoe@example.com'),
('𝌆', 'emoji@example.com');

Like the previous example, the error message you encountered, “ERROR 1366 (HY000): Incorrect string value: ‘xF0x9Dx8Cx86’ for column ‘name’ at row 3,” suggests that there is an issue with the character encoding being used for the ‘name’ column in the ‘users’ table. The error occurred while trying to insert the Unicode character ‘𝌆’ into the ‘name’ column.

mysql> INSERT INTO users (name, email) VALUES
    -> ('Arun Jith', 'arunjith@example.com'),
    -> ('Jane Doe', 'janedoe@example.com'),
    -> ('𝌆', 'emoji@example.com');
ERROR 1366 (HY000): Incorrect string value: 'xF0x9Dx8Cx86' for column 'name' at row 3

mysql> INSERT INTO users (name, email) VALUES
    -> ('Arun Jith', 'arunjith@example.com'),
    -> ('Jane Doe', 'janedoe@example.com')
    -> ;
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

This query inserts the first two rows into the users table. The first two rows contain simple text data, while the third row contains an emoji. The emoji will not be stored correctly in the database because the utf8 character set cannot store emojis.

Output:

SELECT * FROM users;

mysql> SELECT * FROM users;
+----+-----------+----------------------+
| id | name      | email                |
+----+-----------+----------------------+
|  4 | Arun Jith | arunjith@example.com |
|  5 | Jane Doe  | janedoe@example.com  |
+----+-----------+----------------------+
2 rows in set (0.00 sec)

This query will select the two rows from the users table. The output of the query will be a list of all rows in the users table, including the name, email, and ID of each user. The third row with emoji cannot store, and it errored out while inserting, because the utf8 character set cannot store emojis.

Table:

To ensure proper storage of emojis, let’s create the table columns using the utf8mb4 character set. Afterward, we can proceed to check if the emoji insertion works correctly.

mysql> CREATE TABLE users (
    ->   id INT AUTO_INCREMENT PRIMARY KEY,
    ->   name VARCHAR(255) CHARACTER SET utf8mb4,
    ->   email VARCHAR(255) CHARACTER SET utf8mb4
    -> );
Query OK, 0 rows affected (0.03 sec)

mysql> show create table usersG
*************************** 1. row ***************************
Table: users
Create Table: CREATE TABLE `users` (
`id` int NOT NULL AUTO_INCREMENT,
`name` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
`email` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
1 row in set (0.00 sec)

Query:

INSERT INTO users (name, email) VALUES
('Arun Jith', 'arunjith@example.com'),
('Jane Doe', 'janedoe@example.com'),
('𝌆', 'emoji@example.com');

mysql> INSERT INTO users (name, email) VALUES
    -> ('Arun Jith', 'arunjith@example.com'),
    -> ('Jane Doe', 'janedoe@example.com'),
    -> ('𝌆', 'emoji@example.com');
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0

This table uses the utf8mb4 character set for both the name and email columns. This means that the table can store all characters from the full Unicode character set, including emojis and other supplementary characters.

This query inserts three rows into the users table. The first two rows contain simple text data, while the third row contains an emoji. The emoji will be stored correctly in the database because the utf8mb4 character set can store emojis.

Output:

SELECT * FROM users;

mysql> SELECT * FROM users;
+----+-----------+----------------------+
| id | name      | email                |
+----+-----------+----------------------+
|  1 | Arun Jith | arunjith@example.com |
|  2 | Jane Doe  | janedoe@example.com  |
|  3 | 𝌆         | emoji@example.com    |
+----+----------+-----------------------+
3 rows in set (0.00 sec)

This query will select all rows from the users table. The output of the query will be a list of all rows in the users table, including the name, email, and ID of each user. The emoji will be stored as an emoji because the utf8mb4 character set can store emojis.

Conclusion

As you can see, the utf8mb4 character set can store all characters from the full Unicode character set, including emojis and other supplementary characters. This makes it a good choice for storing complex text data, text searches, and comparisons. The utf8 character set, on the other hand, can only store characters from the BMP. This means that it cannot store emojis or other supplementary characters.

In general, it is recommended to use utf8mb4 for all new applications. This will ensure that your data can be stored and processed correctly, regardless of the characters that it contains.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!


Viewing all articles
Browse latest Browse all 1786

Trending Articles