TransWikia.com

How to easily convert utf8 tables to utf8mb4 in MySQL 5.5

Database Administrators Asked by geoaxis on August 14, 2020

I have a database which now needs to support 4 byte characters (Chinese). Luckily I already have MySQL 5.5 in production.

So I would just like to make all collations which are utf8_bin to utf8mb4_bin.

I believe there is no performance loss/gain with this change other than a bit of storage overhead.

10 Answers

If you, like me, do not trust automation, this is how I have handled the problem.

First Stop digging!

Start with altering the default charset of new tables by changing the DB definition(like in all other answers):

ALTER DATABASE database_name 
  CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

Then generate sql to change the default charset for new columns of all existing tables:

SELECT concat("ALTER TABLE `",table_schema,"`.`",table_name,"` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_swedish_ci;") as _sql
  FROM information_schema.TABLES
 WHERE table_schema like "database_name" and TABLE_TYPE="BASE TABLE"
 GROUP BY table_schema, table_name ;

Now we can handle the "legacy"

List character datatypes you are using:

select distinct data_type  from information_schema.columns where table_schema = "database_name" and CHARACTER_SET_NAME is not null;

For me that list was "varchar" and "text"

List character_SETS_ in use:

select distinct character_set_name from information_schema.columns where table_schema = "database_name";

This gives me "utf8", "latin1", and "utf8mb4" which is a reason I do not trust automation, the latin1 columns risk having dirty data.

Now you can make a list of all columns you need to update with:

select table_name, column_name, data_type, character_set_name, collation_name
  from information_schema.columns 
 where table_schema = "database_name" and CHARACTER_SET_NAME is not null AND CHARACTER_SET_NAME <> "utf8mb4"
 group by table_name, , data_type, character_set_name, collation_name;

Tables containing only utf8 or utf8mb4 could be converted with "CONVERT TO CHARACTER SET" as Mathias and MrJingles describes above, but then you risk MySQL changing the types for you, so you may be better of running "CHANGE COLUMN" instead since that gives you control of exactly what happens.

If you have non-utf8 columns these questions may give inspiration about checking the columns data: https://stackoverflow.com/q/401771/671282 https://stackoverflow.com/q/9304485/671282

Since you probably know what you expect to have in most of the columns something like this will probably handle most of them after modifying the non-ascii chars allowed to suit your needs:

SELECT distinct section FROM table_name WHERE column_name NOT REGEXP '^([A-Za-z0-9åäöÅÄÖ&.,_ -])*$';

When the above did not fit I used the below that have a bit "fuzzier" maching:

SELECT distinct
  CONVERT(CONVERT(column_name USING BINARY) USING latin1) AS latin1,
  CONVERT(CONVERT(column_name USING BINARY) USING utf8) AS utf8
FROM table_name
WHERE CONVERT(column_name USING BINARY) RLIKE CONCAT('[', UNHEX('C0'), '-', UNHEX('F4'), '][',UNHEX('80'),'-',UNHEX('FF'),']') limit 5;

This query matches any two characters that could start an utf8-character, thus allowing you to inspect those records, it may give you a lot of false positives. The utf8 conversion fails returning null if there is any character it can not convert, so in a large field there is a good chance of it not being useful.

Answered by Samuel Åslund on August 14, 2020

Run the following SQL statements to get the result statements for updating character set and collation:

SET @database = "your_database_name";

SET @charset = "utf8mb4";

SET @collate = "utf8mb4_0900_ai_ci";

SELECT "SET foreign_key_checks = 0;"
UNION ALL
SELECT concat(
  "ALTER DATABASE `",
  `SCHEMA_NAME`,
  "` CHARACTER SET = ",
  @charset,
  " COLLATE = ",
  @collate,
  ";"
) AS `sql`
FROM `information_schema`.`SCHEMATA`
WHERE `SCHEMA_NAME` = @database
  AND (
    `DEFAULT_CHARACTER_SET_NAME` <> @charset
    OR `DEFAULT_COLLATION_NAME` <> @collate
  )
UNION ALL
SELECT concat(
  "ALTER TABLE `",
  `TABLE_SCHEMA`,
  "`.`",
  `TABLE_NAME`,
  "` CONVERT TO CHARACTER SET ",
  @charset,
  " COLLATE ",
  @collate,
  ";"
) AS `sql`
FROM `information_schema`.`TABLES`
WHERE `TABLE_SCHEMA` = @database
  AND `TABLE_TYPE` = "BASE TABLE"
  AND `TABLE_COLLATION` <> @collate
UNION ALL
SELECT concat(
  "ALTER TABLE `",
  c.`TABLE_SCHEMA`,
  "`.`",
  c.`TABLE_NAME`,
  "` CHANGE `",
  c.`COLUMN_NAME`,
  "` `",
  c.`COLUMN_NAME`,
  "` ",
  c.`COLUMN_TYPE`,
  " CHARACTER SET ",
  @charset,
  " COLLATE ",
  @collate,
  if(c.`IS_NULLABLE`="YES", " NULL", " NOT NULL"),
  ";"
) AS `sql`
FROM `information_schema`.`COLUMNS` c,
  `information_schema`.`TABLES` t,
  `information_schema`.`COLLATION_CHARACTER_SET_APPLICABILITY` a
WHERE c.`TABLE_SCHEMA` = t.`TABLE_SCHEMA`
  AND c.`TABLE_NAME` = t.`TABLE_NAME`
  AND t.`TABLE_COLLATION` = a.`COLLATION_NAME`
  AND c.`TABLE_SCHEMA` = @database
  AND c.`DATA_TYPE` IN (
    'varchar',
    'char',
    'text',
    'tinytext',
    'mediumtext',
    'longtext'
  )
  AND (
    c.`CHARACTER_SET_NAME` <> a.`CHARACTER_SET_NAME`
    OR c.`COLLATION_NAME` <> t.`TABLE_COLLATION`
  )
  AND t.`TABLE_TYPE` = "BASE TABLE"
UNION ALL
SELECT "SET foreign_key_checks = 1;";

Notice:

  • It only changes database, tables or columns that are not consistent with the given character set or collation.
  • It doesn't treat views as regular tables.
  • It avoids foreign key violation by turning off related checks.
  • Update variable values accordingly before running the above SQL statements.
    • Default collation utf8mb4_0900_ai_ci may not be friendly to your preferred language. For example, it consider half-width parentheses () the same as full-width parentheses (), and causes trouble in certain cases.
    • Run SHOW COLLATION WHERE CHARSET = 'utf8mb4' AND COLLATION LIKE 'utf8mb4%0900%'; and choose a suitable collation for your needs.

Answered by Rockallite on August 14, 2020

I made a script which does this more or less automatically:

<?php
/**
 * Requires php >= 5.5
 * 
 * Use this script to convert utf-8 data in utf-8 mysql tables stored via latin1 connection
 * This is a PHP port from: https://gist.github.com/njvack/6113127
 *
 * BACKUP YOUR DATABASE BEFORE YOU RUN THIS SCRIPT!
 *
 * Once the script ran over your databases, change your database connection charset to utf8:
 *
 * $dsn = 'mysql:host=localhost;port=3306;charset=utf8';
 * 
 * DON'T RUN THIS SCRIPT MORE THAN ONCE!
 *
 * @author hollodotme
 *
 * @author derclops since 2019-07-01
 *
 *         I have taken the liberty to adapt this script to also do the following:
 *
 *         - convert the database to utf8mb4
 *         - convert all tables to utf8mb4
 *         - actually then also convert the data to utf8mb4
 *
 */

header('Content-Type: text/plain; charset=utf-8');

$dsn      = 'mysql:host=localhost;port=3306;charset=utf8';
$user     = 'root';
$password = 'root';
$options  = [
    PDO::ATTR_CURSOR                   => PDO::CURSOR_FWDONLY,
    PDO::MYSQL_ATTR_USE_BUFFERED_QUERY => true,
    PDO::MYSQL_ATTR_INIT_COMMAND       => "SET CHARACTER SET latin1",
];


$dbManager = new PDO( $dsn, $user, $password, $options );

$databasesToConvert = [ 'database1',/** database3, ... */ ];
$typesToConvert     = [ 'char', 'varchar', 'tinytext', 'mediumtext', 'text', 'longtext' ];

foreach ( $databasesToConvert as $database )
{
    echo $database, ":n";
    echo str_repeat( '=', strlen( $database ) + 1 ), "n";

    $dbManager->exec( "USE `{$database}`" );

    echo "converting database to correct locale too ... n";

    $dbManager->exec("ALTER DATABASE `{$database}` CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci");


    $tablesStatement = $dbManager->query( "SHOW TABLES" );
    while ( ($table = $tablesStatement->fetchColumn()) )
    {
        echo "Table: {$table}:n";
        echo str_repeat( '-', strlen( $table ) + 8 ), "n";

        $columnsToConvert = [ ];

        $columsStatement = $dbManager->query( "DESCRIBE `{$table}`" );

        while ( ($tableInfo = $columsStatement->fetch( PDO::FETCH_ASSOC )) )
        {
            $column = $tableInfo['Field'];
            echo ' * ' . $column . ': ' . $tableInfo['Type'];

            $type = preg_replace( "#(d+)#", '', $tableInfo['Type'] );

            if ( in_array( $type, $typesToConvert ) )
            {
                echo " => must be convertedn";

                $columnsToConvert[] = $column;
            }
            else
            {
                echo " => not relevantn";
            }
        }


        //convert table also!!!
        $convert = "ALTER TABLE `{$table}` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci";

        echo "n", $convert, "n";
        $dbManager->exec( $convert );
        $databaseErrors = $dbManager->errorInfo();
        if( !empty($databaseErrors[1]) ){
            echo "n !!!!!!!!!!!!!!!!! ERROR OCCURED ".print_r($databaseErrors, true)." n";
            exit;
        }


        if ( !empty($columnsToConvert) )
        {
            $converts = array_map(
                function ( $column )
                {
                    //return "`{$column}` = IFNULL(CONVERT(CAST(CONVERT(`{$column}` USING latin1) AS binary) USING utf8mb4),`{$column}`)";
                    return "`{$column}` = CONVERT(BINARY(CONVERT(`{$column}` USING latin1)) USING utf8mb4)";
                },
                $columnsToConvert
            );

            $query = "UPDATE IGNORE `{$table}` SET " . join( ', ', $converts );

            //alternative
            // UPDATE feedback SET reply = CONVERT(BINARY(CONVERT(reply USING latin1)) USING utf8mb4) WHERE feedback_id = 15015;


            echo "n", $query, "n";


            $dbManager->exec( $query );

            $databaseErrors = $dbManager->errorInfo();
            if( !empty($databaseErrors[1]) ){
                echo "n !!!!!!!!!!!!!!!!! ERROR OCCURED ".print_r($databaseErrors, true)." n";
                exit;
            }
        }

        echo "n--n";
    }

    echo "n";
}

Answered by clops on August 14, 2020

For people who might have this problem the best solution is to modify first the columns to a binary type, according to this table:

  1. CHAR => BINARY
  2. TEXT => BLOB
  3. TINYTEXT => TINYBLOB
  4. MEDIUMTEXT => MEDIUMBLOB
  5. LONGTEXT => LONGBLOB
  6. VARCHAR => VARBINARY

And after that modify the column back to its former type and with your desired charset.

Eg.:

ALTER TABLE [TABLE_SCHEMA].[TABLE_NAME] MODIFY [COLUMN_NAME] LONGBLOB;
ALTER TABLE [TABLE_SCHEMA].[TABLE_NAME] MODIFY [COLUMN_NAME] VARCHAR(140) CHARACTER SET utf8mb4;

I tried in several latin1 tables and it kept all the diacritics.

You can extract this query for all columns doing this:

SELECT
CONCAT('ALTER TABLE ', TABLE_SCHEMA,'.', TABLE_NAME,' MODIFY ', COLUMN_NAME,' VARBINARY;'),
CONCAT('ALTER TABLE ', TABLE_SCHEMA,'.', TABLE_NAME,' MODIFY ', COLUMN_NAME,' ', COLUMN_TYPE,' CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;')
FROM information_schema.columns
WHERE TABLE_SCHEMA IN ('[TABLE_SCHEMA]')
AND COLUMN_TYPE LIKE 'varchar%'
AND (COLLATION_NAME IS NOT NULL AND COLLATION_NAME NOT LIKE 'utf%');

Answered by MalachiteBR on August 14, 2020

I used the following shell script. It takes database name as a parameter and converts all tables to another charset and collation (given by another parameters or default value defined in the script).

#!/bin/bash

# mycollate.sh <database> [<charset> <collation>]
# changes MySQL/MariaDB charset and collation for one database - all tables and
# all columns in all tables

DB="$1"
CHARSET="$2"
COLL="$3"

[ -n "$DB" ] || exit 1
[ -n "$CHARSET" ] || CHARSET="utf8mb4"
[ -n "$COLL" ] || COLL="utf8mb4_general_ci"

echo $DB
echo "ALTER DATABASE `$DB` CHARACTER SET $CHARSET COLLATE $COLL;" | mysql

echo "USE `$DB`; SHOW TABLES;" | mysql -s | (
    while read TABLE; do
        echo $DB.$TABLE
        echo "ALTER TABLE `$TABLE` CONVERT TO CHARACTER SET $CHARSET COLLATE $COLL;" | mysql $DB
    done
)

Answered by Petr Stastny on August 14, 2020

I wrote this guide: http://hanoian.com/content/index.php/24-automate-the-converting-a-mysql-database-character-set-to-utf8mb4

From my work, I saw that ALTER the database and the tables is not enough. I had to go into each table and ALTER each of the text/mediumtext/varchar columns too.

Luckily I was able to write a script to detect the metadata of MySQL databases, so it could loop through the tables and columns and ALTERed them automatically.

Long index for MySQL 5.6:

There is one thing you must have DBA/SUPER USER privilege to do: Setting the database parameters:

innodb_large_prefix : ON
innodb_file_format : Barracuda 
innodb_file_format_max : Barracuda

In the answers for this question, there is instruction how to set those parameters above: https://stackoverflow.com/questions/35847015/mysql-change-innodb-large-prefix

Of course, in my article, there are instructions to do that too.

For MySQL version 5.7 or newer, the innodb_large_prefix is ON by default, and the innodb_file_format is also Barracuda by default.

Answered by Châu Hồng Lĩnh on August 14, 2020

Ran into this situation; here's the approach I used to convert my database:

  1. First, you need to edit my.cnf to make default database connection (between applications and MYSQL) utf8mb4_unicode_ci compliant. Without this characters like emojis and similar submitted by your apps won't make it to your tables in right bytes/encoding (unless your application's DB CNN params specify a utf8mb4 connection).

    Instructions given here.

  2. Execute the following SQL (no need to get prepared SQL to change individual columns, ALTER TABLE statements will do that).

    Before you execute below code replace "DbName" with your actual DB name.

    USE information_schema;
    
    SELECT concat("ALTER DATABASE `",table_schema,
                  "` CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;") as _sql
      FROM `TABLES`
     WHERE table_schema like "DbName"
     GROUP BY table_schema;
    
    SELECT concat("ALTER TABLE `",table_schema,"`.`",table_name,
                  "` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;") as _sql
      FROM `TABLES`
     WHERE table_schema like "DbName"
     GROUP BY table_schema, table_name;
    
  3. Collect and save output of above SQL in a dot sql file and execute it.

  4. If you get an error like #1071 - Specified key was too long; max key length is 1000 bytes. along with the problematic table name, this means index key on some column of that table (which was supposed to be converted to MB4 charstring) will be very big hence that Varchar column should be <= 250 so that its index key will be max 1000 bytes. Check the columns on which you have indexes and if one of them is a varchar > 250 (most likely 255) then

    • Step 1: check data in that column to make sure that max string size in that column is <= 250.

      Example query:

      select `id`,`username`, `email`,
             length(`username`) as l1,
             char_length(`username`) as l2,
             length(`email`) as l3,
             char_length(`email`) as l4
        from jos_users
       order by l4 Desc;
      
    • Step 2: if max charlength of indexed column data <= 250 then change the col length to 250. if that is not possible, remove index on that column

    • Step 3: then run the alter table query for that table again and table should now be converted into utf8mb4 successfully.

Cheers!

Answered by Nav44 on August 14, 2020

I have a solution that will convert databases and tables by running a few commands. It also converts all columns of the type varchar, text, tinytext, mediumtext, longtext, char. You should also backup your database in case something breaks.

Copy the following code into a file called it preAlterTables.sql:

use information_schema;
SELECT concat("ALTER DATABASE `",table_schema,"` CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;") as _sql 
FROM `TABLES` where table_schema like "yourDbName" and TABLE_TYPE='BASE TABLE' group by table_schema;
SELECT concat("ALTER TABLE `",table_schema,"`.`",table_name,"` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;") as _sql  
FROM `TABLES` where table_schema like "yourDbName" and TABLE_TYPE='BASE TABLE' group by table_schema, table_name;
SELECT concat("ALTER TABLE `",table_schema,"`.`",table_name, "` CHANGE `",column_name,"` `",column_name,"` ",data_type,"(",character_maximum_length,") CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci",IF(is_nullable="YES"," NULL"," NOT NULL"),";") as _sql 
FROM `COLUMNS` where table_schema like "yourDbName" and data_type in ('varchar','char');
SELECT concat("ALTER TABLE `",table_schema,"`.`",table_name, "` CHANGE `",column_name,"` `",column_name,"` ",data_type," CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci",IF(is_nullable="YES"," NULL"," NOT NULL"),";") as _sql 
FROM `COLUMNS` where table_schema like "yourDbName" and data_type in ('text','tinytext','mediumtext','longtext');

Replace all occurences of "yourDbName" with the database you want to convert. Then run:

mysql -uroot < preAlterTables.sql | egrep '^ALTER' > alterTables.sql

This will generate a new file alterTables.sql, with all the queries you need to convert the database. Run the following command to start the conversion:

mysql -uroot < alterTables.sql

You can also adapt this to run through multiple databases, by changing the condition for the table_schema. For example table_schema like "wiki_%" will convert all databases with the name prefix wiki_. To convert all databases replace the condition with table_type!='SYSTEM VIEW'.

An issue that might arise. I had some varchar(255) columns in mysql keys. This causes an error:

ERROR 1071 (42000) at line 2229: Specified key was too long; max key length is 767 bytes

If that happens you can simply change the column to be smaller, like varchar(150), and rerun the command.

Please note: This answer converts the database to utf8mb4_unicode_ci instead of utf8mb4_bin, asked in the question. But you can simply replace this.

Answered by MrJingles87 on August 14, 2020

From my guide How to support full Unicode in MySQL databases, here are the queries you can run to update the charset and collation of a database, a table, or a column:

For each database:

ALTER DATABASE
    database_name
    CHARACTER SET = utf8mb4
    COLLATE = utf8mb4_unicode_ci;

For each table:

ALTER TABLE
    table_name
    CONVERT TO CHARACTER SET utf8mb4
    COLLATE utf8mb4_unicode_ci;

For each column:

ALTER TABLE
    table_name
    CHANGE column_name column_name
    VARCHAR(191)
    CHARACTER SET utf8mb4
    COLLATE utf8mb4_unicode_ci;

(Don’t blindly copy-paste this! The exact statement depends on the column type, maximum length, and other properties. The above line is just an example for a VARCHAR column.)

Note, however, that you cannot fully automate the conversion from utf8 to utf8mb4. As described in step 4 of the abovementioned guide, you’ll need to check the maximum length of columns and index keys, as the number you specify has a different meaning when utf8mb4 is used instead of utf8.

Section 10.1.11 of the MySQL 5.5 Reference Manual has some more information on this.

Answered by Mathias Bynens on August 14, 2020

I would write a script (in Perl, or whatever) to use information_schema (TABLES and COLUMNS) to walk through all the tables, and do MODIFY COLUMN on every CHAR/VARCHAR/TEXT field. I would collect all the MODIFYs into a single ALTER for each table; this will be more efficient.

I think (but am not sure) that Raihan's suggestion only changes the default for the table.

Answered by Rick James on August 14, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP