Finding Multi-byte Characters in MySQL Fields

So I was recently helping a client with an issue in MySQL where a migration failed to transfer the full contents of some fields. This amounted to a little over 1% of the total messages transferred. In doing some research, we discovered that the one thing every message had in common was the presence of multi-byte (high unicode) characters. In many cases, this was due to a user pasting some text from Microsoft Word.

Remember, there is a correct way to dump and restore data from MySQL, and it doesn’t involve piping output. That was the root cause of this problem.

So, how to clean up the mess? Well, fortunately, we still had the old database up. So we needed to figure out what messages might be suspicious and need repair. That means finding fields that contain unicode characters.

There are several other articles on the Internet, but they are mostly wrong. They all tell you to do things like this:

SELECT * FROM TABLE WHERE NOT HEX(COLUMN) REGEXP '^([0-7][0-9A-F])*$';
SELECT id FROM posts WHERE LENGTH(body) != CHAR_LENGTH(body);

THIS IS WRONG. DO NOT DO THIS. User eggyal in this Stack Overflow thread :explains why

Suppose, for example, that one has a UTF-16 column containing ‘ā’ (encoded by the byte sequence 0x0101) - it would be deemed “ASCII” using this test: a false negative; indeed, some character sets do not encode ASCII characters within 0x00 to 0x7f whereupon this solution would yield a false positive.

Many character sets are fixed-length and so LENGTH(column) will be a constant multiple of CHAR_LENGTH(column) irrespective of the value.

This is the correct way:

SELECT id FROM posts WHERE body <> CONVERT(body USING ASCII)

Ollie Jones explains why:

The CONVERT(col USING charset) function will turns the unconvertable characters into replacement characters. Then, the converted and unconverted text will be unequal.

Armed with this, I was able to write a program to query all the suspicious messages and compare them with the data in the old database. Where they were not the same, we were able to restore the old data.

Did something I wrote help you out?

That's great! I don't earn any money from this site - I run no ads, sell no products and participate in no affiliate programs. I do this solely because it's fun; I enjoy writing and sharing what I learn.

All the same, if you found this article helpful and want to show your appreciation, here's my Amazon.com wishlist.

Read More

Recursive Queries with MySQL

Discovered something neat with the new version of MySQL and thought it warranted a mention. Storing tree structures in a relational database is a common use case across many different areas of tech. The problem comes when you need to construct a query based on a subset of that tree. But MySQL 8 has some nice new features that makes doing this a breeze.

MySQL mathematical operations and NULL values

So I came across an interesting quirk in MySQL the other day. Let’s say you have a table schema and some values that look like this: +-------------------+------------------+------+-----+---------+-------+ | Field             | Type             | Null | Key | Default | Extra | +-------------------+------------------+------+-----+---------+-------+ | page_id       | varchar(30)      | YES  |     | NULL    |       | | clicks            | int(10) unsigned | YES  |     | NULL    |       | +-------------------+------------------+------+-----+---------+-------+ +---------+--------+ | page_id | clicks | +---------+--------+ | 1 | NULL | +---------+--------+ And then let’s say you pass the following SQL statement to MySQL: update page_click_count set clicks = clicks + 1 where page_id=1; If you come from a loosely-typed language such as PHP, you would probably expect clicks for page_id 1 to now be 1. But that’s not the case in MySQL. After the query is run, the table will still look like this: +---------+--------+ | page_id | clicks | +---------+--------+ | 1 | NULL | +---------+--------+ Not only does the query fail, but it fails with no warnings given. It appears that mathematical operations on null values silently fail. There are a couple of ways around this. The first and most obvious is to set NOT NULL and a default value on the column. In the example above, this would work. The NULL value in that field becomes a 0 and you can to normal mathematical operations on it. But what happens if, for whatever reason, you can’t do that? We actually have this situation in a few places at dealnews, where NULL represents a distinct value of that field that is different from 0. In this case, you can use COALESCE() to fill in the appropriate value for the field. update page_click_count set clicks = coalesce(clicks, 0) + 1 where page_id=1; Edit: Brian Moon informs me that this is actually part of the SQL specification. So hooray for specifications. Still, it’s kind of arcane; in working with MySQL (and PHP) for a decade now, this is the first time I’ve ever actually encountered this. Hopefully this helps someone who was as confused as I was.