Character encodings in practice

Building upon Joel’s post on Unicode, here are some real-world tips relating to character encodings.

Use them by name

Always explicitly specify which encoding you want (perhaps UTF-8). Don’t assume the language or library/tool you’re using will make the right decision for you. If you value your time, don’t ignore this recommendation, otherwise you’ll likely spend lots more time patching things up in the future.

A tale to drive this home

Until very recently, MySQL’s default charset was latin1. Our legacy code didn’t specify an encoding in PHP’s connection to MySQL, in our HTML, nor in our HTTP headers. And web browsers default to god knows what (“it depends”).

When the time came to export some table data to XML I ran into issues of invalid UTF-8 characters. I knew there were special characters in some text fields introduced by Microsoft Word copy-pasters, so I figured flipping the switch to UTF-8 would help. NOPE!

It appears that having failed to specify an encoding in the past caused special characters (fancy double-quotes, crosses, long dashes, etc) to be stored in MySQL in a fashion that doesn’t satisfy the UTF-8 format. Shit!

So I had to come up with a batch of find-and-replace regex patterns to turn those characters into proper UTF-8 or their HTML entities.

So please, do the following:

Choose an encoding and stick to it
Save your text files in that encoding
Set the character encoding in your HTML
Send the Content-type header with charset from your webserver
Change the configuration defaults for MySQL or your DBMS
Check and set the default for each of your databases (note I said database not DBMS)
Check and specify the encoding for your tables and text fields
Make sure you specify the encoding when you create or alter tables in the future

Don’t convert blindly

If you failed to do the above at one point in time, you may have content in your database that isn’t valid UTF-8 (assuming you want to convert to that to prevent further issues).

Users that copy and paste from Microsoft Word are likely the source of this problem, but don’t blame them or Word. Blame the programmer responsible!

Your task now is to convert those invalid UTF-8 byte strings into something more usable. You may be able to convert many to their HTML entities, but you probably want to convert alphabetical characters to their valid UTF-8 representation to ensure the text remains easily searchable.

Don’t convert blindlier

If you have to convert your table definitions and data to UTF-8 (from something like MySQL’s latin1), be careful. There’s a great post by the Wordpress crew of what it entails (multi-stage process, to intermediate binary column types first), but it didn’t cover this troublesome gotcha: If MySQL finds characters it can’t convert to UTF-8 it’ll truncate the rest of the field data

MySQL will gladly go through the motions with you as you convert from latin1 to UTF-8. It’ll issue warnings that something went wrong during the conversion. But it will still truncate your data at any character that’s not valid UTF-8. Instantly your article text will go from 10234 bytes to 145. There’s probably a valid technical reason why it doesn’t error out altogether, but all I know is I now have to re-populate tons of data from backups.