Overcoming missing Unicode support in PHP


 

The lack of Unicode support in PHP is displeasing, but there are workarounds that allow you to develop proper internationalized applications even in PHP. The first problem you have to solve is proper representation of Unicode data. PHP uses so-called binary strings — in PHP, a string is not a string of Unicode characters, but rather a sequence of bytes. You can internally store all strings in UTF-8 encoding and make sure that all input to and output from the script is properly encoded and decoded.

In theory, you can use other encodings than UTF-8, but UTF-8 creates less trouble than other systems. Many PHP libraries already expect that strings are encoded in UTF-8, including all functions working with XML and the newly added intl library. To smoothly work with UTF-8-encoded strings, it is best to encode characters in UTF-8 and send output from scripts in UTF-8.

Still, turning everything into UTF-8 does not solve anything. If you encode a Latin character with an accent or a non-Latin character in UTF-8, you will obtain two, three, of four bytes, which confuses PHP string functions that compute string length or work with substrings. Listing 1 demonstrates this problem.

Listing 1. Problems related to improper Unicode support in PHP

<?php

Header("Content-type: text/plain;charset=utf-8");

$text["en"] = "The Hitchhiker\'s Guide to the Galaxy";
$text["es"] = "Guía del autoestopista galáctico";
$text["cs"] = "Stopařův průvodce po Galaxii";
$text["ru"] = "Путеводитель хитч-хайкера по Галактике";
$text["ja"] = "銀河ヒッチハイク・ガイド";

foreach($text as $lang => $t)
{
echo $lang, ": ", $t, " (", strlen($t), " vs. ", mb_strlen($t, "utf-8"), ")\\n";
}
?>
No comments

Enter your email address:

Delivered by FeedBurner

OR

 Subscribe in a reader