June 03, 2005

Proposal for Implementing Unicode in PHP

On the Drupal team, I am known as an encoding nut: whenever there's an encoding issue or a question about Unicode, people tend to knock on my door. Usually any fix or answer from me is accompanied by a lot of cursing to the unfortunate inquirer about how "PHP is horrible when it comes to string handling" and how it seems that "the entire PHP dev team has its head planted firmly into the ground when it comes to Unicode".

To which the reply is more than often: "Why don't you fix it yourself?".

Well, I'm not a PHP language developer. To be honest I have no interest or time for becoming one. But I do know a lot about encodings and Unicode, so I decided to write this article describing the problem and possible solutions. That way, maybe others can take some of these ideas and put them into practice. At the very least, it should answer a lot of questions that people have about Unicode and PHP.

Right now, the message from the PHP developers seems to be that "PHP supports Unicode, but some assembly is required". In fact, it is a lot worse. Please, read on.

About encodings and Unicode

First, I recommend that anyone reading this article first reads The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky. It is an excellent introduction to Unicode and encodings in general. Note also that the article was written in 2003 and specifically mentions PHP's Unicode support being hopeless. We are now two years later and the situation has not changed much.

The only important thing about Unicode which isn't explained in Joel's article is that Unicode is in fact more than just a big table which maps characters to numbers: it is also a set of character properties, recommendations and algorithms on how those characters should be used. And this is why Unicode needs (and deserves!) much more attention than any other character set.

What is the current situation?

As far as PHP is concerned at the moment, a character consists of 8 bits and a string is a series of characters. This is good enough for legacy 8-bit encodings (like the common ISO-8859-1 or Latin-1 encoding used in Western Europe), but does not cater to more complicated encodings.

To accomodate those, the multibyte string extension (Mbstring) can be used. This extension was originally developed for handling Japanese encodings, but it has now been extended to support many more encodings, including the Unicode Transformation Formats (like the popular UTF-8). Mbstring provides encoding-aware versions of many of PHP's string functions (substr(), strlen(), ereg(), ...). Through a feature called overloading, you can tell PHP to always use the Mbstring version of a function if there is one.

Aside from Mbstring, there are a few other libraries and extensions which may be used to provide encoding- and Unicode-related services, like Imap, Iconv or GNU Recode.

What problems are there with the current approach?

PHP itself still doesn't know anything about encodings or Unicode. Aside from function calls, there are other ways of interacting with strings in PHP. For example, there is the {} operator for selecting characters from strings, as if they were arrays. And like in most programming languages, you can define strings in code with the familiar quote syntax. But all of these methods work with literal bytes, not with actual encoded characters.

PHP source code itself must be encoded in an ASCII-compatible encoding and there is no way to use Unicode codepoints directly. If you want to store a character in a variable, you either have to use a short string of bytes (the encoded representation of the character) or an integer representing the character's Unicode codepoint. But converting between a codepoint and its encoded representation requires ugly work-arounds and wrappers, as PHP itself provides no easy mechanism for doing this.
PHP does not guarantee anything about the local setup as far as encoding support goes. All the actual encoding functionality is located in libraries or extensions which may not be present on the average PHP install or which may be outdated. This makes it very difficult to make Unicode-compatible PHP programs work everywhere. One of PHP's assets is its large install base, yet the large majority of those installs is completely unsuited for Unicode work. At the time of writing this article, the latest PHP (5.0.4) still does not enable the Mbstring extension by default.

A trickier example: in Drupal 4.6.0 we depend on the Perl-compatible Regular Expression Library's support for Unicode and UTF-8. This was supposedly present since PHP 4.1 (exception: since PHP 4.2.3 on Windows). But actual testing shows that it took until PHP 4.3.3 for this library to know how to deal correctly with UTF-8 and the full Unicode range. But even now, PHP still has the ability to use the system-provided PCRE library, which can still be compiled without UTF-8 support. This can result in unsupported installs even for those using the latest PHP version.
When you use Mbstring overloading, you can no longer easily work with strings of binary data. Mbstring overloading sounds nice in theory, as it gives you smarter string functions for free without having to adapt your code. However, this feature denies a basic fact: text strings are fundamentally different from binary data. If this sounds strange to you, consider this:
- Binary data requires no meta-information about its encoding and can be passed around freely. Operations on two byte arrays are guaranteed to work. Text, on the other hand, is always encoded in a particular way. Text operations can only work if the encoding is known and verified to be the same for all operands involved.
- Binary data can contain arbitrary bits, while most text encodings have a much more limited syntax. Take a look at UTF-8's bit patterns for example. However, even plain US-ASCII text has historically had the limitation that it may not contain the NULL character.
- Binary data has no intrinsic semantic meaning, while text does. Many operations (like case conversion) only make sense on text, while other operations become much more complicated (e.g. text sorting needs to take local conventions into account). Specifically, there are a lot of Unicode algorithms for advanced text processing (e.g. the Bidirectional Algorithm for handling text with mixed writing directions).
Due to the fact that text has been 8-bit encoded for a long time, a lot of programmers don't think twice about using text functions for dealing with binary data and vice-versa. But this assumption is no longer valid today.

If Mbstring overloading is enabled and a PHP programmer wants to perform operations on binary data, (s)he has to temporarily trick PHP into using a simple 8-bit encoding (like ISO-8859-1). Quite possibly, locale settings have to be changed back and forth as well. This results in bloated, complicated code.
PHP's string functions don't form a clean, consistent API. There is no consistent naming convention (e.g. substr(), str_replace(), convert_cyr_string(), parse_str(), sprintf(), ...).

There are also a bunch of hodge-podge functions which are only useful in very specific situations and/or which are tied to a particular encoding (e.g. utf8_encode()) or locale (e.g. ucfirst()).

Finally, though some functions take an encoding argument to allow for some encoding support, this is rare and inconsistent. For example, while the html_entities() function supports several encodings, the utility function get_html_translation_table() which fetches its translation table does not.
PHP's locale mechanism is completely platform-dependant and offers no guarantees. The locale identifiers passed to setlocale() differ completely between Windows and Unix platforms, but even between similar Unix platforms there is no guarantee of which locales are available. The dependency of PHP on system locales also means that you are restricted to whatever encodings the system locales are available in.
PHP's XML parser is notorious for violating the specifications when it comes to encodings. In today's web, XML is everywhere in the form of XHTML, RSS feeds, OPML, etc. Being able to parse XML correctly is essential to any PHP application. A significant portion of the XML specification talks about encodings and how to deal with them, but PHP does not implement them correctly.

For example, if an XML document starts with a UTF-8 signature (in the form of the byte-order mark), PHP5's parser will die if it is told the document is in UTF-8 encoding. Similar simple, but critical bugs have had to be worked around by PHP programmers in the past. Before PHP5, absolutely no encoding autodetection was present in the XML parser: this had to be done by the code invoking the parser.
Mbstring is a pragmatic library, not a fully featured Unicode solution. Example limitations include not being able to specify characters beyond U+FFFF for some functions (e.g. mb_substitute_character()) or the way mb_strwidth() seems to be hardcoded for Japanese only (there are no zero widths for combining accents?).

All of these problems together mean that it is very hard at the moment to write PHP software which can support encodings and Unicode. Even worse, if this software has to run on a typical PHP install, then you can forget about implementing anything more than simple pass-through behaviour as far as text is concerned.

Proposed solution

Unfortunately, PHP is very hot on backwards compatibility, so significant changes to the existing string API are pretty much out of the question. New types and APIs need to be introduced which offer a complete, consistent and flexible solution for dealing with encodings and Unicode.

PHP needs a new Unicode text string type which is separate from the classic byte string. This type, let's call it ustring, would represent a string of Unicode text.

Internally, it would be stored using one of the UTF's. In the interests of internal processing efficiency, UTF-16 is probably the best choice, but UTF-8 can be considered as well as it is the most popular UTF on the web today. In that case, outputting UTF-8 could be done without any conversion. On the other hand, the complicated bit patterns and variability of UTF-8 mean that it is harder to find character boundaries and such. Looking at how languages like Perl and Python approach this is a good idea. After all, they've had Unicode strings for quite some time.

To distinguish ustrings from plain strings when defined, a syntax similar to C could be introduced, for example U"This is a Unicode string". This syntax would support \u####, \U######## and \x{#..} notation for defining characters by codepoint inside the string.

Using the {} operator on a ustring would return ints, not chars. To reduce confusion, perhaps a uchar type could be introduced specifically for handling Unicode codepoints. As the Unicode codespace is only 21-bit wide, there would be subtle differences between uchar and int, though both would probably be stored as 32-bit.

For backwards compatibility, plain quoted strings would remain used for byte strings, although it might be interesting to define a B"This is a byte string" notation, while providing a configurable option for choosing which type of string is assumed when there is no prefix. As Unicode usage would become more widespread, it would be nice to not have to litter your code with U's everywhere.

Though the internal encoding would be fixed to one of the UTF's, the external encoding might vary (and would be configurable through an API). When casting a ustring to a string, a conversion would take place from the internal encoding to the external one, and vice-versa. It remains to be seen which type takes precedence when both are mixed together (e.g. $string = U"Unicode" . "Bytes").
PHP needs a new Unicode string API. This API would contain a selection of functions from both the plain String API as well as the Mbstring API, but would have a simpler and more logical naming convention. For example, making all ustring functions start with ustr_. Each of these would accept a ustring where the current ones accept a plain string.

External APIs, like the PCRE library, could choose whether to accept string, ustring or both. For example for PCRE, it makes sense to replace the PHP-proprietary /u modifier with a simple string type check instead.
PHP needs to ensure that a baseline set of encoding-related functions are always available. I believe the Iconv extension is now standard since PHP5, but things like complete UTF-8 support in PCRE are important too. This allows programmers to write their code in a straightforward fashion without having to check for a gazillion exceptions or exotic configurations.
PHP needs an independent locale library across all platforms This ensures consistent handling of locales and no longer limits PHP to what the platform supports. The International Components for Unicode (ICU) are an excellent candidate.

The choice to limit this new string functionality to Unicode strings might seem elitist: after all, the idea of Unicode is not to get rid of other encodings, but merely to ensure compatibility. Non-Unicode encodings will keep fulfilling an important role in the years to come. On the other hand, as Unicode is guaranteed to be a perfect intermediate format, it makes sense to use it for internal string handling. It limits the functionality that has to be dealt with and creates a common baseline to work with.

Finally, as the original String and Mbstring APIs would not be altered by these changes, programmers would be free to use the 'old school' way of dealing with strings. They would simply not be able to take advantage of the cleaner API and consistent locales.

Drupal PHP Unicode June 03, 2005

Hackery, Math & Design

Steven Wittens i

Proposal for Implementing Unicode in PHP

About encodings and Unicode

What is the current situation?

What problems are there with the current approach?

Proposed solution