Home

Proposal for Implementing Unicode in PHP

Jun 03, 2005

On the Drupal team, I am known as an encoding nut: whenever there's an encoding issue or a question about Unicode, people tend to knock on my door. Usually any fix or answer from me is accompanied by a lot of cursing to the unfortunate inquirer about how "PHP is horrible when it comes to string handling" and how it seems that "the entire PHP dev team has its head planted firmly into the ground when it comes to Unicode".

To which the reply is more than often: "Why don't you fix it yourself?".

Well, I'm not a PHP language developer. To be honest I have no interest or time for becoming one. But I do know a lot about encodings and Unicode, so I decided to write this article describing the problem and possible solutions. That way, maybe others can take some of these ideas and put them into practice. At the very least, it should answer a lot of questions that people have about Unicode and PHP.

Right now, the message from the PHP developers seems to be that "PHP supports Unicode, but some assembly is required". In fact, it is a lot worse. Please, read on.

About encodings and Unicode

First, I recommend that anyone reading this article first reads The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky. It is an excellent introduction to Unicode and encodings in general. Note also that the article was written in 2003 and specifically mentions PHP's Unicode support being hopeless. We are now two years later and the situation has not changed much.

The only important thing about Unicode which isn't explained in Joel's article is that Unicode is in fact more than just a big table which maps characters to numbers: it is also a set of character properties, recommendations and algorithms on how those characters should be used. And this is why Unicode needs (and deserves!) much more attention than any other character set.

What is the current situation?

As far as PHP is concerned at the moment, a character consists of 8 bits and a string is a series of characters. This is good enough for legacy 8-bit encodings (like the common ISO-8859-1 or Latin-1 encoding used in Western Europe), but does not cater to more complicated encodings.

To accomodate those, the multibyte string extension (Mbstring) can be used. This extension was originally developed for handling Japanese encodings, but it has now been extended to support many more encodings, including the Unicode Transformation Formats (like the popular UTF-8). Mbstring provides encoding-aware versions of many of PHP's string functions (substr(), strlen(), ereg(), ...). Through a feature called overloading, you can tell PHP to always use the Mbstring version of a function if there is one.

Aside from Mbstring, there are a few other libraries and extensions which may be used to provide encoding- and Unicode-related services, like Imap, Iconv or GNU Recode.


What problems are there with the current approach?

  1. PHP itself still doesn't know anything about encodings or Unicode.
    Aside from function calls, there are other ways of interacting with strings in PHP. For example, there is the {} operator for selecting characters from strings, as if they were arrays. And like in most programming languages, you can define strings in code with the familiar quote syntax. But all of these methods work with literal bytes, not with actual encoded characters.

    PHP source code itself must be encoded in an ASCII-compatible encoding and there is no way to use Unicode codepoints directly. If you want to store a character in a variable, you either have to use a short string of bytes (the encoded representation of the character) or an integer representing the character's Unicode codepoint. But converting between a codepoint and its encoded representation requires ugly work-arounds and wrappers, as PHP itself provides no easy mechanism for doing this.

  2. PHP does not guarantee anything about the local setup as far as encoding support goes.
    All the actual encoding functionality is located in libraries or extensions which may not be present on the average PHP install or which may be outdated. This makes it very difficult to make Unicode-compatible PHP programs work everywhere. One of PHP's assets is its large install base, yet the large majority of those installs is completely unsuited for Unicode work. At the time of writing this article, the latest PHP (5.0.4) still does not enable the Mbstring extension by default.

    A trickier example: in Drupal 4.6.0 we depend on the Perl-compatible Regular Expression Library's support for Unicode and UTF-8. This was supposedly present since PHP 4.1 (exception: since PHP 4.2.3 on Windows). But actual testing shows that it took until PHP 4.3.3 for this library to know how to deal correctly with UTF-8 and the full Unicode range. But even now, PHP still has the ability to use the system-provided PCRE library, which can still be compiled without UTF-8 support. This can result in unsupported installs even for those using the latest PHP version.

  3. When you use Mbstring overloading, you can no longer easily work with strings of binary data.
    Mbstring overloading sounds nice in theory, as it gives you smarter string functions for free without having to adapt your code. However, this feature denies a basic fact: text strings are fundamentally different from binary data. If this sounds strange to you, consider this:

    • Binary data requires no meta-information about its encoding and can be passed around freely. Operations on two byte arrays are guaranteed to work. Text, on the other hand, is always encoded in a particular way. Text operations can only work if the encoding is known and verified to be the same for all operands involved.
    • Binary data can contain arbitrary bits, while most text encodings have a much more limited syntax. Take a look at UTF-8's bit patterns for example. However, even plain US-ASCII text has historically had the limitation that it may not contain the NULL character.
    • Binary data has no intrinsic semantic meaning, while text does. Many operations (like case conversion) only make sense on text, while other operations become much more complicated (e.g. text sorting needs to take local conventions into account). Specifically, there are a lot of Unicode algorithms for advanced text processing (e.g. the Bidirectional Algorithm for handling text with mixed writing directions).

    Due to the fact that text has been 8-bit encoded for a long time, a lot of programmers don't think twice about using text functions for dealing with binary data and vice-versa. But this assumption is no longer valid today.

    If Mbstring overloading is enabled and a PHP programmer wants to perform operations on binary data, (s)he has to temporarily trick PHP into using a simple 8-bit encoding (like ISO-8859-1). Quite possibly, locale settings have to be changed back and forth as well. This results in bloated, complicated code.

  4. PHP's string functions don't form a clean, consistent API.
    There is no consistent naming convention (e.g. substr(), str_replace(), convert_cyr_string(), parse_str(), sprintf(), ...).

    There are also a bunch of hodge-podge functions which are only useful in very specific situations and/or which are tied to a particular encoding (e.g. utf8_encode()) or locale (e.g. ucfirst()).

    Finally, though some functions take an encoding argument to allow for some encoding support, this is rare and inconsistent. For example, while the html_entities() function supports several encodings, the utility function get_html_translation_table() which fetches its translation table does not.

  5. PHP's locale mechanism is completely platform-dependant and offers no guarantees.
    The locale identifiers passed to setlocale() differ completely between Windows and Unix platforms, but even between similar Unix platforms there is no guarantee of which locales are available. The dependency of PHP on system locales also means that you are restricted to whatever encodings the system locales are available in.
  6. PHP's XML parser is notorious for violating the specifications when it comes to encodings.
    In today's web, XML is everywhere in the form of XHTML, RSS feeds, OPML, etc. Being able to parse XML correctly is essential to any PHP application. A significant portion of the XML specification talks about encodings and how to deal with them, but PHP does not implement them correctly.

    For example, if an XML document starts with a UTF-8 signature (in the form of the byte-order mark), PHP5's parser will die if it is told the document is in UTF-8 encoding. Similar simple, but critical bugs have had to be worked around by PHP programmers in the past. Before PHP5, absolutely no encoding autodetection was present in the XML parser: this had to be done by the code invoking the parser.

  7. Mbstring is a pragmatic library, not a fully featured Unicode solution.
    Example limitations include not being able to specify characters beyond U+FFFF for some functions (e.g. mb_substitute_character()) or the way mb_strwidth() seems to be hardcoded for Japanese only (there are no zero widths for combining accents?).

All of these problems together mean that it is very hard at the moment to write PHP software which can support encodings and Unicode. Even worse, if this software has to run on a typical PHP install, then you can forget about implementing anything more than simple pass-through behaviour as far as text is concerned.

Proposed solution

Unfortunately, PHP is very hot on backwards compatibility, so significant changes to the existing string API are pretty much out of the question. New types and APIs need to be introduced which offer a complete, consistent and flexible solution for dealing with encodings and Unicode.

  1. PHP needs a new Unicode text string type which is separate from the classic byte string.
    This type, let's call it ustring, would represent a string of Unicode text.

    Internally, it would be stored using one of the UTF's. In the interests of internal processing efficiency, UTF-16 is probably the best choice, but UTF-8 can be considered as well as it is the most popular UTF on the web today. In that case, outputting UTF-8 could be done without any conversion. On the other hand, the complicated bit patterns and variability of UTF-8 mean that it is harder to find character boundaries and such. Looking at how languages like Perl and Python approach this is a good idea. After all, they've had Unicode strings for quite some time.

    To distinguish ustrings from plain strings when defined, a syntax similar to C could be introduced, for example U"This is a Unicode string". This syntax would support \u####, \U######## and \x{#..} notation for defining characters by codepoint inside the string.

    Using the {} operator on a ustring would return ints, not chars. To reduce confusion, perhaps a uchar type could be introduced specifically for handling Unicode codepoints. As the Unicode codespace is only 21-bit wide, there would be subtle differences between uchar and int, though both would probably be stored as 32-bit.

    For backwards compatibility, plain quoted strings would remain used for byte strings, although it might be interesting to define a B"This is a byte string" notation, while providing a configurable option for choosing which type of string is assumed when there is no prefix. As Unicode usage would become more widespread, it would be nice to not have to litter your code with U's everywhere.

    Though the internal encoding would be fixed to one of the UTF's, the external encoding might vary (and would be configurable through an API). When casting a ustring to a string, a conversion would take place from the internal encoding to the external one, and vice-versa. It remains to be seen which type takes precedence when both are mixed together (e.g. $string = U"Unicode" . "Bytes").

  2. PHP needs a new Unicode string API.
    This API would contain a selection of functions from both the plain String API as well as the Mbstring API, but would have a simpler and more logical naming convention. For example, making all ustring functions start with ustr_. Each of these would accept a ustring where the current ones accept a plain string.

    External APIs, like the PCRE library, could choose whether to accept string, ustring or both. For example for PCRE, it makes sense to replace the PHP-proprietary /u modifier with a simple string type check instead.

  3. PHP needs to ensure that a baseline set of encoding-related functions are always available.
    I believe the Iconv extension is now standard since PHP5, but things like complete UTF-8 support in PCRE are important too. This allows programmers to write their code in a straightforward fashion without having to check for a gazillion exceptions or exotic configurations.
  4. PHP needs an independent locale library across all platforms
    This ensures consistent handling of locales and no longer limits PHP to what the platform supports. The International Components for Unicode (ICU) are an excellent candidate.

The choice to limit this new string functionality to Unicode strings might seem elitist: after all, the idea of Unicode is not to get rid of other encodings, but merely to ensure compatibility. Non-Unicode encodings will keep fulfilling an important role in the years to come. On the other hand, as Unicode is guaranteed to be a perfect intermediate format, it makes sense to use it for internal string handling. It limits the functionality that has to be dealt with and creates a common baseline to work with.

Finally, as the original String and Mbstring APIs would not be altered by these changes, programmers would be free to use the 'old school' way of dealing with strings. They would simply not be able to take advantage of the cleaner API and consistent locales.

Parallel effort -- ICU is already underway

Jun 03, 2005 Dan Scott

See "PHP and Unicode: A Love at Fifth Sight" -- Andrei Zmievski and Derick Rethans have already been doing the design work towards implement Unicode support in PHP, with a complete understanding of the issues from both the Unicode and the PHP side of the fence.

Couldn't find it, but the future looks good

Jun 03, 2005 Steven

Sorry, when I googled for information about ICU and PHP all I seemed to find was a couple of messages from 2001 and a burnt out attempt in 2003.

Edit: I just read through the slides... seems they had mostly the same ideas as me. Yay, this looks good!

I agree very much with your

Nov 01, 2005 Anonymous

I agree very much with your article and your ideas for implementation. I have a couple of comments though:

You said: "PHP source code itself must be encoded in an ASCII-compatible encoding and there is no way to use Unicode codepoints directly."

This is inaccurate. I use UTF-8 strings all the time in my source code (for strings at least) and they work fine in PHP5.

You said: "..., if an XML document starts with a UTF-8 signature (in the form of the byte-order mark), PHP5's parser will die if it is told the document is in UTF-8 encoding."

I believe that this is the correct behaviour. I'm 99% sure that the BOM is considered illegal in a valid UTF-8 bytestream because its presence is totally unnecessary.

Confusion?

Nov 02, 2005 Steven

It seems you are confusing various terms.

First I said there is no way to use Unicode codepoints directly in PHP code, i.e. referring to Unicode characters by their number and retrieving the number directly from a string/character. When you use UTF-8 encoded text in PHP source code, it is 'baked in' and cannot be interpreted correctly without specialized routines. The rest of the original paragraph explains this in more detail really.

Also, the part about requiring an ASCII-compatible encoding is also correct, because UTF-8 is exactly that. ASCII only uses the first 128 characters, which are encoded the same in UTF-8. On the other hand, you cannot use e.g. UTF-16 encoded PHP code.

Finally, the UTF-8 BOM at the beginning of an XML document is indeed valid and this is mentioned specifically in the XML specifications. Even so, there is no reason why the byte order mark would be illegal in a UTF-8 bytestream because the character it represents is perfectly valid anyhow.

I agree that PHP should

Nov 14, 2005 Wong PoKér Hu Online

I agree that PHP should still be develop. PHP as it is has many deficiencies. I thought once that it could not stand alone as a script for web development. Adding a new feature will complement the advantages of using PHP and will produce better outputs.

Re: I agree very much with your

May 17, 2006 Peter Molnar

Your PHP source files work fine, because UTF-8 is designed to be US-ASCII compatible for the 0-127 character range. In UTF-8 the MSB (highest bit) of the byte signifies, that this byte contains further bits of the full representation of the unicode character.
In other words, as long as you use US English characters, there is no problem. Non-ASCII characters in your script will be handled as multiple characters. Try strlen() on japanese text. ;)
Because the PHP language syntax only uses US-ASCII characters, your code will not be broken syntactically, but it might not work as you would like.

Are we there yet?

Jul 25, 2007 James Shields

I found myself thinking almost exactly the same thing today, which prompted me to search for your article. After reading it, I'm not sure if we're any further on in 2007 than we were in 2003, despite the efforts of a few people.

I'm not sure I agree about ustr_ functions. Wouldn't it be better to just overload the existing functions with ones that can recognise ustring parameters?

Like you, I'm not sure I have the skill or time to implement these changes, but maybe that's just making excuses and it's time I learned them.

I'm looking at starting a project that will need to support Unicode, and I would like to do it in PHP, but I'm not sure I can justify it, when I know it will be the source of potential problems, and there are other platforms that have had seemless Unicode support for years.

James: Depends

Jul 25, 2007 Steven

Actually if you are just doing a single project in PHP, the Unicode support you get with PHP's mbstring extension will probably be 'good enough', and you can enable it on your own server.

My post was written from the perspective of an open source CMS developer, whose code has to run on $10/month shared hosting. We can't rely on mbstring being available, so we have to include our own (slow) routines for processing Unicode text without it.

But you're right, the situation today has barely improved. PHP6 is on the horizon, but not here yet. And when it gets here, we'll have to wait for hosts to upgrade to it. Given the effort required to get PHP5 going, I'm not hopeful.

As for that 'ustr_' prefix idea... It was mainly aimed at preserving backwards compatibility, while allowing a new, cleaner String API to emerge. The PHP team however, have chosen to break backwards compatibility with PHP6 Unicode turned on, but chose not to improve the actual String API with it. A big shame, if you ask me.

Recent comments

Images