Home

Unicode

Proposal for Implementing Unicode in PHP

Jun 03, 2005

On the Drupal team, I am known as an encoding nut: whenever there's an encoding issue or a question about Unicode, people tend to knock on my door. Usually any fix or answer from me is accompanied by a lot of cursing to the unfortunate inquirer about how "PHP is horrible when it comes to string handling" and how it seems that "the entire PHP dev team has its head planted firmly into the ground when it comes to Unicode".

To which the reply is more than often: "Why don't you fix it yourself?".

Well, I'm not a PHP language developer. To be honest I have no interest or time for becoming one. But I do know a lot about encodings and Unicode, so I decided to write this article describing the problem and possible solutions. That way, maybe others can take some of these ideas and put them into practice. At the very least, it should answer a lot of questions that people have about Unicode and PHP.

Right now, the message from the PHP developers seems to be that "PHP supports Unicode, but some assembly is required". In fact, it is a lot worse. Please, read on.

PHP, Unicode and ostriches.

Mar 25, 2005

Update: I've written a follow-up post that describes how I would like PHP's encoding support to be.

As the resident encoding geek on the Drupal team, it's usually my job to make sure Drupal handles encodings and Unicode correctly. I don't mind doing this, but PHP doesn't exactly make it easy. With the new search.module for Drupal 4.6 being Unicode-aware, this has become very obvious, as we've had to bump up the minimum required version of PHP to 4.3.3. The UTF-8 support in the Perl-compatible regular expressions in PHP 4.3.2 and earlier is completely broken. And now I've had a bug report about someone on PHP 4.3.8 who still had problems getting it to work.

I don't know why exactly, but as far as encodings go PHP is still in the stone-age. This is odd, as you'd expect a web-oriented scripting language to have excellent support for sharing and exchanging textual information. There is a multi-byte string extension available, but it's not available on 90% of PHP hosts out there, and it's more of a black-box library anyway: it does not present you your strings as Unicode character codepoints, but still as an array of bytes. Furthermore, if you actually enable the mbstring overrides, you lose the ability to work with bytes at will. Apparently, the PHP team still hasn't figured out that bytes and characters are not the same. The other extensions which deal with encodings (iconv, recode) are also unavailable on the majority of PHP installs out there.

This means that if you want to make a PHP application which supports any language and runs on the average PHP host out there, that there's only one option: use UTF-8 internally, and write your own functions for string truncation, email header encoding, validation, etc. Using UTF-8 ensures that you only have one encoding to worry about and because it's Unicode it is guaranteed to be able to represent any language. Of course, you will no longer be able to do something simple as upper/lowercasing a string, as these PHP functions don't take UTF-8 at all.

What PHP needs is Unicode string support in the core, along with a good library of useful functions for handling the very large Unicode character range efficiently. ASP, Perl, Python, Java all have it... for me, it's the only thing that would've made PHP5 worth to upgrade to.

It's as if the entire PHP team has stuck their head in the ground, hoping that all this Unicode stuff will somehow blow over. It won't.

Sprankle Character Map (open-source)

Dec 23, 2004

It hit me a while ago that entering characters which are not available on your keyboard or through your IME is much too complicated. Usually it involves opening up some character map, scrolling through hundreds of symbols to find the one you need and copy/pasting it into the application of your choice.

Not very handy. Enter Sprankle Character Map. The idea is to hit a special key combination when typing (WIN + S for Sprankle) which pops up a character map where you are typing. You then type a symbol to find similar characters and choose one from the list using either numbers or arrows + space. Here's how it looks.

This is just a prototype, but it demonstrates the idea nicely and it's actually pretty usable. Certainly better than firing up a full character map every time.

Notes:

  • Sprankle is a Unicode-application and only runs on Windows 2000/XP.
  • The map appears on top of the current text field. For large, multi-line text fields this is far from ideal. It would be better to have it appear at the current caret position.
  • Sprankle doesn't work on Mozilla Firefox (or other applications that do special keyboard processing). If anyone has an idea on how to fix this, please tell.
  • It might be better to implement Sprankle as a real IME so it integrates completely with the text field. I have no idea how to do this though, but I'm sure MSDN has some documentation about it. The downside would be that it might not work in combination with existing IMEs (e.g. for Japanese).
  • Many of the symbols in the character set are not present in most fonts. Sprankle currently looks for Arial Unicode MS, the universal font that comes with XP and Office.
  • It might be cool to make a JavaScript version of this, so it can be integrated on websites with CMSes like Drupal.
  • You can customize Sprankle's character sets by editing sprankle.txt (UTF-16LE encoded). Right now it covers most of the Latin characters, basic Greek plus some math symbols.

Download Sprankle (source + win32 binary).

UFPDF: Unicode/UTF-8 extension for FPDF

Sep 01, 2004

Note: I wrote UFPDF as an experiment, not as a finished product. If you have problems using it, don't bug me for support. Patches are welcome though, but I don't have much time to maintain this.

FPDF is a PHP class for generating PDF files on-the-fly. Unfortunately it does not support Unicode. So I've coded UFPDF, an extension of FPDF which accepts input in UTF-8.

Only TrueType fonts are supported for now. To embed .TTF files, you need to extract the font metrics and build the required tables using the provided utilities (see README.txt). Included is a modified version of TTF2PT1 which extracts the Unicode glyph info.

UFPDF works the same as FPDF, except that all text is in UTF-8, so consult the FPDF documentation for usage.

Download UFPDF
Example PDF

UTF-8 conversion support for mIRC

Jul 13, 2004

mIRC's lack of UTF-8 support has been an issue for quite some time. The author promised to 'look at it', but in the meantime, chatting in UTF-8 is not possible. This is problematic for any language that uses more than the occasional accented letter.

So I decided to make a temporary fix myself. The result is a flexible conversion mechanism between UTF-8 and the ANSI codepages. The user sees and types regular ANSI characters, but all data which is sent to and received from the IRC server is UTF-8 encoded. You are still limited to one ANSI codepage though: making mIRC support real Unicode is not possible without an mIRC rewrite.

The script performs a real UTF-8 encoding/decoding, so unlike a simple 'find and replace' approach, characters which do not fit into the current codepage are indicated as such.

I included conversion tables for all of the Windows ANSI codepages:

  • 1250 (ANSI - Central Europe)
  • 1251 (ANSI - Cyrillic)
  • 1252 (ANSI - Western Europe / Latin I)
  • 1253 (ANSI - Greek)
  • 1254 (ANSI - Turkish)
  • 1255 (ANSI - Hebrew)
  • 1256 (ANSI - Arabic)
  • 1257 (ANSI - Baltic)
  • 1258 (ANSI/OEM - Viet Nam)

There is also a little utility (with source) for generating conversion tables for more codepages.

For instructions on how to use it, check the top of the utf-8.mrc file. You can download the script here (19 KB).

Important: This script is provided as-is without any guarantees. Use it if you like it, but don't bug me if you can't get it to work. If you find bugs, feel free to report them, but try to give a little more information than just 'it doesn't work'.

My ideal text editor

Feb 19, 2004

Out of recommendation from a certain evil norwegian, I gave EditPad Pro a whirl. Took me 10 minutes to remove it again.

Am I too picky? Maybe. Here's what I want from a text-editor (in no particular order):

  • Runs on Windows 2000. Vent your anti-Microsoft anger somewhere else, I use Windows every day and I'm not likely to switch any time soon.
  • Native Unicode and UTF-8 support. This is 2004. Unicode has been around for ages, and I see no reason why I should occupy myself with encoding issues. I deal with multiple languages, so Unicode is the only logical choice. Unicode compatibility is no longer a problem thanks to the Microsoft Layer for Unicode (from now on I will shoot everyone who refers to a byte as a 'character'). Note: automatic conversion between Unicode and the current ANSI codepage doesn't cut it (that's what Editpad Pro seems to do).
  • IME-friendly, with bonus points for an integrated IME. Sometimes I type Japanese, and it requires indirect input and conversion of typed characters. Certain editors I've encountered do weird things which prevents the IME from doing its job, so that's why I mention it explicitly.
  • Advanced editing for web-development. I do a lot of HTML, CSS, PHP, SQL and JavaScript, so anything that can make coding easier is a plus. The least I want is syntax highlighting, but intelligent auto-completion, validation, previewing and other visual cues are very handy too.
  • Good user-interface. This one shouldn't really be necessary to mention, but so many programs seem to miss the point here: a program should be easy to use. I'm not going to go down to specifics, there are a lot of good references on the subject around. Because I'm picky as hell, reconfigurable toolbars, panels and hotkeys score good too. Don't confuse this item with the next one, which is:
  • Nice to look at. I don't need menus that whiz by, flashy windows with skins or other novelty visual effects, but that doesn't mean my applications can be butt-ugly. Things such as proper spacing and margins, aesthetic proportions and contemporary looks are big pluses.

I don't think these are such crazy demands, so if anyone who has suffered through this rant up to now knows a program which satisfies these conditions, please post a link here ;).

Update: I've settled for Notepad2 for now. It's a small, functional, neat editor and it's open-source too.

Images