Unicode


PHP: Convert JavaScript-escaped Unicode characters to HTML hex references

There are cases where one might receive in PHP, escaped Unicode characters from the client side JavaScript. According to the RFC it is normal for JavaScript to convert characters to that format and in effect that we receive any character in the escaped format of \uXXXX in PHP.

Any character may be escaped.
If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF),
then it may be represented as a six-character sequence:
a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point.
The hexadecimal letters A though F can be upper or lowercase.

A sample input you might receive could look like this George\u2019s treasure box instead of George’s treasure box.

This kind of input should not be stored as is as it does not make sense to the HTML language, instead we should fix it up using preg_replace.

$decoded = preg_replace('/\\\\u([a-fA-F0-9]{4})/', '&#x\\1;', $input);

The above command will look for all instances of \uXXXX in the $input and it will replace each one with the appropriate character using the XXXX value that it will match.

What this part '/\\\\u([a-fA-F0-9]{4})/' of the code do is the following:

  • \\\\ – Find the character \ in the string, the reason we have four \ instead of one, is because it has special meaning in the regular expression and we have to escape it. For that reason we need to use two of them and get \\. After that, we need to escape each of them again due to the special meaning they have in PHP and we end up with four of them.
  • u – The previous step must be followed by a u character.
  • ([a-fA-F0-9]{4}) – After the previous step has matched, we need to match 4 characters. Each of them must be either a character from A-Z or a-z or 0-9.

This part '&#x\\1;' will:

  • &#x – Is a constant string that will print the characters &#x. These characters will instruct HTML to print the character that will occur using hexadecimal entity reference that will follow.
  • \\1 – Contains the reference of the 1st parenthesized pattern. In this case we only have a parenthesis around the XXXX part of the \uXXXX so \\1 will be replaced with the XXXX value.