Site icon Bytefreaks.net

PHP: Convert JavaScript-escaped Unicode characters to HTML hex references

Advertisements

There are cases where one might receive in PHP, escaped Unicode characters from the client side JavaScript. According to the RFC it is normal for JavaScript to convert characters to that format and in effect that we receive any character in the escaped format of \uXXXX in PHP.

Any character may be escaped.
If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF),
then it may be represented as a six-character sequence:
a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character's code point.
The hexadecimal letters A though F can be upper or lowercase.

A sample input you might receive could look like this George\u2019s treasure box instead of George’s treasure box.

This kind of input should not be stored as is as it does not make sense to the HTML language, instead we should fix it up using preg_replace.

$decoded = preg_replace('/\\\\u([a-fA-F0-9]{4})/', '&#x\\1;', $input);

The above command will look for all instances of \uXXXX in the $input and it will replace each one with the appropriate character using the XXXX value that it will match.

What this part '/\\\\u([a-fA-F0-9]{4})/' of the code do is the following:

This part '&#x\\1;' will:

This post is also available in: Greek

Exit mobile version