Decode Utf8
Ever needed to replace all the multibyte characters in a string to their latin equivalent? Me either. This handy little function uses a simple string replace to "decode" the utf8 characters in a string and returns the string with the utf8 characters replaced with their latin counterpart where available. If no latin counterpart exists, an approximation of the character is used. Of course this may be problematic when it comes to simplified chinese, but the user can alway set the corresponding value to an empty string, or their own interpretation.
But PHP already has methods for handling UTF-8! This is correct. PHP has iconv() that can be used with the //TRANSLIT option, or utf8_decode(). Lets see how well they do.
<?php
echo iconv("UTF8", "ISO-8859-1//TRANSLIT", "мúĺţìбýřę śťřïňğ.");
?>
The results from above will vary depending on the character set on the system it is run on, but the results are less than favourable. Lets see how utf8_decode fares.
<?php
echo utf8_decode("мúĺţìбýřę śťřïňğ.");
?>
Once again, the results, depending on the system character set, are less than optimal. Now lets try with this array for substiturion
<?php
/**
*
* @Utf8_decode
*
* @Replace accented chars with latin
*
* @param string $string The string to convert
*
* @return string The corrected string
*
*/
function decode_utf8($string)
{
$accented = array(
'À', 'Á', 'Â', 'Ã', 'Ä', 'Å', 'Æ', 'Ă', 'Ą',
'Ç', 'Ć', 'Č', 'Œ',
'Ď', 'Đ',
'à', 'á', 'â', 'ã', 'ä', 'å', 'æ', 'ă', 'ą',
'ç', 'ć', 'č', 'œ',
'ď', 'đ',
'È', 'É', 'Ê', 'Ë', 'Ę', 'Ě',
'Ğ',
'Ì', 'Í', 'Î', 'Ï', 'İ',
'Ĺ', 'Ľ', 'Ł',
'è', 'é', 'ê', 'ë', 'ę', 'ě',
'ğ',
'ì', 'í', 'î', 'ï', 'ı',
'ĺ', 'ľ', 'ł',
'Ñ', 'Ń', 'Ň',
'Ò', 'Ó', 'Ô', 'Õ', 'Ö', 'Ø', 'Ő',
'Ŕ', 'Ř',
'Ś', 'Ş', 'Š',
'ñ', 'ń', 'ň',
'ò', 'ó', 'ô', 'ö', 'ø', 'ő',
'ŕ', 'ř',
'ś', 'ş', 'š',
'Ţ', 'Ť',
'Ù', 'Ú', 'Û', 'Ų', 'Ü', 'Ů', 'Ű',
'Ý', 'ß',
'Ź', 'Ż', 'Ž',
'ţ', 'ť',
'ù', 'ú', 'û', 'ų', 'ü', 'ů', 'ű',
'ý', 'ÿ',
'ź', 'ż', 'ž',
'А', 'Б', 'В', 'Г', 'Д', 'Е', 'Ё', 'Ж', 'З', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О', 'П', 'Р',
'а', 'б', 'в', 'г', 'д', 'е', 'ё', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'р',
'С', 'Т', 'У', 'Ф', 'Х', 'Ц', 'Ч', 'Ш', 'Щ', 'Ъ', 'Ы', 'Ь', 'Э', 'Ю', 'Я',
'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я'
);
$replace = array(
'A', 'A', 'A', 'A', 'A', 'A', 'AE', 'A', 'A',
'C', 'C', 'C', 'CE',
'D', 'D',
'a', 'a', 'a', 'a', 'a', 'a', 'ae', 'a', 'a',
'c', 'c', 'c', 'ce',
'd', 'd',
'E', 'E', 'E', 'E', 'E', 'E',
'G',
'I', 'I', 'I', 'I', 'I',
'L', 'L', 'L',
'e', 'e', 'e', 'e', 'e', 'e',
'g',
'i', 'i', 'i', 'i', 'i',
'l', 'l', 'l',
'N', 'N', 'N',
'O', 'O', 'O', 'O', 'O', 'O', 'O',
'R', 'R',
'S', 'S', 'S',
'n', 'n', 'n',
'o', 'o', 'o', 'o', 'o', 'o',
'r', 'r',
's', 's', 's',
'T', 'T',
'U', 'U', 'U', 'U', 'U', 'U', 'U',
'Y', 'Y',
'Z', 'Z', 'Z',
't', 't',
'u', 'u', 'u', 'u', 'u', 'u', 'u',
'y', 'y',
'z', 'z', 'z',
'A', 'B', 'B', 'r', 'A', 'E', 'E', 'X', '3', 'N', 'N', 'K', 'N', 'M', 'H', 'O', 'N', 'P',
'a', 'b', 'b', 'r', 'a', 'e', 'e', 'x', '3', 'n', 'n', 'k', 'n', 'm', 'h', 'o', 'p',
'C', 'T', 'Y', 'O', 'X', 'U', 'u', 'W', 'W', 'b', 'b', 'b', 'E', 'O', 'R',
'c', 't', 'y', 'o', 'x', 'u', 'u', 'w', 'w', 'b', 'b', 'b', 'e', 'o', 'r'
);
return str_replace($accented, $replace, $string);
}
?>
Example Usage
<?php
echo decode_utf8('мúĺţìбýťę śťřïňğ');
?>
Demonstration
This time the result is as expected. The string "multibyte string" is returned and a calm falls upon the earth.
Feel free to add to more characters to the array as may fit character set needs.