Character Set Problems

This text is aimed at users whose web servers are hosted by BCA, but it applies more generally, of course. But this page was written primarily as an aide memoire and test page for myself, so it might not make easy or exciting reading.

David Gibson, 15-Dec-2017

Update 04-Jul-2021: also see Using UTF-8

Run this page at localhost | caves.org.uk

If you are using non-7-bit characters on your web pages, BCA's server upgrades may give rise to a number of subtle problems. Non-7-bit characters include the pound sign (i.e. the HTML entity &pound; ) and all accented characters (e.g. if you have a <FORM> that allows a user to enter his name and address).

If your pound symbols, £, are being replaced by £ or � or disappearing completely, read on...

BCA's britiac3 server runs PHP 5.4.45. Some of the problems reported here will arise at a later upgrade, as they occur only from PHP 5.6. However the problem with htmlentities() already occurs (from PHP 5.4).

The origin of the problem appears to be that, from PHP 5.6.0, its php.ini file sets default_charset="UTF-8" instead of leaving that parameter empty. Every web page that is parsed by PHP will have a header added to the output, specifying Content-Type: text/html; charset=utf-8. You can test this using the PHP function get_headers().

The following text refers to the pound sign, but the description applies to accented characters as well. It refers to the ANSI character set, which is equivalent to Windows 1252. This is similar to ISO-8859-1 but contains a group of accented characters that are missing from ISO-8859-1. Reportedly, browsers are likely to default to Windows 1252 when ISO-8859-1 is specified. But whether that applies to non-Windows browsers, I do not know.

The effects of the problem include the following...

Examples

As an example, here is a test string, which may (or may not) appear correctly in your browser. £ | á-é-í-ñ-ö-š-ž-ü | Á-É-Í-Ñ-Ö-Š-Ž-Ü. That string was generated from a PHP statement to echo a character string. (I cannot type the characters directly into this HTML editor because it would convert them straightaway into HTML entities, which defeats the object of the test). This page was written in ANSI and those 8-bit chars are sent direct to your browser.

To demonstrate the problem, click one of the buttons below. This will re-fetch the current page with a query string that causes the page to issue a header specifying a character set for the page. The page will also have a query string element of &text=%A3+%C2%A3 added, so you can see how this is decoded.

The headers returned by PHP can be seen using the PHP function headers_list(), and are displayed below. To see the full list of response headers, the function get_headers() can be used - see the buttons below.

Form 1

Test string £ | á-é-í-ñ-ö-š-ž-ü | Á-É-Í-Ñ-Ö-Š-Ž-Ü
@$_SERVER['QUERY_STRING']
@$_GET['text']
Call this page "as is", using browser and server defaults
Execute header('Content-type: text/html;charset=ISO-8859-1'); at the start of this page.
Execute header('Content-type: text/html;charset=UTF-8'); at the start of this page
Execute get_headers() and display the output on a new page.

Result of print_r(headers_list());

Array
(
    [0] => Content-type: text/html; charset=Windows-1252
)

Results of Test

No Headers ISO-8859 UTF-8
Results depend on the server settings - both on the defaults and on any 'corrective settings' I have applied. $_GET['text'] was £ £ $_GET['text'] was � £

Conclusions

Note that this is a PHP problem, not a web server problem 'as such'. As far as I know, Apache does not try to force a character set on you - but PHP does. The required character set must be set on the server. This can be done in a number of places.

...or you could specify Windows-1252.

If you are migrating from britiac2 (or, like me, you have a localhost set-up running PHP as a module) to britiac3, and you want to use all the same files, you can create a .user.ini file (which will be ignored by britiac2) but, additionally, put the following in .htaccess (which will be ignored by britiac3).

<IfModule !proxy_fcgi_module>
# If we are NOT running on britac3...
php_value default_charset "ISO-8859-1"
</IfModule>

# Note: We cannot use <IF> and a test for the server name because
# Apache on britiac2 does not have an <IF> module. Instead, we test
# for a module that is missing on britiac2.

Server Settings

Parameter Comment This Server Recorded values from my Localhost Recorded values from britiac3
echo php_sapi_name(); .user.ini is processed only by the CGI/FastCGI SAPI. cgi-fcgi apache2handler cgi-fcgi
echo ini_get('max_execution_time'); If my.user.ini is read, this should be 31 31 30 31
echo PHP_VERSION; From version 5.6.0 the default character set is UTF-8 5.6.40 5.6.15 5.4.45

More Excruciating Detail

Form 2

The following textbox is initilaised (via JavaScript's String.fromCharCode())with the ANSI codes for £ and £.

text2:

This page specified (via its headers) a charset of Windows-1252.

Now submit the form back to this page, with a chosen charset header

     

$_SERVER['QUERY_STRING'] is

$_GET['text2'] is

Results of Test & Conclusions

You need to pay attention to detail in order to interpret the above results, or else it can seem as if the behaviour is inconsistent. The most interesting observation is with UTF-8.

The puzzle is why the text box seems to accept character codes as if they were ANSI. This is true even if the attribute accept-charset=utf-8 is added for the FORM element (as I have ascertained). In that situation, the characters are correctly submitted as UTF-8 (even over-riding the content-type header), but the FORM still interprets input as ANSI. Update 04-Jul-2021: Is that because I used String.fromCharCode(0xC2, 0xA3);, specifying single-byte values?

.

Black Diamond Question Marks

If a browser doesnt understand a character, it uses the Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD). This can be represented as an HTML entity using &#xFFFD; or &#65533;, viz: �. In UTF-8 this is the three byte combination 0xEF 0xBF 0xBD and this explains why my pound signs were being replaced, in my database, by � (that is &iuml;&iquest;&frac12;) because that is the ANSI translation of those three bytes.

Another Problem with PHP's htmlentities()

Windows-1252 (see https://en.wikipedia.org/wiki/Windows-1252) contains a block of 32 characters that are not specified in ISO-8859-1. Two of these characters - namely 142 (&Zcaron;) Ž and 158 (&zcaron;) ž- are not covered by PHP's translation table for Windows-1282, as used by htmlentities() (in PHP 5.6, anyway). Whether this is a PHP bug, or a Windows departure-from-standard, I do not know.

Here is the list of translations for characters 128 to 159, using htmlentities(chr($j), 0, 'Windows-1252'); with un-translated characters shown in red.

128 (&euro;) | 129 | 130 (&sbquo;) | 131 (&fnof;) ƒ | 132 (&bdquo;) | 133 (&hellip;) | 134 (&dagger;) | 135 (&Dagger;) | 136 (&circ;) ˆ | 137 (&permil;) | 138 (&Scaron;) Š | 139 (&lsaquo;) | 140 (&OElig;) Œ | 141 | 142 | 143 | 144 | 145 (&lsquo;) | 146 (&rsquo;) | 147 (&ldquo;) | 148 (&rdquo;) | 149 (&bull;) | 150 (&ndash;) | 151 (&mdash;) | 152 (&tilde;) ˜ | 153 (&trade;) | 154 (&scaron;) š | 155 (&rsaquo;) | 156 (&oelig;) œ | 157 | 158 | 159 (&Yuml;) Ÿ

Below 128, only the special chars & < > " ' have HTML entity translations. Above 160, for info, the translations are...

160 (&nbsp;)   | 161 (&iexcl;) ¡ | 162 (&cent;) ¢ | 163 (&pound;) £ | 164 (&curren;) ¤ | 165 (&yen;) ¥ | 166 (&brvbar;) ¦ | 167 (&sect;) § | 168 (&uml;) ¨ | 169 (&copy;) © | 170 (&ordf;) ª | 171 (&laquo;) « | 172 (&not;) ¬ | 173 (&shy;) ­ | 174 (&reg;) ® | 175 (&macr;) ¯ | 176 (&deg;) ° | 177 (&plusmn;) ± | 178 (&sup2;) ² | 179 (&sup3;) ³ | 180 (&acute;) ´ | 181 (&micro;) µ | 182 (&para;) | 183 (&middot;) · | 184 (&cedil;) ¸ | 185 (&sup1;) ¹ | 186 (&ordm;) º | 187 (&raquo;) » | 188 (&frac14;) ¼ | 189 (&frac12;) ½ | 190 (&frac34;) ¾ | 191 (&iquest;) ¿ | 192 (&Agrave;) À | 193 (&Aacute;) Á | 194 (&Acirc;) Â | 195 (&Atilde;) Ã | 196 (&Auml;) Ä | 197 (&Aring;) Å | 198 (&AElig;) Æ | 199 (&Ccedil;) Ç | 200 (&Egrave;) È | 201 (&Eacute;) É | 202 (&Ecirc;) Ê | 203 (&Euml;) Ë | 204 (&Igrave;) Ì | 205 (&Iacute;) Í | 206 (&Icirc;) Î | 207 (&Iuml;) Ï | 208 (&ETH;) Ð | 209 (&Ntilde;) Ñ | 210 (&Ograve;) Ò | 211 (&Oacute;) Ó | 212 (&Ocirc;) Ô | 213 (&Otilde;) Õ | 214 (&Ouml;) Ö | 215 (&times;) × | 216 (&Oslash;) Ø | 217 (&Ugrave;) Ù | 218 (&Uacute;) Ú | 219 (&Ucirc;) Û | 220 (&Uuml;) Ü | 221 (&Yacute;) Ý | 222 (&THORN;) Þ | 223 (&szlig;) ß | 224 (&agrave;) à | 225 (&aacute;) á | 226 (&acirc;) â | 227 (&atilde;) ã | 228 (&auml;) ä | 229 (&aring;) å | 230 (&aelig;) æ | 231 (&ccedil;) ç | 232 (&egrave;) è | 233 (&eacute;) é | 234 (&ecirc;) ê | 235 (&euml;) ë | 236 (&igrave;) ì | 237 (&iacute;) í | 238 (&icirc;) î | 239 (&iuml;) ï | 240 (&eth;) ð | 241 (&ntilde;) ñ | 242 (&ograve;) ò | 243 (&oacute;) ó | 244 (&ocirc;) ô | 245 (&otilde;) õ | 246 (&ouml;) ö | 247 (&divide;) ÷ | 248 (&oslash;) ø | 249 (&ugrave;) ù | 250 (&uacute;) ú | 251 (&ucirc;) û | 252 (&uuml;) ü | 253 (&yacute;) ý | 254 (&thorn;) þ | 255 (&yuml;) ÿ

For clarity, note that the above-listed HTML entities are not the only ones in existence - they are just the ones corresponding to the 8-bit ANSI character set. If you are using this as your default character set, you can only enter these 8-bit characters in a FORM, but you can display other characters. For example, two authors of a paper in Cave & Karst Science 43(2) are Okan KÜLKÖYLÜOĞLU and Ozan Gönensin BOZDAĞ. The Ğ character is described, in HTML, as G&#774; where the character entity &#774; is a 'combining breve', which adds the breve accent to the previous character. In UTF-8 this would be 0xCC 0x86 but, clearly, there is no way it can be stored in a database that uses an 8-bit ANSI character set (which, for historical reasons, is the case for the C&KS data – the C&KS data is, in fact, coded as HTML entities).

David Gibson, 15-Dec-2017


This page, http://caves.org.uk/charset_test.html was last modified on Sun, 04 Jul 2021 09:36:30 +0000