Character Set Problems

This text is aimed at users whose web servers are hosted by BCA, but it applies more generally, of course. But this page was written primarily as an aide memoire and test page for myself, so it might not make easy or exciting reading.

David Gibson, 15-Dec-2017

Update 04-Jul-2021: also see Using UTF-8

Run this page at localhost | caves.org.uk

If you are using non-7-bit characters on your web pages, BCA's server upgrades may give rise to a number of subtle problems. Non-7-bit characters include the pound sign (i.e. the HTML entity £ ) and all accented characters (e.g. if you have a <FORM> that allows a user to enter his name and address).

If your pound symbols, £, are being replaced by Â£ or ï¿½ or disappearing completely, read on...

BCA's britiac3 server runs PHP 5.4.45. Some of the problems reported here will arise at a later upgrade, as they occur only from PHP 5.6. However the problem with htmlentities() already occurs (from PHP 5.4).

The origin of the problem appears to be that, from PHP 5.6.0, its php.ini file sets default_charset="UTF-8" instead of leaving that parameter empty. Every web page that is parsed by PHP will have a header added to the output, specifying Content-Type: text/html; charset=utf-8. You can test this using the PHP function get_headers().

The following text refers to the pound sign, but the description applies to accented characters as well. It refers to the ANSI character set, which is equivalent to Windows 1252. This is similar to ISO-8859-1 but contains a group of accented characters that are missing from ISO-8859-1. Reportedly, browsers are likely to default to Windows 1252 when ISO-8859-1 is specified. But whether that applies to non-Windows browsers, I do not know.

The effects of the problem include the following...

If you have a script that includes a pound sign that has been encoded in the ANSI character set (as 0xA3) it will not be recognised by UTF-8, which uses a different (multi-byte) encoding. Your browser probably displays these non-existent characters as a black diamond question mark, like this �. (See more on BDQMs below).
If you have a FORM on your web page, and the user enters a pound sign, it will be encoded in UTF-8 as 0xC2A3. If this is later displayed with the ANSI character set it will appear as Â£. This could happen if your web server generates an email containing the data, or if you write it to a database or text file and process it later, as ANSI text.
If your data contains the ANSI pound sign 0xA3 and you use the PHP function htmlentities() to attempt to convert it to £ for display in your browser, that will not work because the default character set for htmlentities() is now (from PHP 5.4) UTF-8, in which 0xA3 is not a valid character. The function will silently ignore the error, removing your pound signs from your output. You need to tell htmlentities() what character set you are using, e.g. htmlentities($string, ENT_COMPAT | ENT_HTML401, 'ISO-8859-1'). Note that, from PHP 5.6, this problem might 'go away' because from PHP 5.6 the default character set for htmlentities will follow the system's default_charset setting.

Examples

As an example, here is a test string, which may (or may not) appear correctly in your browser. Ł | á-é-í-ń-ö-š-ž-ü | Á-É-Í-Ń-Ö-Š-Ž-Ü. That string was generated from a PHP statement to echo a character string. (I cannot type the characters directly into this HTML editor because it would convert them straightaway into HTML entities, which defeats the object of the test). This page was written in ANSI and those 8-bit chars are sent direct to your browser.

To demonstrate the problem, click one of the buttons below. This will re-fetch the current page with a query string that causes the page to issue a header specifying a character set for the page. The page will also have a query string element of &text=%A3+%C2%A3 added, so you can see how this is decoded.

The headers returned by PHP can be seen using the PHP function headers_list(), and are displayed below. To see the full list of response headers, the function get_headers() can be used - see the buttons below.

Form 1

Result of print_r(headers_list());

Array
(
    [0] => Content-type: text/html; charset=Windows-1252
)

Results of Test

No Headers	ISO-8859	UTF-8
Results depend on the server settings - both on the defaults and on any 'corrective settings' I have applied.	$_GET['text'] was £ Â£	$_GET['text'] was � £

Conclusions

Note that this is a PHP problem, not a web server problem 'as such'. As far as I know, Apache does not try to force a character set on you - but PHP does. The required character set must be set on the server. This can be done in a number of places.

header('Content-type: text/html;charset=ISO-8859-1'); , executed at the start of each page. Warning: 'charset' must be in lower case or else it will not over-ride any previous setting of 'charset'.
default_charset="ISO-8859-1" in .user.ini but note that this file is processed only by the CGI/FastCGI SAPI - it works on britiac3, but not on my localhost.
php_value default_charset ISO-8859-1 in .htaccess but that will cause a 500 Server Error if it is executed in the CGI/FastCGI SAPI, so it works on my localhost, but not on britiac3.
default_charset="ISO-8859-1" in php.ini if you have sufficient admin privileges.

...or you could specify Windows-1252.

If you are migrating from britiac2 (or, like me, you have a localhost set-up running PHP as a module) to britiac3, and you want to use all the same files, you can create a .user.ini file (which will be ignored by britiac2) but, additionally, put the following in .htaccess (which will be ignored by britiac3).

<IfModule !proxy_fcgi_module>
# If we are NOT running on britac3...
php_value default_charset "ISO-8859-1"
</IfModule>

# Note: We cannot use <IF> and a test for the server name because
# Apache on britiac2 does not have an <IF> module. Instead, we test
# for a module that is missing on britiac2.

Server Settings

Parameter	Comment	This Server	Recorded values from my Localhost	Recorded values from britiac3
echo php_sapi_name();	.user.ini is processed only by the CGI/FastCGI SAPI.	cgi-fcgi	apache2handler	cgi-fcgi
echo ini_get('max_execution_time');	If my.user.ini is read, this should be 31	31	30	31
echo PHP_VERSION;	From version 5.6.0 the default character set is UTF-8	5.6.40	5.6.15	5.4.45

More Excruciating Detail

Form 2

The following textbox is initilaised (via JavaScript's String.fromCharCode())with the ANSI codes for £ and Â£.

text2:

This page specified (via its headers) a charset of Windows-1252.

Now submit the form back to this page, with a chosen charset header

$_SERVER['QUERY_STRING'] is

$_GET['text2'] is

Results of Test & Conclusions

You need to pay attention to detail in order to interpret the above results, or else it can seem as if the behaviour is inconsistent. The most interesting observation is with UTF-8.

Click the UTF-8 button, so that you are presented with a page for which the browser tells you that it is going to use UTF-8. Ignore the 'results' at this point as they may be erroneous.
Note that the text box still interprets the character codes as if they were ANSI. Why is this?
Add a pound symbol to the text box, to give £ Â£ £. This is to check that user input is treated the same way as initialisation via JavaScript
Click the UTF-8 button
Note that the query string shows that all the text input was interpreted as UTF-8 before it was sent to the server.

The puzzle is why the text box seems to accept character codes as if they were ANSI. This is true even if the attribute accept-charset=utf-8 is added for the FORM element (as I have ascertained). In that situation, the characters are correctly submitted as UTF-8 (even over-riding the content-type header), but the FORM still interprets input as ANSI. Update 04-Jul-2021: Is that because I used String.fromCharCode(0xC2, 0xA3);, specifying single-byte values?

Black Diamond Question Marks

If a browser doesnt understand a character, it uses the Unicode Character 'REPLACEMENT CHARACTER' (U+FFFD). This can be represented as an HTML entity using � or �, viz: �. In UTF-8 this is the three byte combination 0xEF 0xBF 0xBD and this explains why my pound signs were being replaced, in my database, by ï¿½ (that is ï¿½) because that is the ANSI translation of those three bytes.

Another Problem with PHP's htmlentities()

Windows-1252 (see https://en.wikipedia.org/wiki/Windows-1252) contains a block of 32 characters that are not specified in ISO-8859-1. Two of these characters - namely 142 (&Zcaron;) Ž and 158 (&zcaron;) ž- are not covered by PHP's translation table for Windows-1282, as used by htmlentities() (in PHP 5.6, anyway). Whether this is a PHP bug, or a Windows departure-from-standard, I do not know.

Here is the list of translations for characters 128 to 159, using htmlentities(chr($j), 0, 'Windows-1252'); with un-translated characters shown in red.

128 (€) € | 129 | 130 (&sbquo;) ‚ | 131 (&fnof;) ƒ | 132 (&bdquo;) „ | 133 (…) … | 134 (&dagger;) † | 135 (&Dagger;) ‡ | 136 (&circ;) ˆ | 137 (&permil;) ‰ | 138 (&Scaron;) Š | 139 (&lsaquo;) ‹ | 140 (&OElig;) Œ | 141 | 142 | 143 | 144 | 145 (‘) ‘ | 146 (’) ’ | 147 (“) “ | 148 (”) ” | 149 (•) • | 150 (–) – | 151 (—) — | 152 (&tilde;) ˜ | 153 (™) ™ | 154 (&scaron;) š | 155 (&rsaquo;) › | 156 (&oelig;) œ | 157 | 158 | 159 (&Yuml;) Ÿ

Below 128, only the special chars & < > " ' have HTML entity translations. Above 160, for info, the translations are...

160 ( ) | 161 (¡) ¡ | 162 (¢) ¢ | 163 (£) £ | 164 (¤) ¤ | 165 (¥) ¥ | 166 (¦) ¦ | 167 (§) § | 168 (¨) ¨ | 169 (©) © | 170 (ª) ª | 171 («) « | 172 (¬) ¬ | 173 () | 174 (®) ® | 175 (¯) ¯ | 176 (°) ° | 177 (±) ± | 178 (²) ² | 179 (³) ³ | 180 (´) ´ | 181 (µ) µ | 182 (¶) ¶ | 183 (·) · | 184 (¸) ¸ | 185 (¹) ¹ | 186 (º) º | 187 (») » | 188 (¼) ¼ | 189 (½) ½ | 190 (¾) ¾ | 191 (¿) ¿ | 192 (À) À | 193 (Á) Á | 194 (Â) Â | 195 (Ã) Ã | 196 (Ä) Ä | 197 (Å) Å | 198 (Æ) Æ | 199 (Ç) Ç | 200 (È) È | 201 (É) É | 202 (Ê) Ê | 203 (Ë) Ë | 204 (Ì) Ì | 205 (Í) Í | 206 (Î) Î | 207 (Ï) Ï | 208 (Ð) Ð | 209 (Ñ) Ñ | 210 (Ò) Ò | 211 (Ó) Ó | 212 (Ô) Ô | 213 (Õ) Õ | 214 (Ö) Ö | 215 (×) × | 216 (Ø) Ø | 217 (Ù) Ù | 218 (Ú) Ú | 219 (Û) Û | 220 (Ü) Ü | 221 (Ý) Ý | 222 (Þ) Þ | 223 (ß) ß | 224 (à) à | 225 (á) á | 226 (â) â | 227 (ã) ã | 228 (ä) ä | 229 (å) å | 230 (æ) æ | 231 (ç) ç | 232 (è) è | 233 (é) é | 234 (ê) ê | 235 (ë) ë | 236 (ì) ì | 237 (í) í | 238 (î) î | 239 (ï) ï | 240 (ð) ð | 241 (ñ) ñ | 242 (ò) ò | 243 (ó) ó | 244 (ô) ô | 245 (õ) õ | 246 (ö) ö | 247 (÷) ÷ | 248 (ø) ø | 249 (ù) ù | 250 (ú) ú | 251 (û) û | 252 (ü) ü | 253 (ý) ý | 254 (þ) þ | 255 (ÿ) ÿ

For clarity, note that the above-listed HTML entities are not the only ones in existence - they are just the ones corresponding to the 8-bit ANSI character set. If you are using this as your default character set, you can only enter these 8-bit characters in a FORM, but you can display other characters. For example, two authors of a paper in Cave & Karst Science 43(2) are Okan KÜLKÖYLÜOĞLU and Ozan Gönensin BOZDAĞ. The Ğ character is described, in HTML, as Ğ where the character entity ̆ is a 'combining breve', which adds the breve accent to the previous character. In UTF-8 this would be 0xCC 0x86 but, clearly, there is no way it can be stored in a database that uses an 8-bit ANSI character set (which, for historical reasons, is the case for the C&KS data – the C&KS data is, in fact, coded as HTML entities).

David Gibson, 15-Dec-2017

This page, http://caves.org.uk/charset_test.html was last modified on Sun, 04 Jul 2021 09:36:30 +0000

Test string	Ł \| á-é-í-ń-ö-š-ž-ü \| Á-É-Í-Ń-Ö-Š-Ž-Ü
@$_SERVER['QUERY_STRING']
@$_GET['text']
	Call this page "as is", using browser and server defaults
	Execute header('Content-type: text/html;charset=ISO-8859-1'); at the start of this page.
	Execute header('Content-type: text/html;charset=UTF-8'); at the start of this page
	Execute get_headers() and display the output on a new page.