This page was written primarily as an aide memoire and test page for myself, so it might not make easy or exciting reading.
David Gibson, 04-Jul-2021
Run this page at localhost | caves.org.uk
My previous notes on Character Set Problems focussed on the problem of interpreting pound signs correctly in ISO-8859-1. But we need to move on from that. There are plenty of accented characters that are not encoded by ISO-8859-1, for which we need to use UTF-8. Also, HTML 5 defaults to UTF-8, so it is sensible to use it. The problem for me is that I tend to make use of text documents that I edit in a simple text editor. When it reads UTF-8 characters it replaces them by a ?. So... some tests ...
This page has now loaded, and reports... No data written from GET string to file(s)
My ancient HTML editor doesnt like an HTML 5 doctype, so this page is specified as HTML 4. Because DOCTYPEs override META headers, I cannot specify <meta charset='UTF-8'> and, instead have to use a PHP line before the doctype, saying
header("Content-type: text/html;charset=utf-8");
To check whether that setting has been understood, we can ask JS via document.characterSet, which reports as follows.
If this page is accessed via a server, it should report UTF-8. If it is accessed via a file:/// address it should report windows-1252.
1. When this page is parsed, PHP fetches a string from a test file and outputs it below. The PHP code is
$utf8 = file_get_contents('charset_data.utf8'); echo $utf8;
À³
2. We can also use PHP to output a script that is executed to update a FORM element, as follows. (See source code of this page to see the script that has been written).
When this page is parsed, PHP also fetches a URL-encoded string from a test file and outputs it below.
%EF%BB%BF%C3%80%C2%B3
We can attempt to convert the original UTF-8 string into HTML entities...
À³
... but it doesnt convert everything that it should do. What's the point of that? What have I missed?
Now we need to verify that we can write correctly to a simple text file. Clicking the SUBMIT box will submit the INPUT box as a GET string. It is submitted back to this page and, on receiving it, this page should report that it has saved it to charset_data.utf8. The page then proceeds to load, as before.
The problem (as confirmed in stackoverflow) is that Textpad does not, in fact, allow the display of UTF-8 characters, despite what it says in the manual. Apparently TP 8 is ok, but I have not tried that yet. A temporary solution would seem to be that all the affected characters should be URL-encoded before storing them in my files - and should be re-coded before outputting.
Word will accept a UTF-8 encoded text file, but it asks to confirm the encoding method. Adding a Byte Order Mark (BOM) at the start of the documernt doesnt seem to assist in this - Windows still wants confirmation.
This page, http://caves.org.uk/charset_utf8.html was last modified on Sun, 04 Jul 2021 16:51:21 +0000