Using UTF-8

This page was written primarily as an aide memoire and test page for myself, so it might not make easy or exciting reading.

David Gibson, 04-Jul-2021

Run this page at localhost | caves.org.uk

My previous notes on Character Set Problems focussed on the problem of interpreting pound signs correctly in ISO-8859-1. But we need to move on from that. There are plenty of accented characters that are not encoded by ISO-8859-1, for which we need to use UTF-8. Also, HTML 5 defaults to UTF-8, so it is sensible to use it. The problem for me is that I tend to make use of text documents that I edit in a simple text editor. When it reads UTF-8 characters it replaces them by a ?. So... some tests ...

Processing the GET data

This page has now loaded, and reports... No data written from GET string to file(s)

Opening this document

My ancient HTML editor doesnt like an HTML 5 doctype, so this page is specified as HTML 4. Because DOCTYPEs override META headers, I cannot specify <meta charset='UTF-8'> and, instead have to use a PHP line before the doctype, saying

header("Content-type: text/html;charset=utf-8");

To check whether that setting has been understood, we can ask JS via document.characterSet, which reports as follows.

 
   

If this page is accessed via a server, it should report UTF-8. If it is accessed via a file:/// address it should report windows-1252.

Reading files

a) UTF-8

1. When this page is parsed, PHP fetches a string from a test file and outputs it below. The PHP code is

$utf8 = file_get_contents('charset_data.utf8'); echo $utf8;
À³

2. We can also use PHP to output a script that is executed to update a FORM element, as follows. (See source code of this page to see the script that has been written).

          

  1. Display some £ signs. When typed directly from the keyboard, as part of a JS string, my editor encodes £ as ANSI, and so it is not understood when the page is rendered as UTF-8, so the black diamond question mark (BDQN) is displayed.
  2. Display £s by encoding as document.form.Text1.value = 'Testing \u00A3\u00A3'. Explanation: UTF-16 encoding for £ is 0x00A3, and JS uses UTF-16 internally (i.e. character strings hold 16-bit values). But in UTF-8 the two bytes that encode that Unicode code-point are 0xC2 0xA3, hence the sequence given by the urlencode() function, below
  3. Display a couple of complicated accented Vietnamese words
  4. Display a long Vietnamese test string

b) URL-encoded

When this page is parsed, PHP also fetches a URL-encoded string from a test file and outputs it below.

%EF%BB%BF%C3%80%C2%B3

c) HTML entities

We can attempt to convert the original UTF-8 string into HTML entities...

&Agrave;&sup3;

... but it doesnt convert everything that it should do. What's the point of that? What have I missed?

Writing files

Now we need to verify that we can write correctly to a simple text file. Clicking the SUBMIT box will submit the INPUT box as a GET string. It is submitted back to this page and, on receiving it, this page should report that it has saved it to charset_data.utf8. The page then proceeds to load, as before.

Compatibility with Textpad and Word

The problem (as confirmed in stackoverflow) is that Textpad does not, in fact, allow the display of UTF-8 characters, despite what it says in the manual. Apparently TP 8 is ok, but I have not tried that yet. A temporary solution would seem to be that all the affected characters should be URL-encoded before storing them in my files - and should be re-coded before outputting.

Word will accept a UTF-8 encoded text file, but it asks to confirm the encoding method. Adding a Byte Order Mark (BOM) at the start of the documernt doesnt seem to assist in this - Windows still wants confirmation.

Concluding Remarks


This page, http://caves.org.uk/charset_utf8.html was last modified on Sun, 04 Jul 2021 16:51:21 +0000