Email: mail@johnoxton.co.uk | Tel: +44 (0)20 8133 0443
I hadn't really thought about it before but I happily use HTML entities within my UTF-8 documents. I know this works but then browsers can be very forgiving and I couldn't help wondering if it was actually "legal".
I thought I'd email The Goodwitch herself with the question, as I figured being the co-lead of the International Liaison Group she'd know a thing or three about such things. As ever she did not disappoint and I share her email for those who might (now) be curious about the answer:
I concur with your thoughts on UTF-8 and entities. Both
&and&will both "pass the test" in an XHTML doc with UTF-8 character encoding. In reality, when you use the named character entity, it will be converted by the browser to the numeric entity. So&is fine.If you are storing data to a database, the answer is different, you would stick with the
&to stay in pure utf-8.
When I mentioned I'd like to blog the answer she also sent me this additional note on databases:
Thanks Glenda! :)You might want to add the following (which is an excerpt from a conversation I was having with Ralph Brandi about utf-8 and named character entities):
This depends on the database and how you get the information in and out of it. For example, in PHP, there's a function to escape HTML characters when you retrieve information (two functions, actually, htmlspecialchars() and htmlentities() ). If you're using one of those functions, then you would want to store an ampersand as simply "&" in the database and let the PHP function expand it to an entity. You'll also want to make sure that your database is set up to use UTF-8, and that the tables in which you're storing the data are also set to use UTF-8. To ensure that you're not accidentally storing entities in the database, you might want to filter data that you're inputting through the html_entity_decode() function.
If you just want to store the raw XHTML, then you would avoid these functions. Either
&or&should work fine in that case.From a theoretical pie in the sky perspective, I would say that one should store the data without the entities. There may come a time when you need to output the data as XML, and XML by default supports a very limited set of entities (unless you tell your document to incorporate a separate DTD that defines the entities). From a practical standpoint, this may not be an issue. I don't know the particular circumstance, so I couldn't say one way or the other.