|
|
Download now
View
the code for the ExtendedHttpUtility class
which you can include in your project. I used
this Html Entity Lookup
to complete the switch statements in the code.
Here's the first version of the code.
About the code
Originally posted to this blog about Unicode
to Html Entity and back again, this method can be used to convert non-ASCII characters
to a form that won't be 'munged' in emails or some databases.
|
|
|
HtmlEncode
The .NET Framework has an HtmlEncode method (in the HttpUtility class
and in the HttpServerUtility class). However while it will 'encode'
upper-range/extended ASCII characters (eg. decimal 128 - 255) as Html Entities
(eg. this format Ӓ) it ignores the remaining Unicode range
(thousands of characters).
That means that while it will encode the characters used by
various European languages (such as é) it will not 'fix'
Japanese, Chinese, Korean or other Unicode characters which you may
also want to encode...
This code (updated 12-July-04)
contains an 'extended' HtmlEncode and HtmlDecode method which will
process all characters outside the narrow, standard ASCII range.
The encode method follows the same pattern as the 'standard' Framework
function (similar to the IL code here).
The decode method uses a Regular Expression to find entities in the HTML
and resolve them to Unicode characters.
@"([&][#](?'decimal'[0-9]+);)|([&][#][(x|X)](?'hex'[0-9a-fA-F]+);)|([&](?'html'\w+);)"
It is used as follows:
string encoded = ExtendedHtmlUtility.HtmlEntityEncode
("test string with Unicode chars and & < > é");
string decoded = ExtendedHtmlUtility.HtmlEntityDecode
(encoded); // in: "string with & < > é"
and can help you to accept Unicode data to store in a 7-bit
database text field while allowing it to be read and displayed
in a browser. WARNING: don't be sending any entity-encoded strings
to Javascript windows 'cause they probably won't be displayed correctly.
P.S. finding info for this code resulted in that rare breed: a
single Google result.
It's not quite a GoogleWhack because
it has three words.
|
|
Useful links
UrlEncode vs. HtmlEncode
From aspnetresources.com...
Representing characters in HTML
Numeric Character References (NCRs), Character Entity References (i18nguy.com)
W3C Entity Reference
"A character entity reference is an SGML construct that references a character of the document character set."
W3C Character References
mentions the
ģ hexadecimal format which I had not seen before
HttpServerUtility Class
On MSDN... as the above article says, this doco seems confused between HtmlEncode/Decode and UrlEncode/Decode
HttpUtility Class
On MSDN... another class (the 'static version'?).
HTML::Entities (Perl)
|