Home  |  Site Map
  ConceptDevelopment.NET
Skip Navigation Links
Home
SearchExpand Search
SilverlightExpand Silverlight
DatabaseExpand Database
ValidationExpand Validation
LocalizationExpand Localization
Fun stuffExpand Fun stuff
  Download now

View the code for the ExtendedHttpUtility class which you can include in your project. I used this Html Entity Lookup to complete the switch statements in the code.
Here's the first version of the code.


About the code
Originally posted to this blog about Unicode to Html Entity and back again, this method can be used to convert non-ASCII characters to a form that won't be 'munged' in emails or some databases.

 

HtmlEncode 
Click to view the code - the graphic just makes the page look nicer The .NET Framework has an HtmlEncode method (in the HttpUtility class and in the HttpServerUtility class). However while it will 'encode' upper-range/extended ASCII characters (eg. decimal 128 - 255) as Html Entities (eg. this format Ӓ) it ignores the remaining Unicode range (thousands of characters).

That means that while it will encode the characters used by various European languages (such as é) it will not 'fix' Japanese, Chinese, Korean or other Unicode characters which you may also want to encode...

This code (updated 12-July-04) contains an 'extended' HtmlEncode and HtmlDecode method which will process all characters outside the narrow, standard ASCII range.

The encode method follows the same pattern as the 'standard' Framework function (similar to the IL code here).

The decode method uses a Regular Expression to find entities in the HTML and resolve them to Unicode characters.
@"([&][#](?'decimal'[0-9]+);)|([&][#][(x|X)](?'hex'[0-9a-fA-F]+);)|([&](?'html'\w+);)"

It is used as follows:
string encoded = ExtendedHtmlUtility.HtmlEntityEncode
                 ("test string with Unicode chars and & < > é");

string decoded = ExtendedHtmlUtility.HtmlEntityDecode
                 (encoded); // in: "string with &amp; &lt; &gt; &eacute;"

and can help you to accept Unicode data to store in a 7-bit database text field while allowing it to be read and displayed in a browser. WARNING: don't be sending any entity-encoded strings to Javascript windows 'cause they probably won't be displayed correctly.

P.S. finding info for this code resulted in that rare breed: a single Google result.
It's not quite a GoogleWhack because it has three words.

Useful links

UrlEncode vs. HtmlEncode
From aspnetresources.com...


Representing characters in HTML
Numeric Character References (NCRs), Character Entity References (i18nguy.com)

W3C Entity Reference
"A character entity reference is an SGML construct that references a character of the document character set."

W3C Character References
mentions the &#x123; hexadecimal format which I had not seen before


HttpServerUtility Class
On MSDN... as the above article says, this doco seems confused between HtmlEncode/Decode and UrlEncode/Decode

HttpUtility Class
On MSDN... another class (the 'static version'?).

HTML::Entities (Perl)