Code?

There isn't really any code - but these are related:

  • NCharDet
  • HtmlEncode
  • NileGlobal
    Enjoy.

  •  

    About Charsets & Unicode 

    ASCII
    First, there was ASCII - the American Standard Code for Information Interchange - using an enormous 7-bits to map 128 (count them!) characters including control chars, the English alphabet and some other bits and pieces.

    Extended codepages
    When presented with the problem of storing and displaying other languages on early computers, simply extending the underlying representation to 9, 10 or more bits to cope with more characters really wasn't as simple as it might sound. The underlying architecture (hardware, compilers, etc) was focussed on shuffling 8 bits around, and the compilers and existing software were already using ASCII, EBCIDIC or something of the sort. Various different encodings were proposed

    Unicode
    The 'ultimate code-page', Unicode defines an enormous set of code-points to represent (almost) every character used in any language! These code-points can be represented in different forms such as 7-bit, 8-bit, 16-bit, 32-bit...

    Using Unicode
    Development environments like Microsoft.NET and Java now use Unicode internally to represent a 'character', making the development of multilingual applications significantly easier. However you still need to be aware of how you display characters to the user, particular on the web.

    Useful links

    Unicode.org
    kinda obvious...

    i18nguy.com on Unicode
    more references than you can poke a stick at...

    character code issues
    A tutorial on character code issues by Jukka "Yucca" Korpela.

    Unicode Case Mapping
    Discussion of how Mozilla handles ToUpper and ToLower case conversions for Unicode data