Home  |  Site Map
  ConceptDevelopment.NET
Skip Navigation Links
Home
SearchExpand Search
SilverlightExpand Silverlight
DatabaseExpand Database
ValidationExpand Validation
LocalizationExpand Localization
Fun stuffExpand Fun stuff
  Download now

Download the C# code [164kb] - a Visual Studio 2005 solution with a project for the NCharDet assembly and a Console Application project that you can use to test it.
About the code
All I have done is 'port' the Java code to C# - it took about an hour of pretty basic syntax conversion.

The original download [91kb] was a #develop project published under an outdated license.


 

NCharDet : character-set detection (guessing)
July 2004  

Read about JCharDet first. It is a Java port of the C++ used in the Mozilla and FireFox browsers.

Using (hardcoded) data on the frequency with which certain characters (and therefore certain byte patterns) appear in each language/charset, the code attempts to guess the character-set by matching the actual frequency distribution of bytes in the input-data with one of the pre-determined datasets (one for each character-set).

Update June 2007: The code has been re-published with new license terms, and provided as a Visual Studio 2005 solution


C# port of Java port of...

This 'simple console app' code does just what you would expect: given an input stream it attempts to determine what character-set was used:

Uri url = new Uri(argv[0]);
HttpWebRequest req = (HttpWebRequest) WebRequest.Create(url.AbsoluteUri);
    
try {
    System .Net.HttpWebResponse imp = (HttpWebResponse) req.GetResponse();
} catch(System.Net.WebException we) { // remote url not found, 404; other error
    Console .Out.WriteLine("Web Request Error " + we.Message);
}
byte[] buf = new byte[1024] ;    int len;    bool done = false ;    bool isAscii = true ;

while( (len=imp.GetResponseStream().Read(buf,0,buf.Length)) != -1) {
    // Check if the stream is only ascii.
    if (isAscii)
        isAscii = det.isAscii(buf,len);
    // DoIt if non-ascii and not done yet.
    if (!isAscii && !done)
        done = det.DoIt(buf,len, false);
}
det.DataEnd();
and seems to work...
The Console application seems to work...

NOTE: that I have NOT done exhaustive testing of the detection algorithm itself, I've just assumed that it worked before, and that my C# port (since it compiles and runs :) works just as well... NOTE: it appears from the latest Mozilla source that the Java port (and hence this C# port) are slightly out-of-date (why? there are more files in the Mozilla src directory). That may or may not be fixed.


What to do with NCharDet?

If you were building an application that downloads content from the web (like Searcharoo for example) then you need to know what character-set/encoding has been used before you can convert the Byte[] into a String (since .NET uses Unicode internally).

For example, the code below 'assumes' that a page is UTF-8 encoded if no HTTP Header or META Content-Type is specified.

string enc = "utf-8"; // default
if (webresponse.ContentEncoding != String.Empty) {
    // Use the HttpHeader Content-Type in preference to the one set in META
    htmldoc.Encoding = webresponse.ContentEncoding;
} else if (htmldoc.Encoding == String.Empty) {
    // TODO: if still no encoding determined, try to readline the stream until we find either
    // * META Content-Type or * </head> (ie. stop looking for META)
    htmldoc.Encoding = enc; // default
}
//http://www.c-sharpcorner.com/Code/2003/Dec/ReadingWebPageSources.asp
System.IO.StreamReader stream = new System.IO.StreamReader
                (webresponse.GetResponseStream(), Encoding.GetEncoding(htmldoc.Encoding) );
Using NCharDet in the final else clause (where it says TODO) you could programmatically attempt to identify the charset using NCharDet, which would ensure that Encoding.GetEncoding(htmldoc.Encoding) will return the correct value to allow the contents of the file to be processed correctly.

Useful links

Mozilla Public License 1.1
The latest C++ code are published under MPL, and NCharDet is now published under the same license.
About Copyleft

Two projects implementing NCharDet:

(thanks Pedro)

Netscape Public Licence 1.1
The original C#, C++ code and the Java port were published under NPL (other mozilla licence info).