|
|
Download now
Download the C# code [164kb]
- a Visual Studio 2005 solution with a project for the NCharDet assembly and a Console Application project
that you can use to test it.
About the code
All I have done is 'port' the Java code
to C# - it took about an hour of pretty basic syntax conversion.
The original download [91kb]
was a #develop project published under an outdated license.
|
|
|
NCharDet : character-set detection (guessing)
July 2004
Read about JCharDet first. It is
a Java port of the C++ used in the Mozilla and FireFox browsers.
Using (hardcoded) data on the frequency with which certain characters (and therefore certain byte patterns)
appear in each language/charset, the code attempts to guess the character-set by matching the
actual frequency distribution of bytes in the input-data with one of the pre-determined datasets
(one for each character-set).
Update June 2007: The code has been re-published with new license terms, and provided as a
Visual Studio 2005 solution
Update September 2008: Marcos pointed me to a CodeProject article
describing how to access MLang from C#. MLang is the Microsoft-supplied C++ library used by
Internet Explorer (among other applications) to perform the same task as NCharDet. If you don't need
a completely managed-code solution, wrapping MLang will almost certainly provide you with more accurate
results (and possibly better performance) than the NCharDet code.
C# port of Java port of...
This 'simple console app' code does just what you
would expect: given an input stream it attempts to determine what character-set was used:
Uri url = new Uri(argv[0]);
HttpWebRequest req = (HttpWebRequest) WebRequest.Create(url.AbsoluteUri);
 
try {
 System .Net.HttpWebResponse imp = (HttpWebResponse) req.GetResponse();
} catch(System.Net.WebException we) {  Console .Out.WriteLine("Web Request Error " + we.Message);
}
byte[] buf = new byte[1024] ;  int len;  bool done = false ;  bool isAscii = true ;
while( (len=imp.GetResponseStream().Read(buf,0,buf.Length)) != -1) {
   if (isAscii)
   isAscii = det.isAscii(buf,len);
   if (!isAscii && !done)
   done = det.DoIt(buf,len, false);
}
det.DataEnd();
and seems to work...
NOTE: that I have NOT done exhaustive testing of the detection algorithm itself,
I've just assumed that it worked before, and that my C# port (since it compiles and runs :)
works just as well... NOTE: it appears from the
latest Mozilla source
that the Java port (and hence this C# port) are slightly out-of-date (why? there are more
files in the Mozilla src directory). That may or may not be fixed.
What to do with NCharDet?
If you were building an application that downloads content from the web
(like Searcharoo for example) then you need to know
what character-set/encoding has been used before you can convert the Byte[] into
a String (since .NET uses Unicode internally).
For example, the code below 'assumes' that a page is UTF-8 encoded if
no HTTP Header or META Content-Type is specified.
string enc = "utf-8"; if (webresponse.ContentEncoding != String.Empty) {
   htmldoc.Encoding = webresponse.ContentEncoding;
} else if (htmldoc.Encoding == String.Empty) {
     htmldoc.Encoding = enc; }
System.IO.StreamReader stream = new System.IO.StreamReader
       (webresponse.GetResponseStream(), Encoding.GetEncoding(htmldoc.Encoding) );
Using NCharDet in the final else clause (where it says TODO) you could
programmatically attempt to identify the charset using NCharDet, which would ensure that
Encoding.GetEncoding(htmldoc.Encoding) will return the correct value to allow
the contents of the file to be processed correctly.
|
|
Useful links
Mozilla Public License 1.1
The latest C++ code are published under MPL, and NCharDet is now published under the same license.
About Copyleft
Two projects implementing NCharDet: (thanks Pedro)
Netscape Public Licence 1.1
The original C#, C++ code and the Java port were published under NPL (other mozilla licence info).
References
Character encoding detection
on Shared Development blog
Howto identify UTF-8 encoded strings
on stackoverflow.com
Detect Encoding for in- and outgoing text (which utilises MLang)
on codproject.com
|