Searcharoo 3: Adding features to the C# search engine/web spider
[Searcharoo Version 1] [Searcharoo Version 2] [Searcharoo Version 3]
Download the source code for this article [ZIP 8kb]
Background
This article follows on from the previous two Searcharoo samples:
Searcharoo Version 1 describes building
a simple search engine that crawls the file system from a specified folder, and indexes all HTML
(or other known types) of document. A basic design and object model was developed to support simple,
single-word searches, whose results were displayed ina rudimentary query/results page.
Searcharoo Version 2 focused on adding a
'spider' to find data to index by following web links (rather than just looking at directory listings in
the file system). This means downloading files via HTTP, parsing the HTML to find more links and ensuring
we don't get into a recursive loop because many web pages refer to each other. This article also
discusses how multiple search words results are combined into a single set of 'matches'.
Introduction
This article (version 3 of Searcharoo) covers three main areas:
- Implementing a ‘save to disk’ function for the catalog
- Feature suggestions, bug fixes and incorporation of code contributed by others on previous articles (mostly via CodeProject - thankyou!)
- Improving the code itself (adding comments, moving classes, improving readability and hopefully making it easier to modify & re-use)
New 'features' include:
- Saving the catalog (which resides in memory for fast searching) to disk
- Making the Spider recognise and follow pages referenced in FRAMESETs and IFRAMEs (suggested by le_mo_mo)
- Paging results rather than just listing them all on one page (submitted by Jim Harkins)
- Normalising words and numbers (removing punctuation, etc)
- (Optional) stemming of English words to reduce catalog size (suggested by
Chris Taylor and
Trickster)
- (Optional) use of Stop words to reduce catalog size
- (Optional) creation of a Go word list, to specifically catalog domain-specific words like "C#",
which might otherwise be ignored
The bug fixes include:
- Correctly parsing <TITLE> tags that may have additional attributes eg. an
ID= attribute in an ASP.NET environment. (submitted by xenomouse)
- Handling Cookies if the server has set them to track a 'session' (submitted by Simon Jones)
- Checking the 'final' URL after redirects to ensure the right page
is indexed and linked (submitted by Simon Jones)
- Correctly parsing (and obeying!) the ROBOTS meta tag (I found this bug myself).
Code layout improvements included:
- The Spider code that was a bit of a mess in SearcharooSpider.aspx being moved into a proper
C# class (and implementing an EventHandler to allow monitoring of progress)
- Encapsulation of Preferences into a single static class
- Layout of Searcharoo.cs using #regions (easy to read if you have VS.NET)
- User control (Searcharoo.ASCX) created for search box - if you want to re-brand it you only
have to modify in one place.
- Paging implementation using PagedDataSource means you can easily alter the 'template' for the
results (eg link size/color/layout) in Searcharoo3.aspx
Design
The fundamental Catalog-File-Word design remains unchanged (from Version 1), however there are quite a
few extra classes implemented in this version.

To build the catalog, SearcharooSpider.aspx calls Spider.BuildCatalog() which:
- Accesses Preferences static object to read settings
- Creates empty Catalog
- Creates IGoWord, IStopper and IStemming implementations (based on Preferences)
- Processes startPageUri (with a WebRequest)
- Creates HtmlDocument, populates properties including Link collections
- Parses the content of the page, creating Word and File objects as required
- Recursively applies steps 4 through 6 for each LocalLink
- BinarySerializes the Catalog to disk using CatalogBinder
- Adds the Catalog to Application.Cache[], for use by Searcharoo3.aspx for searching!
Code Structure
These are the files used in this version (and contained in the download).
| web.config |
14 settings that control how the spider and the search page behave.
They are all 'optional' (ie the spider and search page will run if no config settings
are provided) but I recommend at least providing
<add key="Searcharoo_VirtualRoot" value="http://localhost/content/" />
|
| Searcharoo.cs |
Most code for the application is in this file. Many classes that were in ASPX files
in version 2 have been moved into this file (such as Spider and HtmlDocument) because
it's easier to read and maintain. New version 3 features (Stop, Go, Stemming) all added
here.
|
| Searcharoo3.aspx |
Search page (input and results). Checks the Application-Cache for a Catalog,
and if none exists, creates one (deserialize OR run SearcharooSpider.aspx) |
| Searcharoo.ascx |
NEW user control that contains two asp:Panels:
- the 'blank' search box (when page is first loaded, defaults to yellow background)
- the populated search box (when results are displayed, defaults to blue background)
(see the screenshot at the top of the article)
|
| SearcharooSpider.aspx |
The main page (Searcharoo3.aspx) does a Server.Transfer to this page to
create a new Catalog (if required).
Almost ALL of the code that was in this page in version 2
has been migrated to Searcharoo.cs - OnProgressEvent() allows
it to still display 'progress' messages as the spidering is taking place.
|
Saving the Catalog to Disk
There are a couple of reasons why saving the catalog to disk is useful:
- It can be built on a different server to the website
(for smaller sites, where the code may not have permission to write to disk on the webserver)
- If the server Application restarts, the catalog can be re-loaded rather than rebuilt entirely
- You can finally 'see' what information is stored in the catalog - useful for debugging!
There are two types of Serialization (Xml and Binary) available in the Framework, and since
the Xml is 'human readable', that seemed the logical one to try. The code required to serialize
the Catalog is very simple - the code below is from the Catalog.Save() method, so the
reference to this is the Catalog object.
XmlSerializer serializerXml = new XmlSerializer( typeof( Catalog ) );
System.IO.TextWriter writer
= new System.IO.StreamWriter( Preferences.CatalogFileName+".xml" );
serializerXml.Serialize( writer, this );
writer.Close();
The 'test dataset' I've mostly used is the
CIA World Factbook
(download) which is about 52.6 Mb on disk
for the HTML only (not including images and non-searchable data) - so imagine my "surprise"
when the Xml-Serialized-Catalog itself three times the size at 156 Mb (yes, megabytes!).
Couldn't even open it easily, except by 'type'ing it from the Command Prompt.
OUCH - what a waste of space! And worse, this was the first time I'd noticed the fields
defined in the File class were declared public and not private (see the elements beginning with underscors).
Firstly, let's get rid of the serialized
duplicates (fields that should be private, and their public property counterparts) -- rather than
change the visibility (and pontentially break code), the [XmlIgnore] attribute can be added to
the definition. To further reduce the amount of repeated text, the element names are compressed to
single letters using the [XmlElement] attribute, and to reduce the number of <> some of the
properties are marked to be serialized as [XmlAttribute]s.
[Serializable]
public class Word
{
[XmlElement("t")] public string Text;
[XmlElement("fs")] public File[] Files
...
[Serializable]
public class File
{
[XmlIgnore] public string _Url;
...
[XmlAttribute("u")] public string Url { ...
[XmlAttribute("t")] public string Title { ...
[XmlElement("d")] public string Description { ...
[XmlAttribute("d")] public DateTime CrawledDate { ...
[XmlAttribute("s")] public long Size { ...
...
The Xml file is now a teeny (not!) 49 Mb in size, still too large for notepad
but easily viewed via cmd. As you can see below, the 'compression' of the Xml certainly saved some space
- at least the Catalog is now smaller than the source data!
Even with the smaller output, 49 Mb is of Xml is still a little too verbose to be practical
(hardly a surprise really, Xml often is!) so let's serialize the
index to a Binary format (again, the Framework classes make it really simple).
System.IO.Stream stream = new System.IO.FileStream
(Preferences.CatalogFileName+".dat" , System.IO.FileMode.Create );
System.Runtime.Serialization.IFormatter formatter =
new System.Runtime.Serialization.Formatters.Binary.BinaryFormatter();
formatter.Serialize (stream, this);
stream.Close();
The results of changing to Binary Serialization were dramatic - the same catalog data
was 4.6 Mb rather than 150! That's about 3% of the Xml size, definitely the way to go.
Now that I had the Catalog being saved successfully to disk, it seemed like it would be a
simple matter to re-load it back into memory & the Application Cache...
Loading the Catalog from Disk
Unfortunately, it was NOT that simple. Whenever the Application restarted (say web.config or Searcharoo.cs
was changed), the code could not de-serialize the file but instead threw this cryptic error:
Cannot find the assembly h4octhiw, Version=0.0.0.0, Culture=neutral, PublicKeyToken=null

At first I was stumped - I didn't have any assembly named h4octhiw, so it wasn't immediately
apparent why it could not be found. There are a couple of hints though:
- The 'not found ' assembly appears to have a randomly generated name... and what do we know uses randomly
generated assembly names? The \Temporary ASP.NET Files\ directory where dynamically compiled assemblies
(from src="" and ASPX) are saved.
- The error line references only 'object' and 'stream' types - surely they aren't causing the problem
- Reading through the Stack Trace (click on the image)
from the bottom, up (as always), you can infer that the Deserialize method creates a BinaryParser that
creates an ObjectMap with an array of MemberNames which in turn request ObjectReader.GetType() which triggers
the GetAssembly() method... but it fails!. Hmm - sounds like it might be looking
for the Types that have been serialized - why can't it find them?
If your Google skills are honed, rather than the dozens of useless links returned
when you search for ASP.NET "Cannot find the assembly"
you'll be lucky and stumble across this
CodeProject article on Serialization
where you will learn a very interesting fact:
Type information is also serialized while the class is serialized
enabling the class to be deserialized using the type information. Type
information consists of namespace, class name, assembly name, culture
information, assembly version, and public key token. As long as your
deserialized class and the class that is serialized reside in the same
assembly it does not cause any problem. But if the serializer is in a
separate assembly, .NET cannot find your class' type hence cannot
deserialize it.
But what does it mean? Every time the web/IIS 'Application' restarts, all your ASPX and src="" code
is recompiled to a NEW, RANDOMLY NAMED assembly in \Temporary ASP.NET Files\. So although
the Catalog class is based on the same code, its Type Information
(namespace, class name, assembly name, culture information, assembly version, and public key token)
is DIFFERENT!
And, importantly, when a class is binary serialized, its Type Information is stored along with it (aside:
this doesn't happen with Xml Serialization, so we probably would have been OK if we'd stuck with that).
The upshot: after every recompile (whatever triggered it: web.config change, code change, IIS restart,
machine reboot, etc) our Catalog class has different Type info - and when it tries to load the serialized
version we saved earlier, it doesn't match and the Framework can't find the assembly where the
previous Catalog Type is defined (since it was only Temporary and has been deleted when the recompile took place).
Custom Formatter implementation
Sounds complex? It is, kinda, but the whole 'temporary assemblies' thing is something that happens
invisibly and most developers don't need to know or care much about it. Thankfully we don't have to
worry too much either, because the
CodeProject article on Serialization
also contains the solution: a helper class that 'tricks' the Binary Deserializer into using the
'current' Catalog type.
public class CatalogBinder: System.Runtime.Serialization.SerializationBinder
{
public override Type BindToType (string assemblyName, string typeName)
{
// get the 'fully qualified (ie inc namespace) type name' into an array
string[] typeInfo = typeName.Split('.');
// because the last item is the class name, which we're going to 'look for'
// in *this* namespace/assembly
string className=typeInfo[typeInfo.Length -1];
if (className.Equals("Catalog"))
{
return typeof (Catalog);
}
else if (className.Equals("Word"))
{
return typeof (Word);
}
if (className.Equals("File"))
{
return typeof (File);
}
else
{ // pass back exactly what was passed in!
return Type.GetType(string.Format( "{0}, {1}", typeName, assemblyName));
}
}
}
Et Voila! Now that the Catalog can be saved/loaded, the search engine is much more robust than before. You
can save/back-up the Catalog, turn on debugging to see its contents, and even generate it on a different
machine (say a local PC) and upload to your web server!
Using the 'debug' Xml serialized files, for the first time I could the contents of the
Catalog, and I found lots of 'garbage' was being stored that was both wasteful in terms of memory/disk,
but also useless/unsearchable. With the major task for this release complete, it seemed appropriate
to do some bugfixes and add some "real search engine" features to clean up the Catalog's contents.
New features & bug fixes
FRAME and IFRAME support
CodeProject member le_mo_mo pointed out that the spider did not follow (and index)
framed content. This was a minor change to the regex that finds links - previously A and AREA
tags were supported, so it was simple enough to add FRAME and IFRAME to the pattern.
foreach (Match match in Regex.Matches(htmlData
, @"(?<anchor><\s*(a|area|frame|iframe)\s*(?:(?:\b\w+\b\s*(?:=\s*(?:""[^""]*""|'[^']*'|[^""'<> ]+)\s*)?)*)?\s*>)"
, RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture))
{
Stop words
Let's start with
Google's definition of Stop Words:
Google ignores common words and characters, such as "where" and "how,"
as well as certain single digits and single letters. These terms rarely
help narrow a search and can slow search results. We call them "stop
words."
The basic premise is that we don't want to waste space in the catalog storing data will never be
used, the 'Stop Words' assumption is that you'll never search for words like "a in at I" because
they appear on almost every page, and therefore don't actually help you find anything!
Here's a basic definition from MIT
and some
interesting statistics and Stop Word thoughts
including the 'classic' Stop Word conundrum: should users be able to
search for Hamlet's soliloquy "to be or not to be"?
The Stop Word code supplied with Searcharoo3 is pretty basic - it strips out ALL one and two letter
words, plus
the, and, that, you, this, for, but, with, are, have, was, out, not
A more complex implementation is left for others to contribute (or a future version, whichever
comes first).
Word normalization
I had noticed words were often being stored
with any punctuation that was adjacent to them
in the source text. For example, the Catalog contained Files with Word instances for
| "People | people | people* | people |
This prevented the pages containing those words from ever being returned in a search, unless the
user had typed the exact punctuation as well - in the above example a search for
people
would only return one page, when you would expect it to return all four pages.
The previous version of Searcharoo did have a 'black list' of punctuation [,./?;:()-=etc] but that wasn't
sufficient as I could not predict/foresee all possible punctuation characters. Also, it was implemented
with the Trim() method which was not parsing out punctuation within words [aside: the handling of
parenthesised words is still not satisfactory in version 3]. The following 'white list' of characters that
are allowed to be indexed ensures that NO punctuation is accidentally stored as part of a word.
key = System.Text.RegularExpressions.Regex.Replace(key, @"[^a-z0-9,.]"
, ""
, System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Culture note: this "white list" method of removing punctuation is VERY
English-language centric, as it will remove at least some characters from most
European languages, and it will strip ALL content from most Asian-language content.
If you want to use Searcharoo with non-English character sets, you should find the above line of code
and REPLACE it with this "black list" from Version 2. While it allows more characters to be searched,
the results are more likely to be polluted by punctuation which could reduce searchability.
key = word.Trim
(' ','?','\"',',','\'',';',':','.','(',')','[',']','%','*','$','-').ToLower();
Number normalization
Numbers are a special case of word normalization: some punctuation is required to interpret the number
(eg decimal point), then convert it to a proper number.
Although not perfect, this means phone numbers written as 0412-345-678 or (04)123-45678
would both be Catalogued as 0412345678 and therefore
searching for either 0412-345-678 or (04)123-45678
would match both source documents.
private bool IsNumber (ref string word)
{
try
{
long number = Convert.ToInt64(word); //;int.Parse(word);
word = number.ToString();
return (word!=String.Empty);//true;
}
catch
{
return false;
}
}
Go words
After reading the Word Normalization section above you can see how cataloging and searching for
a technical term/phrase (like C# or C++) is impossible - the non-alphanumeric characters are filtered
out before they have a chance to be catalogued.
To avoid this, Searcharoo allows a 'Go words' list to be created. A 'Go word' is the opposite of
a 'Stop word': instead of being blocked from cataloguing, it is given a free-pass into the catalog,
bypassing the Normalization and Stemming code.
The weakness in this approach is that you must know ahead of time all the different
Go words that your users might search for. In future, you might want to store each unsuccessful
search term for later analysis and expansion of your Go word list. The Go word implementation
is very simple:
public bool IsGoWord (string word)
{
switch (word.ToLower())
{
case "c#":
case "vb.net":
case "asp.net":
return true;
break;
}
return false;
}
Stemming
The most basic explanation of 'stemming' is that it attempts to identify 'related' words and
return them in response to a query. The simplest example is plurals: searching for "field" should
also find instances of "fields" and vice versa. More complex examples are "realize" and "realization",
"populate" and "population" - the
This page on How a Search Engine Works
contains a brief explanation of Stemming and some of the other techniques described above.
The Porter Stemming Algorithm
already existed as a C# class,
so was utilized 'as is' in Searcharoo3 (credit and thanks to Martin Porter).
Affect on Catalog size
The Stop Words, Stemming, and Normalization steps above were all developed to 'tidy up' the Catalog
and hopefully reduce its size/increase search speed. The results are listed below for our
CIA World Factbook:
source: 800 files 52.6 Mb |
Raw * |
+ Stop words |
+ Stemming |
+'white list' normalization |
| Unique Words |
30,415 |
30,068 |
26,560 |
26,050 |
| Xml Serialized |
156 Mb ^ |
149 Mb |
138 Mb |
136 Mb |
| Binary Serialized |
4.6 Mb |
4.5 Mb |
4.1 Mb |
4.0 Mb |
| Binary % of source |
8.75% |
8.55% |
7.79%% |
7.60% |
* black list normalization, which is commented out in the code, and mentioned in the 'culture note'
^ 49 Mb after 'compressing' the Xml output with [Attributes]
The result was a 14% reduction in the number of words and a 13% decrease in Binary file size
(mostly due to the addition of Stemming). Because the whole Catalog stays in memory (in the Application Cache)
keeping the size small is important - maybe a future version will be able to persist some 'working copy'
of the data to disk and enable spidering of really large sites, but for now the catalog seems to
take less than 10% of the source data size.
...but what about the UI?
The search user interface also had some improvements:
- Moving the search inputs into the Searcharoo.ascx User Control
- Adding the same Stemming, Stop and Go word parsing to the search term that
is applied during spidering
- Generating the result list using the new ResultFile class to construct a DataSource to
bind to a Repeater control
- Adding PagedDataSource and custom paging links rather than one long list of results (thanks to
Jim Harkin's feedback/code
and uberasp.net)
ResultFile and SortedList
In version 2, outputting the results was very crude: the code was littered with
Response.Write() calls making it difficult to reformat the output.
Jim Harkins
posted some Visual Basic code which is converted to C# below.
// build each result row
foreach (object foundInFile in finalResultsArray.Keys)
{
// Create a ResultFile with it's own Rank
infile = new ResultFile ((File)foundInFile);
infile.Rank = (int)((DictionaryEntry)finalResultsArray[foundInFile]).Value;
sortrank = infile.Rank * -1000; // Assume not 'thousands' of results
if (output.Contains(sortrank) )
{ // rank exists - drop key index one number until it fits
for (int i = 1; i < 999; i++)
{
sortrank++;
if (!output.Contains (sortrank))
{
output.Add (sortrank, infile);
break;
}
}
} else {
output.Add(sortrank, infile);
}
sortrank = 0; // reset for next pass
}
Jim's code does some trickery with a new 'sortrank' variable to try and keep
the files in 'Searcharoo rank' order, but with unique keys in the output SortedList.
If thousands of results were returned, you might run into trouble...
PagedDataSource
Once the results are in the SortedList, assigned to a PagedDataSource which is then
bound to a Repeater control on Searcharoo3.aspx.
SortedList output =
new SortedList (finalResultsArray.Count); // empty sorted list
...
pg.DataSource = output.GetValueList();
pg.AllowPaging = true;
pg.PageSize = Preferences.ResultsPerPage; // defaults to 10 10;
pg.CurrentPageIndex = Request.QueryString["page"]==null?0:
Convert.ToInt32(Request.QueryString["page"])-1;
SearchResults.DataSource = pg;
SearchResults.DataBind();
making it a LOT easier to reformat the results list however you like!
<asp:Repeater id="SearchResults" runat="server">
<HeaderTemplate>
<p><%=NumberOfMatches%> results for <%=Matches%> took <%=DisplayTime%></p>
</HeaderTemplate>
<ItemTemplate>
<a href="<%# DataBinder.Eval(Container.DataItem, "Url") %>"><b><%# DataBinder.Eval(Container.DataItem, "Title") %></b></a>
<a href="<%# DataBinder.Eval(Container.DataItem, "Url") %>"
target=\"_blank\" title="open in new window" style="font-size:x-small">↑</a>
<font color=gray>(<%# DataBinder.Eval(Container.DataItem, "Rank") %>)</font>
<br><%# DataBinder.Eval(Container.DataItem, "Description") %>...
<br><font color=green><%# DataBinder.Eval(Container.DataItem, "Url") %> - <%# DataBinder.Eval(Container.DataItem, "Size") %>
bytes</font>
<font color=gray>- <%# DataBinder.Eval(Container.DataItem, "CrawledDate") %></font><p>
</ItemTemplate>
<FooterTemplate>
<p><%=CreatePagerLinks(pg, Request.Url.ToString() )%></p>
</FooterTemplate>
</asp:Repeater>
Unfortunately the page links are generated via embedded Response.Write() calls
in CreatePagerLinks()... maybe this will be templated in a future version...
The Future...
If you check the dates below, you'll notice there was almost one and a half years between version 2 and 3,
so it might sound optimistic to discuss another 'future' version - but you never know...
Unfortunately many of the new features above are English-language specific (although they can
be disabled to ensure Searcharoo can still be used on other language websites).
However in a future version I'd like to try making the code can be a little more intelligent
about handling European, Asian and other languages.
It would also be nice if the user could type boolean OR searches, or group terms with quotes " "
like Google, Yahoo, etc.
And finally, indexing of document types besides Html (mainly other web-types like PDF) would
be useful for many sites.