Searcharoo Too : Populating the Search Catalog with a C# Spider
[Searcharoo Version 1] [Searcharoo Version 2] [Searcharoo Version 3]
Download the source code for this article [ZIP 8kb]
Article I describes building a simple search engine
that crawls the filesystem from a specified folder, and indexing all HTML (or other
types) of document. A basic design and object model was developed as well as a query/results
page which you can see here.
This second article in the series discusses replacing the 'filesystem crawler' with
a 'web spider' to search and catalog a website by following the links in the HTML.
The challenges involved include:
-
Downloading HTML (and other document types) via HTTP
-
Parsing the HTML looking for links to other pages
-
Ensuring that we don't keep recursively searching the same pages, resulting in an
infinite loop
-
Parsing the HTML to extract the words to populate the search catalog from Article
I
Design
The design from Article I remains unchanged...
A Catalog contains a collection of Words,
and each Word contains a reference to every File that it appears in |
... the object model is the same too...
What has changed is the way the Catalog is populed. Instead of looping through
folders in the filesystem to look for files to open, the code requires the Url of
a start page which it will load, index and then attempt to follow every link within
that page, indexing those pages too. To prevent the code from indexing the entire
internet (in this version) it only attempts to download pages on the same server as
the start page.
Code Structure
Some of the code from Article I will be referenced
again, but we've added a new page - SearcharooSpider.aspx - that does the HTTP
access and HTML link parsing [making the code that walks directories in the filesystem
- SearcharooCrawler.aspx -obsolete]. We've also changed the name of the search
page to SearcharooToo.aspx so you can use it side-by-side with the old one.
|
Searcharoo.cs |
Implementation of the object model; compiled into both ASPX pages
RE-USED FROM ARTICLE 1
|
|
SearcharooCrawler.aspx
|
OBSOLETE, REPLACED WITH SPIDER |
|
SearcharooToo.aspx
|
<%@ Page Language="C#" Src="Searcharoo.cs" %>
<%@ import Namespace="Searcharoo.Net"%>
Retrieves the Catalog object from the Cache and allows searching via an HTML form.
UPDATED SINCE ARTICLE 1 TO IMPROVE USEABILITY, and renamed to
SearcharooToo.aspx
|
|
SearcharooSpider.aspx
|
<%@ Page Language="C#"
Src="Searcharoo.cs" %>
<%@ import Namespace="Searcharoo.Net"%>
Starting
from the start page, download and index every linked page.
NEW PAGE FOR THIS ARTICLE
|
There are three fundamental tasks for a search spider:
-
Finding the pages to index
-
Downloading each page successfully
-
Parsing the page content and indexing it
The big search engines - Yahoo, Google, MSN - all 'spider' the internet to build their
search catalogs. Following links to find documents requires us to write an HTML parser
that can find and interpret the links, and then follow them! This includes being able
to follow HTTP-302 redirects, recognising the type of document that has been returned,
determing what character set/encoding was used (for Text and HTML documents), etc. -
basically a mini-browser! We'll start small and attempt to build a passable spider
using C#...
Build the Spider [SearcharooSpider_alpha.aspx]
Getting Started - Downloading a Page
To get something working quickly, let's just try to download the 'start page' - say
the root page of the local machine (ie. Step 2 - downloading pages).
Here is the simplest possible code to get the contents of an HTML page from a website
(localhost in this case):
|
using System.Net;
string url = "http://localhost/"; WebClient browser = new WebClient();
UTF8Encoding enc = new UTF8Encoding();
string fileContents = enc.GetString(browser.DownloadData(url));
|
|
Listing 1 - Simplest way to download an Html document |
The first thing to notice is the inclusion of the System.Net namespace.
It contains a number of useful classes including WebClient, which
is a very simple 'browser-like' object that can download text or data from a given
URL.
The second thing is that we assume the page is encoded using UTF-8, using the UTF8Encoding class
to convert the downloaded Byte[] array into a string. If the page
returned was encoded differently (say, Shift_JIS or GB2312) then this conversion
would produce garbage. We'll have to fix this later.
The third thing, which might not be immediately obvious, is that I haven't actually
specified a page in the url. We rely on the server to resolve
the request and return the default document to us - however the server might have
issued a 302 Redirect to another page (or another directory, or even another site). WebClient will
successfully follow those redirects but it's interface has no simple way for the code
to query what the pages actual URL is (after the redirects). We'll have to
fix this later, too, otherwise it's impossible to resolve relative Urls within
the page.
Despite those problems, we now have the full text of the 'start page' in a variable.
That means we can begin to work on the code for Step 1 - finding pages to index.
Parsing the page
There are two options (OK, probably more, but two main options) for parsing the links
(and other data) out of Html:
-
Reading in entire page string, building a DOM and walking through it's elements looking
for links, or
-
Using Regular Expressions to find link patterns in the page string.
Although I suspect "commercial" search engines might use option 1 (building a DOM),
it's much simpler to use Regular Expressions. Because my initial test website had
very-well-formed HTMl, I could get away with this code:
|
ArrayList linkLocal = new ArrayList();
ArrayList linkExternal = new ArrayList();
foreach (Match match in Regex.Matches(htmlData
, @"(?<=<(a|area)\s+href="").*?(?=""\s*/?>)"
, RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture)) {
link = match.Value; int spacePos = link.IndexOf('
'); int quotePos = link.IndexOf('"'); int chopPos = (quotePos<spacePos?quotePos:spacePos);
if (chopPos > 0) { link = link.Substring(0,chopPos);
}
if ( (link.Length
> 8) && (link.Substring(0, 7).ToLower() == "http://") ) {
linkExternal.Add(link) ;
} else {
link = startingUrl + link;
linkLocal.Add(link);
}
}
|
|
Listing 2 - Simplest way to find links in a page |
As with the first cut of page-downloading, there are a number of problems with this
code. Firstly, the Regular Expression used to find the links is *very* restrictive,
ie. it will find -
<a href="News.htm">News</a>
<area href="News.htm" shape="rect" coords="0,0,110,20">
- because the href appears as the first attribute after the a (or area),
and the URL itself is double-quoted. However that code will have trouble with a lot
of valid links, including:
<a href='News.htm'>News</a>
<a href=News.htm>News</a>
<a class="cssLink" href="News.htm">News</a>
<area shape="rect" coords="0,0,110,20" href="News.htm">
<area href='News.htm' shape="rect" coords="0,0,110,20">
It will also attempt to use 'internal page links' (beginning with #),
and it assumes that any link beginning with http:// is external,
without first checking the servername against the target server. Despite the bugs,
testing against tailored HTML pages this code will successfully parse the links into
the linkLocal ArrayList, ready for processing -- coupling that list
of URLs with the code to download URLs, we can effectively 'spider' a website!
Downloading More Pages
The basic code is shown below - comments show where additional code is required, either from
the listings above or in Article I.
|
protected void Page_Load (object sender, System.EventArgs
e) {
startingPageUrl = "http://localhost/"; parseUrl (startingPageUrl, new UTF8Encoding(), new WebClient() );
}
public void parseUrl (string url, UTF8Encoding enc, WebClient browser) {
if (visited.Contains(url)) {
Response.Write ("<br><font
size=-2> "+ url +"
already spidered</font>");
} else {
visited.Add(url);
string fileContents = enc.GetString (browser.DownloadData(url));
if (null != pmd.LocalLinks)
foreach (object link in pmd.LocalLinks) {
parseUrl (Convert.ToString(link), enc, browser);
}
}
}
|
|
Listing 3 - Combining the link parsing and page downloading code. |
Review the three fundamental tasks for a search spider, and you can see we've developed
enough code to build it:
-
Finding the pages to index - we can start at a specific Url and find links using Listings 2
& 3.
-
Downloading each page successfully - we can do this using the WebClient in Listings
1 & 2.
-
Parsing the page content and indexing it - we already have this code from Article
I
Although the example above is picky about what links it will find, it will work
to 'spider' and then search a website! FYI, you can view
the 'alpha version' of the code and use it in conjunction with the other
files from Article I to search the catalog. The
remainder of this article discusses the changes required to this code to fix the shortcomings
discussed earlier; the ZIP file contains a complete set
of updated code.
Fix the Spider [SearcharooSpider.aspx]
Problem 1 - Correctly parsing relative links
The alpha code fails to follow 'relative' and 'absolute' links (eg. "../../News/Page.htm"
and "/News/Page2.htm" respectively) partly because it does not 'remember' what folder/subdirectory
it is parsing. My first instinct was to build a new 'Url' class which would take a
page URL and a link, and encapsulate the code required to build the complete link
by resolving directory traversal (eg "../") absolute references (eg. starting with
"/"). The code would need to do something like this:
|
Page URL |
Link in page |
Result should be |
|
http://localhost/News/ |
Page2.htm |
http://localhost/News/Page2.htm |
|
http://localhost/News/ |
../Contact.htm |
http://localhost/Contact.htm |
|
http://localhost/News/ |
/Downloads/ |
http://localhost/Downloads/ |
|
etc. |
Solution: Uri class
The first lesson to learn when you have a class library at your disposal is LOOK BEFORE
YOU CODE. It was almost by accident that I stumbled across the Uri class,
which has a constructor -
new Uri (baseUri, relativeUri)
- that does exactly what I need. No re-inventing the wheel!
Problem 2 - Following redirects
Following relative links is made even more difficult because the WebClient class,
while it enabled us to quickly get the spider up-and-running, is pretty dumb. It does
not expose all the properties and methods required to properly emulate a web browser's
behaviour... It is capable of following redirects issued by a server, but it has no
simple interface to communicate to the calling code exactly what URL it ended up requesting.
Solution: HttpWebRequest & HttpWebResponse classes
The HttpWebRequest and HttpWebResponse classes provide
a much more powerful interface for HTTP communication. HttpWebRequest has
a number of useful properties, including:
-
AllowAutoRedirect - configurable!
-
MaximumAutomaticRedirections - redirection can be limited to
prevent 'infinite loops' in naughty pages
-
UserAgent - set to "Mozilla/6.0 (MSIE 6.0;
Windows NT 5.1; Searcharoo.NET Robot)" (see Problem 5 below)
-
KeepAlive - efficient use of connections
-
Timeout - configurable based on the expected performance of the target
website
which are set in the code to help us get the pages we want. HttpWebResponse has
one key property - ResponseUri - that returns the final Uri
that was read; for example, if we tried to access http://localhost/ and
the server issued a 302-redirect to /en/index.html then
the HttpWebResponseInstance.ResponseUri would be http://localhost/en/index.html and
NOT just http://localhost/. This is important because
unless we know the URL of the current page, we cannot process relative links correctly
(see Problem 1).
Problem 3 - Using the correct character-set when downloading
files
getting content-type
Solution: HttpWebResponse and the Encoding namespace
The HttpWebResponse has another advantage over WebClient:
it's easier to access HTTP server headers such as the ContentType and ContentEncoding.
This enables the following code to be written:
|
if (webresponse.ContentEncoding != String.Empty) {
htmldoc.Encoding = webresponse.ContentEncoding;
} else if (htmldoc.Encoding == String.Empty) {
htmldoc.Encoding = "utf-8"; }
System.IO.StreamReader stream = new System.IO.StreamReader
(webresponse.GetResponseStream(), Encoding.GetEncoding(htmldoc.Encoding) );
htmldoc.Uri = webresponse.ResponseUri; htmldoc.Length = webresponse.ContentLength;
htmldoc.All = stream.ReadToEnd ();
stream.Close();
|
|
Listing 4 - Check the HTTP Content Encoding and use the correct Encoding class
to process the Byte[] Array returned from the server |
Elsewhere in the code we use the ContentType to parse out the MIME-Type
of the data, so that we can ignore images, stylesheets (and, for this version,
Word, PDF, ZIP and other file types).
Problem 4 - Does not recognise many valid link formats
When building the alpha code I implemented the simplest Regular Expression I could
find to locate links in a string - (?<=<(a|area)\s+href=").*?(?="\s*/?>).
The problem is that it is far too dumb to find the majority of links.
Solution: Smarter Regular Expressions
Regular Expressions can be very powerful, and clearly a more complex expression was
required. Not being an expert in this area, I turned to Google and eventually Matt
Bourne who posted a couple of very useful Regex patterns, which resulted
in the following code:
|
foreach (Match match in Regex.Matches(htmlData
, @"(?<anchor><\s*(a|area)\s*(?:(?:\b\w+\b\s*(?:=\s*(?:""[^""]*""|'[^']*'|[^""'<>
]+)\s*)?)*)?\s*>)"
, RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture)) {
link=String.Empty;
foreach (Match submatch in Regex.Matches(match.Value.ToString()
, @"(?<name>\b\w+\b)\s*=\s*(""(?<value>[^""]*)""|'(?<value>[^']*)'|(?<value>[^""'<>
\s]+)\s*)+"
, RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture)) {
if ("href" == submatch.Groups[1].ToString().ToLower() ) {
link = submatch.Groups[2].ToString();
break;
}
}
}
|
|
Listing 5 - More powerful Regex matching |
Listing 5 performs three steps:
-
Match entire link tags (from < to >) including the tag name and all attributes.
The Match.Value for each match could be and of the link samples shown earlier
<a href='News.htm'>
<a href=News.htm>
<a class="cssLink" href="News.htm">
<area shape="rect" coords="0,0,110,20" href="News.htm">
<area href='News.htm' shape="rect" coords="0,0,110,20">
-
The second expression matches the key-value pairs of each attribute, so it will return:
href='News.htm'
href=News.htm
class="cssLink" href="News.htm"
shape="rect" coords="0,0,110,20" href="News.htm"
href='News.htm' shape="rect" coords="0,0,110,20"
-
We access the groups within the match and only get the value for the href attribute,
which becomes a link for us to process.
The combination of these two Regular Expressions makes the link parsing a lot more
robust.
Problem 5 - Poor META-tag handling
The alpha has very rudimentary META tag handling - so primative that it accidentally
assumed <META NAME="" CONTENT=""> instead
of the correct <META HTTP-EQUIV="" CONTENT=""> format.
There are two reasons to process the META tags correctly: (1) to get the Description
and Keywords for this document, and (2) read the ROBOTS tag so that our spider behaves
nicely when presented with content that should not be indexed.
Solution: Smarter Regular Expressions and support for
more tags
Using a variation of the Regular Expressions from Problem 4, the code parses out the
META tags as required, adds Keywords and Description to the indexed content and stores
the Description for display on the Search Results page.
|
string metaKey = String.Empty, metaValue = String.Empty;
foreach (Match metamatch in Regex.Matches (htmlData
, @"<meta\s*(?:(?:\b(\w|-)+\b\s*(?:=\s*(?:""[^""]*""|'[^']*'|[^""'<>
]+)\s*)?)*)/?\s*>"
, RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture)) {
metaKey = String.Empty;
metaValue = String.Empty;
foreach (Match submetamatch in Regex.Matches(metamatch.Value.ToString()
, @"(?<name>\b(\w|-)+\b)\s*=\s*(""(?<value>[^""]*)""|'(?<value>[^']*)'|(?<value>[^""'<>
]+)\s*)+"
, RegexOptions.IgnoreCase|RegexOptions.ExplicitCapture)) {
if ("http-equiv" == submetamatch.Groups[1].ToString().ToLower() ) {
metaKey = submetamatch.Groups[2].ToString();
}
if ( ("name" == submetamatch.Groups[1].ToString().ToLower() )
&& (metaKey == String.Empty) ) { metaKey = submetamatch.Groups[2].ToString();
}
if ("content" == submetamatch.Groups[1].ToString().ToLower() ) {
metaValue = submetamatch.Groups[2].ToString();
}
}
switch (metaKey.ToLower()) {
case "description":
htmldoc.Description = metaValue;
break;
case "keywords":
case "keyword":
htmldoc.Keywords = metaValue;
break;
case "robots":
case "robot":
htmldoc.SetRobotDirective (metaValue);
break;
}
}
|
|
Listing 6 - Parsing META tags is a two step process, because we have to check the
'name/http-equiv' so that we know what the content relates to! |
It also obeys the ROBOTS NOINDEX and NOFOLLOW directives if they appear in the META
tags (you can read more about the Robot
Exclusion Protocol as it relates to META tags; note that we have not implemented
support for the robots.txt file which sites in the root of a website
- perhaps in version 3!). We also set our User-Agent (Solution 2) to indicate that
we are a Robot so that the web log of any site we spider will clearly differentiate
our requests from regular browsers; it also enables us to prevent Searcharoo from
indexing itself.
Spidering the web!
When you load the SearcharooSpider.aspx page it immediately begins
spidering, starting with either (a) the root document in the folder where the file
is located, OR (b) the location specified in web.config (if it exists).
|
|
Screenshot 1 - The title of each page is displayed as it is spidered. We're using
the CIA
World FactBook as test data |
Once the catalog is built, you are ready to search.
Performing the Search
All the hard work was done in Article 1 - this code is repeated for your information...
|
public Hashtable Search (string searchWord) {
searchWord = searchWord.Trim('?','\"', ',', '\'', ';', ':', '.', '(', ')').ToLower();
Hashtable retval = null;
if (index.ContainsKey (searchWord) ) { Word thematch = (Word)index[searchWord];
retval = thematch.InFiles(); }
return retval;
}
|
|
Article 1 Listing 8 - the Search method of the Catalog object |
We have not modified any of the Search objects in the diagram at the start of this
article, in an effort to show how data encapsulation allows you to change both the
way you collect data (ie. from filesystem crawling to website spidering)
and the way you present data (ie. updating the search results page) without
affecting your data tier. In article 3 we'll examine if it's possible to convert the
Search objects to use a database back-end without affecting the collection and presentation
classes...
Improving the Results [SearcharooToo.aspx]
These are the changes we will make to the results page:
-
Enable searching for more than one word and requiring all terms to appear
in the resulting document matches (boolean AND search)
-
Improved formatting, including:
-
Pre-filled search box on the results page
-
Document count for each term in the query, and link to view those results
-
Time taken to perform query
The first change to support searching on muliple terms is to 'parse' the query typed
by the user. This means: trimming whitespace from around the query, and compressing
whitespace between the query terms. We then Split the query into
an Array[] of words and Trim any punctuation from around each term.
|
searchterm = Request.QueryString["searchfor"].ToString().Trim('
');
Regex r = new Regex(@"\s+"); searchterm = r.Replace(searchterm, "
");searchTermA = searchterm.Split('
'); for (int i = 0; i
< searchTermA.Length; i++) {
searchTermA[i] = searchTermA[i].Trim
('
', '?','\"', ',', '\'', ';', ':', '.', '(', ')').ToLower();
}
|
|
Listing 7 - the Search method of the Catalog object |
Now that we have an Array of the individual search terms, we will find ALL the documents
matching each individual term. This is done using the same m_catalog.Search() method
from Article I. After each search we check if any results
were returned, and store them in the searchResultsArrayArray to process
further.
|
Hashtable[] searchResultsArrayArray = new Hashtable[searchTermA.Length];
HybridDictionary finalResultsArray = new HybridDictionary();
string matches="";
bool botherToFindMatches = true;
int indexOfShortestResultSet = -1, lengthOfShortestResultSet = -1;
for (int i = 0; i
< searchTermA.Length; i++) {
searchResultsArrayArray[i] = m_catalog.Search (searchTermA[i].ToString()); if (null == searchResultsArrayArray[i]) {
matches += searchTermA[i] + "
<font color=gray style='font-size:xx-small'>(not found)</font> ";
botherToFindMatches = false; } else {
int resultsInThisSet = searchResultsArrayArray[i].Count;
matches += "<a
href=\"?searchfor="+searchTermA[i]+"\">"
+ searchTermA[i]
+ "</a>
<font color=gray style='font-size:xx-small'>(" + resultsInThisSet + ")</font>
";
if ( (lengthOfShortestResultSet == -1) || (lengthOfShortestResultSet > resultsInThisSet) ) {
indexOfShortestResultSet = i;
lengthOfShortestResultSet = resultsInThisSet;
}
}
}
|
|
Listing 8 - Find the results for each of the terms individually |
Describing how we find the documents that match ALL words in the query is easiest
with an example, so imagine we're searching for the query "snow cold weather" in the CIA
World FactBook. Listing 8 found the Array of documents matching each word, and
placed them inside another Array. "snow" has 10 matching documents, "cold"
has 43 matching documents and "weather" has 22 matching documents.
Obviously the maximum possible number of overall matches is 10 (the smallest result
set), and the minimum is zero -- maybe there are NO documents that appear in all three
collections. Both of these possibilities catered for - indexOfShortestResultSet remembers
which word had fewest results and botherToFindMatches is set to false
if any word fails to get a single match.
|
|
Diagram 1 - Finding the intersection of the result sets for each word involves traversing
the 'array of arrays' |
Listing 9 shows how we approached this problem. It may not be the most efficient
way to do it, but it works! Basically we choose the smallest resultset and
loop through its matching Files, looping through the SearchResultsArrayArray (counter
'cx') looking for that same file in all the other resultsets.
Imagine, referring to the diagram above, that we begin with [0][0] file D (we
start with index [0] "snow" because it's the SMALLEST set, NOT just because
it's item 0). The loop below will now start checking all the other files to see if
it finds D again... but it won't start in set [0] because we already
know that D is unique in this set. "if (cx==c)" checks that condition
and prevents looping through resultset [0].
Counter 'cx' will be incremented to 1, and the loop will begin examining items
[1][0], [1][1], [1][2], [1][3], [1][4] (files G, E, S, H, K, D) but "if (fo.Key =
fox.Key)" won't match because we are still searching for matches to file [0][0] D.
However, on the next iteration, file [1][5] is found to be file D,
so we know that file D is a match for BOTH "snow" and "cold"!
The next problem is, how will we remember that this file exists in both sets? I chose
a very simple solution - count the number of sets we're comparing totalcount -
and keep adding to the matchcount when we find the file in a set.
We can then safely break out of that loop (knowing that the file
is unique within a resultset, and we wouldn't care if it was duplicated in there anyway)
and start checking the next resultset.
After the looping has completed, "if (matchcount == totalcount)" then we know this
file exists in ALL the sets, and can be added to the FinalResultsArray,
which is what we'll use to show the results page to the user.
The looping will continue with 'cx' incremented to 2, and the "weather" matches will
be checked for file D. It is found at position [2][2] and the matchcount will
be adjusted accordingly. The whole looping process will then begin again in the "snow"
matches [0][1] file G, and all the other files will again be checked
against this one to see if it exists in all sets.
After a LOT of looping, the code will discover that only files D and G exist in all
three sets, and the finalResultsArray will have just two elements
which it passes to the same display-code from Listings 10-13 in Article
I .
|
if (botherToFindMatches) { int c = indexOfShortestResultSet; Hashtable searchResultsArray = searchResultsArrayArray[c];
if (null != searchResultsArray)
foreach (object foundInFile in searchResultsArray) { DictionaryEntry fo = (DictionaryEntry)foundInFile;
int matchcount=0, totalcount=0, weight=0;
for (int cx = 0; cx < searchResultsArrayArray.Length; cx++) {
totalcount+=(cx+1); if (cx == c) {
matchcount += (cx+1); weight += (int)fo.Value; } else {
Hashtable searchResultsArrayx = searchResultsArrayArray[cx];
if (null != searchResultsArrayx)
foreach (object foundInFilex in searchResultsArrayx) { DictionaryEntry fox = (DictionaryEntry)foundInFilex;
if (fo.Key == fox.Key) { matchcount += (cx+1); weight += (int)fox.Value; break; }
} } } if ( (matchcount>0) && (matchcount == totalcount) ) { fo.Value = weight; if ( !finalResultsArray.Contains (fo.Key) ) finalResultsArray.Add ( fo.Key, fo);
} } }
|
|
Listing 9 - Finding the sub-set of documents that contain EVERY word in the query.
There's three nested loops in there - I never said this was efficient! |
The algorithm described above is performing a boolean AND query on all the words
in the query, ie. the example is searching for "snow AND cold AND weather". If we
wished to build an OR query, we could simply loop through all the files and filter
out duplicates. OR queries aren't that useful unless you can combine them with AND
clauses, such as "snow AND (cold OR weather)" - but this is NOT supported in Version
2!
BTW, the variables in that code which I've called "Array" for simplicity are actually
either Hashtables or HybridDictionaries. Don't be confused when you look at the code
- there were good reasons why each Collection class was chosen (mainly that I didn't
know in advance the final number of items, so using Array was too hard).
The Finished Result
|
|
Screenshot 2 - The Search input page has minor changes, including the filename
to SearcharooToo.aspx! |
|
|
Screenshot 3 - You can refine your search, see the number of matches for each
search term, view the time taken to perform the search and, most importantly, see
the documents containing all the words in your query! |
Using the sample code
The goal of this article was to build a simple search engine that you can install
just by placing some files on your website; so you can copy Searcharoo.cs, SearcharooSpider.aspx
and SearcharooToo.aspx to your web root and away your go!
However that means you accept all the default settings, such as crawling from the
website root, and a 5 second timeout when downloading pages.
To change those defaults you need to add some settings to web.config:
<appSettings>
<add key="Searcharoo_VirtualRoot" value="http://localhost/" /> <!--website
to spider-->
<add key="Searcharoo_RequestTimeout" value="5" /> <!--5
second timeout when downloading-->
<add key="Searcharoo_RecursionLimit" value="200" /> <!--Max
pages to index-->
</appSettings>
|
|
Listing 14 - web.config |
Then simply navigate to
http://localhost/SearcharooToo.aspx (or
wherever you put the Searcharoo files) and it will build the catalog for the first
time.
If your application re-starts for any reason (ie. You compile code into the /bin/
folder, or change web.config settings) the catalog will need to be rebuilt - the next
user who performs a search will trigger the catalog build. This is accomplished by
checking if the Cache contains a valid Catalog and if not using Server.Transfer to
start the spider and return to the search page when complete.
Future
SearcharooSpider.aspx greatly increases the utility of Searcharoo, because you can
now index your static and dynamic (eg. database generated) pages to allow
visitors to search your site. That means you could use it with products like Microsoft
Content Management Server (CMS) which does not expose it's content-database directly.
The two remaining (major) problems with Searcharoo are:
(a) It cannot persist the catalog to disk or a database - meaning that a very large
site will cause a lot of memory to be used to store the catalog, and
(b) Most websites contain more than just HTML pages; they also link to Microsoft Word
or other Office files, Adobe Acrobat (PDF Files) and other forms of content which
Searcharoo currently cannot 'understand' (ie. parse and catalog).
The next articles in this series will (hopefully) examine these two problems in more
detail...
Glossary
|
Term |
Meaning |
|
HTML |
Hyper Text Markup Language |
|
HTTP |
Hyper Text Transmission Protocol |
|
URL |
Universal Resource Locator |
|
URI |
Universal Resource Identifier |
|
DOM |
Document Object Model |
|
302 Redirect |
The HTTP Status code that tells a browser to redirect to a different URL/page. |
|
UTF-8 |
Unicode Transformation Format - 8 bit |
|
MIME Type |
Mulitpart Internet Mail Extension |
|
Spider |
Program that goes from webpage to webpage by finding and following links in the HTML:
visualize a spider crawling on a web :) |
|
Crawler |
Although the terms 'spider' and 'crawler' are often used interchangably, we'll use
'crawler' to refer to a program that locates target pages on a filesystem or external
'list'; whereas a 'spider' will find other pages via embedded links. |
|
Shift_JIS, GB2312 |
Character sets... |
|
Search Engine Glossary |
Postscript : What about code-behind and Visual-Studio.NET?
(from Article I)
You'll notice the two ASPX pages use the src="Searcharoo.cs" @Page
attribute to share the common object model without compiling to an assembly, with
the page-specific 'inline' using <script runat="server"> tags (similar to ASP3.0).
The advantage of this approach is that you can place these three
files in any ASP.NET website and they'll 'just work'. There are no other dependencies
(although they work better if you set some web.config settings) and no DLLs to worry
about.
However, this also means these pages can't be edited in Visual-Studio.NET,
because it does not support the @Page src="" attribute, instead preferring the codebehind=""
attribute coupled with CS files compiled to the /bin/ directory. To get these pages
working in VisualStudio.NET you'll have to setup a Project and add the CS file and
the two ASPX files (you can move the <script> code into the code-behind if you
like) then compile.
Links
Code for this article [ZIP 24kb]
Article
I - which describes the data model and initial implementation
Working
with Single-File Web Forms Pages in Visual Studio .NET (to help those wanting
to use VisualStudio)