Search

Search technology is a huge subject, encompassing:

  • networking (spidering the web),
  • string and markup-language manipulation (parsing HTML)
  • language and text-parsing (finding words & sentences in documents, stemming and other linguistic analysis),
  • algorithms (finding matches, AND/OR queries, combining multiple word results), and
  • performance (both increasing spidering speed, and making large catalogs fast to search).

In addition to the articles and code below, these search-related links might be interesting or useful.

Searcharoo.NET - Version 7        LATEST !
Highlight search terms in a proper 'document summary' on the results page
Searcharoo.NET - Version 6
Search/index/catalog IMAGES, and GPS coordinates!
Searcharoo.NET - Version 5
Remove Binary Serialization to solve Medium Trust problem; index OpenXML document formats
Searcharoo.NET - Version 4
Refactored codebase and ability to index and search Microsoft Word, Excel, PowerPoint and Acrobat PDFs. Little improvements like robots.txt and excluding regions of HTML also added.
Searcharoo.NET - Version 3
Add disk-based catalog persistence, frameset/iframe spidering, paged results, stemming, stop-words and more!
Searcharoo.NET - Version 2
Extend Searcharoo to populate its search catalog by Spidering HTML pages - follow links and imagemaps to process both static and dynamicly generated pages! You can also search for multiple words.
Searcharoo.NET - Version 1
How to build a simple, extensible search engine using ASP.NET that can crawl files and create a searchable catalog by processing the text from HTML source.
Useful links

searcharoo.net

On Search, the Series

Lucene.net [Open Source]

Nata1 [Open Source]

SiteSearchEngine [article]

What is Stemming?

Robots.txt

more links »