The title of this website changes daily!
 
Our new website is here
 
Yider 0.5 - An ASP Spider/Search system

Used throughout the world in numerous langauges, the Yider is an open source VBScript spider that allows you to quickly add a search system to your site like the one at the top of this page. It stores data in a Microsoft Access or SQL 2000 database with full text searching. The Yider does not require DLLs or COM components to run. It works for all languages. Guaged by typical usage, it seems there are 2-3 new Yider users a day in the world!

The Yider is very easy to use and requires little coding experience. It comes with full instructions and several detailed tutorials to get you up and running fast. If there's one philopsphy guiding the Yider's development, it's to make the process of adding search functionality to a web site with Microsoft tools as simple, quick and easy as possible.

As of August 2005, I no longer offer support on the Yider. Please do not email me any queries as they wont be answered! Having said that, the Yider is quite stable and I still use it myself. If the first tutorial works for you then you should not have any problems.


Table of Contents



1. Licensing

The Yider is copyrighted by (c) Yart Pty Ltd 2002 until eternity. The Yider and its associated source code is distributed under the terms of the GNU General Public License.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

If you would like to add features to the Yider's source code, please contact me and I will provide you with instructions on how to go about it.

Back to table of contents


2. Prerequisites


To use the Yider you need:

a) A web site that can run ASP with VBScript 5. This must be on a machine with Windows 2000, XP or later, or Microsoft Windows NTģ 4.0 with Microsoft Internet Explorer version 5.01 or later. You can determine the version of Internet Explorer by clicking About Internet Explorer on the Help menu. The version number reported in this dialog box should be 5.00.2900.6300 or higher.

b) A Microsoft Access or SQL Server 2000 database. If you are using SQL Server, you only need the database connection string - you don't need Enterprise Manager or Query Analyzer to use the Yider.

c) But you're still wondering, should I use the Yider - right? The Yider is suited to web sites that:

   i) Don't need Word documents or PDFs spidered,
   ii) Do not require access from .NET.

These limitations are being overcome, but that's where it stands at the current version.

Back to table of contents


3. Tutorials


Tutorial 1 - Installation


I'm going to hold your hand and gently walk you through using the Yider with several tutorials that demonstrate everything it can do. The basic tutorials to get you up and running are very simple and should take you no more than 15 minutes to complete.

1. Unzip the source code. This file contains an empty Access database that we'll use in the examples.

Note: If you download and use this file, make sure you join the spam free mailing list for this site so you can be notified when new releases of the Yider are released.

2. The tutorial uses the Access version of the Yider and you should stick with this until you've completed Tutorial 2. Change permissions for the Yider folder to write. These must be the permissions for the user that IIS is running under, not the user currently using the machine. Lest you quake in fear at this technical tidly bit, I have created a faq that shows inexperienced system administrators exactly how to do this.

3. Set up a virtual site on IIS to point to the directory you downloaded the source code in (I'll be assuming you have called it http://localhost/yider in the tutorials to follow).

4. We now need to check whether the XMLHTTP or WinHTTP COM components are installed on your server. At least one of these components is required so that the Yider can read the contents of the URL's it's to spider. If you use Unicode character sets in your HTML, you must use XMLHTTP. Otherwise, WinHTTP should be satisfactory. To see which component is installed, run http://localhost/yider/xmlhttp.asp.

5. If it's not installed, you can download XMLHTTP here

Back to table of contents


Tutorial 2 - Basic Usage


1. We are now going to spider (or should I say yider?) the first six pages in the site http://www.yart.com.au. To do this, run http://localhost/yider/population.asp and press the 'Populate' button. The Yider is configured to do this upon installation without a single setting being touched.

When you press this button, there will be a delay before the screen is refreshed with information about which files have been parsed. The length of the delay depends on the speed of your internet connection. During the delay, the Yider is extracting the contents of the pages on the Yart site; parsing them for more URL's to spider, and storing the contents in an Access database.
If you want to repeat this process, press the 'Clear' button to empty the Access database tables and then press 'Populate' again. ///

2. After you have populated the database once, try pressing the population button again. You will notice that the Yider tells you that there are no more URL's to parse. The time it takes to do this is less than time the original population took, but something is clearly going on. What is it? When you press the population button, the Yider re-examines the contents of every URL and only re-indexes those pages whose total size in bytes has changed. If you want to force the Yider to spider every URL, select the 'Clear' button before you select the 'Populate' button.

3. Let's see what the Yider has spidered. Run http://localhost/yider/search.asp. Enter the term 'xml' into the search box and press the Go button. You should see a list of search result pages. The pages are ordered according to a page rank algorithm that I developed. You can read about it here. You can also search for words based on the * wildcard. Click on the Help button to see just how this works.

Note: If you use the Access version of the Yider, you must have the Access database closed during population and searching

That's it, that's how easy it is to use the Yider. Let's now look at how you will go about using the Yider on your site.

Back to table of contents

Tutorial 3 - Essential Yider Configuration


1. The Yider has lots of options you can set and these are located in the file configuration.asp. Some of these must be changed by you to use the Yider. We'll cover the essential ones in this tutorial. Open confiuration.asp in a text editor and we'll get going. If you find anything goes wrong during this tutorial, please read the frequently asked questions which covers most stumbling blocks that are commonly encountered.

2. The first options to set are the database connection strings. If you are happy with the Access database in the installed directory (and there's no reason you shouldn't be) and you will be keeping all Yider files in the same directory, move on to the next point.

If you will be storing the Yider files in a separate directory from the Access database, you'll need to modify the variable g_path at the top of configuration.asp to point to the path of your Access database.

If the database is installed at an ISP and you aren't sure what its location is, paste this code into an empty ASP page and run this script from the database's directory:

<%

Dim path
path = Request.ServerVariables("PATH_TRANSLATED")
Response.Write("path is " & path)

%>


This code will print the directory location you are running it from on the screen.

You don't need to create any database tables to use the Yider. The Yider does this for you automatically.

If you want to use the SQL Server database, comment out line 10 by placing a ' at the beginning of the line. Remove the ' at the beginning of line 7 and change YourSQLServerName, ADatabaseName, username and password to your appropriate settings. The connection string must give you the database owner (dbo) role (the ability to create and delete tables) or you will receive error messages when using the Yider. If your database has been created by an ISP, your connection string may not provide dbo access so it's best to check. Since this version of the Yider does not use full text searching, SQL Server will have no performance improvements over Access. I recommend using the Access database because of its simplicity.

3. g_url_to_spider - this, surprisingly, is the URL at which spidering commences. You will need to change this to your URL. The Yider will find every file that is referred by href and src tags within this domain, even if they are in different directories. Wow.

Note: Although not essential, it is always best to run the Yider from a separate physical machine from the one you are spidering as there seems to be a bug with XMLHTTP on certain Windows 2000 patches on some user's systems.

4. g_valid_url_strings - this means that the fully qualified URL of every page the Yider crawls must contain the text 'www.yart.com.au'. You can add multiple strings as well with this syntax:

Array("www.yart.com.au", true)

or

Array("(www.yart.com.au)|(www.yartsoftware.com.au)", true)

for a multiple URLs. You need to change this to an appropriate value for your site.

As the Yider crawls though an URL, it might find a link to say www.microsoft.com.au. If it started crawling through Microsoft's domain it would, of course, go on for a long, long time. This parameter prevents that from happening. The word 'true' should not be changed.

If you are a perceptive reader, you will notice that "(www.yart.com.au)|(www.yartsoftware.com.au)" is a vbscript regular expression and you can use any regular expression as part of the excluding string. Vbscript regular expressions are documented here.

Note: As the Yider crawls through an URL, it looks for href tags to see where it should go next eg:

<a href="index.asp">home</a>

You don't need to add www.yart.com.au to this for the Yider to work e.g. you don't need to do this:

<a href="http://www.yart.com.au/index.asp">home</a>

The Yider will be smart enough to do this for you.

5. g_valid_file_extensions - the Yider will only crawl through files that end with these file extensions. If you were indexing a PHP site, you might change this to:

Array("(htm)|(html)|(php)", true)

Or even:

Array("(htm)|(html)|(jsp)", true)

for a jsp site. The word 'true' should not be changed.

The first parameter in g_valid_file_extensions is a regular expression as per the example above.

6. g_max_pages - You need to change this value as the default is too small. The Yider will spider no more than this many pages. This variable should be set to at least 1 more than the total number of pages in your site. In that case, you might think it's wise to set it to 1,000,000,000. But what if you got something wrong and the Yider found a link to an URL you didn't want it to go to? It might keep spidering forever. This parameter is a safety measure built in to stop the Yider after it has searched a certain number of pages.

Note: to avoid a large amount of unneccesary database querying, the parameter g_max_pages is only implemented approximately. You may find a small number of extra pages are parsed when you use it).

7. g_your_email_address - Although not essential, it would be useful for you to fill this value out. The Yider uses global vbscript exception handling and sends me a report whenever it crashes along with some basic data about the error code and description. If this variable is filled in, I will be able to contact you to sort the problem out. Any email addresses filled out here will be kept confidential and not distributed to third parties.

At this point, you now have enough information to implement the Yider on your site. If you don't want anymore information on the Yider's configuration options, skip to Tutorial 5.

Back to table of contents

Tutorial 4 - Optional Yider Configuration


1. g_compact - if this is set to true, the Yider will compress its database tables when you clear the database. This is a good idea, as databases do not automatically reduce their size when you empty their data. Change this option to true and, if you are using the Access database, change the database's folder permsissions to write and modify for the IIS user. This is explained here.

2. g_default_documents - by default, this is set to 0. An example of using this variable is:

Array("index.htm", "index.asp")

If your web server has default web pages e.g. when the URL http://www.yart.com.au defaults to http://www.yart.com.au/index.htm, you may wish to use this parameter. This is because your web site may contain pages that have links to both http://www.yart.com.au and http://www.yart.com.au/index.htm. The Yider will not know these URLs are different so it will index both of them. This is inefficient and worse, makes search results appear as though they are contained in two different pages. To remove this problem, add the default page names to the array.

3. g_urls_not_to_view - by default, this is set to 0. An example of using this variable is:

g_urls_not_to_view = Array("http://www.yart.com.au/finances", true)

or

g_urls_not_to_view = Array("(http://www.yart.com.au/finances)|(http://www.yart.com.au/pays_due)", true)

for more than one string.

The Yider will not visit any URLs containing any of the strings in the array. In our example, this means the Yider will not examine the contents of any URL's that contain the string "bad_url" or "really_bad_url". The last parameter, currently set to true, means the comparison will be case insensitive. If set to false, it will be case sensitive.

If you are a perceptive reader, you will notice that "(http://www.yart.com.au/finances)|(http://www.yart.com.au/pays_due)" is a vbscript regular expression and you can use any regular expression as part of the excluding string. Vbscript regular expressions are documented here.

4. g_urls_to_view_not_store - by default, this is set to 0. An example of using this variable is:

g_urls_to_view_not_store = Array("http://www.yart.com.au/finances", true)

or

g_urls_to_view_not_store = Array("(http://www.yart.com.au/finances)|(http://www.yart.com.au/pays_due)", true)

for more than one string.

If the Yider finds an URL containing any of the strings in the array, it will search the web page for further URLs to spider, but not store the contents of those pages for searching. In our example, this means the Yider will not store the content of any URLs that contain the word "bad_url" or "really_bad_url" but it will search those web pages for further URLs to spider. The last parameter, currently set to true, means the comparison will be case insensitive. If set to false, it will be case sensitive.

The first parameter in g_urls_to_view_not_store is a regular expression as per the example above.

5. g_bad_page_strings - by default, this is set to 0. An example of using this variable is:

g_bad_page_strings = Array("bad string", true)

or

g_bad_page_strings = Array("(bad string)|(another bad string)", true)

for more than one string.

If any of the web pages (that's pages, not the URL itself) crawled contain the text of any of the strings in g_bad_page_strings, the page will not be added to the database. However, the page will be examined for further URL's to visit. In our example, this means the Yider will not store any pages that contain the word "bad_string" or "another bad string". The last parameter should always be true.

Note: The first parameter in g_bad_page_strings is a regular expression as per the example above.

6. g_search_maintain_url_params - by default, this is set to 0. An example of using this variable is:

g_search_maintain_url_params = Array("foo", "param")

In addition, suppose you had the form variables:

<input type="hidden" name="foo" value="foovalue">
<input type="hidden" name="param" value="paramvalue">


embeded somewhere in the Yider <form>

The Yider search results page would add the string

foo=foovalue&param=paramvalue

to the Next, Previous and more results page links at the bottom of the search results.

7. g_search_maintain_search_term - by default, this is set to an empty string. An example of using this variable is:

g_search_maintain_search_term = "yider_search_term"

Now, let's say you were searching for the string 'xml'. This would result in the string:

yider_search_term=xml

appended to the hyperlinks of document titles in the search results.

8. g_delete_between_tags - by default, this is set to 0. An example of using this variable is:

g_delete_between_tags = Array("<noindex>", "</noindex>", "<!--", "-->")

Using these settings, the Yider would not store any text between the tags:

<noindex> and </noindex> and <!-- and -->

It would, however, scan the text between these tags for further URLs to spider.

9. g_delete_between_tags_complete - by default, this is set to 0. An example of using this variable is:

g_delete_between_tags = Array("<noindex>", "</noindex>", "<!--", "-->")

Using these settings, the Yider would not store any text for spidering or searching between the tags:

<noindex> and </noindex> and <!-- and -->

10. If the site you are spidering requires authentication, you will need to set the g_username and g_password settings.

11. g_strip_url_parameters - by default, this is set to 0. An example of using this variable is:

g_strip_url_parameters = Array("y", "z")

This would result in the query string parameters "y=..." and "z=..." being stripped. For instance, the URL:

http://www.yart.com.au?y=4&z=5

would become:

http://www.yart.com.au

Why do you need this variable? Imagine two pages that had exactLy the same content but perhaps different URLs:

http://www.yart.com.au?id=34
and
http://www.yart.com.au?id=45

If there is a search result in this content, it will appear to be in two pages unless you specify:

g_strip_url_parameters = Array("id")

12. g_urls_per_iteration - I suggest you leave this value. The Yider will spider no more than this many pages before it pauses for g_pause seconds. Let me explain why this is necessary.

As you can imagine, the Yider works by requesting web pages from a web server. Sometimes, the requests for these pages will be issued faster than the web server can respond to them. In such cases, the web server will return pages that look like this:

The page cannot be displayed

The site you are attempting to access is temporarily unavailable due to heavy usage. Please try again later.

HTTP 403.9 - Too many users are connected
Internet Information Services


Technical Information (for support personnel)

  • Background:
    This error can occur if the Web server is busy and cannot process your request due to heavy traffic.

  • More information:
    Microsoft Support

Requesting a limited number of pages before pausing gives the web server intermitent rests and prevents this error.

Strangely enough, on my own web server, if I set g_urls_per_iteration to the number of pages in the site, I can never get it to fall over. In fact, the variables above were introduced because a user I no longer have contact with got this problem. If you can also generate this problem, please let me know as there is a means of the Yider automatically determining these variables without you having to set them but I need a helpful user to work with me on this one.

If you were analysing a real site, I suggest you leave the default settings as they are very conservative. If you want to speed up population, you may want to increase m_max_urls_per_parse.

13. From the variable g_search_style onwards, there are a number of variables that control the fonts used by the Yider when it displays results in search.asp. The best way to see what they do is to change the font sizes to extreme values so you can see where they are being applied.

14. g_use_keywords - Many sites use the meta tag to create keywords that are meant to be used by search engines when indexing a site e.g:

<meta name="keywords" content="City of Turlock government city, government, fines">

In the example above, searching for any of the words:

"City of Turlock government city, government, fines"

should return the page where this meta tag was found over a page where say the phrase "City of Turlock" was found in the body of the document but not in the meta tag.

If g_use_keywords is set to true, the Yider will add the keywords to the beginning of the text in the web page thus giving those words the highest possible page rank (Yider's page rank is explained here). In general, this is not recommended and the default setting of false should not be changed unless:

a) You are certain you want pages ranked by keywords,
b) Each page has different keywords. If it doesn't, the search results will always show the keywords first and it will look like every page has exactly the same data in the search results.

Back to table of contents

Tutorial 5 - Implementing the Yider On Your Site


1. After modifying the Yider's options in Tutorial 3 (or if you were keen, Tutorial 4 as well), you need to repopulate the database. Run http://localhost/population.asp and select the 'Clear' button to remove the Yart site's data. Then select the 'Populate' button to spider your site.

2. Now all you have to do is place some of the code in search.asp into your web pages. Firstly, you have to add the search box to every page in your site that needs it. To do this, paste the code:

<!-- #include file="CYiderSearch.asp" -->
<!-- #include file="configuration.asp" -->
<!-- #include file="search_button_input.asp" -->


into the files that require a search box in the position you need it (see search.asp to see how simple this is).

If you have a complex site with many directories, I recommend keeping all Yider files in a seperate Yider directory to keep things simple. For instance, if you had a web site with an URL such as www.mysite.com and all Yider files were located in www.mysite.com/yider, you would need to change the includes to:

<!-- #include virtual="yider/CYiderSearch.asp" -->
<!-- #include virtual="yider/configuration.asp" -->
<!-- #include virtual="yider/search_button_input.asp" -->


If you were using the Access database, you would also have to hard code the database path in configuration.asp:

g_database_connection = "PROVIDER=MICROSOFT.JET.OLEDB.4.0;DATA SOURCE=c:\a directory\on your PC\yider.mdb"

3. It is possible to use a graphical button instead of the standard HTML Go button. To do this, paste the code:

<!-- #include file="CYiderSearch.asp" -->
<!-- #include file="configuration.asp" -->
<!-- #include file="search_button_image.asp" -->


into the files that require a search box.

4. The search box you have added will displays results in a file with the name defined by the variable g_results in configuration.asp. If you want to change the name of the results page, do so here e.g:

g_results = "search.asp"

5. You now have to add code to your search results page. To do this, paste the code:

<!-- #include file="CYiderSearch.asp" -->
<!-- #include file="configuration.asp" -->
<!-- #include file="search_include.asp" -->


into the files that require a search box and results in the position you require it (see search.asp to see how simple this is).

6. If you would like the length of the search box increased, modify the parameter g_search_box_length in configuration.asp.

That's it!

Back to table of contents

Tutorial 6 - Implementing the Yider On Foreign Language Sites


1. The Yider did not spider foreign language sites correctly until version 0.4. However, if you used the Yider pre version 0.4 on a foreign language site and it worked, you can almost certainly ignore this section.

2. Change g_english in configuration.asp to false. This seemingly trivial variable has the effect of slowing down spidering significantly. Unfortunately, when spidering foreign languages sites, the Yider has to read http streams in binary and then convert all characters back to ascii. This is a particularly inefficient operation in vbscript whose string handling abilities are terrible. Oh well.

3. Change g_charset in configuration.asp to the charset you use in the HTML tag:

<meta http-equiv="Content-Type" content="text/html; charset=gb_2312-80">

would result in:

g_charset = gb_2312-80

4. Change the Windows Family Codepage on line 1 in the files population.asp and search.asp:

<%@CodePage=936 Language=VBScript%>

5. There are a number of English words used throughout the Yider's search results pages. These are defined from the variable g_text_search onwards in configuration.asp. These need to be translated to your language. I won't explain them all, as they are pretty obvious. The best way to make sure you've covered them all is to test them on a search word that you know will appear in more than 10 pages as this will display them all.

6. Rewrite the search_help.htm file in your language. Note that this search page does not appear in multibyte character sets like Chinese, Japanese, Korean etc., because the Yider cannot use the * wildard when searching in these langauges. If you use one of these languages, don't bother with this point.

7. That's it! Try repopulating and searching your site.

Note: If searching doesn't appear to work, reboot your machine. There is a bug in at least some versions of Windows that causes foreign languages not to be recognised during the search if you switch between foreign langauges during development. This is something we expereince every time we execute the Yider's text plan!

Back to table of contents

Tutorial 7 - Full-text Searching with SQL Server

1. In order to use the Yider with full-text searching, you must use the SQL Server database connection string in configuration.asp.

2. If you are upgrading, delete the three tables starting with the word 'Yider' in the database.

3. Set the variable g_full_text to true in configuration.asp.

4. If you are fussy, set the variable g_local_ID to your language. The default seems to be OK for English. The variable is used when the stored procedure sp_fulltext_column is applied to the database. SQL Server help defines g_local_ID as:

"...the language of the data stored in the column. The following table lists languages included in SQL Server..."

and then goes on to define the values you can see in configuration.asp. I presume this variable allows SQL Server to recognise word boundaries in various langauges.

5. Note that the Yider does not work for multi-lingual languages like Chinese, Korean, Japanese etc.,

Back to table of contents

4. To Do List

Look at the Change Log after the current release. It shows all Yider features and roughly when they will appear.

Back to table of contents

5. Frequently Asked Questions


I get an error message when populating the Yider

This is the most common problem when using the Yider and is usually the result of an oversight. The best way to find the problem is to remove the line:

on error resume next

at approximately line 50 in population.asp. This will cause the Yider to crash when the error occurs but will give a much clearer error message. The error will state the problem and the file and the line number and should be fairly obvious to fix.


Should I Use The Access Or SQL Server Version Of The Yider?

If you don't need
full text searching, there's really no difference between the Access and SQL versions. Since Access databases are usually cheaper to host on a commerical ISP than SQL Server, and simpler to use, I suggest Access as the most economical alternative.

Back to table of contents

How Do I Use The Yider To Index Web Pages That Are Created Dynamically?

Let's say you have a site whose navigation depends on a select:

<select name="page">
<option id="1">News</option>
<option id="2">Press Releases</option>
</select>


when a button is clicked to submit the form, the next web page might be:

articles.asp?id=1 for case 1 or
articles.asp?id=2 for case 2.

If these pages are not referred to by href tags in the site, you need to trick the Yider into including them. To do this, you should include a reference to these files in a page you know the Yider will spider.

<a href= "articles.asp?id=1"></a>
<a href= "articles.asp?id=2"></a>


Note that since there's no text within the link tags, they are invisible to viewers of the site but not the Yider. If you don't want to create dummy tags in an existing page because it will be too slow to load, you could put the tags in a dummy file that the Yider spiders but no one else sees. The dummy page could be referenced from an existing page e.g.

<a href= "dummy.asp"></a>

dummy.asp would contain thousands of links to files the Yider has to spider.

See this
frequently asked question as well.

Back to table of contents

Isn't Full Text Searching Necessary For Fast, Efficient Searching?

The Yider works by using the SQL like statement. This is in effect, a GREP on database tables. Now if you've ever tried GREP on large text files (say 10 Mb), you'll notice that it performs searches fairly quickly on today's modern processors. If you have a site whose sum content of text is less than 10 Mb, there's no need for the full text version. That's 10 Mb of text only, not images, as the Yider strips everything else from the html files it stores.

Back to table of contents

How Does The Yider Spider A Site?

The function GetURLsDirect from CURLExtractor uses WinHTTP to extract the HTML from the domain it has been requested to search. It then searches for href and src tags to find further files.

href and src tags can refer to different files within a web site in many ways:

<a href="index.asp">a link</a>
<a href="/index.asp">a link</a>
<a href="./index.asp">a link</a>
<a href="../index.asp">a link</a>
<a href="../index.asp#comment">a link</a>
<a href="../index.asp?id=1">a link</a>
<a href="index/">a link</a>
<a href="javascript:newWin('newpage.asp?ID=130',450,400);">a link</a>


Back to table of contents

What Information Does The Yider Extract From A Web Page?

The Yider strips HTML files of all:

a) HTML comment tags,
b) Carriage returns,
c) Line feeds.

The stripped file contains only the text a user would see when reading a web page in a browser. The stripped file is then stored in the database ready for searching.

Back to table of contents

How Does The Yider Obtain Search Results?

This is best explained by an example. Let's say you had a web page whose source looked like this:


<html>
<head>
<title>My Life In Yart</title>
</head>
<body>
At first, I wanted to be an artist but my mother suggested I would be much better off as a Yartist.
It was a more liberating profession. I wasn't hard to convince.
</body>
</html>

Once this file is parsed it would be stored in the database like this:

My Life In Yart At first, I wanted to be an artist but my mother suggested I would be much better off as a Yartist. It was a more liberating profession. I wasn't hard to convince.


Now, let's say you searched for the term "Life". The Yider would search for the first occurrence of the word "Life" and display 10 words before it and 10 words after it. As you can see, there is only 1 word before "Life" but more than 10 after it. This results is a search results page that looks like this:


My Life In Yart
My Life In Yart At first, I wanted to be an artist ...
http://localhost/yidertest/crossing_the.htm

Result pages: 1   


Because there are more than 10 words after the word "Life", three dots are appended to the match. Now let's say you searched for the phrase "Life hard". This phrase does not occur in the page but the individual words do. The Yider searches the document for each word and highlights it:


My Life In Yart
My Life In Yart At first, I wanted to be an artist ...
... better off as a Yartist. It was a more liberating profession. I wasn't hard to convince.

http://localhost/yidertest/crossing_the.htm

Result pages: 1   


10 words with three dots occur before the word "hard" to show it is in the middle of the page. Now let's say you searched for the phrase "Life first hard". This phrase does not occur in the page but again the words do. The Yider again searches the document for each word and highlights it:


My Life In Yart
My Life In Yart At first, I wanted to be an artist ...
... better off as a Yartist. It was a more liberating profession. I wasn't hard to convince.

http://localhost/yidertest/crossing_the.htm

Result pages: 1   


Since the word "first" occurs within the search results of the word "Life", it is highlighted in the same line. When this happens, 10 words are shown before and after the first matched word.

Back to table of contents

How Does The Yider Rank Search Results?

The Yider's page rank system is documented
here.

Back to table of contents

Can I Use the Yider If My Web Site Is Not In English?

Yes. Read this
tutorial.

Back to table of contents

I Can't Use The Yider Because It's Missing A Small Feature. What Do I Do?

Write to me and tell me what you need. I usually add features that Yider users ask for so there's a good chance your requirements will be added in a forthcoming release.

Back to table of contents

Why Was The Yider Created?

When I first wrote this company's
web site, I thought I'd use the Google Free Site Search system to make it searchable. Unfortunately, it came with a whole host of problems. So I thought I'd write my own searching system. After all, how hard could it be to write a spider? I've called the result a Yider (that's a Yarted spider).

Back to table of contents

Does The Yider Leak Memory?

In VBScript, there are two ways a script can leak memory. The first is not setting objects to Nothing:

set file_object = Server.CreateObject("Scripting.FileSystemObject")
set file_object = Nothing


The second is by not closing recordsets, databases or file objects:

set recordset = database.Execute("select * from [Yider]")
recordset.Close
set recordset = Nothing

set database = Server.CreateObject("ADODB.Connection")
database.Open database_connection_stirng
database.Close
set database = Nothing


Of course, if you diligently freed memory, you would never have a leak. But how can you be sure that in hundreds of lines of code, you haven't missed a deallocation? There is only one way that I am aware of and that's to count the amount of set operations and close operations. The Yider does this with the variables g_set and g_open:

set file_object = Server.CreateObject("Scripting.FileSystemObject")
g_set = g_set + 1
set file_object = Nothing
g_set = g_set - 1

set recordset = database.Execute("select * from [Yider]")
g_set = g_set + 1
g_open = g_open + 1
recordset.Close
g_open = g_open - 1
set recordset = Nothing
g_set = g_set - 1

set database = Server.CreateObject("ADODB.Connection")
g_set = g_set + 1
database.Open database_connection_stirng
g_open = g_open + 1
database.Close
g_open = g_open - 1
set database = Nothing
g_set = g_set - 1


g_set and g_open are global counters and their value is printed out when I am developing the Yider. If they both aren't equal to 0, then I know there is a missing cleanup operation somewhere in my code.

Microsoft claim that VBScript internally cleans up resources when scripts exit. However, I have never believed this to be 100% true. My ISP, who hosts thousands of sites, confirmed my suspision. They run numerous memory scans on scripts as they execute. Scripts that begin consuming 100 Mb's of resoucres are told to clean up their code or find another host. Tough but fair.

Back to table of contents

How Do You Use WinHTTP To Store Charsets From Foreign Languages In Aceess and SQL Server Databases?

One of the worst bugs when developing the Yider was the difficulty of retrieving web pages written in foreign languages using WinHTTP. I thought I'd discuss just how I did this as it might help someone save many hours of wasted time.

1. This must be the first two commands in the asp page:

<%@CodePage=936 Language=VBScript>
response.charset="gb2312"

If you are not sure what code page to use for your character set, refer to
this.

2. You must retrieve foreign language pages in WinHTTP in binary format using the ResponseBody function of WinHTTP, not ResponseText.

3. You must convert the binary string returned by ResponseBody into text using a binary to text function. I use this one:


Function BinaryToString1(Binary, CharSet)


  Const adTypeText = 2
  Const adTypeBinary = 1

  'Create Stream object
  Dim BinaryStream 'As New Stream
  Set BinaryStream = CreateObject("ADODB.Stream")

  'Specify stream type - we want To save text/string data.
  BinaryStream.Type = adTypeBinary

  'Open the stream And write text/string data To the object
  BinaryStream.Open
  BinaryStream.Write Binary

  'Change stream type To binary
  BinaryStream.Position = 0
  BinaryStream.Type = adTypeText

  'Specify charset For the source text (unicode) data.
  BinaryStream.CharSet = CharSet

  'Open the stream And get binary data from the object
  BinaryToString1 = BinaryStream.ReadText

End Function


4. Make sure you use the tag:

<meta http-equiv="Content-Type" content="text/html; charset=windows-1250">

in all foreign language web pages.

5. If your charset is Unicode (see this), make sure you store data in SQL Server as ntext, nvarchar etc. See SQL Server help for the special "N" prefix to insert data into Unicode fields e.g.

insert into [atable] values (N'value')

Back to table of contents

How Do I Enable Permssions For The Yider Directory?

Firstly you need to enable folder permissions to be viewed in Windows Explorer. Open Windows Explorer and select the Yider directory. Then select Tools/Folder Options from the main menu and click on the View tab. Scroll down to the 'Use Simple File Sharing (Recommended)' option and make sure it is not ticked. That's NOT ticked e.g:

Now, go back to Windows Explorer and right click on the folder the Yider is in. Select 'Properties' and then the 'Security' tab (removing simple file sharing enables the Security tab which is otherwise not visible). Click on the user IIS is running under. I use Windows XP Professional and for me that's the user called 'Users'. For default Yider usage, turn on the 'Write' option. If you have the Yider option g_compact set to true, turn the Modify option on as well. For what it's worth, I always switch both these options on no matter how I'm using the Yider. If you're not sure which user IIS is, turn these options on for all users.



If you're using a Windows 2000 Server Pro, you will need to flick on a few more options. Select the Yider folder, Properties, Security Tab, Select IIS user account, Advanced, Permissions Tab, Select IIS user account again, View/Edit and select the following options:



Note: These settings may be slightly different in Windows 2000 Server and Windows 2000 Advanced Server.

If you're hosting the Yider at an ISP, ask them how you go about changing folder permissions. Most ISP's provide a control panel to set these options or they give you a special folder where databases are to be placed and these options are on by default.

Back to table of contents

How Do I Spider Sites That Contain JavaScript Menus?

The Yider finds web pages by searching for href and src tags:

<a href="articles.htm">articles</a>

and

<frame name="bottom" src="index.htm" scrolling="auto">

If your navigation system uses javascript or flash etc., these tags might not be present. To ensure every page is spidered, create an invisible link on a page you know is being spidered to another page which lists all your URLs:

<a href="dummy.htm"></a>


dummy.htm would contain all your links:

<a href="page1.htm">page1.htm</a>
<a href="page2.htm">page2.htm</a>
<a href="page3.htm">page3.htm</a>
<a href="page4.htm">page4.htm</a>
etc


Back to table of contents

6. Troubleshooting Yider Problems


Known Bugs

1. If the Yider doesn't seem to work, firstly try running it from a separate physical machine from the one you are spidering. There is a bug with XMLHTTP on certain Windows 2000 patches that prevents the spidering of URLs on the same physical machine.

2. The Yider cannot exclude a specific page if there are other pages that are similar. For instance, if you wanted to search the URL http://www.yart.com.au for further URLs to parse but not store its contents, you might think the setting:

g_urls_to_view_not_store = Array("http://www.yart.com.au", true)

would do it. However, this setting would also exclude any URL containing the string http://www.yart.com.au such as http://www.yart.com.au/yider.asp. This will be fixed in version 0.5.

3. The Yider incorrectly weights common words similarly to other words in the search as described in the comments to this article.

4. The Yider will not find URLs reached by redirects unless they have a fully qualified name or are virtual eg:

Don't do this in a page that is reached from a redirect:

<a href="../page.htm">

instead, do this:

<a href="/page.htm"> or <a href="http://www.yart.com.au/page.htm">

5. The Yider cannot find the & character in sites that were spidered with foreign language settings on.

6. The Yider cannot use the * wildard when searching in multibyte character sets like Chinese, Japanese, Korean etc.

7. If you change the Yider's charset in the foreign language settings frequently, you may find it does not return results accurately. This is considered a Windows bug and the only way to be sure it is not affecting you is to reboot after modifying charsets.

Back to table of contents

I Get An Error When Using The Yider

If the error is not covered by any of the troubleshooting tips below, send me:

a) The URL you are trying to spider.
b) The contents of your population.asp or population_large.asp file.
c) The URL of the page with the error if you can detect it.

I respond to most errors within 24 hours.

Back to table of contents

I Get A WinHTTP Error When Trying To Index The Site

The error is in the file CURLExtractor.asp at the point:

m_winHttp.send()

the error says:

WinHttp.WinHttpRequest (0x80072EE7)
The server name or address could not be resolved


This is the most common error with the Yider. The Microsoft WinHTTP is telling you that it can't connect to the URL you are requesting. This could be because your network is down, your Internet connection is down, or there is an error with WinHTTP. If the error persists, try running this script:

<%Option Explicit%>

<html>
<head>
</head>
<body>

<%
Dim html, winHttp, URL
set winHttp = Server.CreateObject("MSXML2.ServerXMLHTTP")
URL = "http://www.microsoft.com"

winHttp.open "GET", URL, false
winHttp.send()

if Err.Number <> 0 then
Response.Write "<br>The URL " & URL & " cannot be found"
end if

html = winHttp.ResponseText

Response.Write "The text at " & URL & " is " & Server.HTMLEncode(html)

set winHttp = Nothing
%>

</body>
</html>


This script connects to http://www.microsoft.com and dumps the HTML. If this doesn't work but you can connect to the Internet, reboot. If it still doesn't work, try reinstalling XMLHTTP. If it does work, replace http://www.microsoft.com with your URL and see what happens.

Back to table of contents

My Web Site Is Not In English And Pages Referred To By Directory Listings Don't Seem to Be Spidered

If you have pages that are directory listings and you have used non-English characters in your URL's in those pages, there is a chance those characters will be modified during spidering. The Yider won't be able to find those pages because the modified URL characters have created a non-existent URL. This is because spidering uses a codepage to interpret characters whilst directory listings are never created with a codepage by IIS. Thus itís best to avoid non-English characters in URLs.

Back to table of contents

I Get A Server Timeout Error Or The Error "A Connection With The Server Could Not Be Established" During Spidering

There are two entirely different reasons why this may be happening.

It may be due to this Microsoft knowledge base article:

"...the calling Active Server Page (ASP) should not send requests to an ASP in the same virtual directory or to another virtual directory in the same pool or process. This can result in poor performance due to thread starvation..."

Luckily, this problem seems to occur irregularly. The way to get around it is as follows. If you are trying to spider http://www.yart.com.au from http://www.yart.com.au/yider/population.asp and you see this error, try creating a new virtual directory e.g. http://new_virtual_directory/yider/population.asp or http://localhost/yider/population.asp or http://127.0.0.1/yider/population.asp

Alternatively, disable the ASPExecuteinMTA metabase property for your site.

Secondly, the code processing web pages may be calling the database faster than the database can respond to requests. Try setting max_pages to 10 and urls_per_iteration to 10 as these are very low values. If this solves the problem, increase these values until failure occurs so you determine the mamximum spidering speed you can attain.

Back to table of contents

The Yider Is Not Spidering Every Page In My Site

The Yider finds web pages by searching for href and src tags:

<a href="articles.htm">articles</a>

and

<frame name="bottom" src="index.htm" scrolling="auto">

If your navigation system uses javascript or flash etc., these tags might not be present. To ensure every page is spidered, create an invisible link on a page you know is being spidered to another page which lists all your URLs:

<a href="dummy.htm"></a>


dummy.htm would contain all your links:

<a href="page1.htm">page1.htm</a>
<a href="page2.htm">page2.htm</a>
<a href="page3.htm">page3.htm</a>
<a href="page4.htm">page4.htm</a>
etc


Back to table of contents

7. Change Log


Note: Always download the latest version of the Yider.
I only offer support on the most recent releases

Date Description Release
18th Sep 02 Bug fix. Previous version thought links like this were valid:
&lt;a href="index.asp"&gt;link description&lt;/a&gt;
0.11
19th Sep 02 Feature - Ability to exclude pages that contain certain strings when Spidering
Feature - Ability to exclude URL's that contain certain strings when Spidering
Bug fix - URLs containing the single quote character now work.
0.2
27th Sep 02 Feature - Spidering performance improvement to cope with sites over 200 pages.
Bug fix - crash when search term contains a single-quote.
Bug fix - SQL 7 Yider table creation bug fixed.
0.21
10th Oct 02 When installing this release you must first manually delete the table called Yider in your SQL database.
Feature - improved usage on big sites where web server speed used to require manual code modifications.
Feature - the spidering algorithms have been approved yet again for faster indexing.
Bug fix - some types of sentence endings were not correctly recognised in previous versions.
0.22
8th Nov 02 Bug fix - Database connection string modified 0.23
18th Jan 03 Bug fix - Malformed tag crash
Bug fix - crash when spidering files that are images named as asp, php, jsp etc., pages
0.24
22nd Jan 03 Feature - Support for Access databases added
Feature - Phrase matches rank higher than individual word matches

Note: When upgrading to this release, you must manually delete the table 'Yider' in your database
0.3
24th Jan 03 Feature - Support for Access database compression added
Bug fix - pages containing tags like this caused a crash </title> <title>. Why people would do this, I don't know, but they do. It takes all sorts...
0.31
7th Feb 03 Feature - Support for SQL database compression added
Feature - Enhanced presentation of matches when the search term consists of multiple words
Bug fix - some types of unusual hrefs were missed by previous versions or counted twice by the Yider
e.g <a href="/somewhere">link</a>
<a href="./somewhere">link</a>
<a href="javascript:newwin('file.asp')">link</a>

Note: When upgrading to this release, you must change the Access compaction option to from m_compact_access to m_compact
0.32
3rd March 2003
Feature - You can now re-yider, oops, I mean re-spider a site without clearing the database. The Yider will only update pages that have changed since its last pass.

Bug fix - imagine you were at the page http://www.yart.com.au/articles_new/more/index.asp and there was a href "../index.asp". This means there must be a page at http://www.yart.com.au/articles_new/index.asp and the Yider got this right. Now imagine you were at the page http://www.yart.com.au/index.asp and there was a href "../../../../../abc.asp". Where does this refer to? Surprisngly, it refers to http://www.yart.com.au/abc.asp which the Yider got wrong in previous releases.

Bug fix - Any URL with a # would never have been found.

Bug fix - Searching for terms containing any of the letters "/\.*+?|()[]{}" caused a crash.

Bug fix - the Yider did not highlight multiple matches if they were adjacent (ie separated only by a space). Consider the sentence:

the the the cat

If you searched for 'the', the Yider would display the matched result as

the the the cat

The correct result is:

the the the cat

Note: When upgrading to this release, you must delete the Access and SQL Server tables Yider and YiderResult
0.33
2nd September 2003 Feature - Page ranking
Feature - Searching using the * wildcard.
Feature - Enhanced directory exclusion.
Feature - Improved More Page links for > 100 results.
Feature - New configuration options g_search_maintain_url_params, g_search_maintain_search_term, g_default_documents, g_delete_between_tags and g_your_email_address.
Feature - The Yider uses a single configuration file.
Feature - Help rewritten based on a usability study.
Bug fix - The yider uses XMLHTTP as Winhttp has troubles with Unicode
Bug fix - The Yider now copes with foreign languages correctly. This was not thought to be a problem with previous versions of the Yider because foreign language sites appeared to work correctly on our test server and some other users' servers. By a tortuous process, it was discovered that depending on what version of Windows you were using; the patches you had; the last charset set you were using and, believe it or not, the last time you had rebotted your machine, the Yider may or may not work correctly. This has now been solved.
Bug fix - Highlight matched words in the document title
Bug fix - Highlighting of matched words that contained any of the escaped regular expression characters .*?\+|()[] was slightly incorrect
Bug fix - Memory leaks removed
Bug fix - The search results did not highlight a word if it was the very last word in the document
Bug fix - Spidering did not occur properly for pages without a . in their name e.g. http://www.website.com/dir?parameter1&title=parameter2

Beta i changes
Bug fix - Remove text between <head></head> for searching in all pages spidered
Bug fix - g_default_documents bug when URLs contained JavaScript
Bug fix - catch of WinHTTP exceptions on URLs that return redirects
Bug fix - spidering of multibyte character set languages like Chinese now works
Bug fix - ability to search for . and & characters
Bug fix SQL Server version - ability to search for search terms containing the chacraters []^%_

Beta j changes
Bug fix - Highlighting search terms in multibyte charsets fixed

Beta k changes
Bug fix - Switch to using octal character matches in VBScript's regular expressions as single character searches were not showing up correctly in the previous version.
Bug fix - Tags that are stripped from the Yider are replaced by a space character.

0.4 release changes
Bug fix - When the Yider generates a message, it displays a JavaScript box which failed when the message had new line characters
Bug fix - SQL Server sometimes doesn't like retrieving the contents of the data type 'text' from a recordset unless they are placed in a variable first

0.41 changes
Bug fix - When the Yider generates a message, it displays a JavaScript box which failed when the message had new line characters, yep, just like 0.4 release

0.42 changes
Bug fix - PageRank failed on some installations of Windows due to foreign language settings affecting number formatting

Note: When upgrading to this release, you must delete the Access and SQL Server tables Yider, YiderResult and YiderConstants. In addition, you should also read this file as the source code has changed significantly.
0.4
17th June 2004 Feature - Full text searching on SQL Server
Feature - Enchanced error reporting. All Err object properties printed
Feature - Increase in the search styles on the line You searched for the word...
Feature - New configuration parameter g_strip_url_parameters
Feature - New configuration parameter g_delete_between_tags_complete very similar to g_delete_between_tags
Bug fix - Text between <script> tags is removed prior to parsing and searching
Bug fix - Access version would not work if it resided in a directory with spaces
Bug fix - Highlighting was incorrect if a phrase search didn't return a complete match but the words in the phrase did occur in a document (thought I had this one fixed months ago!!!)
Bug fix - Documents with a title match but no body showed up with an unneccessary blank line
Bug fix - hrefs of the form <a href='url.html'> were not found
Bug fix - hrefs of different case were ignored

Note: When upgrading to this release, you must delete the Access and SQL Server tables Yider, YiderResult and YiderConstants. In addition, you should also read this file as the source code has changed significantly.
30th October 2004 Feature - Added the variable g_use_keywords 0.51
16th November 2004 Bug fix - https would not spider. Can't believe this bug was picked up only now!!! 0.52
28th November 2004 This is the current version of the Yider
Bug fix - searching with image didn't work
0.53
Some time in the future Yider additions in approximate order of importance:

Allow spidering of Word and PDF files
Create an API to only index one page at a time
Spider files in robots.txt files
Ability to allow searching of sites that are indeitcal, sit on the same machine but are listed under multiple domains.
Ability to define the structure of links rather than just searching for href and src tags
Ability to give certain search words more importance and/or rank URL matches with those words more highly
Create a GUI for all variables in configuration.asp
Display the results of URLs parsed, URLs not parsed, pages not parsed, badly formed URLs and all other options visually
Warn the user when server timeouts are occuring
.NET searching
Spell checking
For those with physical or terminal services access to the server, a .vbs or executable that could be scheduled in Task Manager, run from the commandline, etc
?

Back to table of contents

Top | Home | What Is Yart? | Mailing List | Yart Work | Contact Us