Site Search
sitemap • search tips • faqs 
LOG-IN
SUBSCRIBE
PACKAGES
FEATURES
CUSTOMERS
ABOUT US
SUPPORT
Frequently Asked Question #21
Can I prevent spiders from indexing pages or directories within my site?
Yes. To disallow all spiders from indexing your site, but allow FusionBot to enter, include a robots.txt file (all lower case) in your root directory containing the following information at a minimum:

User-Agent: *
Disallow: /

User-Agent: fusionbot
Disallow:

- OR -

User-Agent: *
Disallow: /

User-Agent: fusionbot
Disallow: /cgi-bin/
Disallow: /info/secret.htm
Disallow: /info/brochure.pdf

The initial example above will DISALLOW all other spiders (*) from indexing any directory within your site while DISALLOWING NO directories to the FusionBot spider. The end result is a successful index build for your FusionBot implementation while still preventing other spiders from indexing your content.

The second example is similar, yet shows you how to also prevent FusionBot from indexing only particular directories and pages within your site. In this example, No other spider will index any content, and the FusionBot spider will index all of your content except that which is within the /cgi-bin directory and any of its sub-directories, and the page named 'secret.htm' in the 'info' directory.

Alternatively, you may login to your FusionBot account and setup and internal robots deny list by clicking on the 'Spider' tab, then selecting the 'Exclude Pages & Directories' link. The syntax for creating your deny list via the Robots Exclusion Form is identical to placing a robots.txt file on your web server. However, by eliminating the requirement to publish an actual text file to your server, the process is much simpler. Creating a Deny List in your FusionBot account only impacts how the FusionBot spider crawls your site. Also, if you have implemented BOTH a robots.txt file AND populated your Robots Exclusion Form, FusionBot will combine their contents when determing which pages or directories should be excluded from your index.

Also, there may be times when you include a certain site for indexing within your
mini-portal, but want to specify which sections of their site should not indexed. While FusionBot will always adhere to the contents of the actual site's robots.txt file, there may be times when they have not implemented a robots.txt file, and you would like to specify additional / alternate pages and directories that our crawler should omit.

To do this, simply add additional Disallow instructions with absolute URLs in your own robots.txt file or FusionBot Exclusion Form.

For example, assuming you have a site named www.widgets.com as part of your mini-portal, and you wish to exclude the contents of their /cgi-bin, add the following line to the end of your robots.txt file:

User-Agent: fusionbot
Disallow: /cgi-bin/
Disallow: /info/secret.htm
Disallow: /info/brochure.pdf
Disallow: http://www.widgets.com/cgi-bin

You may also implement a robots.txt file for your mini-portal sites without specifying any pages or directories to be omitted within your own site.

For a detailed explanation of how the robots.txt file can be implemented, please visit http://www.robotstxt.org/wc/exclusion-admin.html.

In addition to adhering to the robots.txt standards outlined in the previous link, FusionBot also extends the functionality of the robots.txt standard by allowing you to populate your DISALLOW statements with wildcards (*). In this manner, you can instruct, for example, the FusionBot crawler to not crawl any pages on your site that contain a specific value within the pagename / querystring.

For example, you may have various pages that offer a version optimized for viewing in a browser, and another version, optimized for printing. Having FusionBot crawl both of these pages would result in a number of duplicate / unnecessary pages being crawled and indexed.

In this scenario, many times, the only differentiating characteristic from one page to another, may be an additional querystring variable. For example, the browser optimized page may have a URL such as:

http://www.yoursite.com/products/widget.asp

While the print optimized page's URL would appear as:

http://www.yoursite.com/products/widget.asp?print=1

Many different pages throughout a site may utilize this same syntax, where "print=1" indicates the "print" version of the same document. In this case, using wildcards in your robots.txt file or FusionBot online exclusions form, you can instruct FusionBot to not crawl your print only pages, by including the following line:

User-Agent: fusionbot
Disallow: *print=1*

In this manner, any document / link on your site that contains "print=1", anywhere within the pagename / url, will be omitted from your index. Use this same syntax for any documents that contain a common querystring variable, anywhere within the URL, that should be omitted, when present.

Also, please reference our
FAQ concerning details on implementing robots exclusion syntax within the <HEAD> section each page on your site.


If you did not find an answer to your question, please return to the Support Page and fill out the form provided.


 Featured Customers:

[Subscribe to FusionBot - Site Search & Sitemap Solutions]




Copyright 1998 - 2010 LOGIKA Corporation. All rights reserved.
[about site search] [advertising] [build vertical search engine] [reseller program] [popular searches] [terms of use]