What Are Search Engine Robots?

A robot spider

Search engine robots, sometimes called "spiders", "bots" or "crawlers", are programs used by search engines to explore the web, seeking out new pages and checking/updating the Content of known pages. The information that a search engine uses to rank your pages and every other page it holds in its database was found by these Spiders prior to being indexed and added to the database.

There are four reasons a robot might visit your pages

  1. You submitted the URL to the search engine
  2. The engine knows about your pages and is checking to see if the content has changed
  3. The robot has followed an internal Link to a new page you have recently uploaded
  4. The robot has followed an external link from another site that links to you

You might think that with such a major role to play in indexing the web, Robots would be powerful and sophisticated animals. Well you would be wrong! Robots are relatively simple programs with limited functionality not unlike early browsers.

Robots don't understand or have difficulty understanding;

  • Frames
  • Flash Movies
  • Flash Intros
  • Invalid Code
  • JavaScript
  • Image Maps
  • Dynamically Generated URL's
  • JavaScript Navigation.

When a robot arrives at your website the first thing it does is to check your robots.txt file if you have one. This file is used to inform robots about pages or directories that you don't want indexed, these may be directories containing legacy pages or printer friendly pages. A robot gathers as much information as it can about a page before following any links through to other pages.

Not all robots are friendly, some malicious spiders are designed to scrape e-mail addresses that will later be used to send unsolicited Spam e-mail.

If you have access to your server logs or a log statistics program you will be able to see which pages on your site have been visited by the robots. Using your logs or statistics program you will see which robots visited, when they visited, which pages they visited and how often they visit. Some robots are easily spotted from their user agent names, like "Googlebot" Google's spider.

If you identify activity from these spiders in your server logs or log statistics program, your pages are probably about to be listed on that particular search engine. However, be patient, some search engines can take 3 - 6 months to update their databases.

The following info is intended to assist you in identifying the search engine spiders and robots that visit your site based on information you can obtain by viewing your site's visitor log reports.

For information on blocking any of these robots using the Robots.txt exclusion standard, see
http://www.robotstxt.org/wc/exclusion.html

Company Alta Vista
User Agent Scooter-3.0.3
(Many variations. Most contain the word Scooter)
Robot.txt Identifier User-agent: Scooter
Details
Company Ask Jeeves
User Agent Mozilla/2.0  (compatible; Ask Jeeves)
Robot.txt Identifier User-agent:directhit
User-agent: teomaagent1
Details
Company Fast Search and Transfer ASA
User Agent FAST-WebCrawler/3.4/Nirvana)
AKA - Mozilla/4.0 (compatible; FastCrawler3, support-fastcrawler3@fast.no)
Robot.txt Identifier User-agent: fast
Details http://fast.no/support/crawler.asp
Powers Alltheweb.com, Lycos and many smaller search engines
Company Google
User Agent Googlebot/2.1 (+http://www.googlebot.com/bot.html)
AKA: Wget/1.5.3
AKA: Googlebot-image (+http://www.googlebot.com/bot.html)
Robot.txt Identifier User-agent: googlebot
Details http://www.googlebot.com/bot.html
Company Inktomi
User Agent Slurp (Slurp.so/1.0 (slurp@inktomi.com; http://www.inktomi.com/slurp.html)
Robot.txt Identifier User-agent: slurp
Details http://www.inktomi.com/slurp.html
Powers AOL, MSN and many others
Company Lycos
User Agent Lycos_Spider_(modspider)
AKA: T-Rex
Robot.txt Identifier User-agent: lycos
Details Lycos is powered by the search engine at Fast, we don't know why they continue to operate their own spider
Company Microsoft / MSN
User Agent MSNbot / MSRbot
Robot.txt Identifier N/A
Details MSNbot is the development Bot for their new search engine
MSRbot is supposed to be a Research Bot

Opening Hours: 9:30am to 5pm, Mon to Fri, except public holidays.

Phone us on 0871 900 8407

IndiciumWeb are now on twitter

You can now find us on twitter: http://twitter.com/indicium

Gold Standard Seo Audit for large websitesSilver Standard Seo Audit for medium websitesBronze Standard Seo Audit for small  websites
Indicium Web on Facebook