Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

[CSEO 1] Instructions and information about the "Bot List" Rate Topic: -----

#1 User is offline   admin Icon

  • Administrator
  • PipPipPip
  • Group: Root Admin
  • Posts: 7,799
  • Joined: 25-January 07

Posted 12 February 2007 - 04:34 AM

Contained in the /community_seo/Documentation/ folder is a Bot_List_Readme.txt, and a Default_Bot_List.txt. If you are new to Invision Power Board, you may not understand what these are for.

Firstly, I would like to extend my thanks to Daniel, the author of the largest bot list on the net. You can click on the preceeding link to visit his site, and view his bot lists, email extractors list, spam bot lists, and 'other lists' (which he describes as link checkers, verifiers, etc.). The included Default_Bot_List.txt is a modified version of his original bot list.

In the IPB ACP, you can go to Tools & Settings -> Search Engine Spiders, and one of the settings allows you to enter in a spider mapping (Spider Bot User-Agent). The mapping works like so

(user agent string)=Displayed Bot Name


The "(user agent string)" is a string that should be matched against their HTTP User Agent. The Displayed Bot Name is what they will be shown as on the site, and in the spider logs in the ACP. By default, IPB supplies only 6 spider mappings. They are the most common/largest spiders (arguably) but the list is far from comprehensive.

googlebot=Google.com
slurp@inktomi=Hot Bot
ask jeeves=Ask Jeeves
lycos=Lycos.com
whatuseek=What You Seek
ia_archiver=Archive.org


What does this mean for you?

Not much, specifically, but the more mappings you have, the more spiders you can recognize, log, and treat specially. This is why we want to enter in a more comprehensive list of mappings so we can monitor different kinds of spiders.

Why not just link to Daniel's original list?

The problem I found with Daniel's list was that there were many duplicates, effectively. Here is an example:

AbachoBOT (Mozilla compatible)=Crawler.de
AbachoBOT=Crawler.de


Remembering that the data on the left side is matched against the user agent, and then if a match is found the name on the right side is displayed, what happens here is the forums will try to match against the first string "AbachoBOT (Mozilla compatible)" - if it can, then this name will be used "Crawler.de". If not, the next string is tested "AbachoBOT", and then it's name is used "Crawler.de". The problem is that we can acheive the same end result by ONLY trying to match against the second entry "AbachoBOT" - if a bot matches this entry, it will match both anyways, and there is no benefit to trying to match them separately. Additionally, the larger this list is, the more overhead IPB has to contend with in trying to check for the bots.

By going through Daniel's list, to which I was originally just going to link, and removing the duplicates as in the situation above, I've narrowed down the list by 10kb - this is a huge resource savings when you factor in that this list has to be parsed and loaded into a regular expression on every single page load.

Summary

The bot list is used to recognize spiders, with the option to treat them special (as in put them in a special member group, or force them to use a specific skin), and the ability to log the spider activity on your site. We've included a more comprehensive list than the one included in IPB. You can see the original list that our included list is based off of here, however at the time of this writing, there is no benefit to using the original list over our trimmed list.
0

#2 User is offline   justme Icon

  • Advanced Member
  • PipPipPip
  • Group: Customers
  • Posts: 313
  • Joined: 22-March 07

Posted 16 May 2007 - 12:05 PM

Here's a simple list that catches most the major bots:
slurp@inktomi=Hot Bot
lycos=Lycos.com
whatuseek=What You Seek
ArchitectSpider=Excite.com
ask jeeves=Ask Jeeves
BSDSeek/1.0=Inktomi.com
BullsEye=Intelliseek.com
Google=Google.com
Googlebot/1.0=Google.com
Googlebot/2.1= Google.com
Googlebot/Test=Google.com
googlebot@googlebot.com=Google.com
Googlebot=Google.com
Googlebot-Image/1.0=Google.com
Googlebot-Image/1.0=Google.com Image Bot
gsa-crawler (Enterprise; GID-01422; jplastiras@google.com)=Google.com
gsa-crawler (Enterprise; GID-01742;gsatesting@rediffmail.com)=Google.com
gsa-crawler=Google.com
Inktomi Search=Yahoo.com
Inktomi=Yahoo.com
jeeves=Ask Jeeves
MSNBOT/0.1=MSN.com
msnbot=MSN.com
Slurp.so/1.0= Yahoo.com
Slurp/2.0j=Yahoo.com
Slurp/2.0-KiteHourly=Yahoo.com
Slurp/2.0-OwlWeekly=Yahoo.com
Slurp/3.0-AU=Yahoo.com
slurp@inktomi.com=Yahoo.com
slurp@inktomi= Yahoo.com
Slurp=Yahoo.com
spider@aeneid.com=Yahoo.com
www.inktomisearch.com=Yahoo.com
Yahoo-Blogs/v3.9=Yahoo.com Blogs
YahooSeeker/CafeKelsa=Yahoo.com

A note of warning. If you effectively filter all the bots from your online list, the number of "guests" may APPEAR to drop significantly (identified bots are not shown as guests). The larger the site, the more pronounced the effect. It's been my experience using the list above, instead of the default IPB list, the number of 'guests' online will drop by about half.
0

#3 User is offline   admin Icon

  • Administrator
  • PipPipPip
  • Group: Root Admin
  • Posts: 7,799
  • Joined: 25-January 07

Posted 22 May 2007 - 04:49 AM

Your list is a good mesh of the major spiders, however it can be cleaned quite a bit, which would also improve performance (the smaller the list, the faster the regex in the IPB session handler can process the user agent). i.e.

Slurp.so/1.0= Yahoo.com
Slurp/2.0j=Yahoo.com
Slurp/2.0-KiteHourly=Yahoo.com
Slurp/2.0-OwlWeekly=Yahoo.com
Slurp/3.0-AU=Yahoo.com
slurp@inktomi.com=Yahoo.com
slurp@inktomi= Yahoo.com
Slurp=Yahoo.com


Could all be effectively combined to

slurp=Yahoo.com


gsa-crawler (Enterprise; GID-01422; jplastiras@google.com)=Google.com
gsa-crawler (Enterprise; GID-01742;gsatesting@rediffmail.com)=Google.com
gsa-crawler=Google.com


could be combined to

gsa-crawler=Google.com


Inktomi Search=Yahoo.com
Inktomi=Yahoo.com


to

Inktomi=Yahoo.com


Googlebot/1.0=Google.com
Googlebot/2.1= Google.com
Googlebot/Test=Google.com
googlebot@googlebot.com=Google.com
Googlebot=Google.com
Googlebot-Image/1.0=Google.com
Googlebot-Image/1.0=Google.com Image Bot


to

googlebot=Google.com


(Also, an interesting note, you have the Googlebot-Image/1.0 here twice - the first entry would match first, so the second entry does nothing).

In the end, you'd end up with the following

lycos=Lycos.com
whatuseek=What You Seek
ArchitectSpider=Excite.com
BSDSeek/1.0=Inktomi.com
BullsEye=Intelliseek.com
Google=Google.com
gsa-crawler=Google.com
Inktomi=Yahoo.com
jeeves=Ask Jeeves
msnbot=MSN.com
Slurp=Yahoo.com
spider@aeneid.com=Yahoo.com
Yahoo-Blogs/v3.9=Yahoo.com Blogs
Yahoo=Yahoo.com



It's important to note that once something on the left hand side matches the user agent, the value on the right hand side of the = sign will be used. This is why for example you can just use

yahoo=Yahoo.com

If the user agent is "YahooSeeker/CafeKelsa=Yahoo.com", then "Yahoo" would also be found in the user agent as you can see, and you will end up with the same net result, with less bots to check. The trimmed list I just posted should give you the same end result as the original in justme's post.
0

#4 User is offline   smallblockfuelie Icon

  • Advanced Member
  • PipPipPip
  • Group: Customers
  • Posts: 238
  • Joined: 10-September 07
  • Gender:Male
  • Location:CA

Posted 29 April 2008 - 12:35 PM

Have there been any updates to this list?
0

#5 User is offline   admin Icon

  • Administrator
  • PipPipPip
  • Group: Root Admin
  • Posts: 7,799
  • Joined: 25-January 07

Posted 30 April 2008 - 03:26 AM

I haven't seen a need to update the list. There are literally thousands of "spiders" (which we will define as any automated program that can index a webpage) out there, but webmasters shouldn't really care about any other than the included list.

Even then, there's plenty in the list I wouldn't personally care about. In all honesty, outside of Yahoo, MSN, Google, and Ask.com, most of the other search engines are so rarely used (and already syndicate indexed pages from the other major search engines) they don't really matter that much.
0

#6 User is offline   smallblockfuelie Icon

  • Advanced Member
  • PipPipPip
  • Group: Customers
  • Posts: 238
  • Joined: 10-September 07
  • Gender:Male
  • Location:CA

Posted 02 May 2008 - 09:15 AM

I am only getting Ask on my site a few times (less than a dozen) a month and then it's indexing the same page every time. Why isn't it doing more?
0

#7 User is offline   twistedgamer Icon

  • Advanced Member
  • PipPipPip
  • Group: Customers
  • Posts: 368
  • Joined: 09-November 07
  • Gender:Male
  • Location:Middle of the Desert state

Posted 04 May 2008 - 02:36 PM

View Postsmallblockfuelie, on May 2 2008, 07:15 AM, said:

I am only getting Ask on my site a few times (less than a dozen) a month and then it's indexing the same page every time. Why isn't it doing more?

I could have swore I responded to this question on this site already, maybe not.

Short answer is Ask is refocusing on sites Geared towards women. If your site isn't geared towards women its not going to index much of your site. What it sounds like there doing is what it looks like there doing on my site. Indexing my home page looking for links to content there users are searching for, namely women related sites.

I think Ask has indexed a total 5 or 6 pages on my site looking at the logs.
0

Page 1 of 1
  • You cannot start a new topic
  • You cannot reply to this topic

1 User(s) are reading this topic
0 members, 1 guests, 0 anonymous users