Check today's New Posts or your PMs

Welcome Guest ( Log In | Register )

Tags
This content has not been tagged yet
 Digg this topic · Save to del.icio.us · Slashdot It · Post to Technorati · Post to Furl · Submit to Reddit · Share on Facebook · Fark It · Googlize This Post · Add to ma.gnolia · Tag to Wink · Add to MyWeb · Add to Netscape
Reply to this topicStart new topic
Instructions and information about the "Bot List"
admin
post Feb 12 2007, 04:34 AM
Post #1


Administrator
***

Group: Root Admin
Posts: 4,991
Joined: 25-January 07
Member No.: 1



Contained in the /community_seo/Documentation/ folder is a Bot_List_Readme.txt, and a Default_Bot_List.txt. If you are new to Invision Power Board, you may not understand what these are for.

Firstly, I would like to extend my thanks to Daniel, the author of the largest bot list on the net. You can click on the preceeding link to visit his site, and view his bot lists, email extractors list, spam bot lists, and 'other lists' (which he describes as link checkers, verifiers, etc.). The included Default_Bot_List.txt is a modified version of his original bot list.

In the IPB ACP, you can go to Tools & Settings -> Search Engine Spiders, and one of the settings allows you to enter in a spider mapping (Spider Bot User-Agent). The mapping works like so

CODE
(user agent string)=Displayed Bot Name


The "(user agent string)" is a string that should be matched against their HTTP User Agent. The Displayed Bot Name is what they will be shown as on the site, and in the spider logs in the ACP. By default, IPB supplies only 6 spider mappings. They are the most common/largest spiders (arguably) but the list is far from comprehensive.

CODE
googlebot=Google.com
slurp@inktomi=Hot Bot
ask jeeves=Ask Jeeves
lycos=Lycos.com
whatuseek=What You Seek
ia_archiver=Archive.org


What does this mean for you?

Not much, specifically, but the more mappings you have, the more spiders you can recognize, log, and treat specially. This is why we want to enter in a more comprehensive list of mappings so we can monitor different kinds of spiders.

Why not just link to Daniel's original list?

The problem I found with Daniel's list was that there were many duplicates, effectively. Here is an example:

CODE
AbachoBOT (Mozilla compatible)=Crawler.de
AbachoBOT=Crawler.de


Remembering that the data on the left side is matched against the user agent, and then if a match is found the name on the right side is displayed, what happens here is the forums will try to match against the first string "AbachoBOT (Mozilla compatible)" - if it can, then this name will be used "Crawler.de". If not, the next string is tested "AbachoBOT", and then it's name is used "Crawler.de". The problem is that we can acheive the same end result by ONLY trying to match against the second entry "AbachoBOT" - if a bot matches this entry, it will match both anyways, and there is no benefit to trying to match them separately. Additionally, the larger this list is, the more overhead IPB has to contend with in trying to check for the bots.

By going through Daniel's list, to which I was originally just going to link, and removing the duplicates as in the situation above, I've narrowed down the list by 10kb - this is a huge resource savings when you factor in that this list has to be parsed and loaded into a regular expression on every single page load.

Summary

The bot list is used to recognize spiders, with the option to treat them special (as in put them in a special member group, or force them to use a specific skin), and the ability to log the spider activity on your site. We've included a more comprehensive list than the one included in IPB. You can see the original list that our included list is based off of here, however at the time of this writing, there is no benefit to using the original list over our trimmed list.
Go to the top of the page
 
+Quote Post
justme
post May 16 2007, 12:05 PM
Post #2


Advanced Member
***

Group: Customers
Posts: 303
Joined: 22-March 07
Member No.: 54



Here's a simple list that catches most the major bots:
slurp@inktomi=Hot Bot
lycos=Lycos.com
whatuseek=What You Seek
ArchitectSpider=Excite.com
ask jeeves=Ask Jeeves
BSDSeek/1.0=Inktomi.com
BullsEye=Intelliseek.com
Google=Google.com
Googlebot/1.0=Google.com
Googlebot/2.1= Google.com
Googlebot/Test=Google.com
googlebot@googlebot.com=Google.com
Googlebot=Google.com
Googlebot-Image/1.0=Google.com
Googlebot-Image/1.0=Google.com Image Bot
gsa-crawler (Enterprise; GID-01422; jplastiras@google.com)=Google.com
gsa-crawler (Enterprise; GID-01742;gsatesting@rediffmail.com)=Google.com
gsa-crawler=Google.com
Inktomi Search=Yahoo.com
Inktomi=Yahoo.com
jeeves=Ask Jeeves
MSNBOT/0.1=MSN.com
msnbot=MSN.com
Slurp.so/1.0= Yahoo.com
Slurp/2.0j=Yahoo.com
Slurp/2.0-KiteHourly=Yahoo.com
Slurp/2.0-OwlWeekly=Yahoo.com
Slurp/3.0-AU=Yahoo.com
slurp@inktomi.com=Yahoo.com
slurp@inktomi= Yahoo.com
Slurp=Yahoo.com
spider@aeneid.com=Yahoo.com
www.inktomisearch.com=Yahoo.com
Yahoo-Blogs/v3.9=Yahoo.com Blogs
YahooSeeker/CafeKelsa=Yahoo.com

A note of warning. If you effectively filter all the bots from your online list, the number of "guests" may APPEAR to drop significantly (identified bots are not shown as guests). The larger the site, the more pronounced the effect. It's been my experience using the list above, instead of the default IPB list, the number of 'guests' online will drop by about half.
Go to the top of the page
 
+Quote Post
admin
post May 22 2007, 04:49 AM
Post #3


Administrator
***

Group: Root Admin
Posts: 4,991
Joined: 25-January 07
Member No.: 1



Your list is a good mesh of the major spiders, however it can be cleaned quite a bit, which would also improve performance (the smaller the list, the faster the regex in the IPB session handler can process the user agent). i.e.

CODE
Slurp.so/1.0= Yahoo.com
Slurp/2.0j=Yahoo.com
Slurp/2.0-KiteHourly=Yahoo.com
Slurp/2.0-OwlWeekly=Yahoo.com
Slurp/3.0-AU=Yahoo.com
slurp@inktomi.com=Yahoo.com
slurp@inktomi= Yahoo.com
Slurp=Yahoo.com


Could all be effectively combined to

CODE
slurp=Yahoo.com


CODE
gsa-crawler (Enterprise; GID-01422; jplastiras@google.com)=Google.com
gsa-crawler (Enterprise; GID-01742;gsatesting@rediffmail.com)=Google.com
gsa-crawler=Google.com


could be combined to

CODE
gsa-crawler=Google.com


CODE
Inktomi Search=Yahoo.com
Inktomi=Yahoo.com


to

CODE
Inktomi=Yahoo.com


CODE
Googlebot/1.0=Google.com
Googlebot/2.1= Google.com
Googlebot/Test=Google.com
googlebot@googlebot.com=Google.com
Googlebot=Google.com
Googlebot-Image/1.0=Google.com
Googlebot-Image/1.0=Google.com Image Bot


to

CODE
googlebot=Google.com


(Also, an interesting note, you have the Googlebot-Image/1.0 here twice - the first entry would match first, so the second entry does nothing).

In the end, you'd end up with the following

CODE
lycos=Lycos.com
whatuseek=What You Seek
ArchitectSpider=Excite.com
BSDSeek/1.0=Inktomi.com
BullsEye=Intelliseek.com
Google=Google.com
gsa-crawler=Google.com
Inktomi=Yahoo.com
jeeves=Ask Jeeves
msnbot=MSN.com
Slurp=Yahoo.com
spider@aeneid.com=Yahoo.com
Yahoo-Blogs/v3.9=Yahoo.com Blogs
Yahoo=Yahoo.com



It's important to note that once something on the left hand side matches the user agent, the value on the right hand side of the = sign will be used. This is why for example you can just use

yahoo=Yahoo.com

If the user agent is "YahooSeeker/CafeKelsa=Yahoo.com", then "Yahoo" would also be found in the user agent as you can see, and you will end up with the same net result, with less bots to check. The trimmed list I just posted should give you the same end result as the original in justme's post.
Go to the top of the page
 
+Quote Post
smallblockfuelie
post Apr 29 2008, 12:35 PM
Post #4


Advanced Member
***

Group: Customers
Posts: 143
Joined: 10-September 07
From: CA
Member No.: 459



Have there been any updates to this list?
Go to the top of the page
 
+Quote Post
admin
post Apr 30 2008, 03:26 AM
Post #5


Administrator
***

Group: Root Admin
Posts: 4,991
Joined: 25-January 07
Member No.: 1



I haven't seen a need to update the list. There are literally thousands of "spiders" (which we will define as any automated program that can index a webpage) out there, but webmasters shouldn't really care about any other than the included list.

Even then, there's plenty in the list I wouldn't personally care about. In all honesty, outside of Yahoo, MSN, Google, and Ask.com, most of the other search engines are so rarely used (and already syndicate indexed pages from the other major search engines) they don't really matter that much.
Go to the top of the page
 
+Quote Post
smallblockfuelie
post May 2 2008, 09:15 AM
Post #6


Advanced Member
***

Group: Customers
Posts: 143
Joined: 10-September 07
From: CA
Member No.: 459



I am only getting Ask on my site a few times (less than a dozen) a month and then it's indexing the same page every time. Why isn't it doing more?
Go to the top of the page
 
+Quote Post
twistedgamer
post May 4 2008, 02:36 PM
Post #7


Advanced Member
***

Group: Customers
Posts: 308
Joined: 9-November 07
From: Middle of the Desert state
Member No.: 574



QUOTE (smallblockfuelie @ May 2 2008, 07:15 AM) *
I am only getting Ask on my site a few times (less than a dozen) a month and then it's indexing the same page every time. Why isn't it doing more?

I could have swore I responded to this question on this site already, maybe not.

Short answer is Ask is refocusing on sites Geared towards women. If your site isn't geared towards women its not going to index much of your site. What it sounds like there doing is what it looks like there doing on my site. Indexing my home page looking for links to content there users are searching for, namely women related sites.

I think Ask has indexed a total 5 or 6 pages on my site looking at the logs.
Go to the top of the page
 
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 


Collapse

> Links to this thread

Page Date Hits
google Googlebot-Image/1.0 - Google Search 1st June 2007 - 02:17 AM 1
bot list spider list - Buscar con Google 1st June 2007 - 09:08 AM 1
yahoo bot name - Google Search 1st June 2007 - 11:58 AM 2
google images bot agent - Google Search 1st June 2007 - 01:22 PM 1
yahoo crawler bot name - Google Search 3rd June 2007 - 02:36 PM 1
ask bot list - Pesquisa Google 3rd June 2007 - 05:13 PM 1
ipb bot list - Google Search 4th June 2007 - 02:16 AM 1
spider bot browser string recognize - Google Search 5th June 2007 - 08:13 AM 1
BullsEye™ IntelliSeek - Google Search 5th June 2007 - 12:13 PM 1
spider bot user-agent - Google Search 10th June 2007 - 09:59 AM 1
gsa-crawler - Google ŒŸõ 11th June 2007 - 04:09 AM 1
gsa-crawler - Google Search 11th June 2007 - 04:15 PM 1
invision yahoo crawler - Căutare Google 16th June 2007 - 02:01 AM 1
ipb bot list - Google Search 19th June 2007 - 05:39 AM 1
Googlebot-Image/1.0 - Google Search 19th June 2007 - 11:54 AM 1
yahooseeker cafekelsa - Google Search 19th June 2007 - 04:59 PM 1
bullseye bot - Google Search 21st June 2007 - 11:52 PM 1
+9@yahoo.com @msn.com - Google Search 23rd June 2007 - 03:27 AM 1
inktomi invision - Google Search 22nd June 2007 - 11:52 PM 1
spam bot user agent - Google Search 24th June 2007 - 08:12 PM 1


RSS Lo-Fi Version Time is now: 1st December 2008 - 07:50 PM