![]() ![]() |
Instructions and information about the "Bot List" |
Feb 12 2007, 04:34 AM
Post
#1
|
|
|
Administrator ![]() ![]() ![]() Group: Root Admin Posts: 4,991 Joined: 25-January 07 Member No.: 1 |
Contained in the /community_seo/Documentation/ folder is a Bot_List_Readme.txt, and a Default_Bot_List.txt. If you are new to Invision Power Board, you may not understand what these are for.
Firstly, I would like to extend my thanks to Daniel, the author of the largest bot list on the net. You can click on the preceeding link to visit his site, and view his bot lists, email extractors list, spam bot lists, and 'other lists' (which he describes as link checkers, verifiers, etc.). The included Default_Bot_List.txt is a modified version of his original bot list. In the IPB ACP, you can go to Tools & Settings -> Search Engine Spiders, and one of the settings allows you to enter in a spider mapping (Spider Bot User-Agent). The mapping works like so CODE (user agent string)=Displayed Bot Name The "(user agent string)" is a string that should be matched against their HTTP User Agent. The Displayed Bot Name is what they will be shown as on the site, and in the spider logs in the ACP. By default, IPB supplies only 6 spider mappings. They are the most common/largest spiders (arguably) but the list is far from comprehensive. CODE googlebot=Google.com slurp@inktomi=Hot Bot ask jeeves=Ask Jeeves lycos=Lycos.com whatuseek=What You Seek ia_archiver=Archive.org What does this mean for you? Not much, specifically, but the more mappings you have, the more spiders you can recognize, log, and treat specially. This is why we want to enter in a more comprehensive list of mappings so we can monitor different kinds of spiders. Why not just link to Daniel's original list? The problem I found with Daniel's list was that there were many duplicates, effectively. Here is an example: CODE AbachoBOT (Mozilla compatible)=Crawler.de AbachoBOT=Crawler.de Remembering that the data on the left side is matched against the user agent, and then if a match is found the name on the right side is displayed, what happens here is the forums will try to match against the first string "AbachoBOT (Mozilla compatible)" - if it can, then this name will be used "Crawler.de". If not, the next string is tested "AbachoBOT", and then it's name is used "Crawler.de". The problem is that we can acheive the same end result by ONLY trying to match against the second entry "AbachoBOT" - if a bot matches this entry, it will match both anyways, and there is no benefit to trying to match them separately. Additionally, the larger this list is, the more overhead IPB has to contend with in trying to check for the bots. By going through Daniel's list, to which I was originally just going to link, and removing the duplicates as in the situation above, I've narrowed down the list by 10kb - this is a huge resource savings when you factor in that this list has to be parsed and loaded into a regular expression on every single page load. Summary The bot list is used to recognize spiders, with the option to treat them special (as in put them in a special member group, or force them to use a specific skin), and the ability to log the spider activity on your site. We've included a more comprehensive list than the one included in IPB. You can see the original list that our included list is based off of here, however at the time of this writing, there is no benefit to using the original list over our trimmed list. |
|
|
|
May 16 2007, 12:05 PM
Post
#2
|
|
|
Advanced Member ![]() ![]() ![]() Group: Customers Posts: 303 Joined: 22-March 07 Member No.: 54 |
Here's a simple list that catches most the major bots:
slurp@inktomi=Hot Bot lycos=Lycos.com whatuseek=What You Seek ArchitectSpider=Excite.com ask jeeves=Ask Jeeves BSDSeek/1.0=Inktomi.com BullsEye=Intelliseek.com Google=Google.com Googlebot/1.0=Google.com Googlebot/2.1= Google.com Googlebot/Test=Google.com googlebot@googlebot.com=Google.com Googlebot=Google.com Googlebot-Image/1.0=Google.com Googlebot-Image/1.0=Google.com Image Bot gsa-crawler (Enterprise; GID-01422; jplastiras@google.com)=Google.com gsa-crawler (Enterprise; GID-01742;gsatesting@rediffmail.com)=Google.com gsa-crawler=Google.com Inktomi Search=Yahoo.com Inktomi=Yahoo.com jeeves=Ask Jeeves MSNBOT/0.1=MSN.com msnbot=MSN.com Slurp.so/1.0= Yahoo.com Slurp/2.0j=Yahoo.com Slurp/2.0-KiteHourly=Yahoo.com Slurp/2.0-OwlWeekly=Yahoo.com Slurp/3.0-AU=Yahoo.com slurp@inktomi.com=Yahoo.com slurp@inktomi= Yahoo.com Slurp=Yahoo.com spider@aeneid.com=Yahoo.com www.inktomisearch.com=Yahoo.com Yahoo-Blogs/v3.9=Yahoo.com Blogs YahooSeeker/CafeKelsa=Yahoo.com A note of warning. If you effectively filter all the bots from your online list, the number of "guests" may APPEAR to drop significantly (identified bots are not shown as guests). The larger the site, the more pronounced the effect. It's been my experience using the list above, instead of the default IPB list, the number of 'guests' online will drop by about half. |
|
|
|
May 22 2007, 04:49 AM
Post
#3
|
|
|
Administrator ![]() ![]() ![]() Group: Root Admin Posts: 4,991 Joined: 25-January 07 Member No.: 1 |
Your list is a good mesh of the major spiders, however it can be cleaned quite a bit, which would also improve performance (the smaller the list, the faster the regex in the IPB session handler can process the user agent). i.e.
CODE Slurp.so/1.0= Yahoo.com Slurp/2.0j=Yahoo.com Slurp/2.0-KiteHourly=Yahoo.com Slurp/2.0-OwlWeekly=Yahoo.com Slurp/3.0-AU=Yahoo.com slurp@inktomi.com=Yahoo.com slurp@inktomi= Yahoo.com Slurp=Yahoo.com Could all be effectively combined to CODE slurp=Yahoo.com CODE gsa-crawler (Enterprise; GID-01422; jplastiras@google.com)=Google.com gsa-crawler (Enterprise; GID-01742;gsatesting@rediffmail.com)=Google.com gsa-crawler=Google.com could be combined to CODE gsa-crawler=Google.com CODE Inktomi Search=Yahoo.com Inktomi=Yahoo.com to CODE Inktomi=Yahoo.com CODE Googlebot/1.0=Google.com Googlebot/2.1= Google.com Googlebot/Test=Google.com googlebot@googlebot.com=Google.com Googlebot=Google.com Googlebot-Image/1.0=Google.com Googlebot-Image/1.0=Google.com Image Bot to CODE googlebot=Google.com (Also, an interesting note, you have the Googlebot-Image/1.0 here twice - the first entry would match first, so the second entry does nothing). In the end, you'd end up with the following CODE lycos=Lycos.com whatuseek=What You Seek ArchitectSpider=Excite.com BSDSeek/1.0=Inktomi.com BullsEye=Intelliseek.com Google=Google.com gsa-crawler=Google.com Inktomi=Yahoo.com jeeves=Ask Jeeves msnbot=MSN.com Slurp=Yahoo.com spider@aeneid.com=Yahoo.com Yahoo-Blogs/v3.9=Yahoo.com Blogs Yahoo=Yahoo.com It's important to note that once something on the left hand side matches the user agent, the value on the right hand side of the = sign will be used. This is why for example you can just use yahoo=Yahoo.com If the user agent is "YahooSeeker/CafeKelsa=Yahoo.com", then "Yahoo" would also be found in the user agent as you can see, and you will end up with the same net result, with less bots to check. The trimmed list I just posted should give you the same end result as the original in justme's post. |
|
|
|
Apr 29 2008, 12:35 PM
Post
#4
|
|
|
Advanced Member ![]() ![]() ![]() Group: Customers Posts: 143 Joined: 10-September 07 From: CA Member No.: 459 |
Have there been any updates to this list?
|
|
|
|
Apr 30 2008, 03:26 AM
Post
#5
|
|
|
Administrator ![]() ![]() ![]() Group: Root Admin Posts: 4,991 Joined: 25-January 07 Member No.: 1 |
I haven't seen a need to update the list. There are literally thousands of "spiders" (which we will define as any automated program that can index a webpage) out there, but webmasters shouldn't really care about any other than the included list.
Even then, there's plenty in the list I wouldn't personally care about. In all honesty, outside of Yahoo, MSN, Google, and Ask.com, most of the other search engines are so rarely used (and already syndicate indexed pages from the other major search engines) they don't really matter that much. |
|
|
|
May 2 2008, 09:15 AM
Post
#6
|
|
|
Advanced Member ![]() ![]() ![]() Group: Customers Posts: 143 Joined: 10-September 07 From: CA Member No.: 459 |
I am only getting Ask on my site a few times (less than a dozen) a month and then it's indexing the same page every time. Why isn't it doing more?
|
|
|
|
May 4 2008, 02:36 PM
Post
#7
|
|
|
Advanced Member ![]() ![]() ![]() Group: Customers Posts: 308 Joined: 9-November 07 From: Middle of the Desert state Member No.: 574 |
I am only getting Ask on my site a few times (less than a dozen) a month and then it's indexing the same page every time. Why isn't it doing more? I could have swore I responded to this question on this site already, maybe not. Short answer is Ask is refocusing on sites Geared towards women. If your site isn't geared towards women its not going to index much of your site. What it sounds like there doing is what it looks like there doing on my site. Indexing my home page looking for links to content there users are searching for, namely women related sites. I think Ask has indexed a total 5 or 6 pages on my site looking at the logs. |
|
|
|
![]() ![]() |
Similar Topics
| Topic Title | Replies | Topic Starter | Views | Last Action | |||
|---|---|---|---|---|---|---|---|
![]() |
0 | admin | 8,043 | 4th February 2007 - 08:41 AM Last post by: admin |
|||
![]() |
0 | RSS Aggregator | 574 | 15th March 2007 - 09:27 AM Last post by: RSS Aggregator |
|||
![]() |
0 | RSS Aggregator | 442 | 16th March 2007 - 04:37 AM Last post by: RSS Aggregator |
|||
![]() |
0 | RSS Aggregator | 538 | 19th March 2007 - 08:13 AM Last post by: RSS Aggregator |
|||
![]() |
0 | RSS Aggregator | 349 | 19th March 2007 - 10:03 PM Last post by: RSS Aggregator |
|||
Links to this thread
|
Lo-Fi Version | Time is now: 1st December 2008 - 07:50 PM |