Exabot has to be the rudest spider out there. They totally ignore robots.txt, actually they don't even seem to look for it. Then, they make requests CONSTANTLY for many pages per second. Rude, rude rude.
Exabot/Exalead.. you guys suck. Welcome to Deny.
They used to use 193.47.80.42 now they use 193.47.80.46, in case anyone wants to be pre-emptive.
Posted by Kevin at July 31, 2006 09:33 AMWe are sorry for the numerous hits on your site.
We do observe some fairness towards the site we crawl by not downloading a new page more often than every 3 seconds.
However our algorithm is based on hostnames and if your site is designed to generate a huge amount of different hostnames, it will mislead our crawler. This issue should be solved in a few weeks.
If it annoys you, you can contact me and we can add special treatment for your site.
Thank you for this advertisement
Posted by: ExaleadGuy on August 2, 2006 04:44 PMIt's not an advertisement. Fairness is looking for robots.txt, which I didn't see happen. And, when the world should just see www.domain.com, there should be one hostname. You hit www.domain.com multiple times a second requesting different pages, non-stop. Fairness is only grabbing a few pages, waiting, then some more. Not being overly agressive.
I couldn't care less if it were my personal site, but this was my job site. It was pounding the site, not once, but twice. I had to ban your IP the last time too. I've seen numerous reports on boards about your spider doing the same elsewhere.
I'll wait a few weeks, unban the IP, then see if it does it again. If it does, I'll just email you directly so maybe you can check your spider logs to see what it's doing.
We are already looking for the robots.txt on a daily basis. And if the site is mono host, we should not come back more often than every 3s.
If you have observed something else, it is a bug on our side, the current algorithm should respect this.
Can you give me at least in private mail the url of the corresponding site so that we can investigate ?
For the different reports on the web, the problem was generally due with features on the edge of the specs of the robots.txt and solved directly with the responsibles even if they do not all publish it in their blog after.
Posted by: ExaleadGuy on August 2, 2006 05:08 PMSure, I'll drop you an email with the site info.
Posted by: Kevin on August 2, 2006 05:12 PMThanks for signing in, . Now you can comment. (sign out)
(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)