July 31, 2006
Exabot

Exabot has to be the rudest spider out there. They totally ignore robots.txt, actually they don't even seem to look for it. Then, they make requests CONSTANTLY for many pages per second. Rude, rude rude.

Exabot/Exalead.. you guys suck. Welcome to Deny.

They used to use 193.47.80.42 now they use 193.47.80.46, in case anyone wants to be pre-emptive.

Posted by Kevin at July 31, 2006 09:33 AM
Comments

We are sorry for the numerous hits on your site.

We do observe some fairness towards the site we crawl by not downloading a new page more often than every 3 seconds.
However our algorithm is based on hostnames and if your site is designed to generate a huge amount of different hostnames, it will mislead our crawler. This issue should be solved in a few weeks.

If it annoys you, you can contact me and we can add special treatment for your site.

Thank you for this advertisement

Posted by: ExaleadGuy on August 2, 2006 4:44 PM

It's not an advertisement. Fairness is looking for robots.txt, which I didn't see happen. And, when the world should just see www.domain.com, there should be one hostname. You hit www.domain.com multiple times a second requesting different pages, non-stop. Fairness is only grabbing a few pages, waiting, then some more. Not being overly agressive.

I couldn't care less if it were my personal site, but this was my job site. It was pounding the site, not once, but twice. I had to ban your IP the last time too. I've seen numerous reports on boards about your spider doing the same elsewhere.

I'll wait a few weeks, unban the IP, then see if it does it again. If it does, I'll just email you directly so maybe you can check your spider logs to see what it's doing.

Posted by: Kevin on August 2, 2006 5:01 PM

We are already looking for the robots.txt on a daily basis. And if the site is mono host, we should not come back more often than every 3s.
If you have observed something else, it is a bug on our side, the current algorithm should respect this.
Can you give me at least in private mail the url of the corresponding site so that we can investigate ?

For the different reports on the web, the problem was generally due with features on the edge of the specs of the robots.txt and solved directly with the responsibles even if they do not all publish it in their blog after.

Posted by: ExaleadGuy on August 2, 2006 5:08 PM

Sure, I'll drop you an email with the site info.

Posted by: Kevin on August 2, 2006 5:12 PM
Post a comment

Leave a comment