Computer Science Canada

Responsible crawler

Author:  mirhagk [ Fri Aug 09, 2013 9:59 am ]
Post subject:  Responsible crawler

I am building a crawler to crawl a website and collect data from the site. It currently does one request at a time, and waits until one is done before moving onto the next, but since it's crawling over about 10K pages it takes many hours to complete. I have it cache old content so that I can pause it part way through without a problem, but I'd still like to make it faster.

Now I can make thousands of requests simultaneously, but obviously that's mean to the server, and I'd like to do it in a responsible manner. What would be a responsible rate at which to crawl the site (how many milliseconds or seconds should I delay before launching a new request)? I want to make sure my crawler doesn't cause an inconvenience.

The site doesn't have a robots.txt, and it's targeted to very specific pages (all of which are public, main pages) so I'm making sure I'm not hitting anything like login servers etc.

Thanks,
Nathan Jervis

Author:  DemonWasp [ Sat Aug 10, 2013 2:06 am ]
Post subject:  RE:Responsible crawler

Why are you crawling a site so frequently that you want it to take less time to complete? The answer to that might help reveal a meaningful answer to your question.

Author:  rdrake [ Sat Aug 10, 2013 9:04 am ]
Post subject:  Re: RE:Responsible crawler

DemonWasp @ Sat Aug 10, 2013 2:06 am wrote:
Why are you crawling a site so frequently that you want it to take less time to complete? The answer to that might help reveal a meaningful answer to your question.

^ this

Schedule it to run off-peak (~2-3 AM) and then it won't matter if it takes a few hours. It shouldn't affect other users and as an added bonus you won't be awake to care how long it takes.

Author:  DemonWasp [ Sat Aug 10, 2013 12:47 pm ]
Post subject:  Re: RE:Responsible crawler

rdrake @ Sat Aug 10, 2013 9:04 am wrote:
Schedule it to run off-peak (~2-3 AM) and then it won't matter if it takes a few hours. It shouldn't affect other users and as an added bonus you won't be awake to care how long it takes.


Good idea. When doing that, though, keep in mind the site's audience and their preferred waking hours. Not all people live in North America; e.g. if the site is targeted at Australia then aim for 2-3am Australia Eastern Standard Time.

Author:  mirhagk [ Sat Aug 10, 2013 10:27 pm ]
Post subject:  RE:Responsible crawler

That's a good idea. Mostly I just need the data quickly, so I didn't want to wait a really long time for it to complete. I guess running it over night is a good idea (and this site doesn't really even operate at night, and is Canada based).

So is there a limit I should do or is just waiting until the last request finish what I should do?

Author:  rdrake [ Sat Aug 10, 2013 11:59 pm ]
Post subject:  RE:Responsible crawler

You can also contact the webmaster and ask them if they're OK with you crawling their site and when. I'm not sure you're obligated to do so, but it seems like the "nice" thing to do.

Author:  chrisbrown [ Sun Aug 11, 2013 12:30 am ]
Post subject:  Re: Responsible crawler

Quote:
So is there a limit I should do or is just waiting until the last request finish what I should do?

Here's a "just for fun" for you:

Let's assume that the time to complete a request is proportional to the load on the server, and that the time it takes to route a request is random but relatively constant over time.

So if you track the rate of change of how long it takes to complete a request and adjust the period between requests proportionally, the system will naturally settle at a rate that doesn't apply an unsustainable load to the server.

By tuning the proportionality constant, you can track the server load fairly closely while filtering out fluctuations in routing delay.

And by subtracting a very small constant from the delay, you can nudge the system towards higher frequencies when the request time is constant.

code:
var reqStartTime, reqEndTime
var curReqDuration, lastReqDuration
var delay = 1 sec
var scale = 0.123
var bias = 0.01

loop
    reqStartTime= curTime()
    makeRequest()
    reqEndTime= curTime()

    lastReqDuration = curReqDuration
    curReqDuration = reqEndTime - reqStartTime
   
    delay += scale * (curReqDuration - lastReqDuration) - bias
    sleep(delay)
end loop


: