
-----------------------------------
mirhagk
Fri Aug 09, 2013 9:59 am

Responsible crawler
-----------------------------------
I am building a crawler to crawl a website and collect data from the site. It currently does one request at a time, and waits until one is done before moving onto the next, but since it's crawling over about 10K pages it takes many hours to complete. I have it cache old content so that I can pause it part way through without a problem, but I'd still like to make it faster.

Now I can make thousands of requests simultaneously, but obviously that's mean to the server, and I'd like to do it in a responsible manner. What would be a responsible rate at which to crawl the site (how many milliseconds or seconds should I delay before launching a new request)? I want to make sure my crawler doesn't cause an inconvenience.

The site doesn't have a robots.txt, and it's targeted to very specific pages (all of which are public, main pages) so I'm making sure I'm not hitting anything like login servers etc.

Thanks,
Nathan Jervis

-----------------------------------
DemonWasp
Sat Aug 10, 2013 2:06 am

RE:Responsible crawler
-----------------------------------
Why are you crawling a site so frequently that you want it to take less time to complete? The answer to that might help reveal a meaningful answer to your question.

-----------------------------------
rdrake
Sat Aug 10, 2013 9:04 am

Re: RE:Responsible crawler
-----------------------------------
Why are you crawling a site so frequently that you want it to take less time to complete? The answer to that might help reveal a meaningful answer to your question.
^ this

Schedule it to run off-peak (~2-3 AM) and then it won't matter if it takes a few hours.  It shouldn't affect other users and as an added bonus you won't be awake to care how long it takes.

-----------------------------------
DemonWasp
Sat Aug 10, 2013 12:47 pm

Re: RE:Responsible crawler
-----------------------------------
Schedule it to run off-peak (~2-3 AM) and then it won't matter if it takes a few hours.  It shouldn't affect other users and as an added bonus you won't be awake to care how long it takes.

Good idea. When doing that, though, keep in mind the site's audience and their preferred waking hours. Not all people live in North America; e.g. if the site is targeted at Australia then aim for 2-3am Australia Eastern Standard Time.

-----------------------------------
mirhagk
Sat Aug 10, 2013 10:27 pm

RE:Responsible crawler
-----------------------------------
That's a good idea. Mostly I just need the data quickly, so I didn't want to wait a really long time for it to complete. I guess running it over night is a good idea (and this site doesn't really even operate at night, and is Canada based).

So is there a limit I should do or is just waiting until the last request finish what I should do?

-----------------------------------
rdrake
Sat Aug 10, 2013 11:59 pm

RE:Responsible crawler
-----------------------------------
You can also contact the webmaster and ask them if they're OK with you crawling their site and when.  I'm not sure you're obligated to do so, but it seems like the "nice" thing to do.

-----------------------------------
chrisbrown
Sun Aug 11, 2013 12:30 am

Re: Responsible crawler
-----------------------------------
So is there a limit I should do or is just waiting until the last request finish what I should do?
Here's a "just for fun" for you:

Let's assume that the time to complete a request is proportional to the load on the server, and that the time it takes to route a request is random but relatively constant over time. 

So if you track the rate of change of how long it takes to complete a request and adjust the period between requests proportionally, the system will naturally settle at a rate that doesn't apply an unsustainable load to the server.

By tuning the proportionality constant, you can track the server load fairly closely while filtering out fluctuations in routing delay.

And by subtracting a very small constant from the delay, you can nudge the system towards higher frequencies when the request time is constant.

[code]var reqStartTime, reqEndTime
var curReqDuration, lastReqDuration
var delay = 1 sec
var scale = 0.123
var bias = 0.01

loop
    reqStartTime= curTime()
    makeRequest()
    reqEndTime= curTime()

    lastReqDuration = curReqDuration
    curReqDuration = reqEndTime - reqStartTime
    
    delay += scale * (curReqDuration - lastReqDuration) - bias
    sleep(delay)
end loop
[/code]
