Author |
Message |
Tony
|
Posted: Mon Jan 10, 2005 2:22 pm Post subject: Determining if a webpage has changed |
|
|
So here's what I'm interesting in finding out : how would I be able to tell if a given webpage has changed since my last visit?
The most obvious method that comes to mind is just checking the META time-stamp to check when the page was generated, but that appears to apply only to static .HTML pages. Anything dynamic (.PHP for example) and the approach goes flying out the window.
I suppose I could just cache the page, but then even the most minor changes will set flags off (that - and the space issue). Is there a better way? I'm was thinking of just keeping track of the hyperlinks on the page or something |
|
|
|
|
|
Sponsor Sponsor
|
|
|
wtd
|
Posted: Mon Jan 10, 2005 3:44 pm Post subject: Re: Determining if a webpage has changed |
|
|
tony wrote: I suppose I could just cache the page, but then even the most minor changes will set flags off (that - and the space issue). Is there a better way? I'm was thinking of just keeping track of the hyperlinks on the page or something
You wouldn't have to cache the entire page. Just use an MD5 hash or such. |
|
|
|
|
|
Amailer
|
Posted: Mon Jan 10, 2005 4:11 pm Post subject: (No subject) |
|
|
http://web.archive.org/web/*/http://www.compsci.ca
does something like that, but it looks like it archives most of the pages or just the front page... |
|
|
|
|
|
Tony
|
Posted: Mon Jan 10, 2005 5:15 pm Post subject: Re: Determining if a webpage has changed |
|
|
wtd wrote: Just use an MD5 hash or such.
could you elaborate a bit please?
Amailer: web.archive.org does that about once a month... I need to be able to capture "updates" to the webpage as they appear... some have news feeding though them quite fast... damn spammers |
|
|
|
|
|
rizzix
|
Posted: Mon Jan 10, 2005 6:42 pm Post subject: (No subject) |
|
|
Tony in Java (5.0) its just a matter of using the Adler32 or the CRC32 Class |
|
|
|
|
|
wtd
|
Posted: Mon Jan 10, 2005 7:55 pm Post subject: (No subject) |
|
|
To expand on that, hashing basically just takes a bit chunk of data and performs a calculation on it such that no two pieces of data will yield the same "hash". MD5 is a pretty good algorithm. |
|
|
|
|
|
rizzix
|
Posted: Mon Jan 10, 2005 8:56 pm Post subject: (No subject) |
|
|
and to expand on that.. a hash function generates a fixed length identity (usually a series of bits) for variable length data. but whats important to note here.. the result is always of a fixed length. its length depends on the function.
MD5 is a slow function but it generates a 128bit hash. while CRC32 generates a 32bit hash.. the Adler32 also generates a 32bit but its the fastest of the lot and as reliable as the CRC32. |
|
|
|
|
|
josh
|
Posted: Mon Jan 10, 2005 10:16 pm Post subject: (No subject) |
|
|
I am just curious as to why you would need to keep track of the canges on a webpage? just wonderng about what this applies to? |
|
|
|
|
|
Sponsor Sponsor
|
|
|
Hikaru79
|
Posted: Mon Jan 10, 2005 10:56 pm Post subject: (No subject) |
|
|
wtd wrote: To expand on that, hashing basically just takes a bit chunk of data and performs a calculation on it such that no two pieces of data will yield the same "hash". MD5 is a pretty good algorithm.
MD5 is apparently on the way out. I remember seeing a bunch of articles about it on Slashdot. Here's a few of them:
http://slashdot.org/article.pl?sid=04/08/17/0030243&tid=93&tid=162&tid=1
http://developers.slashdot.org/article.pl?sid=04/07/03/1728231&tid=172&tid=93&tid=8
http://developers.slashdot.org/article.pl?sid=04/12/07/2019244&tid=93&tid=172&tid=8
That last one is especially interesting... to quote it:
Slashdot wrote: ...essentially, we can create 'doppelganger' blocks (my term) anywhere inside a file that may be swapped out, one for another, without altering the final MD5 hash. This lets us create any number of binary-inequal files with the same md5sum. But MD5 uses an appendable cascade construction -- in other words, if you happen to find yourself with two files that MD5 to the same hash, an arbitrary payload can be applied to both files and they'll still have the same hash. Wang released the two files needed (but not the collision finder itself). A tool, Stripwire, demonstrates the use of colliding datasets to create two executable packages with wildly different behavior but the same MD5 hash. The faults discovered are problematic but not yet fatal; developers (particularly of P2P software) who claim they'd like advance notice that their systems will fail should take note." |
|
|
|
|
|
Tony
|
Posted: Tue Jan 11, 2005 9:02 am Post subject: (No subject) |
|
|
rhysticlight wrote: just wonderng about what this applies to?
a more general example would be search engines - when a website is updated, google wants to keep up but obviously its a waste of resources to catolog some huge site all over again if it was not altered significantly.
though in my case I want to be able to monitor for certain changes. (Such as news updates for example). Determining if the contents of the page has been changed at all is the first step.
wtd, rizzix: thx guys This information could come in userful |
|
|
|
|
|
Martin
|
Posted: Tue Jan 11, 2005 1:54 pm Post subject: (No subject) |
|
|
Take .png screenshots of the websites, and then use whatdotcolour to determine if they are rendering any differently. |
|
|
|
|
|
josh
|
Posted: Tue Jan 11, 2005 5:57 pm Post subject: (No subject) |
|
|
I see now, thanks for the good example |
|
|
|
|
|
Tony
|
Posted: Tue Jan 11, 2005 9:46 pm Post subject: (No subject) |
|
|
martin wrote: use whatdotcolour to determine if they are rendering
maybe I can subcontract Andy for this project
well hashing seems to be a good trick for this, works quite nicely.
on top of that I will try to play around with some regular expressions to perhaps determine what kind of changes have occured. Thx for the tutorials wtd - they are amazing |
|
|
|
|
|
wtd
|
Posted: Wed Jan 12, 2005 12:03 am Post subject: (No subject) |
|
|
Thank you. Be careful though. HTML is beyond the parsing capabilities of regular expressions. There are proper HTML/XML parsing tools for many languages, though. |
|
|
|
|
|
Andy
|
Posted: Wed Jan 12, 2005 7:09 pm Post subject: (No subject) |
|
|
Oooooo SNAP! yea i'd love to rofl |
|
|
|
|
|
|