
-----------------------------------
r.3volved
Wed Jun 14, 2006 7:04 pm

Spiders and creepy Crawlers
-----------------------------------
Does anyone have experience here with Java based webcrawlers?
I've tried some open source crawlers out, but they aren't offering what I want...most seem to be for spidering and mirroring full sites.

What I'm looking to do is crawl a single domain and parse each page for specific data. Then export this data to a .CSV file for later use.
Has anyone come across any open source apps that will do this, or will allow me to say, search for the word "Description" and grab all text after that until I hit the end of a table row?

I am fairly proficient in Java, so modifying code is not a problem...I'm simply looking for a lightweight app that will support what I'm trying to do without all the crap I don't need.

Any help is much appreciated.

-----------------------------------
rizzix
Wed Jun 14, 2006 7:40 pm


-----------------------------------
No I don't know any, but I do know some cool technologies you can make good use of and create your own webcrawler. :)

Search Engine: [url=http://lucene.apache.org/]Lucene
HTTP & Misc: [url=http://jakarta.apache.org/commons/index.html]Jakarta Commons (HttpClient, etc..)

-----------------------------------
r.3volved
Thu Jun 15, 2006 10:01 pm


-----------------------------------
:? 
Does anyone have any relavent information in response to my question?
Some of you leet :roll: UofW students must know something about data mining...

-----------------------------------
Tony
Thu Jun 15, 2006 10:57 pm


-----------------------------------
I wrote my own web crawler in Ruby during a work term.. It was for a highly specialized data extraction, so I don't know about any (let alone Java) pre-packaged programs.
