|The Dregs by jazzijava (CC BY-NC-ND)|
Node.js + Zombie - a Great ComboAs it happens Node.js and associated technologies are a great fit for this purpose. You get to use a familiar query syntax. And there is tooling available. A lot of it.
|Disfigured by scabeater|
My absolute favorite is Zombie.js. Although designed mainly for testing it works often alright for scraping. node.io is another good alternative. In a certain case I had to use a combination of request, htmlparser and soupselect as zombie just didn't bite there.
Basic WorkflowWhen it comes to scraping the basic workflow is quite simple. During development it can be useful to stub out functionality and fill it in as you progress. Here is the rough approach I use:
- Figure out how the data is structured currently
- Come up with selectors to access it
- Map the data into some new structure
- Serialize the data into some machine readable format based on your needs
- Serve the data through a web interface if you so want
It can be helpful to know how to use Chrome Developer Tools or Firebug effectively. SelectorGadget bookmarklet may come in handy too. If you feel like it, play around with jQuery selectors in your browser. It will be very useful to be able to compose selectors effectively.
|Shady Customer by Petur|
Even the crude methods described here briefly are often quite enough. But you can definitely make scraping a more interesting problem if you want to.