|The Dregs by jazzijava (CC BY-NC-ND)|
Node.js + Cheerio + Request - a Great ComboAs it happens Node.js and associated technologies are a great fit for this purpose. You get to use a familiar query syntax. And there is tooling available. A lot of it.
|Disfigured by scabeater|
My absolute favorite used to be Zombie.js. Although designed mainly for testing it works often alright for scraping. node.io is another good alternative. In a certain case I had to use a combination of request, htmlparser and soupselect as zombie just didn't bite there.
These days I like to use combination of cheerio and request. Getting this combo to work on various environments is easier than Zombie. In addition you get to operate on a familiar jQuery syntax so that's a big bonus as well.
Basic WorkflowWhen it comes to scraping the basic workflow is quite simple. During development it can be useful to stub out functionality and fill it in as you progress. Here is the rough approach I use:
- Figure out how the data is structured currently
- Come up with selectors to access it
- Map the data into some new structure
- Serialize the data into some machine readable format based on your needs
- Serve the data through a web interface if you so want
It can be helpful to know how to use Chrome Developer Tools or Firebug effectively. SelectorGadget bookmarklet may come in handy too. If you feel like it, play around with jQuery selectors in your browser. It will be very useful to be able to compose selectors effectively.
|Shady Customer by Petur|
sonaatti-scraper scrapes some restaurant data. It uses node.io, comes with a small CLI tool and makes it possible to serve the data through a web interface.
There is some room for improvement. It would be a good idea not to scrape the data each time a query is performed to the web API for instance. There should be a cache of some sort to avoid unnecessary polling. It is a good starting point, though, given its simplicity.
My other example, jklevents, is based on
zombie cheerio. It is a lot more complex as it parses through a whole collection of pages, not just one. It also performs tasks such as geocoding to further improve the quality of data.
lte-scraper uses cheerio and request. The implementation is somewhat short and may be worth investigating.
When scraping, be polite. Sometimes the "targets" of scraping might actually be happy that you are doing some of the work for them. In case of jkl-event-scraper I contacted the right holder of the data and we agreed on an attribution deal. So it is alright to use the data in a commercial way given there is an attribution.
This is just a point I wanted to make as there are times when good things can come out of these sort of things. In the best case you might even earn a client this way.
Node.js is an amazing platform for scraping. The tooling is mature enough and you can use familiar query syntax for instance. It does not get much better than that for me at least. I believe it could be interesting to try to apply fuzzier approaches (think AI) to scraping.
For instance in case of restaurant data this might lead into a more generic scraper you can then apply to many pages containing that type of data. After all there is a certain structure to it although the way it has been structured in DOM will always vary somewhat.
Even the crude methods described here briefly are often quite enough. But you can definitely make scraping a more interesting problem if you want to.