Web scraping is a technique of extracting information over the web. Usually, information linked to products, services or price/discounts are targeted. For example, a start up might have the need to keep up with its competitor’s prices on various products or services.
Traditionally, acquiring this kind of information and keeping it up to date, could mean a lot of manual work. Today you can pay for commercial scraping services, or write the code yourself or even get tools for free that do all the work for you.
What used to require some level of programming skill is today replaced with automated tools. You don’t need any pre-existing knowledge, you only need to be able to read, have access to a computer with Internet connection. Online you can find tutorials (step by step videos) and forums supporting you in your quest of scraping.
An example to demonstrate the point, this is Import.io, which is a free scraping tool that allows anyone to scrape, regardless of his or her previous knowledge.
I typed in my desired URL, for this example I went to IKEA’s webpage and searched for tables.
I got the product name, price, model and description. Now that I have the data, I might want to explore the data set.
I got my data set ready for exploring! I also had the choice to export the data to my desktop. Scraping could not be easier.
Web scraping is today in its golden age, not only because of how simple it is to use, but also because of the legal “grey area” it is in. For the opposing side, the web administrators of the world, protecting the information on your site, and making sure to have as much human traffic as possible is a challenge.
Since legislation has not been able to keep up to date to the rapid development of the web, it is neither easy nor cheap to settle disputes in court. Therefore web administrators need to know what kind of traffic is accessing their sites. Today you can block specific, as well as ranges of IP’s from browsing your site, but because of the changing “nature” of IP’s, it can be hard to keep track on which IP you want to block. Forcing the user to verify that it is indeed a real person, you can insert CAPTCHA tests, however there exist services online that circumvent these tests.
You can control bot traffic generated by honest crawlers (bots that identify themselves, like googlebot for Google) by specifying in your robot.txt which ones you allow and also limit their access.
Monitor your traffic 24/7 and identify price scraping
Keeping track of all the traffic on your site can be time consuming, especially if the site generates a lot of traffic. You can hire commercial anti-scraping services to monitor your traffic and identify what traffic is genuine. ScrapeSentry offers automated defence against scraping, as well as around the clock monitoring.