In the last 4 years I have done a lot of screen scraping. There are many things I enjoy about it. For the most part it is the constant challenge. Each website, and in some cases, each web page, provides unique hurdles that I have never cleared before. Most of my time is spent in the present hacking instead of in the past remembering how I got past the same problem.
At a certain point, when you do it enough, you start sparring with the technical team who runs the website you are scraping. Similar to a chess match, they are trying to prevent me from scraping their website and I am hacking my way through their attempts.
So far to date, no website has been able to stop me. If you are wondering, Google is the most difficult. They do some sick things with JavaScript.
One day I thought, I wonder if I could stop myself. I didn’t just want to make it more difficult for myself, I wanted to make it impossible. I have a website called Furnished.com, and I wanted to see if it was possible to prevent myself from screen scraping the furnished rental listings on the website.
After many months of research, I found a way to make it impossible to screen scrape not just my Furnished.com website, but any website. I leveraged the hierarchal nature of HTML combined with the vast array of tags and attributes to obfuscate the HTML on the page such that the content shown on the page is identical, but you can never consistently grab the same thing programmatically.
I then challenged a few of my colleagues with extensive experience with screen scraping and they were unable to screen scrape a number of webpages protected by my latest creation.
It was at that point that I realized I had something that could probably help a lot of websites out there. For the next couple of months I refactored the code, tested for edge cases, played with a few different configurations, and then added a few more features. Taking a lesson from what anti-virus software taught us, I not only built in Internet Protocol Address blacklists but pattern matching, similar to virus definitions.
As of right now I have been mainly selling this product to small to medium-sized online retailers with a lot of SKU’s (or number of products sold online). They are the easiest customers to get in contact with and most have tried to find a product like mine that did not exist within the last year.
As for the future of this product, I will likely move into preventing more advanced bot activity such as programmatic posting of data.
To read more about it, check out the website Unscrapable.com!