The first major hurdle after building a pretty site was that of populating the initial WooCommerce database with all of the existing items from the client’s Etsy Shop. Not only were there roughly 1,100 items, each item sported at least six (6) variations (multiple colours and styles), as well as a sizeable number of related ‘addon accessories’ specific to each listing. This was going to be a challenge ref /thisgonbgud
Realising Etsy has a very solid API, my initial panic somewhat settled but it quickly became apparent that only a ‘paid’ level of access to the API would provide the kind of detail required to export the complex taxonomies and variation data I needed and the client was not prepared to pay for that on a once-off or ongoing basis (ref need to have ongoing synchronisation Etsy -> Website in the future).
After attempting some pretty hacky CSV => SQL data conversions via custom bash scripts which were hit and miss at best, I delved into the dirty, dirty world of scraping via Node, NightmareJS and the gorgeous Cheerio. I realise these tools are now (and were then, tbh) ‘old hat’ in the world of web-scraping but that is precisely why I selected them.
I had done some Python scraping for a handful of previous projects with Beautiful Soup but I was not about to try and shoe-horn a Python data collection node into WordPress for a one-off project with a serious budget and I was a relative noob to JS-based scraping frameworks. Accordingly, I elected to go with ‘older’ scraping frameworks on the (correct) assumption that the documentation and communities behind them would help me out if (and by that I mean when) I got stuck.
When all was said and done, I ended up using a headless NightmareJS script to extract the Etsy Shop listings and associated data and Cheerio consumed and filtered that rather raw diet, extracted what we needed, then spat the result into a SQL database on the back-end. From there, WordPress and Woo were, with a few custom functions, comfortable doing what they do best – making a pretty Website Shop for the client.
With ongoing access to Etsy’s API out of the question, synchronisation of items/listings could only be one-way – Etsy->Website. This was achieved by coding up a WordPress plugin which heavily modified WP’s cron functionality to periodically engage the NightmareJS script. This proved to be surprisingly reliable, believe it or not. That plugin may or may not be released in due course… I am still fuzzy on some of bona fide enforceable ‘legals’ when it comes to certain aspects of scraping (as opposed to Terms of Service breaches, which are very, very clear) and am not prepared to put in the work only to be shot down by WP Plugin admins.
The Website is no longer in service as the client elected to stick solely with Etsy listings, however, a snapshot of its state can still be perused if anyone is interested here.
This project forced me to learn far more than I ever wanted to know about scraping as a source of consumable data which isn’t necessarily a bad thing. Learning never is. And… if nothing else, I will never, ever complain again about how ‘janky’ an API is… ever.
Bits of the NightmareJS script used in this project is the subject of another post here if anyone is interested.