Recently, I tried to write a script to parse the output of the log file created by Cloudera Manager. Unfortunately, I had to abandon the script after a week and and quite a number of hours invested. Parsing text is hard. It is not that CM created bad log entries or anything like that. It is just what you might expect in a log, might now be there to make the decisions you want. That was the case for me, the answer I wanted couldn’t be derived easily from the logs.
Fast forward a few days later. My buddy calls me and ask me about scraping website for product info. My first question was, “is this the only option”. If we can engage with the website owner, we can get them to give us access to the products and related tables. Especially, since the project was contracted by the owner of the website.
He said consider, that this is our only option. That getting access to the actual data would mean talking to different teams, and the teams haven’t been responsive so far. I laid out for him, the details of why getting data from the primary source is preferable than trying to parse web pages.
When people look at a web page, they see structure and something that is easy to digest. But they are not seeing what the code looks like. A web page is coded so a browser can render it for the consumption by human, not by machine or another programming language. When developers want data to be consumed by machine or another programming language, it is much more efficient to use binary or some other format. Even if you want it to be human ready. Even JSON, XML, or CSV is not meant to be the presentation format for humans, it just means that it is more open.
What he was asking, wasn’t about even some text interchange format like JSON/XML, etc. but parsing a number of pages for product details. That would be such a nightmare of a project. So here to hoping that we really don’t have to do that.