I mentioned recently how a friend proposed that we do some data import by scraping a website. I tried to explain, that that was very difficult, time consuming, and error prone. Well, I didn’t do such a good job of getting the point across. So let me try here, not, again. Hopefully, this should be understandable by both techies and non-techies.
The best way I informed, was to be able to import the data directly from one computer storage to another. For people in computer technology, they would understand this as computer interchange or exchange format. It is best, if data will be produce by one computer for the consumption by another computer, that the data be in a binary or otherwise encoded form. There really isn’t any reason to move data from one computer to another in a human readable format. UNLESS we want to ensure that at some future date, we can still use it.
So writing a program to scrape data off a website. Sure, that is doable, but what would be a lot easier is to just have the data exported. Even if we go for some intermediate format like CSV, JSON, or XML.
As an example, here are three ways of passing the same information between computer:
Example 1: HTML
<html>
<head>
</head>
<body>
<div>
<person>
<name><b>John Smith</b></name>
<age><i>33</i></age>
<ssn><i>011-11-1234</i></ssn>
<dob><i>1982-03-04</i></dob>
</person>
</div>
</body>
</html>
All we really wanted to send was John’s information. Notice all the extra stuff like how it presented, name bolded and the other bit italicized. Yet, quite a bit was stripped out that would usually be there. So if you had to get this data from one computer to the other, why put in all that extra?
Now here is example two, slightly better, using JSON and CSV
Example 2: CSV
name,age,ssn,dob
John Smith,33,011-11-1234,1982-03-04
Example 2: JSON
{“name”:”John Smith”, “age”:33, “ssn”:”011-11-1234″, “dob”:”1982-03-04″}
NOTE: Even if JSON was spread out over several lines, it is still more compact than HMTL.
Finally, here is the same data, encoded in “some” binary form. I won’t go into the details of it. But let’s assume that both sending and receiving computers knew they were exchanging data about a person. Example 3: Binary
0a4a6f686e20536d69746801210b3031312d31312d313233340a313938322d30332d3034
It is basically [length:data] repeated. So for example, the number 33 takes up 1 type and is ’21’ in hexadecimal. So you see 0121 some where that run of characters. The name “John Smith”, is 10 chars long, and turning each char into a byte and then their hex value, you ‘0a’ followed by the ‘4a6f686e20536d697468’.
This is certainly harder for a human to read. But it wasn’t meant for a human, not yet anyway. We wanted the most efficient way of passing some data between computers to be processed by computers. And since the binary form only have [length:data] repeated, without the ‘:’ by the way. We only send what we need and save a ton of space and time. Not to mention less opportunity for error in try to parse the data on the other end.
And that is why it is so clear to me that we should be doing things either using Example 2 or 3, but not 1.