Understanding the value of Data Lake(s) I

Tech loves buzzwords. Data lake is a big thing with Big Data. Before we go crazy technical, try to simplify the term. What comes to mind when you think of a data lake? First, it’s two words: data and lake.

Data: You remember this one from primary school. Unprocessed or raw information, facts and figures, all in their various formats. .MP3, .DOC, .CSV, .JPG etc.

Lake: A large body of water, simply put. I think of my weekend camping trip in Lake George, but in this context, a lake will be a storage for data, and not water.

Now think of yourself. What do you need data for? To get work done. For example, I need to write a research paper. I generate some original ideas, and source for some more. I need to document and compile my findings. I research, type up, save what I want, and store of the content in some kind of database, to use the term loosely, for future use or reference. For now I’ll just save in folders on my laptop or my Google Drive (Box sometimes, my Dropbox is full). Whenever I need it, all I need to do is search (query) and retrieve.

Right! Data, lake, base but I’m really thinking about a hard drive. When do I need one or the other?

Nuances. A database is really a collection or repository of information that’s organized so that it can easily be accessed, managed and updated. [1] My hard drive is a storage medium. It is my local data repository, my local database on my laptop.

Now picture several people having similar needs as you, not limited to students across the globe writing research papers, or people working in an organization, all constantly needing data to get all sorts of work done, you’ll start to see why a hard drive is no longer enough. Now a client-server model is important. Entire computers need to be dedicated now, to provide database services using computer programs to other computer programs, using a client-server model. That is what a database server does.

My hard drive or disk is clearly not a server. However servers are made up of hard disks. Local storage media include disk, tape – the magnetic ones such as floppy and cassettes, or memory. Data is stored onto the hard disk in form of files, a collection of bytes.

Having cleared these nuances, or at least attempted to, let’s crank it up a notch and think enterprise. By that, I mean scale, which is where a Database Management System (DBMS) comes in. Of course, the database and the DBMS are two separate things. The DBMS is a system software. It creates and manages the database, and interfaces between the database and the end user(s) or an app. Fundamentally, it manages the data, the database engine, and the database schema, and users can Create, Read, Update, Delete (CRUD) data in a database.

In the enterprise, more users are doing CRUD frequently and need a single version of the truth. In that case, the data (base) needs to be ACID (atomic, consistent, isolated, durable). So you see how the news of the Swift hack was a big deal.

That helps. Now that I know better, I can ask better questions, like what is the difference between a database and a data lake?

I think this is enough for one post. Now we got the introduction, let’s look at that in a sequel. (How does Wikipedia do it?)



  1. Database, TechTarget.com http://searchsqlserver.techtarget.com/definition/database