I blogged about attending Grace Hopper last year and some of the great things that followed as a result. With exactly a week to go, GHC 16 is upon us and I’m looking forward to being there, not only as an attendee but also as a volunteer blogger! Let’s connect in Houston, TX.
LinkedIn Analytics: Another Person You May Know
This past Labor Day, my college friend was in town for a month and had invited me to hang out with friends over grilled food and games. It was a a good time. I got to meet new people, enjoy their company and all of us – guests and the host couple – ended up having a blast. A couple days later, I logged into LinkedIn (via web), and a picture of the host popped up in the bottom right corner of the webpage as Another Person You May Know.
I was curious. Certainly, there’s a lot to keep up with as a ton of changes have occurred over time in LinkedIn’s algorithms which when I joined, categorized and recommended potential contacts based on existing network, work and school history. However, this was not the case. The host and I had had this one mutual friend whom we had both been on and offline friends with for years, and I wanted to figure out what suddenly roused this latent connection that resulted in a social recommendation. So I started to mentally retrace my steps:
- My friend Peter mentions he has a friend he wants me to meet, Joyce. While he’s talking about the impressive work that Joyce has been involved with, I google her name on my phone’s browser. Cookies, page browsing tracking.
- Couple hours later, Joyce and her husband show up to pick Peter from my place, and right there and then, all four of us are at the same location. Our phones which have IP addresses are oozing geo-spatial data. The three go off to spend some time together, and later return to drop Peter off at mine. Again, another geo-spatial count.
- This time around, we talk for a bit and Joyce and I exchange numbers before they leave. Phone numbers/Contacts. By the way, I don’t have my contacts synced to my LinkedIn account. Maybe she does, I didn’t ask. A couple days to the party, we trade texts and a phone call.
- Peter posts a picture on Instagram, tags me in it and Joyce leaves a comment. I check her profile, ‘like’ some of her pictures and follow her. She does the same. Social media data.
- Finally it’s party time! I visit her home, sharing geo-spatial data again. And a couple days later, she’s in my LinkedIn feed.
Without knowing whether or how heavily involved LinkedIn is with using a 360 view of its users, my best bet is to solve by elimination. I check the default privacy settings on their site and it appears the glaring culprit is the phone number (exchange). I reckon if it were geo-spatial data, Joyce’s husband’s profile (that’s if he has one) would also pop up in my feed, since I had met both her and her husband at the same time and place. However, I only exchanged numbers with her. The same goes for the interaction on Instagram. Plus the next time we would all regroup at the party, there were more people – none of which have popped up in my feed. At least not yet. There was no number swapping (or Instagram following), but then again, what if they do not have LinkedIn accounts?
It gets more interesting to break this up into chunks and identify the underlying technologies that enable this interaction.
Understanding the value of Data Lake(s) I
Tech loves buzzwords. Data lake is a big thing with Big Data. Before we go crazy technical, try to simplify the term. What comes to mind when you think of a data lake? First, it’s two words: data and lake.
Data: You remember this one from primary school. Unprocessed or raw information, facts and figures, all in their various formats. .MP3, .DOC, .CSV, .JPG etc.
Lake: A large body of water, simply put. I think of my weekend camping trip in Lake George, but in this context, a lake will be a storage for data, and not water.
Now think of yourself. What do you need data for? To get work done. For example, I need to write a research paper. I generate some original ideas, and source for some more. I need to document and compile my findings. I research, type up, save what I want, and store of the content in some kind of database, to use the term loosely, for future use or reference. For now I’ll just save in folders on my laptop or my Google Drive (Box sometimes, my Dropbox is full). Whenever I need it, all I need to do is search (query) and retrieve.
Right! Data, lake, base but I’m really thinking about a hard drive. When do I need one or the other?
Nuances. A database is really a collection or repository of information that’s organized so that it can easily be accessed, managed and updated.  My hard drive is a storage medium. It is my local data repository, my local database on my laptop.
Now picture several people having similar needs as you, not limited to students across the globe writing research papers, or people working in an organization, all constantly needing data to get all sorts of work done, you’ll start to see why a hard drive is no longer enough. Now a client-server model is important. Entire computers need to be dedicated now, to provide database services using computer programs to other computer programs, using a client-server model. That is what a database server does.
My hard drive or disk is clearly not a server. However servers are made up of hard disks. Local storage media include disk, tape – the magnetic ones such as floppy and cassettes, or memory. Data is stored onto the hard disk in form of files, a collection of bytes.
Having cleared these nuances, or at least attempted to, let’s crank it up a notch and think enterprise. By that, I mean scale, which is where a Database Management System (DBMS) comes in. Of course, the database and the DBMS are two separate things. The DBMS is a system software. It creates and manages the database, and interfaces between the database and the end user(s) or an app. Fundamentally, it manages the data, the database engine, and the database schema, and users can Create, Read, Update, Delete (CRUD) data in a database.
In the enterprise, more users are doing CRUD frequently and need a single version of the truth. In that case, the data (base) needs to be ACID (atomic, consistent, isolated, durable). So you see how the news of the Swift hack was a big deal.
That helps. Now that I know better, I can ask better questions, like what is the difference between a database and a data lake?
I think this is enough for one post. Now we got the introduction, let’s look at that in a sequel. (How does Wikipedia do it?)
- Database, TechTarget.com http://searchsqlserver.techtarget.com/definition/database
Rethinking commodity hardware for Hadoop
Traditionally, the idea of deploying Hadoop on commodity hardware was genius. With this option of low cost infrastructure, Hadoop was in fact was designed to be the RAID of compute farms – made up of homogeneous, generally cheap and easily replaceable servers. This model does work, but when big data starts to really scale, and I mean really scale, the terms ‘commodity’ and ‘cheap’ start to go from hand-in-hand to tongue-in-cheek. In order words, it would be an oversight and in poor taste to make long term goals for these two.
Get to your point already?
Infrastructure solutions come as servers with internal storage. Stripped down to the basics, they perform two functions: compute and storage. The issue with this is the reason behind the term ‘Big Data.’ The data will grow. As this growth occurs (the technical term is scale), more and more storage would be needed to house the data. You could store this data on servers of course, but if your need is really to store, then you probably don’t need the compute resources. And that’s where the challenge begins. Because traditional commodity servers have their storage and compute resources are joined at the hip, this is a quick way to under-utilize the resources, which is not cost effective.
So hey, why don’t you use external storage?
Well, great question. That’s definitely an option but there’s a whole “movement” about moving the analytics to the data, which we could pick apart. Keeping the storage local means faster jobs. The moment you move the data across a network to external storage, you are susceptible to issues that accompany network and storage bandwidth – the good, the bad and it could get real ugly (e.g. complexity, latency, loss of features, loss of data governance, more infrastructure). So ideally, you want your data to be local – on local disks, which is also where your programs run.
Having said that, at the enterprise level, data resides on SANs – Storage Area Networks, and must be moved to compute nodes for processing. To optimize infrastructure and reduce data bottlenecks, the nodes should be on the same cluster.
So what else is there to Hadoop and commodity hardware?
High availability is essential to Hadoop, which is simply a mechanism to avoid having a single point of failure (SPOF) in the overall system. Thus to incorporate redundancy, the recommendation is to have more than one copy, and with Hadoop, specifically three copies of the data. And just as you see through the good intentions, you also see the implications of scale. This single move of replicating the data in three would result in replicating local disks, as well as scaling server and disk together. Again, the demon of under-utilization rears its head, causing the physical footprint of the datacenter to multiply. What a mess!
One popular use of Hadoop is as a data warehouse. If you’re considering doing this, i.e modernizing your existing/traditional data warehose solution, also be mindful of the impact of scaling on your datacenter and the associated costs which are not limited to:
- Personnel: administrators for deployment and day to day operations
- Network: bottlenecks, bandwidth and (re)solutions
- Workloads: nature of jobs, needs – data streaming, or at data at rest
- Software licensing (operating systems, applications) per node/cluster
Sudo: Great Power, Great Responsibility
This post should really be called How to modify the /etc/hosts file on a Mac but how do you resist an opportunity to hail Sudo?
While installing a Hadoop distribution on my sandbox, one of the requisite steps for configuring the network was to edit the /etc/hosts file to include the IP address, fully qualified domain name, and short name of the hosts in the cluster.
From the terminal, I logged on to the remote server via ssh and ran the following commands:
The result of the first displayed the contents of the file, and the second gave a warning about changing a read-only file, and did not quite take the edits I was trying to make. Not until I used sudo:
sudo vi /etc/hosts
This got me over, and I completed the necessary edits. Additional tip for Mac users: hit esc before entering the save and exit command, to exit the editor. On that note:
Business for Tech, Or Technology for Business?
Every day millions of dollars are wasted in companies because non-tech people and tech people either don’t communicate at all or completely miss each other’s points.
That’s where people like us come in! The Hybrids: the ones who bridge tech and business, and everything in between (traveling, writing, building bridges, speaking – you name it)