Rethinking commodity hardware for Hadoop

Traditionally, the idea of deploying Hadoop on commodity hardware was genius. With this option of low cost infrastructure, Hadoop was in fact was designed to be the RAID of compute farms – made up of homogeneous, generally cheap and easily replaceable servers. This model does work, but when big data starts to really scale, and I mean really scale, the terms ‘commodity’ and ‘cheap’ start to go from hand-in-hand to tongue-in-cheek. In order words, it would be an oversight and in poor taste to make long term goals for these two.

Get to your point already?

Infrastructure solutions come as servers with internal storage. Stripped down to the basics, they perform two functions: compute and storage. The issue with this is the reason behind the term ‘Big Data.’ The data will grow. As this growth occurs (the technical term is scale), more and more storage would be needed to house the data. You could store this data on servers of course, but if your need is really to store, then you probably don’t need the compute resources. And that’s where the challenge begins. Because traditional commodity servers have their storage and compute resources are joined at the hip, this is a quick way to under-utilize the resources, which is not cost effective.

So hey, why don’t you use external storage?

Well, great question. That’s definitely an option but there’s a  whole “movement” about moving the analytics to the data, which we could pick apart. Keeping the storage local means faster jobs. The moment you move the data across a network to external storage, you are susceptible to issues that accompany network and storage bandwidth – the good, the bad and it could get real ugly (e.g. complexity, latency, loss of features, loss of data governance, more infrastructure). So ideally, you want your data to be local – on local disks, which is also where your programs run.

Having said that, at the enterprise level, data resides on SANs – Storage Area Networks, and must be moved to compute nodes for processing. To optimize infrastructure and reduce data bottlenecks, the nodes should be on the same cluster.

So what else is there to Hadoop and commodity hardware?

High availability is essential to Hadoop, which is simply a mechanism to avoid having a single point of failure (SPOF) in the overall system. Thus to incorporate redundancy, the recommendation is to have more than one copy, and with Hadoop, specifically three copies of the data. And just as you see through the good intentions, you also see the implications of scale. This single move of replicating the data in three would result in replicating local disks, as well as scaling server and disk together. Again, the demon of under-utilization rears its head, causing the physical footprint of the datacenter to multiply. What a mess!

One popular use of Hadoop is as a data warehouse. If you’re considering doing this, i.e modernizing your existing/traditional data warehose solution, also be mindful of the impact of scaling on your datacenter and the associated costs which are not limited to:

  • Personnel: administrators for deployment and day to day operations
  • Network: bottlenecks, bandwidth and (re)solutions
  • Workloads: nature of jobs, needs – data streaming, or at data at rest
  • Software licensing (operating systems, applications) per node/cluster

 

:wq

Sudo: Great Power, Great Responsibility

This post should really be called How to modify the /etc/hosts file on a Mac but how do you resist an opportunity to hail Sudo?

While installing a Hadoop distribution on my sandbox, one of the requisite steps for configuring the network was to edit the /etc/hosts file to include the IP address, fully qualified domain name, and short name of the hosts in the cluster.

From the terminal, I logged on to the remote server via ssh and ran the following commands:

cat /etc/hosts

vi /etc/hosts

The result of the first displayed the contents of the file, and the second gave a warning about changing a read-only file, and did not quite take the edits I was trying to make. Not until I used sudo:

sudo vi /etc/hosts

This got me over, and I completed the necessary edits. Additional tip for Mac users: hit esc before entering the save and exit command, to exit the editor. On that note:

:wq