Hadoop At Scale – Challenges Drive Innovation GHC16

drelephant-logo
Dr. Elephant, popular for Hadoop diagnostics

LinkedIn is a data-driven company that relies heavily on Big Data processing and storage systems like Hadoop to drive critical decisions and features for its business applications. In this session, software engineer Vinitha Gankidi shares some experiences at LinkedIn with respect to challenges and solutions around scalability and performance while allocating and monitoring resources, and running workflows on Hadoop. Some of the results impacted the following areas:

  • Developer productivity: Production review at the company once posed a series of challenges which included cluster inefficiency, reduced productivity, and at the same time chewed up valuable resources and time. Teams at LinkedIn developed a solution that improved their review timelines from an average of six weeks to one day. The solution was Dr. Elephant (now open sourced: Github), a performance tuning tool that helps users of Hadoop and Spark understand, analyze, and improve the performance of their flows by providing job-level suggestions rather than cluster-level statistics[1].
  • Capacity Planning and Resource Management: Over the years, LinkedIn has had its share of dealing with unprecedented growth of its user base and in turn, resource usage. These growth spikes were usually accompanied by issues that often went unnoticed until the workflow bled or the clusters under-performed due to over-utilization. To take proactive steps in selecting new hardware and anticipating capital expenditure, LinkedIn teams leveraged analytics solutions by building pipelines on PrestoDB (an open source distributed SQL query engine for running interactive analytic queries against data sources) with dashboards for metrics using tools such as Apache Hive, and Avro.
  • Cost-effective scale testing: A software performance regression is a situation where the software functions correctly, but performs slowly or uses more memory when compared to previous versions[2]. Dynamometer is a tool developed at LinkedIn to address the issue of cluster degradation, slow running jobs, and performance regression. It simulates test clusters using 10 – 100x less hardware resources while replaying realistic production workloads and monitoring metrics. Be on the lookout, this tool will also be open sourced soon.

 

Citations:
  1. Open Sourcing Dr. Elephant: Self-Serve Performance Tuning for Hadoop and Spark (2016, January 8). Retrieved from https://engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark
  2. Software regression (2016, July 28). Retrieved from https://en.wikipedia.org/wiki/Software_regression
Disclaimer: GHC Speaker Vinitha Gankidi is a Software Engineer at LinkedIn. I am a community volunteer for Grace Hopper Celebration of Women In Computing. Opinions are my own.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s