Hadoop At Scale – Challenges Drive Innovation GHC16

drelephant-logo
Dr. Elephant, popular for Hadoop diagnostics

LinkedIn is a data-driven company that relies heavily on Big Data processing and storage systems like Hadoop to drive critical decisions and features for its business applications. In this session, software engineer Vinitha Gankidi shares some experiences at LinkedIn with respect to challenges and solutions around scalability and performance while allocating and monitoring resources, and running workflows on Hadoop. Some of the results impacted the following areas:

  • Developer productivity: Production review at the company once posed a series of challenges which included cluster inefficiency, reduced productivity, and at the same time chewed up valuable resources and time. Teams at LinkedIn developed a solution that improved their review timelines from an average of six weeks to one day. The solution was Dr. Elephant (now open sourced: Github), a performance tuning tool that helps users of Hadoop and Spark understand, analyze, and improve the performance of their flows by providing job-level suggestions rather than cluster-level statistics[1].
  • Capacity Planning and Resource Management: Over the years, LinkedIn has had its share of dealing with unprecedented growth of its user base and in turn, resource usage. These growth spikes were usually accompanied by issues that often went unnoticed until the workflow bled or the clusters under-performed due to over-utilization. To take proactive steps in selecting new hardware and anticipating capital expenditure, LinkedIn teams leveraged analytics solutions by building pipelines on PrestoDB (an open source distributed SQL query engine for running interactive analytic queries against data sources) with dashboards for metrics using tools such as Apache Hive, and Avro.
  • Cost-effective scale testing: A software performance regression is a situation where the software functions correctly, but performs slowly or uses more memory when compared to previous versions[2]. Dynamometer is a tool developed at LinkedIn to address the issue of cluster degradation, slow running jobs, and performance regression. It simulates test clusters using 10 – 100x less hardware resources while replaying realistic production workloads and monitoring metrics. Be on the lookout, this tool will also be open sourced soon.

 

Citations:
  1. Open Sourcing Dr. Elephant: Self-Serve Performance Tuning for Hadoop and Spark (2016, January 8). Retrieved from https://engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark
  2. Software regression (2016, July 28). Retrieved from https://en.wikipedia.org/wiki/Software_regression
Disclaimer: GHC Speaker Vinitha Gankidi is a Software Engineer at LinkedIn. I am a community volunteer for Grace Hopper Celebration of Women In Computing. Opinions are my own.

When Data is Your Product GHC16

The role of (information) technology in business is to facilitate business outcomes. If this statement is true, then it solves the chicken or egg causality dilemma – Tech for business or Business for tech, while also making a case for “hybrids i.e. having people in an organization that are able to combine business acumen with tech savvy. Hybrids are the real MVPs saving companies from waste, and analytics solutions are the tools that enable these superhero abilities. The context of data, as in the title is from viewing data systems as a product, and the goal of this session is to provide a guide to empower business users to make better data-driven decisions.

Behind business goals are Key Performance Indicators (KPIs) dressed up as questions that crucial for organization success. Business users are desperate for data and tools that will provide these answers. They are curious to know for example – how well their marketing campaigns are performing, elapsed time between orders, etc. Placing analytics solutions in the hands of product teams that need to be business smart and data savvy will not only maximize effectiveness but will transform how teams work to make products smarter and better decisions.

To empower business users with data, it’s important to start where the organization is. Below are some pointers to consider when planning a data product.

  • Communication and inclusion. See the business users as customers, and make them an active part of requirements, implementation and testing phases. Start the conversation between the tech and non-tech teams, with a clear vision. Seek to understand the problem and goals, and resist trying to dazzle with tech.
  • Consider the options. Having understood the goals, then proceed to pick the tech or use what you have, bearing in mind that there are many possible paths to success. Make the most of your expertise to solve the problem, harnessing data models and structure. To avoid over- or under- engineering a solution, stay flexible and focused on solving problems without getting too attached to a single idea.
  • Be open-minded about choosing software tools. What matters most is getting the job done, and while there is a variety of good answers, paying attention to needs and success criteria will help find the best without interfering with reaching the goal.
  • Be rigorous in ensuring correctness and availability of your data and product. Apply engineering and operational rigor through SDLC processes (testing, validation, quality, anomalies) and SLAs (availability, stability, performance, scalability) respectively.

Bringing it all together, remember which comes first as you combine and bridge business acumen with technology and let these work in tandem to get the job done.

Disclaimer: GHC speaker Denise McInerney is a Data Architect at Intuit. I am a community volunteer for Grace Hopper Celebration of Women In Computing. Opinions are my own.

Doing Good with Data – Human-Centered Data Science For Social Good GHC16

Measuring civic life is critical to improving it. This was the core message I got from an enlightening panel of five women in the field of data science, who introduced themselves by their favorite data sets, and Civic Data to the audience – what it is, who owns it and how citizens could leverage it. At the intersection of analytics and social good, the conversation touched on topics of real life data sets and issues that spread across transportation and safety in New York City, water consumption in cities like Flint, Michigan and historic droughts in California, human trafficking and labor exploitation, racial and ethnic bias in patent filing and rates of arrests.

In developed countries today, we live in data saturated environments. The deluge of open data comes with tons of benefits. It’s impactful when conversations are reoriented as a result of this uncovering or discovering and sharing of knowledge. For example, one of the panelists, Katy Dickinson (pictured below) has been part of an exciting project with a goal to improve public transport in Tunis, which is a North African city in Tunisia, where citizens have major pain points dealing with sporadic bus schedules and routes. Her team will be making a map of public transport in the city and the schedules. But as usual, benefits are usually not without costs. Behind numbers are people, and these people who provide their information must be considered and protected while maintaining a human centered approach. Managing the risks of personally identifiable information (PII), consequences of integrating and differential privacy were weighed in with examples such as using tip line data to help human trafficking victims.

20161020_115629
With panelist Katy Dickinson at GHC16

Overall this was a profound session for more reasons than one, and most importantly, it inspired and expanded my thinking. As mentioned before, there’s a deluge of open data in developed countries with access to tools for computing and storing this “big data” so much that tinkering is easily the norm – which is phenomenal. On the other hand, I would love to say the same for the rest of the world. Depending on where you stand, the reach of the digital age might seem widespread, but from a global viewpoint, we are not there yet. My heart and mind went out to developing and undeveloped countries that are neither privy to access this wealth of data or even create theirs. I don’t believe in the saying that ignorance is bliss. Their challenges and priorities are different as they are valid. I talked with Katy and Erin Akred after the session and learned about the Tech Women Initiative (supported by the U.S. Department of State’s Bureau of Educational and Cultural Affairs), as well DataKind‘s mission, involvement and outreach. As my call to action, I’ll love to be a part of expanding the ecosystem and improving the chances of success where people are otherwise not so privileged.

…………………………………………………………………………………………………………………………………………….

Enjoyed this post? Got ideas, interests about collaborating for data driven impact, whether new or existing, for-profit or NGO, whatever, please connect in the comments or tweet @nvictta
Follow the conversation on Twitter using #CivicData.
GHC Presentation Slides (Source: FeelingElephants)
Disclaimer: I am a community volunteer for Grace Hopper Celebration of Women In Computing. Opinions are my own.