Should your business be on blockchain?

It depends.

Blockchain has evolved from the concept of a platform for trading cryptocurrencies to one that exchanges all kinds of digital assets, from flowers and luxury goods to medical records and supply chain documents (smart contracts). When considering adopting a private blockchain model for your business transactions, you should ask yourself these four questions:

What are my transactions?

A first step is identifying the assets that you will be exchanging on a blockchain network. A transaction is simply an exchange, trade or transfer of an asset, a digital one in this case, between two or more parties. If your asset will change hands, then you will need to keep a chain of records documenting the activities of the exchange. This is the essence of a blockchain, to manage and secure digital transactions as part of an open, transparent and decentralized system of record. And don’t worry, your blockchain data is confidential and private.

A blockchain is a shared ledger of key-value hash chains, representing transactions, that is distributed amongst approved parties. It records data from member transactions and updates the ledger securely and efficiently, in a way that is both verifiable and permanent. This system of record is publicly visible to each approved member of the network, and when written to blockchain, each new transaction record is linked to the record before it, making the system auditable and immutable, the latter meaning it cannot be altered, as long as the encryption on the blockchain remains intact. Computational algorithms and approaches are deployed to guarantee that the transaction recording on the database is permanent, chronologically ordered, and available to all members of the network.

Will I leverage network effect?

A network effect is the effect, described in economics and business, that one user of a good or service has on the value of that good or service to others. When a network effect is present, the value of a good or service is dependent on the number of others using it.[1] A greater number of users increases the value to each.

The strength of a blockchain network is in its numbers, and this is where the concept of consensus comes in. In his lucid blogpost for Evergreen, @EricJorgenson explains the nuances between virality, network effects, and economies of scale. The network of users acts as a consensus mechanism. Having more users or nodes on the network generally means more distributed operators will review and agree on all addenda before the data is permanently committed to the blockchain. This means that each member is creating the same shared system of record simultaneously. Because the database is distributed, each party has access to the entire database and its complete history, making data manipulation very difficult, if not near impossible. This also eliminated the potential of having Single Points of Failure (SPOF) in any one part of the system, leading to a more robust system overall than a traditional centralized database could offer.

Can I pool resources?

For the purpose of pooling resources to achieve a common goal, business networks can come together to form a consortium that benefits from sharing reference data. A consortium blockchain platform is essentially a private blockchain model that will allow its members to retain privacy and control, while reducing their transaction speed and cost. This model might be appealing to parties spread across industrial, geographical or regulatory boundaries.

Each member of a consortium blockchain platform plays a different role in meeting their common goal, with a visionary who will be responsible for leading the pack. Hyperledger Fabric, for example, obviates the need for having a visionary to build the code, therefore a potential barrier to easy adoption.

What is the business use case?

Beyond building a minimum viable product that will be deployed on the blockchain, what is the overarching vision?

  1. Faster payment settlements – no reliance on centralized servers to do all of the work.
  2. Increased security – virtually impossible to reproduce (or tamper with) the entire blockchain, and to get member nodes to accept the hijacked version.
  3. More open and transparent – with permission, members are able to view the entire history of transactions, increasing auditability and general trust amongst participants.
  4. Lower cost – expensive infrastructure of transaction processing, reviews and approvals by intermediaries is eliminated.

In summary, businesses today should be on blockchain, but not all businesses qualify. Blockchains are well-suited for certain functions, while for others, they are not.



[1] – Network effect (n.d). In Wikipedia. Retrieved November 5, 2017, from


Curated list of blockchain services and exchanges – Imbaniac, Github


New York State: Tomorrow Starts Today (#MamaWeMadeIt)

This month, I am featured in a series of TV commercials for the New York state, themed Tomorrow Starts Today. The commercials are currently airing on prime time TV nationwide (United States and Canada).


See more on Youtube:

New York State: Rust Belt to Brain Belt
“Millennials are heading to Upstate New York for rapidly expanding job markets in fields like technology, engineering, and even gaming.”

New York State The Millennial Effect
“Millennials are reshaping Upstate New York by creating vibrant communities full of diverse restaurants and nightlife.”

New York State:  The New American Dream
“New York is evolving into a hub for Millennials to chase the new American Dream with an array of career opportunities and a high quality of life.”

Hadoop At Scale – Challenges Drive Innovation GHC16

Dr. Elephant, popular for Hadoop diagnostics

LinkedIn is a data-driven company that relies heavily on Big Data processing and storage systems like Hadoop to drive critical decisions and features for its business applications. In this session, software engineer Vinitha Gankidi shares some experiences at LinkedIn with respect to challenges and solutions around scalability and performance while allocating and monitoring resources, and running workflows on Hadoop. Some of the results impacted the following areas:

  • Developer productivity: Production review at the company once posed a series of challenges which included cluster inefficiency, reduced productivity, and at the same time chewed up valuable resources and time. Teams at LinkedIn developed a solution that improved their review timelines from an average of six weeks to one day. The solution was Dr. Elephant (now open sourced: Github), a performance tuning tool that helps users of Hadoop and Spark understand, analyze, and improve the performance of their flows by providing job-level suggestions rather than cluster-level statistics[1].
  • Capacity Planning and Resource Management: Over the years, LinkedIn has had its share of dealing with unprecedented growth of its user base and in turn, resource usage. These growth spikes were usually accompanied by issues that often went unnoticed until the workflow bled or the clusters under-performed due to over-utilization. To take proactive steps in selecting new hardware and anticipating capital expenditure, LinkedIn teams leveraged analytics solutions by building pipelines on PrestoDB (an open source distributed SQL query engine for running interactive analytic queries against data sources) with dashboards for metrics using tools such as Apache Hive, and Avro.
  • Cost-effective scale testing: A software performance regression is a situation where the software functions correctly, but performs slowly or uses more memory when compared to previous versions[2]. Dynamometer is a tool developed at LinkedIn to address the issue of cluster degradation, slow running jobs, and performance regression. It simulates test clusters using 10 – 100x less hardware resources while replaying realistic production workloads and monitoring metrics. Be on the lookout, this tool will also be open sourced soon.


  1. Open Sourcing Dr. Elephant: Self-Serve Performance Tuning for Hadoop and Spark (2016, January 8). Retrieved from
  2. Software regression (2016, July 28). Retrieved from
Disclaimer: GHC Speaker Vinitha Gankidi is a Software Engineer at LinkedIn. I am a community volunteer for Grace Hopper Celebration of Women In Computing. Opinions are my own.

When Data is Your Product GHC16

The role of (information) technology in business is to facilitate business outcomes. If this statement is true, then it solves the chicken or egg causality dilemma – Tech for business or Business for tech, while also making a case for “hybrids i.e. having people in an organization that are able to combine business acumen with tech savvy. Hybrids are the real MVPs saving companies from waste, and analytics solutions are the tools that enable these superhero abilities. The context of data, as in the title is from viewing data systems as a product, and the goal of this session is to provide a guide to empower business users to make better data-driven decisions.

Behind business goals are Key Performance Indicators (KPIs) dressed up as questions that crucial for organization success. Business users are desperate for data and tools that will provide these answers. They are curious to know for example – how well their marketing campaigns are performing, elapsed time between orders, etc. Placing analytics solutions in the hands of product teams that need to be business smart and data savvy will not only maximize effectiveness but will transform how teams work to make products smarter and better decisions.

To empower business users with data, it’s important to start where the organization is. Below are some pointers to consider when planning a data product.

  • Communication and inclusion. See the business users as customers, and make them an active part of requirements, implementation and testing phases. Start the conversation between the tech and non-tech teams, with a clear vision. Seek to understand the problem and goals, and resist trying to dazzle with tech.
  • Consider the options. Having understood the goals, then proceed to pick the tech or use what you have, bearing in mind that there are many possible paths to success. Make the most of your expertise to solve the problem, harnessing data models and structure. To avoid over- or under- engineering a solution, stay flexible and focused on solving problems without getting too attached to a single idea.
  • Be open-minded about choosing software tools. What matters most is getting the job done, and while there is a variety of good answers, paying attention to needs and success criteria will help find the best without interfering with reaching the goal.
  • Be rigorous in ensuring correctness and availability of your data and product. Apply engineering and operational rigor through SDLC processes (testing, validation, quality, anomalies) and SLAs (availability, stability, performance, scalability) respectively.

Bringing it all together, remember which comes first as you combine and bridge business acumen with technology and let these work in tandem to get the job done.

Disclaimer: GHC speaker Denise McInerney is a Data Architect at Intuit. I am a community volunteer for Grace Hopper Celebration of Women In Computing. Opinions are my own.

Doing Good with Data – Human-Centered Data Science For Social Good GHC16

Measuring civic life is critical to improving it. This was the core message I got from an enlightening panel of five women in the field of data science, who introduced themselves by their favorite data sets, and Civic Data to the audience – what it is, who owns it and how citizens could leverage it. At the intersection of analytics and social good, the conversation touched on topics of real life data sets and issues that spread across transportation and safety in New York City, water consumption in cities like Flint, Michigan and historic droughts in California, human trafficking and labor exploitation, racial and ethnic bias in patent filing and rates of arrests.

In developed countries today, we live in data saturated environments. The deluge of open data comes with tons of benefits. It’s impactful when conversations are reoriented as a result of this uncovering or discovering and sharing of knowledge. For example, one of the panelists, Katy Dickinson (pictured below) has been part of an exciting project with a goal to improve public transport in Tunis, which is a North African city in Tunisia, where citizens have major pain points dealing with sporadic bus schedules and routes. Her team will be making a map of public transport in the city and the schedules. But as usual, benefits are usually not without costs. Behind numbers are people, and these people who provide their information must be considered and protected while maintaining a human centered approach. Managing the risks of personally identifiable information (PII), consequences of integrating and differential privacy were weighed in with examples such as using tip line data to help human trafficking victims.

With panelist Katy Dickinson at GHC16

Overall this was a profound session for more reasons than one, and most importantly, it inspired and expanded my thinking. As mentioned before, there’s a deluge of open data in developed countries with access to tools for computing and storing this “big data” so much that tinkering is easily the norm – which is phenomenal. On the other hand, I would love to say the same for the rest of the world. Depending on where you stand, the reach of the digital age might seem widespread, but from a global viewpoint, we are not there yet. My heart and mind went out to developing and undeveloped countries that are neither privy to access this wealth of data or even create theirs. I don’t believe in the saying that ignorance is bliss. Their challenges and priorities are different as they are valid. I talked with Katy and Erin Akred after the session and learned about the Tech Women Initiative (supported by the U.S. Department of State’s Bureau of Educational and Cultural Affairs), as well DataKind‘s mission, involvement and outreach. As my call to action, I’ll love to be a part of expanding the ecosystem and improving the chances of success where people are otherwise not so privileged.


Enjoyed this post? Got ideas, interests about collaborating for data driven impact, whether new or existing, for-profit or NGO, whatever, please connect in the comments or tweet @JackOfMyTrades
Follow the conversation on Twitter using #CivicData.
GHC Presentation Slides (Source: FeelingElephants)
Disclaimer: I am a community volunteer for Grace Hopper Celebration of Women In Computing. Opinions are my own.

LinkedIn Analytics: Another Person You May Know

This past Labor Day, my college friend was in town for a month and had invited me to hang out with friends over grilled food and games. It was a a good time. I got to meet new people, enjoy their company and all of us – guests and the host couple – ended up having a blast. A couple days later, I logged into LinkedIn (via web), and a picture of the host popped up in the bottom right corner of the webpage as Another Person You May Know.

I was curious. Certainly, there’s a lot to keep up with as a ton of changes have occurred over time in LinkedIn’s algorithms which when I joined, categorized and recommended potential contacts based on existing network, work and school history. linkpinHowever, this was not the case. The host and I had had this one mutual friend whom we had both been on and offline friends with for years, and I wanted to figure out what suddenly roused this latent connection that resulted in a social recommendation. So I started to mentally retrace my steps:

  • My friend Peter mentions he has a friend he wants me to meet, Joyce. While he’s talking about the impressive work that Joyce has been involved with, I google her name on my phone’s browser. Cookies, page browsing tracking.
This setting was automatically set to Yes by default.
  • Couple hours later, Joyce and her husband show up to pick Peter from my place, and right there and then, all four of us are at the same location. Our phones which have IP addresses are oozing geo-spatial data. The three go off to spend some time together, and later return to drop Peter off at mine. Again, another geo-spatial count.
  • This time around, we talk for a bit and Joyce and I exchange numbers before they leave. Phone numbers/Contacts. By the way, I don’t have my contacts synced to my LinkedIn account. Maybe she does, I didn’t ask. A couple days to the party, we trade texts and a phone call.
    What my default Data privacy settings look like


  • Peter posts a picture on Instagram, tags me in it and Joyce leaves a comment. I check her profile, ‘like’ some of her pictures and follow her. She does the same. Social media data.
  • Finally it’s party time! I visit her home, sharing geo-spatial data again. And a couple days later, she’s in my LinkedIn feed.


Without knowing whether or how heavily involved LinkedIn is with using a 360 view of its users, my best bet is to solve by elimination. I check the default privacy settings on their site and it appears the glaring culprit is the phone number (exchange). I reckon if it were geo-spatial data, Joyce’s husband’s profile (that’s if he has one) would also pop up in my feed, since I had met both her and her husband at the same time and place. However, I only exchanged numbers with her. The same goes for the interaction on Instagram. Plus the next time we would all regroup at the party, there were more people – none of which have popped up in my feed. At least not yet. There was no number swapping (or Instagram following), but then again, what if they do not have LinkedIn accounts?

It gets more interesting to break this up into chunks and identify the underlying technologies that enable this interaction.

Understanding the value of Data Lake(s) I

Tech loves buzzwords. Data lake is a big thing with Big Data. Before we go crazy technical, try to simplify the term. What comes to mind when you think of a data lake? First, it’s two words: data and lake.

Data: You remember this one from primary school. Unprocessed or raw information, facts and figures, all in their various formats. .MP3, .DOC, .CSV, .JPG etc.

Lake: A large body of water, simply put. I think of my weekend camping trip in Lake George, but in this context, a lake will be a storage for data, and not water.

Now think of yourself. What do you need data for? To get work done. For example, I need to write a research paper. I generate some original ideas, and source for some more. I need to document and compile my findings. I research, type up, save what I want, and store of the content in some kind of database, to use the term loosely, for future use or reference. For now I’ll just save in folders on my laptop or my Google Drive (Box sometimes, my Dropbox is full). Whenever I need it, all I need to do is search (query) and retrieve.

Right! Data, lake, base but I’m really thinking about a hard drive. When do I need one or the other?

Nuances. A database is really a collection or repository of information that’s organized so that it can easily be accessed, managed and updated. [1] My hard drive is a storage medium. It is my local data repository, my local database on my laptop.

Now picture several people having similar needs as you, not limited to students across the globe writing research papers, or people working in an organization, all constantly needing data to get all sorts of work done, you’ll start to see why a hard drive is no longer enough. Now a client-server model is important. Entire computers need to be dedicated now, to provide database services using computer programs to other computer programs, using a client-server model. That is what a database server does.

My hard drive or disk is clearly not a server. However servers are made up of hard disks. Local storage media include disk, tape – the magnetic ones such as floppy and cassettes, or memory. Data is stored onto the hard disk in form of files, a collection of bytes.

Having cleared these nuances, or at least attempted to, let’s crank it up a notch and think enterprise. By that, I mean scale, which is where a Database Management System (DBMS) comes in. Of course, the database and the DBMS are two separate things. The DBMS is a system software. It creates and manages the database, and interfaces between the database and the end user(s) or an app. Fundamentally, it manages the data, the database engine, and the database schema, and users can Create, Read, Update, Delete (CRUD) data in a database.

In the enterprise, more users are doing CRUD frequently and need a single version of the truth. In that case, the data (base) needs to be ACID (atomic, consistent, isolated, durable). So you see how the news of the Swift hack was a big deal.

That helps. Now that I know better, I can ask better questions, like what is the difference between a database and a data lake?

I think this is enough for one post. Now we got the introduction, let’s look at that in a sequel. (How does Wikipedia do it?)



  1. Database,

Rethinking commodity hardware for Hadoop

Traditionally, the idea of deploying Hadoop on commodity hardware was genius. With this option of low cost infrastructure, Hadoop was in fact was designed to be the RAID of compute farms – made up of homogeneous, generally cheap and easily replaceable servers. This model does work, but when big data starts to really scale, and I mean really scale, the terms ‘commodity’ and ‘cheap’ start to go from hand-in-hand to tongue-in-cheek. In order words, it would be an oversight and in poor taste to make long term goals for these two.

Get to your point already?

Infrastructure solutions come as servers with internal storage. Stripped down to the basics, they perform two functions: compute and storage. The issue with this is the reason behind the term ‘Big Data.’ The data will grow. As this growth occurs (the technical term is scale), more and more storage would be needed to house the data. You could store this data on servers of course, but if your need is really to store, then you probably don’t need the compute resources. And that’s where the challenge begins. Because traditional commodity servers have their storage and compute resources are joined at the hip, this is a quick way to under-utilize the resources, which is not cost effective.

So hey, why don’t you use external storage?

Well, great question. That’s definitely an option but there’s a  whole “movement” about moving the analytics to the data, which we could pick apart. Keeping the storage local means faster jobs. The moment you move the data across a network to external storage, you are susceptible to issues that accompany network and storage bandwidth – the good, the bad and it could get real ugly (e.g. complexity, latency, loss of features, loss of data governance, more infrastructure). So ideally, you want your data to be local – on local disks, which is also where your programs run.

Having said that, at the enterprise level, data resides on SANs – Storage Area Networks, and must be moved to compute nodes for processing. To optimize infrastructure and reduce data bottlenecks, the nodes should be on the same cluster.

So what else is there to Hadoop and commodity hardware?

High availability is essential to Hadoop, which is simply a mechanism to avoid having a single point of failure (SPOF) in the overall system. Thus to incorporate redundancy, the recommendation is to have more than one copy, and with Hadoop, specifically three copies of the data. And just as you see through the good intentions, you also see the implications of scale. This single move of replicating the data in three would result in replicating local disks, as well as scaling server and disk together. Again, the demon of under-utilization rears its head, causing the physical footprint of the datacenter to multiply. What a mess!

One popular use of Hadoop is as a data warehouse. If you’re considering doing this, i.e modernizing your existing/traditional data warehose solution, also be mindful of the impact of scaling on your datacenter and the associated costs which are not limited to:

  • Personnel: administrators for deployment and day to day operations
  • Network: bottlenecks, bandwidth and (re)solutions
  • Workloads: nature of jobs, needs – data streaming, or at data at rest
  • Software licensing (operating systems, applications) per node/cluster