Oil is oil. Data is dynamic.

Oil is oil.  Data is dynamic.

Background

Data has recently been lauded as a commodity that's more valuable than oil, and it's easy to see why. More and more of the products and services that affect everyday life are fundamentally driven by data.

  • Amazon mastered logistics by optimizing around supply chain data.
  • Google ranks search results by indexing website data.
  • Netflix knows which shows and movies to recommend by classifying video data.
  • Tesla navigates the road autonomously by processing telemetry data.
  • Illumina develops new drugs and treatments by analyzing genetic data.

This list keeps going, and the theme can be applied to large and small organizations alike. Just as oil fueled innovation during the second industrial revolution, data is fueling innovation in the Information Age. This is a strong indication that the world of tomorrow will rely on data to an even greater extent. Marc Andreesen, a prominent technology investor, famously coined the phrase "software is eating the world." Software and data are tightly linked; some software's sole purpose is to act on data, and data is often the output of software systems. The relationship between software and data is circular. By extension, data is eating the world as well.

Problem

This is where the comparison between data and oil breaks down. Oil is oil. Unsurprisingly, the website data that Google indexes looks very different than the telemetry data that Tesla processes. The video data that Netflix classifies looks very different than the supply chain data that Amazon optimizes. Data is dynamic. Data is produced by humans, whether it be in an automated or manual fashion, and that production process is different for every organization. Typically, teams of engineers are responsible for building the software systems that produce, and interact with, much of today's data, and their effectiveness is influenced by a number of external factors.

  • Levels of expertise vary between team members.
  • Budgets and requirements are volatile.
  • Teams depend on external people, processes, and software.
  • Edge cases are nearly impossible to predict ahead of time.
  • Team composition changes over time.

These are complex systems that change all the time, and they carry a significant amount of risk.

In the example above, optimizing, indexing, classifying, processing, and analyzing are a few actions that well-known companies perform on their data. Some other common actions include organizing, cataloging, and transforming data. Companies invest large sums of money into developing systems, or data pipelines, that enable them to perform these actions. The scope and scale of these systems varies dramatically, but it is often the case that components in data pipelines expect data to follow a well-defined pattern, or schema.

For example, Component A in Acme Inc.'s data pipeline reads in customer data of the following form:

carbon(16).svg

Component A responds with an error if the "age" field is missing, and that error creates a cascade of failures in the rest of the system. This is an oversimplified example, but it illustrates that unhandled errors in these data pipelines are responsible for causing downtime in production infrastructure. Engineers know that it would be trivial to manually write a validation for the age field, but consider writing validations for every field, across every possible case, across every component in a data pipeline. Couple this with the fact that it is nearly impossible to predict all edge cases ahead of time. This presents a massive challenge for teams.

Solution

OrgStack reduces risk for data-driven teams by preventing unexpected, or malformed, data from making its way into data pipelines. Our goal is to eliminate downtime in production infrastructure. We've developed a platform, as well as a set of developer tools, that makes it easy to establish schema constraints on data components, to manage data sources, and to receive critical alerts.

Get started here.