Part Three: Designing a Data Architecture to Support Both Fast and Big Data
In post one of this series, we introduced the ideas that a Corporate Data Architecture was taking shape and that working with Fast Data is different from working with Big Data. In the second post we looked at examples of Fast Data and what is required of applications that interact with Fast Data. In this post, I will illustrate how I envision the corporate architecture that will enable companies to achieve the data dream that integrates Fast and Big.
The following diagram depicts a basic view of how the “Big” side of the picture is starting to fill out. At the center is a Data Lake, or pool or reservoir or…. there is no shortage of clever names and debate over what to call it. What is clear is this is the spot in which the enterprise will dump ALL of its data. This component is not necessarily unique because of its design or functionality, but because it is an enormously cost effective system to store everything. Essentially, it is a distributed file system on cheap commodity machines.
There may or may not be a single winning technology here. It may be HDFS or some other store (maybe S3 if you’re on Amazon), but the point is, this is where all data will go. This platform will:
- Store data that will be sent to other data management products, and
- Support frameworks for executing jobs directly against the data in the file system.
Moving around the outside of our Data Lake are the complementary pieces of technology that allow people to gain insight and value from the data stored in the Data Lake. Starting at 12 o’clock in the diagram above and moving clockwise:
BI – Reporting: Data warehouses do an excellent job of reporting, and will continue to offer this capability. Some data will be exported to those systems and temporarily stored there, while other data will be accessed directly from the Data Lake in a hybrid fashion. These data warehouse systems were specifically designed to run complex report analytics, and do this well.
SQL on Hadoop: There is a lot of innovation here. The goal of many of these products is to displace the data warehouse. Advances have been made with the likes of Hawq and Impala. But make no mistake, there is a long way to go for these systems to get near the speed and efficiency of the data warehouses, especially those with columnar designs. SQL-on-Hadoop systems exist for a couple of important reasons:
- SQL is still the best way to get at data, and
- Processing can occur without moving big chunks of data around.
Exploratory Analytics: This is the realm of the data scientist. These tools offer the ability to “find” things in data – patterns, obscure relationships, statistical rules, etc. Mahout and R are popular tools in this category.
MapReduce: This is a lazily-named group of all the job scheduling and management tasks that often occur on Hadoop (I really should come up with something more accurate). Many Hadoop use cases today involve pre-processing or cleaning data prior to the use of the analytics tools described above. These are the tools and interfaces that allow that to happen.
ETL of Enterprise Apps: Last at 6 o’clock is the ETL process that will help get all the legacy data from our trusty enterprise applications into our data lake that stores everything. These applications will slowly migrate to full-fledged Fast+Big Data apps in time, which I will discuss in a future post. But suffice it to say: once I add sensors to a manufacturing line, I have a Fast+Big Data problem.
OK, we now have analytics … so what?
Why do we do analytics in the first place? Simple. We want:
- Better decisions
- Better personalization
- Better detection
- Better …. Interaction.
Interaction is what the application is responsible for, and the most valuable improvements come when you can do these interactions accurately and in real-time. This brings us to the second half of the architecture where we deal with Fast Data to make better, faster real-time applications, depicted in the diagram below.
The first thing to notice is that there is a tight coupling of Fast and Big, although they are separate systems. They have to be, at least at scale. The database system designed to work with millions of event decisions per second is wholly different from the system designed to hold Petabytes of data and generate extensive reports.
The nature of Fast Data produces a number of critical requirements to get the most out of it. These include the ability to:
- Ingest / interact with the data feed
- Make decisions on each event in the feed
- Provide visibility into fast-moving data with real-time analytics
- Seamlessly integrate into the systems designed to store Big Data
- Ability to serve analytic results and knowledge from the Big Data systems quickly to users and applications, closing the data loop.
There is no better technology to meet these requirements than an operational database. The challenge we have faced is that there hasn’t been an operational database that can manage this kind of throughput. As a result, there have been a number of Band-Aids people have used to attempt to meet their needs, often giving up capabilities and always adding complexity.
In a next post, I will detail the capabilities I see customers looking for to support their Fast Data applications. Then we will take a look at the results of attempting this solution with a popular alternative, stream processing.