Big Data And The Smart Grid: Is Hadoop The Answer?
The smart grid is advancing at a rapid rate. A nascent market at the beginning of the 21st century, as of the end of 2013 over 310 million smart meters have been installed globally. That number will more than triple by 2022, reaching nearly 1.1 billion according to Navigant Research. While representing only a fraction of the sensors on the grid infrastructure, the smart meter installation numbers provide a good indication of the penetration and rate of growth of the smart grid.
The truth is that the smart devices themselves provide little utility. They simply provide the capability to remotely sense a device’s state. Collectively, these devices generate massive amounts of information. To realize the economic, social, and environmental value of the smart grid, utilities need a solution that can aggregate the sum of these data to correlate and scientifically analyze all of the information generated by the smart grid infrastructure in real time. Hadoop is often touted as the go-to big data solution, but is it right for the complex analytics requirements of the smart grid?
Hadoop and Big Data Tools
Hadoop is a collection of open source tools, managed by the Apache Software Foundation (ASF) and designed for processing big data. Although the original Hadoop architecture focused largely on just one mode of computation (the MapReduce model developed by Google, Yahoo!, and others), “Hadoop” has come to refer to a number of related ASF projects. This includes the original Hadoop MapReduce, Hadoop Distributed File System, and Hadoop YARN in addition to projects that are distinct from Hadoop itself, like Cassandra, Storm, Spark, and many others. To avoid confusion, in this article we will refer to the collection of these tools as the “Apache stack”.
More fundamental than particular pieces of software are the underlying programming paradigms that they support. These include not only batch MapReduce processing techniques but also stream and iterative processing models, both of which are extremely important in the context of energy systems. While there exist projects within the Apache stack (Storm and Spark, respectively) that can address these processing models, they are independent and complex projects in and of themselves. Simply saying, “We use Hadoop (really meaning the Apache stack) to handle big data” is akin to saying, “We use Linux to handle computation.” The systems in the Apache stack provide basic functionality, but the question is how to successfully and cost-effectively develop applications and integration that binds these together in a manner specific to energy system data.
At C3 Energy, we developed a unified data analytics platform that can handle all these types of processing paradigms. When advantageous, we have extensively used components from the Apache stack. However, in designing our platform we identified areas where a custom system to support particular programming paradigms would be more powerful and efficient. By combining the most suitable of the available tools, developing our own software framework, and focusing specifically on data relevant to energy systems, C3 Energy delivers a platform with the flexibility, scalability, and speed to meet the challenges of big data in energy systems and unlocks significant economic value for utilities and their customers.
Collectively, smart devices generate massive amounts of information (Image: Becky Lai, Flickr CC).
The Start for Big Data: Batch Processing
Batch processing techniques like Hadoop MapReduce were designed for the analysis of large, static, historical data sets. With MapReduce, a large data set is divided into many small sets for processing. The same tiny unit of work is done many times across many machines in parallel for each piece of data in the big data set.
In the case of power systems, this might be applied to rate case development, “static” customer segmentation and targeting, energy savings measure modeling, and other analytics that do not require real-time data. These types of batch workflows are well-supported by the MapReduce paradigm and scale very well with large numbers of work items (e.g. smart meters) since many machines can process the subsets in parallel.
The downside of this style of processing is that it must be repeatedly invoked, and it must reprocess all the data each time it is invoked, making it suitable for periodic batch processing, but not well suited to frequent reprocessing of dynamic datasets.
As MapReduce was not designed to process streaming data and real-time sensor data, it cannot inherently address many requisite smart grid real-time or near real-time analytic processes. A common smart grid requirement is the ability to track thousands of data attributes and to subject those attributes to thousands of computations. The frequency at which those computations need to be performed varies from infrequently (i.e. monthly, daily, or hourly) in the case of customer energy efficiency programs to millisecond granularity in the case of maintaining grid load stability or identifying grid cybersecurity threats.
Analyzing Real-Time Data: Stream Processing
Stream processing consists of invoking dependent logic as new data arrive, rather than waiting for the next batch upload and then re-processing everything. Although C3 Energy provides a MapReduce capability natively in our analytics platform, we make more use of stream processing due to the operational requirement of smart grid analytics to provide timely results for accurate insights. Our model is to define analytics that take metrics as input, which allows us to re-evaluate those analytics as, and only when, the underlying data change.
For example, to predict the risk of grid asset failures (such as transformers or feeders), we compute and keep current a real-time risk index for millions of grid assets. This asset risk index takes into account hundreds of attributes such as the equipment model and rating, real-time asset loading, and other significant factors such as maintenance history and local weather. Components of the risk index that change on a second or sub-second level, such as asset loading, require real-time computation in order to detect potential critical asset failure.
Stream processing provides the benefit of timely and efficient execution of analytics, which is not possible with batch processing. This is especially important as data arrive continuously from multiple data sources. Asset utilization and event data comes through SCADA systems and maintenance information from asset management systems. There is never a time when “all data are ready” to kick off batch processing.
The Apache stack’s answer to stream processing is the Storm project, which grew out of a code base originally acquired by Twitter and was later put into an ASF project. While we are watching Storm with great interest, it is still a relatively new piece of technology with identified issues during integration even with other components in the Apache stack. Furthermore, stream processing is a fundamental use case for a huge number of energy applications. Thus, just as with the MapReduce model, integrating stream processing directly into our analytics engine allows us faster, streamlined, and more effective assimilation with our data model.
Providing More Insight into the Grid: Iterative Processing
There is a third set of workflows required by smart grid analytic applications that is not well addressed by either batch or stream models. We call this class of workflows “iterative” because the processing requires visiting data multiple times, frequently across a wide range of data types. Many machine learning techniques required to optimize smart grid operations fall into this category.
A simple technique such as clustering, used, for example, to predict equipment with high likelihood of failure, require iterating repeatedly through data. MapReduce frameworks do not provide a suitable solution for these types of use cases because they require reading and writing data from and to disk at each iteration, which typically results in run times ~100x slower than methods designed for iterative work.
Rather than just horizontally scaling the processing (matching it to the data), iterative processing horizontally scales both the processing and keeps the data in memory (or provides that appearance) across the cluster. This makes techniques that require iterating through vast amounts of data repeatedly possible.
The Apache Spark project is the leading implementation of the iterative processing model. Originally developed by the University of California at Berkeley AMPLab and now an Apache project, Spark provides the abstraction of an unlimited amount of memory over which processing can iterate. Unlike the MapReduce and streaming settings, iterative processing is less time constrained by direct data model integration. We can pull the data from the platform once, perform long and time-consuming iterative computations (such as running machine learning algorithms that take the vast majority of the compute time), and then write data back to the platform. By embedding Spark within our analytics platform, ad-hoc processing and machine learning algorithms are able to run in a natural way.
Unified Power Grid Platform
Building a smart grid software development platform requires going beyond the “Hadoop” buzzword. Hadoop (again, more correctly, the whole Apache stack) is not a solution: it is a set of underlying tools that can be used to build a system, much like the libraries that accompany an operating system. While it is possible to embark on a smart grid software development project based solely on aggregating only the components that the Apache stack provides, it requires a significant, non-trivial effort, is very expensive, and comes with a high risk of failure.
A data analytics platform for the smart grid should be capable of analyzing both slowly and rapidly changing data using a combination of batch and real-time data processing techniques. It also requires utilization of machine learning upon data sets that are characteristically large, dynamic, and rapidly expanding. These requirements necessitate a combination of MapReduce, stream, and iterative processing.
Cover image by Colin Behrens is licensed under CC0 Public Domain.