Ellen Friedman: Budgeting time for AI/ML projects

ABOUT THE AUTHOR: Ellen Friedman is a principal technologist at HPE focused on large-scale data analytics and machine learning. Ellen worked at MapR Technologies for seven years prior to her current role at HPE, where she was a committer for the Apache Drill and Apache Mahout open source projects. She is a co-author of multiple books published by O’Reilly Media, including AI & Analytics in Production, Machine Learning Logistics, and the Practical Machine Learning series.

Time matters, but mainly in ways that are often overlooked. It’s true that many AI/ML projects need to meet challenging time demands for low-latency responses. Models involved in machine-based decisions or discovery often must return results to meet strict latency SLAs. The very sophisticated deep learning models that drive (literally) AI systems for autonomous cars, for example, must deal with new data in almost-real-time if the vehicle is to interact safely with unpredictability in the world around it. Similarly, the AI and ML systems used in the telecommunications industry to help process call data for global devices and to optimize service by quickly readjusting bandwidth allocation and beamforming for cell phone towers all must react to changes in the world as they happen. 

These requirements lead people to think of the time issue in terms of how fast an ML model runs, and how fast it delivers results. But what about the time it takes to build the AI/ML system in the first place? Or to rebuild it when the world changes?  Ongoing maintenance also must be considered when building a time budget for machine learning systems.

Business goals impose strict time-to-market requirements. Time pressure also comes from outside circumstances. Building an intelligent system to help analyze data related to time-sensitive, real-world events (such as the COVID-19 pandemic) requires getting models up and running within a short time window. Even in simple situations, there’s almost always a time constraint in new system development. Time is a precious resource. 

How then can you budget end-to-end time effectively for developing an AI/ML project? 

Part of the answer may surprise you. It’s not the time you budget for the learning step itself (the “ML budget”) that matters most in your planning. Instead, it’s the much larger time period needed for all the rest of the project. The ML time budget for data scientists to actually do the specialized “smartness” that uses their expertise—to work with algorithms, and train and tune models—is almost always only a small fraction of your total time budget. I’ve described this challenge in the short book Machine Learning Logistics, that I co-authored with Ted Dunning, and a paper from Google titled Hidden Technical Debt in Machine Learning Systems identifies many of the risk factors that contribute to long term costs of ML systems and some best practices to mitigate the debt. 

In order to address this threat of technical debt, your choice of infrastructure and technologies can help. In an upcoming series of blog posts, I explore the impact of technologies related to ML operations, data infrastructure, and the containerization of ML applications. For now, I am focusing mainly on several aspects of data infrastructure, touching lightly on the other two topics.

MLOps: Managing the machine learning lifecycle 

MLOps involves applying the principles of DevOps, such as agility and better intra-team communication, to the entire machine learning lifecycle. Remember, ML is more than just code. It also involves data, so MLOps goes beyond the scope of DevOps alone. An emerging class of tools supports and streamlines the efforts of MLOps teams from data preparation for training to model deployment, monitoring, and maintenance. For example, HPE Machine Learning Ops (HPE ML Ops) is an end-to-end solution for the lifecycle management and operationalization of machine learning workflows. I’ll talk more about several aspects of the larger MLOps topic in future posts but in this blog I will concentrate on data logistics and infrastructure.

Data infrastructure: Impact on ML data logistics

Data logistics takes up a major part of the overall effort needed to build and maintain an AI/ML system, as illustrated here.

HPE-AI-ML-time budget.png

This diagram shows the penalty paid for inefficiency in dealing with data logistics: the small ML time budget for specialized thinking about machine learning itself can get eaten if the data engineering is not done efficiently.  

If you are forced to spend even a small fraction of extra time on the data logistics, that will be time you don’t spend on machine learning. The leverage here is huge; a percentage-wise small overrun in logistics can completely consume your budget for machine learning.

How can you avoid this time trap?

Good data hygiene and efficient data infrastructure are essential to avoid having technical debt turn into technical bankruptcy. Here are three ways that an effective data fabric improves machine learning logistics:

Platform level vs. application level data orchestration

The best way to avoid an overrun is to have infrastructure that makes data orchestration and processing as efficient as possible. Data movement, for example, should be handled at the platform level rather than the application level. The HPE Data Fabric (formerly the MapR Data Platform) is a software solution that provides excellent data orchestration in addition to highly scalable data storage. We’ve seen the advantages afforded by handling data movement via  native data fabric capabilities in industries as diverse as telecommunications, automotive manufacturing, oil and gas exploration and the financial sector. 

Even in extreme examples involving petabytes of data per day from multiple, geo-distributed data sources, the HPE Data Fabric can efficiently move data between edge and core and between multiple core data centers, on premises or in cloud. The mirroring capability of the HPE Data Fabric does this by making atomically consistent copies of data appear at remote locations. Because this mirroring is incremental, it’s also fast and efficient. With HPE Data Fabric, mirroring can be set up to happen automatically on a pre-arranged  schedule or done manually, on command. In addition to mirroring, data can also be moved efficiently using the event streams that are part of the HPE Data Fabric. 

Use cases that are less extreme in terms of large scale and low latency requirements still need to move data efficiently between data centers or from edge sources (such as personal mobile devices) back to the core. Doing all this at the platform level using mirroring or event stream replication capabilities of the HPE Data Fabric frees up developers’ time and reduces the risk of human errors that ensue when data motion is handled manually or at the application level. 

Data versioning with snapshots

A heinous potential time trap for AI/ML data logistics is the need for reproducibility of the entire training process. At its core, solving this requires data versioning, especially for training data but also for the raw data used to produce training data. Code versioning is a standard thing to do, often taken care of using notebooks or Github repositories, but data versioning is just as important, and this need is all too often overlooked in early planning. It is essential to know exactly what data was used to train what model and how it was produced, especially given that machine learning is an iterative process. Accurate data versioning can be a challenge given the massive scale of the data involved or the danger that it will be overwritten or “corrected” by someone else, making it useless to go back to for re-training or tuning models or for model-model comparisons. HPE Data Fabric removes this potential headache through convenient, truly point-in-time data snapshots. 

Efficient data collection and access

Another way to ensure efficiency for data logistics is to avoid unnecessary steps during data collection, data preparation for training, and data access by AI and machine learning models, steps that waste time and effort and which can make replication nearly impossible. Starting with data collection, HPE Data Fabric confers a huge time advantage. The distributed file system of the data fabric is unusual in that data from a wide range of sources can usually be written directly to the data fabric without intermediate processing steps that are often required before landing data in other large scale distributed systems. Being able to collect data directly improves the efficiency and productivity of teams working with data and thus protects the narrow time window allotted to the heart of machine learning and AI efforts. 

Similarly, HPE Data Fabric makes it much easier for data engineers to prepare training data and for data scientists to access that data when they actually train, tune and run AI/ML models. The challenge arises because the APIs that data engineers use to access large scale data are typically different from the most efficient APIs for training models. That’s not a problem with the HPE Data Fabric. Tools for data preparation, including R, Python and especially Apache Spark for very large data sets, can directly access the raw data that was collected into the HPE Data Fabric. And when it’s time for training models, AI and machine learning tools can use their preferred APIs to directly access data stored in the HPE Data Fabric, unlike other distributed storage systems (including HDFS-based platforms) that require copying data out to specialized data science platforms with their own data storage. In contrast, with HPE Data Fabric, AI applications, analytics applications, even legacy applications can be run directly on data in the data fabric. This unique capability requires both API compatibility as well as very high performance.

These advantages of the HPE Data Fabric for efficient machine learning logistics are documented by the experiences of companies already using it. A recent IDC study observed the impact of this technology for a group of customers using it across different sectors. (The name at the time of the study was “MapR Data Platform”; now it’s called HPE Data Fabric). Key findings of this study, including substantially improved productivity for data science teams, are highlighted in the blog post, How HPE Data Fabric (formerly MapR) maximizes the value of data. You can get the full IDC report here

A bonus: comprehensive view of data

In addition to efficient logistics, having large data sets collected together in the data fabric is a big advantage for building AI and machine learning systems by giving data scientists a comprehensive view of data. In this way, HPE Data Fabric not only protects the precious time window of machine learning against overruns by logistical steps, it may also improve the accuracy and performance of the learning systems themselves. 

Containerization of ML applications

Machine learning involves running multiple model applications simultaneously in specifically defined environments for model training and evaluation, model-to-model comparisons and models in production. Containerization of AI/ML applications makes this much more efficient and more feasible, as described in the article Containers as an Enabler of AI. In future blog posts, we’ll explore technologies that make a difference for AI/ML applications, including Kubernetes for orchestration of containerized applications and the HPE Container Platform that uses HPE Data Fabric as its data infrastructure.