The Machine Learning Lifecycle: An End-To-End Look
Artificial intelligence (AI) has taken the world by storm. Individuals are applying publicly available AIs to aid them in work tasks and organizations are increasing their investment in developing and then running AIs. For most organizations, that means they are working with the machine learning (ML) aspects of AI. This article overviews the machine learning lifecycle, from beginning to end (and back again).
The Machine Learning Lifecycle: From Training to Inferences
The AI literature is either incredibly dense regarding how to build AIs or incredibly sparse, depending on what you’re looking for. When it comes to the ML lifecycle, it seems to lean towards the sparse end of the spectrum. Figure 1 depicts the typical level of discourse I see in much of the literature: You train your models, deploy them into production, where they are used to make inferences/predictions, and any learnings are fed back into the model to improve its quality. This is a great overview of the ML lifecycle, but it’s very high-level, as you can see. Let’s flesh it out a bit.
Figure 1. A very high-level machine learning lifecycle (click to enlarge).
The Machine Learning Lifecycle: End-to-End Overview
A more robust machine learning lifecycle is presented in Figure 2. This version depicts the full system development lifecycle (SDLC) for ML with the following phases:
- Initiate. During this phase, which hopefully lasts a few days to a week or two, your aim is to develop a vision for what you hope to achieve and how you hope to do so. You will perform typical initiation activities, such as forming the initial team, initial modelling, initial estimation, and initial planning. Note the use of the word “initial” everywhere. In PMI’s Disciplined Agile (DA) toolkit, the Initiate phase is called Inception, and in Scrum, it is called “Sprint 0” (although some in the Scrum community struggle to recognize this).
- Construction. During this phase, you iteratively and collaboratively build your ML model. This is likely to take weeks or even months to develop the first release of your model. Over time, you may find that changes in your operational environment require you to frequently update and deploy new versions of your model(s). This phase is often called Training by AI practitioners and Development or Programming by non-AI developers.
- Deploy. This phase, better thought of as an activity, encompasses the efforts required to release your model(s) into production. I say activity because you should strive to fully automate the deployment of your model(s) via standard Data Ops techniques. This phase is sometimes referred to as Release and in the DA toolkit as Transition.
- Production. This is the period where your end users work with your AI, and you, in turn, perform the usual operations and support of that running solution. This phase is often called Inference by AI practitioners, Operations and Support (O&S), or simply Operations in IT methods.
- Retirement. All good things must come to an end. During this phase, you remove the AI from your production environment. This phase is also referred to as Decommissioning.
Figure 2. Overview of the machine learning lifecycle.
On the surface, this lifecycle may appear to be serial in nature, an approach sometimes referred to as waterfall or even “predictive”, but nothing could be further from the truth. This becomes apparent when going to a higher level of detail.
The Machine Learning Lifecycle: A More Detailed Look
A more detailed version of the machine learning lifecycle is depicted in Figure 3. A few interesting points about this rendering of the ML lifecycle:
- It depicts only key activities. Key activities, such as Explore Usage and Make Inferences, are explicitly called out. However, many other activities – such as building your team, obtaining funding, and reporting status – are not for the sake of simplicity. The key activities are described below.
- It’s a hybrid. ML initiatives require a hybrid strategy that embodies concepts from serial, agile, and lean ways of thinking (WoT). The serial aspect is evident in the lifecycle’s named phases described earlier. Agile is reflected in the iterative nature of activities within the phases, ideally performed collaboratively. Lean strategies should be applied to ensure the effective flow of work.
- Popular agile frameworks don’t map well to it. Frameworks such as Scrum and SAFe struggle to admit the need for initiation, and don’t address data activities well (not at all in the case of Scrum). Bottom line is that you need to choose your way of working (WoW).
Figure 3. A detailed machine learning lifecycle (click to enlarge).
Figure 3 calls out the following key activities:
- Explore usage. How will the end users of your AI work with it? Usage is often explored through a use case or user story, but frankly, there are many modelling options available to you. The key issue is to understand how your end users will interact with your solution. The goal at this point is to gain a high-level understanding of what you hope to accomplish; the details will evolve as your stakeholders learn during Construction.
- Explore data. What data will be required to train your model? Early in the initiative, you want to determine whether you have the data you need and its level of quality. If you don’t have some of the data you need, you want to identify a strategy to obtain it, or worst case, to create it yourself.
- Explore viability. AI initiatives suffer from a very high failure rate, roughly 85% fail. This implies that you want to regularly ask whether your initiative is technically, economically, and operationally viable. Given that the answer is likely no, you either need to get the initiative back on track or cancel it.
- Choose a training strategy. Given the nature of the opportunity that you’re hoping to address, you will need to develop an overall training strategy. This includes identifying one or more candidate algorithms, platform choices, testing strategies, and perhaps initial deployment choices (see below).
- Prepare data. Data preparation often proves an onerous task, involving data analysis, feature selection, and, more often than not, significant data cleansing. This activity is often called data wrangling.
- Evolve requirements. As your construction/training efforts progress, you will start to identify what is possible to do, what isn’t possible, and what might be possible if only the data were sufficient. As a result, your stakeholders will be motivated to evolve their requirements to reflect these new learnings.
- Train the model(s). This is mostly an automated activity, except for data labelling. Depending on your environment, you may find that you’re really training several models, not just one. For example, imagine an AI built to detect vehicle insurance fraud. The company may discover that it needs a collection of models to do that, perhaps one for each distinct territory that the insurance company operates in. The issue is that different laws apply in each territory, different economic environments exist, and different cultural norms. These factors, and others, lead to very different claim behaviours among customers.
- Validate the model. This is where you test your model, rating its effectiveness and actively exploring potential biases (there will be biases). See my blog Bias and Machine Learning: 7 Strategies for Better AI to better understand this model bias.
- Deploy. There are several deployment architectures to choose from. Will your model run in a centralized manner, either on the cloud or on premisesw Will it run on edge devices such as phones? Or will it run as a combination of those strategies? Regardless, you want to automate your deployment efforts appropriately, especially if you will often deploy model updates. My philosophy is that deployment should be automated to the extent that it is effectively risk-free, cost-free, and frictionless. In short, deployment done right is a non-event.
- Make inferences. Yay, this is the fun stuff! Your end users work with your AI solution to create value through effective (I hope!) inferences/predictions.
Other Renderings of the Machine Learning Lifecycle
Figure 4 depicts a DataOps version of the ML lifecycle, with the activities of Figure 3 mapped onto it. Many people prefer this sort of rendering because it reflects the continuous delivery, DevOps style thinking. That’s great once your team is up and running, but it unfortunately downplays the initiation/envisioning effort at the beginning that is required for AI/ML initiatives to be successful. When I’m coaching teams, I usually tell them we’re starting with the lifecycle in Figure 3, but want to evolve to the lifecycle in Figure 4. Figure 3 is more palatable to senior leadership, the people who are paying the bills. Figure 4 is more palatable to the people doing the actual work. I find it best to keep both groups happy.
Figure 4. A DataOps version (click to enlarge).
CRISP-DM is a data mining methodology that has an iterative lifecycle. Some people have modified it, as shown in Figure 5, to address the machine learning lifecycle. Personally, I like it, although not as much as what you see in Figure 3. Choose what works for you.
Figure 5. A CRISP-DM-based version (click to enlarge).
Where To Go To Learn More
If you’re looking for details regarding AI then look no further than the 1,000+ page Artificial Intelligence: A Modern Approach by Russell and Norvig. This book is an incredible technical textbook that is a must-read for anyone who is serious about learning about how to build AIs. I have also found the O’Reilly series of machine learning books to be great resources for anyone implementing ML models.
If you’re interested in software development lifecycles, you will find The Agile System Development Lifecycle (SDLC), The Scrum Lifecycles: Construction, Project, and Solution, and The Continuous Delivery Software Lifecycle to be of value.
Let Me Help
If you are about to embark on an AI initiative, or have run aground on data quality issues, and would like the advice of an unbiased outsider, you should consider reaching out to see if I’m currently available. In the meantime, you may find my artificial intelligence blog postings insightful.

