Introduction
Machine learning (ML) and Artificial Intelligence (AI) are not just buzzwords anymore. Many companies are actively using ML models and many services you receive daily are based on those models. Recommendation for offers you receive on buying products, chatbots you chat with, and even your Google search results are based on ML models. This post will interest you, if you are just starting to put your models into production, or if you already have some models in production and want to scale your results.
This blog post series has two parts. This first part focuses mainly on introducing MLOps, challenges and solutions on developing your ML model and deploying it to production. The second part explains different aspects of life cycle management of your ML models and concludes the blog post series.
Disclaimer: This blog post is not meant to provide a technical guideline or a manual for implementing an MLOps pipeline in your organisation. Rather, it provides a basic introduction to common MLOps principles and discusses our experiences of putting ML models into production at SEB.
As mentioned, ML is not just a research project anymore, and it has proven the value it brings to different business areas. In addition, the growing community of ML practitioners, by applying the best practices from other computer science fields, has accelerated the development, implementation, and industrialisation of ML algorithms in tangible applications. Widely used programming languages like Python and R, by providing constantly maintained and enhanced packages and libraries, support a wide range of different ML algorithms. This further makes the powerful toolkit of ML available for everyone in this field.
While advances in implementing ML models make the development of ML models relatively fast and cheap, it does not mean that the maintenance of such systems in production and over time would be simple and cheap. However, having both data and code as main building blocks of an ML model makes the maintenance of such systems complex; while the code may remain relatively static, the data is perpetually changing and your ML model needs to be constantly monitored, updated, adapted and refined to ensure it remains relevant and fits for purpose. In addition, not only must your infrastructure support this – it must facilitate them, and your organisation must embrace them. This said, the organisation and the complexity of its business may complicate the problem further.
In this series of blog posts, I explain how MLOps can help to overcome these complexities and difficulties. But first let me explain what I mean by MLOps by borrowing this definition from the book Introducing MLOps by Mark Treveil, et al.:
"MLOps is a process that helps organisations to generate long-term value and reduce risk associated with Data Science, Machine Learning and AI Initiatives."
From business need to first prototype
Let us now consider an ML project in a real case scenario. It usually starts from a business question or need, and after getting a rough idea of the question to be answered it is time for your data scientist and business expert to formalise it as a mathematical problem. Next, the data scientist starts the data exploration and model development phase. In this phase, you usually need data engineers to onboard data to your data science platform – the place where your data scientists do their work with their preferred tools. The data scientist then uses the data and, usually, comes up with a first prototype and with some statistics on its power – a model evaluation (i.e. how good the model is doing on previously unseen data). Various models may be created, compared, and evaluated against the original goals. The way in which a model is judged will ultimately depend not only on the mathematics but on the nature of the question and the goals of the person asking it. Iterations and discussion are key to this phase.
Once your ML model is ready, it is time to deploy it to production, where its results will be used by business experts to address the original concern or question. It is at this stage that a common complication appears. During development of the model, you usually use historical data and freely try different features to build the model. However, those features may not be available when the model is deployed (for instance, there may not be sufficient history, some data may be corrupted, the features may no longer be tracked, or they may be tracked differently). In this way, the final model selected in the previous phase cannot be used in production, and you need to go back and iterate further with the new data availability constraints. Therefore, it is very important to assess and validate data availability in production even during development.
From development to production
Going from development to deployment, you should also consider other potential differences between the two environments. First, while data in the former is static (historical data), in the latter you usually need to handle dynamic and shifting data. While in development your main criterion of choosing a model might be high accuracy (model performance), in production you should consider other factors too, e.g. speed of inference. In addition, in production you are faced with a bigger challenge: that of an entire system working properly, while in model development your data scientists only focus on building an algorithm with good performance, and not an entire system flow. For example, the model may perform poorly for certain ranges of input, and you may have excluded these with ‘errors’ in development. In contrast, in production, you must always get an answer, so how do you handle the error cases? Having addressed these challenges and chosen a suitable-for-production model, you can continue to the deployment phase.
The deployment phase usually combines data engineering, software engineering and DevOps skills to set up the environment and deploy an ML pipeline. If you are planning to have multiple ML models in production (which I hope you do!), you will face many issues similar to more standard software development. For example, you might see dependencies, need to make updates, conduct testing, be compliant, and ensure your models are continuously available. At this point, MLOps can borrow strongly from DevOps. Look into containerisation technology to help resolve dependencies, or automate your Continuous Integration, Continuous Delivery (CI/CD) pipeline to speed up the deployment process. Do not forget, however, that ML is based on models, so be sure to check if all requirements are met – for code, documentation, and validation of model accuracy.
Up to this point, I have talked about development and deployment of ML models, and differences between these two phases of operationalising ML models. You might think the deployment is the final step of operationalising ML models, however, it is just the beginning of the main part, so- called ML life cycle management. This will be the focus of the second part of this blog post series, where I will be explaining different aspects of life cycle management of an ML model and conclude this blog post series.
I would like to thank Diane Reynolds, Manne Fagerlind, and Julatouch Pipatanangura for their valuable comments and suggestions that led to improvement of this blog post series.