First, we use our project’s main branch to store the problem definition, documentation, data description, and project structure. This serves as a space for collaboration and discussions.
TIP: Beginning with clearly defining the business problem, determining the desired outcome, identifying target values or labels and how they are obtained, and establishing evaluation metrics and requirements, we can ensure a successful start to our project and provide a place for onboarding and collaboration.
We can also use it for tracking experiments, where your experiments’ results are combined. For example, MLflow’s mlruns folder can be merged there for this purpose. Any collaborator can checkout this branch and run the UI.
Alternatively, the tracking can be done in another branch.
Starting this way is very simple, and as needs change over time, it is possible to upgrade to an MLflow server or a tracking platform like Weights and Biases with minimal changes.
These are branches of the project, which mainly include data files, documentation, and transformation scripts, and they remain active. You can think of them like S3 buckets, but instead of uploading and downloading, you checkout a branch, and your files are there.
It is recommended to always commit (upload) to the raw branch. These create a source of truth, a place never edited or deleted, so we can always track where data is coming from and passing. It also enables creating new flows easily, auditing and governess.
💡 If you add a commit message of where the data is coming from, you can get even more granular observability over your data.
You can use another clean branch where only clean data exists. For example, broken images or empty text files that were uploaded to the raw branch do not appear in the clean branch.
A split branch where the data is divided for training, validation, and testing can ensure that all teams and collaborators work on the same playing field.
This approach helps prevent data leakage and enables more robust feature engineering and collaboration. Minimizing the chance of examples from the test set being included in the training stages reduces the risk of introducing bias. Additionally, having all collaborators on the same split ensures consistent and unbiased results in an experiment.
In a former classification project, I was part of a team of individual contributors where each person ran the whole pipeline from scratch; each of us had used different data splitting percentages and seeds, which led to weaker models in production based on bugs and data biases.
💡 ML tip: The three phases model development best practice
We use the “train” and “validation” datasets to train and optimize the model’s hyperparameters. We then use the train plus validation as training set to train our tuned model and evaluate with the test dataset only once. Lastly, we train the model on all the data and save it as our model.
These branches are active branches for training and inference. Here, you can run your training, save your model, checkpoints, and model card, run tests, build and test the Docker image, commit everything at the end of a training cycle, and then Tag. They should be capable of handling the retrieval of new data and re-training. This is where the automation takes place.
⚠️ No code is written in these branches.
This ensures that a model is coupled with the data it was trained on, the code used to train and run it in production (including feature engineering), and the result metrics. All of these components are combined into a single unified “snapshot”. Whenever you check out a tag, all the necessary pieces for that model are present.
💡 Tip: By choosing the tag name ahead of time, you can add to the tracking info during training as a parameter. This ensures you can always retrieve the model-data-code “snapshot” given the tracking data using any tracking tool.
After training, only the tracking data is merged (copied) to your main branch for tracking.
In the simplest case, it can be a JSON text file that contains the hyperparameters and evaluation results. This file is then appended to a list in the main branch. In the case of MLflow, it involve copying the experiments from the mlruns folder to the main branch.
These branches are for code development & data exploration, training on sampled or small data until you have a working program. While developing, you are welcome to use all Git best practices. However, only branch out to a stablebranch when no further changes to the code are required, even if additional data is pulled in. These branches should include the inference code, the server, the Dockerfile, and tests.
There is always at least one development branch that remains active, where all new features, bug fixes, and other changes are merged.
💡 ML and MLOps engineers can collaborate on the training and inference sides.
For example, you can create a dev/model branch where you develop a baseline model. This can be the most popular class for classification or the mean/median for regression. The focus is on setting up the code while thoroughly understanding your data.
When it’s stable, and tests pass, we branch out to stable/model where we train and commit the model, code and data together to remote and tag that commit. That is fast and easy to share and will enable the DevOps, backend, and frontend teams to initiate development and exchange feedback. It will also facilitate validating of newly discovered requirements in a real-world environment as early as possible.
Next, we develop the model on the dev/model branch to a simple model like linear regression, and when it’s ready, and tests pass, we can merge it to stable/model where we train, commit, and tag a release to prod.
This approach gives you the freedom to incrementally improve your model while preserving the full context of previous models in the stable branch.
From this point, we have three options:
- We can re-train when more data arrives by pulling data to the stable branch.
- We can start experimentation using feature engineering on the dev/linear-regression branch.
- We can create a new dev/new-approach branch for more sophisticated models.
In model monitoring we care about the data distribution, outliers, and prediction distributions.
In the monitoring branch, we save the queried data, commit tag and model prediction from prod as files.
💡 You can use multiple monitoring branches for each environment — dev, stable, and prod.
We can set alerts on data commits to test for drift in any feature distributions, outlier values, calibration sanity test, and save the alerts code; this enables more advanced solutions like an outlier detection model as we can save the model in this branch too.
This branch could typically belong to another project that is decoupled from the code responsible for creating the monitoring logs, as well as the data and model that generated them.
Data science and analysis is another aspect of the project that is often separated into a different project. This is where the analysis code and non-training data of the data scientists are gathered.
A data scientist can check out and pull data from the monitoring branch to run analysis, A/B tests, and other online and offline experiments. They can also use data from the raw branch for these purposes.
Online examples are simpler, as each experiment group corresponds to a branch.
💡 Tip: Common online experiments:
Forward test– Comparing the current model 99% vs. a candidate model 1%.
Backtest — after merging a new model, keep 1% on the former model to validate expected effect in reverse.
Having the model tag as a parameter in the monitoring data helps you pinpoint every change in the metric potential cause.