In my recent exploration of MLOps tools, I came across MindsDB, a user-friendly system designed for non-machine learning experts. In this blog post, I will delve into how MindsDB operates and discuss its merits and limitations.
MindsDB is an open-source automated machine learning (AutoML) tool that simplifies the implementation and usage of AI models. With a focus on empowering users with little to no AI expertise, MindsDB utilizes automated data analysis and ML algorithms to generate predictive models.
How is it done?
Users can create models and utilize them using the SQL language in a manner quite similar to creating tables. For instance, creating a model to predict home rental prices is as straightforward as the SQL code showcased in the MindsDB quick start:
CREATE MODEL mindsdb.home_rentals_model
(SELECT * FROM demo_data.home_rentals)
Using that model is as easy as submitting a SELECT query:
WHERE number_of_bathrooms = 2
AND sqft = 1000;
The system’s beauty lies in its ability to treat the model as a table, offering exceptional simplicity. In fact, you can even perform batch predictions by joining the model with a table that contains the data you desire to predict:
SELECT m.rental_price, m.rental_price_explain
FROM mindsdb.home_rentals_model AS m
JOIN example_data.demo_data.home_rentals_to_predict AS d
There is no need to explicitly specify the joining column as the model will search for the columns it was trained on and make predictions based on them, disregarding any other columns in the table. This batch prediction mechanism is well-suited for production integration. You can easily prepare a temporary table that contains the necessary prediction data, execute and save the predictions, and then delete the temporary table.
The core of the MindsDB system lies in the AI tables, which serve as virtual database tables representing AI models. This unique feature enables users to seamlessly interact with the models as if they were regular tables, whether it’s for individual predictions or batch processing (as demonstrated earlier using JOIN). I had previously come across a similar concept while reviewing BigQuery ML from Google Cloud. However, there are two key distinctions between BigQuery ML and MindsDB:
- BigQuery ML is limited to working only with data stored in BigQuery, including data from external sources connected to BigQuery such as Google Cloud Storage files (Object store similar to S3) and Google Sheets. In contrast, MindsDB is designed to be cloud agnostic, allowing it to operate on databases from various vendors. This flexibility makes MindsDB an ideal choice for scenarios where avoiding vendor-lock is a priority
- Both Solutions offer support for various ML engines, with an impressive range to choose from. However, in my experience, the importance of selecting the right engine to enhance the quality of results is often overstated in the industry. Instead, I have consistently found that proper feature selection and transformation have a greater impact on the outcomes, compared to hyperparameter tuning or the choice of model engine.
MindsDB utilizes data handlers to establish connections with various data sources, allowing it to efficiently read data for training or batch prediction purposes. The team at MindsDB has done an exceptional job in terms of data source integration. They have comprehensive support for most of the well-known databases and offer both an API and their own services for cases where a particular database is not supported out of the box.
By providing the vendor-specific connection details, you can establish connections, retrieve information, and write the predictions back to any database.
MindsDB offers support for a wide range of ML engines that can be utilized for training and data prediction. Their list includes AutoML, LLM, and Time Series engines. Additionally, users have the option to bring their own model by deploying it to MLFlow, Ray Serve, or BYOM and connecting it to MindsDB.
When considering an ML system for production and large scale modeling, it is essential to pay attention to various areas.
Managed or Self-Hosted?
MindsDB offers two operational options: managed service and self-hosted. I will discuss both of these options in the upcoming sections.
For the managed service option, it is as simple as creating an account with any service provider. The only requirement is to provide the database details and ensure that your database is accessible to MindsDB servers.
For the self-hosted option, I discovered documentation on how to install MindsDB on any machine, whether it be local or on the cloud (AWS and GCP). The Docker option didn’t work well on my M1 Mac, but the Python option worked well. There is also a page that references articles about installing MindsDB on other platforms.
When evaluating an ML system, scaling is a significant concern. Efficiently handling scalability ensures optimal response times for users while also being cost-effective. Typically, the following factors are taken into consideration when contemplating scalability:
- In real-life use cases, various models require different resources. For example, certain models may require 4 cores and 16GB RAM, while others may need 64 cores and 256GB RAM. Some models may also utilize GPU acceleration.
- Allocating separate resources for training and prediction — to guarantee low latency during prediction, it is advisable to allocate distinct resources for prediction. This way, even if the system is engaged in training large models, the prediction process can remain fast and responsive.
- Cost efficiency — spin up machines only when necessary. For example, when a training request is made, dedicated machine(s) should be spun up to fulfill the request, and then torn down once the task is complete. This approach ensures that machines are only active when they are truly needed, minimizing unnecessary costs.
When evaluating scaling options for the managed service, the documentation provides the following statement:
Whether in your private cloud or using MindsDB’s managed service, MindsDB enables you to handle large-scale AI/ML workloads efficiently. MindsDB can scale to meet the demands of your use case, ensuring optimal performance and responsiveness.
The details of how the scaling concerns are implemented by the managed service and whether the separation of resources for train and prediction is handled properly may require further investigation. It is recommended to reach out to the service provider for more specific information on how these concerns are addressed.
In the self-hosted option, it appears that the documentation for MindsDB lacks information regarding scaling. The installation instructions provided only cover setting up the software on a single instance, where the available resources are used for both model training and prediction purposes. It is important to note that without proper scaling mechanisms, such as dynamically allocating resources based on demand, the scalability and efficiency of the self-hosted solution may be limited.
It is indeed crucial for an ML system to provide easy and efficient ways for communication and interaction. Having a user-friendly dashboard that displays all models and enables users to manage re-training or batch prediction settings is important. To do so, a programmatic interface or API should be available to allow querying and operating the system programmatically, providing flexibility and facilitating integration with other systems or workflows. This ensures that programmers have multiple options to interact with the ML system based on their specific needs and preferences.
In MindsDB, both the managed service and self-hosted options offer several ways to communicate and interact with the ML system:
The documentation for the different API options is excellent, and it appears that working with them is straightforward.
Predictions are not the end of the way
When considering any ML system, it is important to keep in mind that the act of predicting data is just one aspect of a comprehensive ML solution. Business experts play a crucial role in determining the actions to be taken based on those predictions and how they can effectively impact the operation of their business. ML Engineers are responsible for designing the architecture and infrastructure necessary to integrate these predictions into the broader business system.
As a low-code system, MindsDB simplifies the process of creating and utilizing models. However, the drawback of hiding complexity is the potential for limited flexibility. While users have the ability to specify options for model usage, they are unable to directly influence the data transformation prior to its ingestion by the model. In such cases, reliance on the system to make appropriate decisions on behalf of users becomes necessary. This raises questions about the practicality of low-code systems in real-world scenarios where high-stakes financial implications are involved.
MindsDB presents an intriguing approach to model creation and utilization. It effectively conceals a significant amount of complexity, enabling users who lack expertise in ML to effectively harness models for solving business problems. However, it is important to note that the self-hosted option may not be suitable for production purposes due to the absence of scaling capabilities. As a result, companies are often left with no choice but to opt for the managed service.
The question that looms above all others is whether low-code systems should be utilized, as they may restrict users from having the necessary flexibility for important business use-cases.