In the recent years, Data Science(DS) has gained immense popularity and has found its application in almost every industry. But despite that, it is also a fact that many organisations have yet not been able to realise the full potential of DS. This blog focuses on the different aspects of building successful data driven solutions which is not just limited to training ML model. In my opinion, it is as much as an art as it a science. We ll focus on the integration of business knowledge with ML models, how to select the right data , how to convert a business problem into a technical problem, identifying the right set of features and algorithms, importance of model explainability and possible solutions. We also focus on the overall life cycle of data science including model maintenance, deployment, handling data drift and rapid prototyping and serving of DS solutions.
Below flow represents how a normal data science life cycle looks like:
Every step (refer to fig. a) itself is very vast in itself, which will be published in the upcoming blogs. Here, we will majorly focus on aspects of DS that we unintentionally miss to pay attention to, but are as important as training a model or using a pretrained model. In short, a high level understanding of aspects which are beyond ML models.
It’s very important to understand how an ML model can be made useful. There are few aspects of it, as mentined below & explained by the figure (refer to fig. b):
- Business aspect
- Data aspect
- Modeling aspect
- Engineering aspect
Let’s understand the Business aspect first. It involves addressing all the pointers related to business, at the start of any DS problem. Some of them are listed below-
i. Converting business problem into an ML problem: To achieve this we need to describe & gather the complete requirement from business and understand the problem in detail by asking questions as precisely as possible. Need to understand what the business really wants to gain from the data and how will the analysis be used & how would it benefit the business.
ii. Right metric selection: It’s very critical to select right metric for performance measurement. For e.g. we may want to use confusion metric for to measure complete performance. At times, we might need to use F1 score to make a balance between precision and recall. It also depends alot on the type of data we have at hand (balanced or imbalanced data), type of classifier(binary or multi-class) and sometimes on what sort of ML problem are we trying to solve (Regression, classification etc).
Second important aspect is Data. This is the key step in the process of DSLC.
i) Right data to analyse: Right data plays an important role in achieving what we desire. Most of the time, we lack the domain knowledge. Hence it’s very important to include domain expert at the start and take inputs from subject matter experts. Data validation with source systems and performing EDA on it so as to understand & check the data summary, distribution & other statistics should be an unavoidable step.
ii) Checking data quality & data preparation: Data sanity checks are needed at the beginning itself. There are certain python libraries like deepcheck that can be used for this purpose and sometimes ML can also be used for the same. Some of the checks are: duplicated records, redundant features, data standardisation, units of analysis, assessing relevance, detect anomalies, filling data gaps. At the same time we need to know what is the cost of bad data.
iii) Selecting relevant features: I have observed that sometimes the performance of a model can simply be improved by using just a few relevant features and avoiding all unnecessary features that lead to high dimensionality. At the same time removing features that are highly corelated or have almost same impact on the dependent or target variable should also be removed. Also, another advantage of using the relevant features helps in reduction of computational cost of modelling.
Let’s also briefly talk about the third & very important aspect i.e. modeling. People have the misconception that this is the only step involved in a data science problem, which is not really the case, as we have already seen above and will also see after this.
i) Model selection & performance evaluation: Model should be selected by keeping in mind the following pointers (refer to fig. c).
ii) Interpreting the results: It’s important to explain the predictions of our model. It indicates the relative importance of each feature when making a prediction. At the same time, it helps in building trust towards the model. A model should be fair and reliable. There are multiple tools used for the same, some of them are LIME, SHAP etc.
Last but not least is the Engineering aspect. It deals with managing deployment of analytical model, model maintaining, model monitoring & rapid prototyping.
i) Managing deployment: It is important to deploy a model in order to make practical business decisions based on real data and to make the model available to the users. Basis the initial requirement discussed with business we need to deploy the model as a web service, embedded service or an on demand batch or prediction service. There are tools available for end to end deployment like MLflow, Kubeflow, AWS Sagemaker etc.
ii) Model maintenance & monitoring: Once the model is deployed, then comes its maintenance & monitoring part. Model monitoring is essential to capture data drift, control bias, monitor performance shift, maintain data integrity and ensure fairness. Some of the tools that can be used for this are Grafana & Prometheus, Amazon sagemaker model monitor, NannyML etc.
iii) Rapid prototyping: To create a minimum viable product in a short time, we can make use of tools like PyCaret, H2O AutoML, Vertex AI, Lobe, Microsoft PowerApps ans many more.
Apart from everything that we discussed above, identifying right set of tools is crucial. For eg. If we are handling Big data, we know the characteristics i.e volume, variety and velocity. Based on the type of data we need to select the tools, for instance:
- Hadoop, hive, sql etc for high volume data.
- NoSQL database, sql, AWS RDS etc for high variety data.
- Kafka, apache storm, amazon kinesis , flink etc for high velocity data.
For data science work we can select from variety of available options. Some of them are as below:
- Reporting and Business intelligence: PowerBI, MicroStrategy, Google Analytics, Tableau etc.
- Predictive modeling, Analysis, ML & AI: Python, R, spark, julia, jupyter, PyTorch, Keras, Tensorflow.
Some of the key aspects of building a successful AI solution that we have seen through this blog are business, data, modelling and engineering. DS is as much of an art as it is science because it provides knowledge obtained by the scientific method and it’s an expression or application of creative skill and imagination.