Data Science Methodology
Data Science methodology is one the most important subject to know about any data scientist, I have been stuck so many times when I was thinking about this problem and always thought, like mad man how can the data science cycle run and big company’s design methodology for data science? my search is completed when I reached out to one of the amazing courses of this at Rakamin Academy
This is one of the best methodologies to convert your data science, business problem into a data science solution. You can learn the whole project cycle here. I write all my learning from this course. After reading this you will know how to convert business problems into Data Science base Solutions.
Outline of this article
I am starting with a data science Rakamin Academy to learn about solving data science problems leveraging python and obtain a basic understanding of data science methodology.
10 Steps of Data Science Methodology
- Business understanding
- Analytic approach
- Data requirements
- Data collection
- Data understanding
- Data preparation
- Model Training
- Model Evaluation
- Deployment
- Feedback
1. Business understanding
Before solving any problem in the Business domain, it needs to be adequately understood. Business understanding forms a concrete base, leading to easy query resolution and clarity of the exact problem we will solve. Identifying and stating the business problem clearly is the most crucial step in any Data Science project. This step sets objectives and guides the rest of your data science project and team.
To enhance your business understanding better, data scientists must ask what problem you are trying to solve and how will it impact business objectives.
Some of the steps to ensure that:
- Establishing a clearly defined data science problem by asking clearly defined questions to stakeholders or business leaders to understand the business objective and value creation.
- People with both business and data understanding should be involved in the problem definition framework.
- Leadership should allow time for a rigorous definition of the problem
- Analyze the problem in terms of data complexity, data availability, and data liability.
- The goal should be to define the problem clearly and to solve it in a way that will benefit the business.
2. Analytic approach
Once you get familiar with business understanding, you now know what kind of problem you are trying to solve. The analytics approach is a step where you get the answer using the data to all those questions you got familiar with in the previous step.
Based on your business understanding, there is generally four types of analytics approaches that can be utilized.
- Descriptive approach: to show current status based on the retrospective information, statistical analysis to show the relationship, track specific key performance indicators using business intelligence tool.
- Predictive approach: If the question is to determine the probabilities of action in the future based on the retrospective information.
- Prescriptive approach: If the question is to determine an optimal course of action using the data.
- Diagnostic approach: If the question is to understand why a particular change or event happened in the data? The diagnostics analytics approach generally uses data discovery, drill-down, mining, and correlation techniques. For example, Diagnostic analytics can help companies answer questions such as:
- Why did our company sales decrease compared to the previous year?
- Why is user engagement down compared to the previous month?
- Why is a specific product category demand increased compared to the previous year?
5. Cognitive approach: You can think of the cognitive analytics approach as analytics with human-like intelligence. Cognitive analytics reveals specific patterns and connections that simple analytics cannot. This approach can include understanding the context and meaning of a sentence or visual recognition of an image or video provided a large amount of data. Cognitive analytics reveals specific patterns and connections that simple analytics cannot.
3. Data Requirements
You can’t get good results in data science without good-quality data. Getting the right data quality from multiple sources is crucial in data science.
The analytical method gathers suitable data sources, formats, and volumes. To understand data requirements in detail, one must answer the following questions before moving to the data collection methodology :
- Which type of data is required.
- How to identify the suitable source or collect them.
- How to explore the data or work with them, and
- How to prepare the data to meet the desired outcome.
Data requirement methodology includes identifying the necessary data content, formats, and sources for initial data collection.
4. Data Collection
The information gathered can be accessed in any random format. As a result, the data obtained should be validated according to the technique chosen, and the output approved. As a result, if necessary, additional data may be gathered, or unnecessary data can be discarded.
The data needs are reviewed throughout this phase, and choices are made regarding whether the collection requires more or less data. After gathering the data components, the data scientist will know what they will be working on throughout the data collecting stage.
Descriptive statistics and visualization techniques can be applied to the data collection to examine the data’s content, quality, and early insights. Data gaps will be detected, and preparations will need to be established to fill them or make alternatives.
5. Data Understanding
Data understanding methodology responds to the question, Is the collected data reflective of the problem to be solved? Descriptive statistics computes the measurements applied to data to determine the content and quality of matter. This step may need a return to the previous action for adjustment.
The data understanding component of the data science approach essentially addresses the question:
- Is the data you obtained reflective of the problem to be solved?
6. Data Preparation
Data preparation is the most time-consuming phase of a data science project, with data collection and understanding typically taking 70–80% of the overall data science project time.
Automating some data collecting and preparation procedures in the database can cut this time in half. This time savings translates into more time for data scientists to spend on model creation.
Data preparation is the process of making sure that raw data is correct and consistent before processing and analyzing so that the output of BI and analytics apps will be valid. The data preparation step of the data science methodology, in particular, answers the question: How is data prepared?
It must be prepared to be free of missing or incorrect values and duplicates and adequately structured to work effectively with data. Data preparation includes feature engineering. It is the process of leveraging data domain knowledge to produce characteristics that allow machine learning algorithms to function. A feature is a property that can be useful in problem-solving. Data features are vital to predictive models and will impact the results you aim to attain. When using machine learning methods to evaluate data, feature engineering is essential.
The data preparation phase lays the groundwork for the subsequent stages in answering the issue. While this step may take some time, the outcomes will benefit the project if done correctly. If this step is skipped, the end result will be subpar, and you may have to start over.
7. Modeling
Modeling determines if the data is suitable for processing or if extra finishing and preprocessing are required. This phase focuses on developing predictive/descriptive/prescriptive models.
“Data modeling is mainly concerned with creating either descriptive or predictive models.”
A descriptive model could investigate questions such as: what are the top ten selling products in a category? And A predictive model is a mathematical process used to predict future events/outcomes by analyzing patterns in a given set of input data, for example, to predict yes/no or multi-class outcomes. These models are dependent on the analytics technique used, which might be statistically or machine learning-driven.
The data scientist will use a training set for predictive modeling. A training set is a collection of data with known outcomes. The data scientist will experiment with various techniques to confirm the necessary variables.
The effectiveness of data gathering, preparation, and modeling depend on a thorough grasp of the situation at hand and a suitable analytical methodology.
8. Evaluation
Model assessment occurs during the model creation process. This determines the model’s quality, fits the business needs, and goes through the diagnostic measure phase and statistical significance testing.
Model assessment may be divided into two stages.
- The diagnostic measures phase is used to confirm the model is functioning correctly. For Example, A decision tree may determine whether the model’s output is consistent with the initial design if the model is predictive. If the model is descriptive, testing set with known results may be used, and the model can be adjusted.
- Statistical significance testing is a possible second phase of review. This form of assessment may ensure that the data is handled and processed correctly inside the model. This is done to avoid excessive second-guessing after the solution is revealed.
Ten standard predictive model evaluation metrics in data science :
- Mean Squared Error(MSE): The most used and simplistic evaluation metric for the regression model represents the squared distance between actual and predicted values.
- Root Mean Square Error (RMSE): An Evaluation metric for the regression model is the square root of mean squared error(MSE). The output value is in the same unit as the output variable, making interpretation of error easy.
- Precision: An evaluation metric for the classification model is a ratio that measures what proportion of predicted positives is truly positive?
- Recall: An evaluation metric for the classification model is a ratio that measures what proportion of actual Positives is correctly classified?
- F1 score: An evaluation metric for the classification model is the harmonic mean of precision and recall.F1 Score maintains a balance between precision and recall.
- AUC ROC: An evaluation metric for the classification model which indicates how well the probabilities from the positive classes segregate themselves from the negative classes.
- Log loss/Binary Cross entropy: When the output of a classifier is prediction probabilities. Log Loss considers the uncertainty of your prediction based on how much it varies from the actual label.
- Categorical Cross entropy: Generalized log loss to the multi-class classification problem.
- Average Precision(AP): It is an essential Object Detection Evaluation Metric, and it summarizes the weighted mean of precisions for each threshold with the increase in recall. This metric makes model comparison easier.
- Mean Average Precision (mAP): Mean average precision is an extension of Average precision. In Average precision, we only calculate individual objects, but in mAP, it gives the precision for the entire model.
9. Deployment
If you have reached this stage, the model has been thoroughly assessed and is ready for implementation in the production environment. This is the ultimate test for the model to determine how well it performs on external data and how scalable it is. Depending on the model’s goal, it may be pushed out to a small set of users or a test environment to gain confidence in implementing the results across the board or customer production environment.
10. Feedback
Feedback is essential for production model performance monitoring. It also helps data scientists understand model robustness, for example, how well the model will perform in the long term? One of the significant purposes of this methodology is that it helps in refining the model and accessing its performance and impact.
Feedback steps include defining the review procedure, tracking the record (data drift), measuring efficacy, and reviewing and improving.
Once you deploy the model in production, the predictions will be correct till data submitted to the model in production mimics. If it doesn’t, we call it a data drift.
A variation in the production data from the data used to test and validate the model before deploying it in production is known as data drift.
Data drift can be for multiple reasons, like a significant time gap (weeks to months to years) between the time data is gathered, and the model deployed, which is used to predict with actual data depending on the complexity of the problem, errors in data collection, seasonality for example, if the data is collected before covid and model, is deployed post covid this will automatically cause data to drift. You can identify data drift using sequential analysis methods, model-based methods, and time distribution-based methods. For more information about data drift you can start here.
There are multiple steps to handle data drifts :
- Check the data quality and compare it with current data and reference data to get the idea what changed.
- Investigate the drift to understand where does the drift come from?
- You can live with the drift provided it does not impact the business objectives.
- Retrain the model with the current data which will refresh the model.
- Calibrating or rebuilding the model means you can make more changes to the training pipeline like changing the prediction target, applying domain adaptation strategies, identifying new segments where the model fails, and reweighing samples in the training data.
- Pause the model or scrap the model, in this case you can have fallback strategy like changing the nature of the solution.
- You can tune the model and apply business logic on top of the model to get the relevant solutions.
===Conclusion===
Thanks a lot for reading the entire article.