The Forgotten Step in CRISP-DM and ASUM-DM MethodologiesPosted on 2018-08-17 by Majid Bahrepour
CRISP-DM stands for the cross-industry standard process for data mining which an open standard for data mining existing since 1999 and proposed by IBM. CRISP-DM suggests a set of steps to perform data mining projects to maximize the success of the project and minimize the common faults happening in any data-oriented projects. Later in 2015, an extended version of CRISP-DM is proposed by IBM so-called ASUM-DM (the Analytics Solutions Unified Method). ASUM-DM is an extension of CRISP-DM having the same steps in data mining (development) plus an operational / deployment part. I personally pretty much a fan of CRISP-DM and ASUM-DM. In my daily consultancy life, I stick to the steps provided because it minimizes the risk of project failures. I believe following CRISP-DM and ASUM-DM methodologies properly distinguishes a senior data scientist from junior ones. Many data-scientists/data-miners have the tendency to quickly model the data to reach the insights ignoring proper understanding of the problem and the right data preparation. That is the reason CRISP-DM comes with clear steps that taking them minimizes the common failure in any data science/data mining projects. Being a data miner and later a data scientist for over 12 years, I believe CRISP-DM misses one crucial step. By writing this article I intend to add a new step in CRISP-DM/ASUM-DM which comes from some years of experience in data science.
CRISP-DM suggests these steps for data-mining/data-science: (1) Business understanding: which means the data scientist should properly understand the business of his/her client. Why is analytics important to them? How analytics can be of a great value for the business and so on. (2) Data understanding: which means the data scientist should go through all the fields within the data to understand the data like a domain expert. With a poor understanding of the data, a data scientist can barely provide high-quality data science solutions. (3) Data preparation: which is the most time-consuming step in any data science project being data preparation in the way that a model can ingest and understand it. (4) Modeling: the magical phase turning the raw-data to (actionable) insights. With recent advances in data science and the toolings such as AutoML and deep learning, modeling is less complicated as before. (4) Evaluation: checking the accuracy of the model which metrics such as a confusion matrix, RMSE, MAPE, and MdAPE. (5) Deployment: which means making the use of the model with the new data. As you can see in the picture, CRISP-Dm is an iterative approach, matches quite well with agile methodology. The steps can be taken in parallel and they are flexible enough to be redone quickly once there is a modification in any previous steps.
ASUM-DM adds a new deployment/operation wing to CRISP-DM. The development phase stays the same as CRISP-DM however in deployment new facets are added such as collaboration, version control, security, and compliance.
The forgotten step in CRISP-DM and ASUM-DM:
CRISP-DM repeats itself in ASUM-DM as the development part however it misses an important step being data validation. My CRISP-DM version looks like this.
Why data validation?
Data validation happens immediately after data preparation/wrangling and before modeling. it is because during data preparation there is a high possibility of things going wrong especially in complex scenarios. Data validation ensures that modeling happens on the right data. faulty data as input to the model would generate faulty insight!
How is data validation done?
Data validation should be done by involving minimum one external person who has a proper understanding of the data and business. In my situation is usually my clients who technically good enough to check my data. Once I go through data preparation and just before data modeling, I usually make data visualization and give my newly prepared data to the client. The clients with the help of SQL queries or any other tools try to validate if my output contains no error. Combing CRISP-DM/ASUM-DM with the agile methodology, steps can be taken in parallel meaning you do not have to wait for the green light for data validation to do the modeling. But once you get feedback from the domain expert that there are faults in the data, you need to correct the data by re-doing the data-preparation and re-model the data.
What are the common causes leading to a faulty output from data preparation?
Common causes are:
1. Lack of proper understanding of the data, therefore, the logic of the data preparation is not correct.
2. Common bugs in programming/data preparation pipeline that lead to a faulty output.
3. Data formats that make some troubles within the data-preparation step and generating faulty outputs with no error trace to be caught by the data scientist/engineer during the data-preparation.
In this article, I would like to extend the CRISP-DM/ASUM-DM by adding a new step. The whole idea of these methodologies is to formalize the steps helping the data-scientists/data-miners to improve the success of the projects and reduce failures. In my CRISP-DM version, “data validation” step is added which ensures even more success of the project and prevents, even more, the failures and faults of any data-science/data-mining projects.