Principles of Data Science project

Sathish Manthani
3 min readMar 31, 2020
PC: Daki Photography

Humans have the natural habit of learning from past experiences, we even try identifying patterns from our experiences. Data Science emulates that behavior with the latest technologies on a larger dataset which human brains can’t process. Smart, targeted, personalized are some of the keywords often indicative of data science projects. The primary goal of a data science project is to provide us with actionable insights that lead to better decision making and finally better outcomes.

CRISP-DM methodology provides the steps for a data science project in a structured manner. Typical steps include understand the business problem, prepare the data accordingly, create ML models, and use the models to find patterns that provide actionable insights to business users.

CRISP-DM Steps. PC: Github

Success of a Data Science project

What exactly defines the success of a data science project? A successful data science project starts with a clear understanding of the business problem/needs and concludes with results that can be understood in a usable platform. And during the project, scrum master must convince the leadership team that the project is worth doing. The business manager’s buy-in is the key to the success of the project.

A successful data science project requires a team of people with technical skills and clear business understanding.

In most organizations, the Business Intelligence and Data warehousing team collects the data from source OLTP systems. It stages the data, cleanses data by applying data quality checks. Later, it transforms data by applying business rules. Finally, the ETL process loads the data into the data warehouse. DWs are typically designed using dimensional modeling to support slice-and-dice reporting. So, it’s clever to source data from data warehouse systems because DW contains organized data in a structured format. This data often requires minimal cleansing as it has already removed the anomalies flowing in from source systems. That being said, every business use-case is different. So, further data profiling/cleansing may be needed to fit the use-case.

Another advantage of sourcing data from DW is historical data. It’s not operationally feasible for the OLTP systems to hold data beyond the current data processing needs. So, most of the data science projects source data from DWs or Data lakes since they store a significant amount of historical data — in some cases, it could be from the point of inception. Predictive modeling requires historical data to come up with the best fit model to accurately predict the future trends or patterns. So, all in all, it’s cost-effective and less time-to-market approach to source data from the datawarehouse.

Max Shron [1], talks about the importance of stakeholders-driven data science. A few points from his article

1. Data scientists should spend more time listening to business users and spend less time doing data preparation and other technical tasks.

2. First of all, start your project work with mockups. Start with things that have no real analytical content except to verify that if we put this deliverable in front of a decision-maker, could they do get that actionable insight. Creating mockups early on will enable the business team to know if it fetches any value to the business or not right away.

3. The life cycle of a project should follow these steps:

  • Discuss idea — Subject Matter Expert surfaces ideas
  • Determine the business value of the project
  • Rapid iteration using mockup data
  • Condense plan and ensure your business team signs off
  • Begin Exploratory Data analysis, feature engineering and modeling
  • Deliver code and set up monitoring
  • Retrospective

Finally, data science is essential for any organization in the modern age to stay ahead and relevant in the game. So, a well-planned execution of data science projects could only attribute to such success.

References:

--

--