Mobilize Data To Your Advantage – An analytics practitioner point of view

Our CEO, Gowri Selka, presented live BrightTalk on Dec 17, 7:00 pm 

Following is a summary of her talk.. 


Data Literacy is a widely discussed topic in the industry.

Volantsys Analytics hosted a week-long survey on LinkedIn to share a practical perspective about Data Literacy in terms of the concepts that a diverse Data Science team needs to understand to effectively apply analytics for business needs. Before we get into the details of the survey and results, we want to provide a common understanding for the terms we use: “Diverse Data Science team” and the “Data Science concepts”.

Diverse Data Science Team

The solution development for analytics business problem is a team sport. This attitude enables a cross-functional team with different skillsets to implement solutions not simply for technology sake such as adopting cloud or fancy AI tools, but to further the mission to improve business outcomes.

Team members with business and technology functions must be equipped with appropriate data literacy knowledge to help build the right analytics solutions.

Who is part of the Diverse Team and their role?

  • Business Analyst: Apply context to data to develop hypothesis, translate data to analytics problem, establish success criteria for each phase.
  • Project Manager: Facilitate the 5 phased methodology to improve solution quality and efficiency, ensure program success
  • Data Analyst/Scientist: Collect, analyze data and develop data science models/algorithms

What is the purpose of this diverse team?

To analyze data to identify trends, apply business context to transform the data to fit within the new realty (such as unexpected COVID situations), and use these insights to perform future decisions as well as scale them responsibly.

Data Science Solution Development Concepts

The following five concepts must be understood by the team to formulate data science solution for real-world problems and avoid reinforcing existing bias in data into the new solution.

Data Analysis:

  • EDA (*Exploratory Data Analysis): Analyze data using various techniques to understand the patterns, dependencies, trends.
  • Data Visualization: Pictures are worth thousand words. Charts and visual representation of data individually and collectively as a group helps to understand the context and relationship between the data effectively.

Data Transformation:

Data Transformation is a set of processes to improve Data Quality. It’s based upon various statistical methods and tools that are utilized to prepare a balanced and clean dataset for the Model to consume.

Model selection:

Algorithms or Models are key to utilize complex data patterns to develop predictive insights which in turn help to scale business decisions based on these insights. Selecting right models based on business problem is key to develop right solution.

Model performance & Evaluation:

While there are multiple models /algorithms available to develop a type of business solution, the right model is selected based on the business risk and sensitivity of the solution. Each type of model has multiple performance factors, and it is important to understand the model behavior to select a solution that fits business needs.

Exploring the Survey and Response

The survey questions were designed to highlight the nuances of the five concepts listed above from business applications perspective rather than technicalities. There may be more than one right answer for the question based on the cognitive process applied. We have provided our rationale along with the answer below.

  1. Exploratory Data Analysis (EDA) Concept — What steps are typically NOT included in Data Science development process?

a. Data Collection

b. Performance Assessment

c. Data Compliance

d. Data Transformation

Answer: b

Performance assessment is a technique used to assess the model performance which is typically performed in a later phase after model development and not in EDA phase. Data collection, data compliance and data transformation are performed during exploratory data analysis. The purpose is to ensure the data is collected from repeatable sources that are trustworthy, are in compliance with the organization data policies and transformed to ensure the data fits the need to develop business solutions. Transformation typically includes replacing missing values and converting some data types for the algorithm to use them effectively.

2. Data Quality Concept — What factors are considered to assess initial data quality of the dataset?

a. Relevancy

b. Relevancy and Availability

c. Outliers and Missing Value

d. All of the above

Answer: d

All of the above is the correct answer as all these factors are critical while assessing data quality. Data quality is a key concept that involves inspecting datasets to ensure they are relevant to the business problem, consistently available to repeat testing as needed, does not have too many missing values so the data is significant for the insights being developed.

3. Data Visualization Concept — What factor should be considered first when selecting charts to represent data?

a. Data Types

b. Data Presentation

c. Data Storytelling

d. All of the above

Answer: a

The right answer is Data Types, as the question was specifically asked about the first factor that one should consider while selecting the visuals. Selecting visual representation for a dataset typically includes, selecting a chart based on data type as specific charts are applicable only for specific data types such as categorical, ordinal, numerical data. This subset of visuals applied for data types is used to assess if that visual will convey the right message to address the purpose and if it is suitable for the audience. All of the above could be seen as a right answer, as we need all these factors to be considered while selecting visuals to explain data.

4. Model Selection Concept — Association Rule Mining algorithm is used to develop solutions to recommend products to customers.

a. True

b. False

Answer: a

Associate Rules mining is used to identify the relationship between two/more products per customer. Apriori is an algorithm that helps to find related data patterns, such as if two products are associated. This algorithm is one way to perform market basket analysis and based on the identified products relationship it can be used to recommend product bundles. There are about ten popular business patterns where data science models are used across various industries. Examples of business patterns are, identifying recommendation offers for customers, predicting weather data such as temperature, forecasting demand. There are multiple models for each of those business patterns. It is important for the business analysts and project managers to be aware of various algorithms used based on the business problem they are addressing. This helps the diverse team to develop clear success criteria and select appropriate pre-built AI products to fit within the expected model behaviors.

5. Model Performance Concept — Select a performance parameter that is NOT available in the Confusion matrix.

a. F1 Score

b. Specificity

c. Sensitivity

d. Root Mean Square Error

Answer: d

For regression-based models, Root Mean Square error is used to measure performance and effectiveness of the solutions. This parameter is not part of the confusion matrix. Confusion matrix is used to validate the performance of classification models. In general models vary and each type has its own way to assess their performance such as the accuracy of results and if they are sensitive to change in data. The purpose of the Confusion Matrix measure is to show how well the model predicted the outcome accurately either if the outcome will occur or NOT occur. F1 score, specificity and sensitivity are various factors derived from these predictions performed by the models in both training and validation datasets.

Conclusion

Data science or Machine learning solutions are geared towards using standard algorithms to analyze datasets, gain business insights and develop predictive insights to make business decisions at scale with the collective intelligence built into the algorithm.

To develop the right solution and to maximize the power of data it is critical for a diverse team to perform these activities in a collaborative manner.

Hope you enjoyed this short exercise. We look forward to sharing more best practices.

Comments are closed