Principle of Data Science - Exercises 2

21a. Machine learning is not a solution for every type of problem. There are certain cases where robust solutions can be developed without using ML techniques. Discuss TWO situations where machine learning is useful.
Ans:
ML is not needed if you can determine a target value by using simple rules, computations, or predetermined steps that can be programmed without needing any data-driven learning.
Use machine learning for the following situations:
• You cannot code the rules: Many human tasks (such as recognizing whether an email is spam or not spam) cannot be adequately solved using a simple (deterministic), rule-based solution. A large number of factors could influence the answer. When rules depend on too many factors and many of these rules overlap or need to be tuned very finely, it soon becomes difficult for a human to accurately code the rules. You can use ML to effectively solve this problem.
• You cannot scale: You might be able to manually recognize a few hundred emails and decide whether they are spam or not. However, this task becomes tedious for millions of emails. ML solutions are effective at handling large-scale problems.

21b. Formal ML is defined as : A computer program is said to learn from experience E, with respect to some task T, and some performance measure P, if its performance on T as measured by P improves with experience E.
Suppose your email program watches, which emails you, do or do not mark as spam, and based on that learns how to better filter spam.
What is the task T, experience E and performance measure P in this setting?

Ans:
Spam Filter
T: Classifying emails as spam or not spam
E: Watching you label emails as spam or not spam.
P: The number (or fraction) of emails correctly classified as spam/not spam.

22. The phrase “data storytelling” has been associated with many things - data visualizations, infographics, dashboards, data presentations, and so on. Too often data storytelling is interpreted as just visualizing data effectively; however, it is much more than just creating visually-appealing data charts.
Present a THREE key elements structured approach of data storytelling for communicating data insights. Summarize your presentation with a diagram.

Ans:
Data storytelling is a structured approach for communicating data insights, and it involves a combination of three key elements: data, visuals, and narrative.

It’s important to understand how these different elements combine and work together in data storytelling. When narrative is coupled with data, it helps to explain to your audience what’s happening in the data and why a particular insight is important. Ample context and commentary is often needed to fully appreciate an insight. When visuals are applied to data, they can enlighten the audience to insights that they wouldn’t see without charts or graphs. Many interesting patterns and outliers in the data would remain hidden in the rows and columns of data tables without the help of data visualizations.

Finally, when narrative and visuals are merged together, they can engage or even entertain an audience. It’s no surprise we collectively spend billions of dollars each year at the movies to immerse ourselves in different lives, worlds, and adventures. When you combine the right visuals and narrative with the right data, you have a data story that can influence and drive change. map

23. With the accelerated growth of tools allowing for easy implementation of powerful machine learning algorithms, it can become tempting for an amateur data scientist to skip the exploratory data analysis.
Anticipate the effects of skipping exploratory data analysis in a data science project.

Ans:
Such inconsiderate behavior can lead to skewed data, with outliers and too many missing values and, therefore, some bad outcomes for the project:
• generating inaccurate models;
• generating accurate models on the wrong data;
• choosing the wrong variables for the model;
• inefficient use of the resources, including the rebuilding of the model.

24. Suggest FOUR (4) ways you can use exploratory graphs to begin viewing what your own data can reveal. Your suggestion must include the type of exploratory graph, what it shows and its purpose in exploring your data.
Ans:
i. Start exploring with box plots- Box plots divides data into its quartiles. The “box” shows a user the data set between the first and third quartiles. The median gets drawn somewhere inside the box and then you see the most extreme non-outliers to finish the plot. Box plots help give a shape to your data that is broad without sacrificing the ability to look at any piece and ask more questions.
ii. Measure your categories with bar charts - Bar chart lets you see individual categories and how big those categories are. A uniform bar chart can tell you there is a lot of variety in your data while a bar chart with an uneven range can show you what might be responsive (or not) in the future. With a bar chart, you can see how different things are between separate categories of data. That is good when you want to know what separates your variables. If you have a lot of categories you may want to compare a limited set of categories and see how things stack up.
iii. See data range with Histograms - The key is that a histogram looks solely at quantitative variables while a bar chart looks at categorical variables. That’s why the bars in a histogram are typically grouped together without spacing in between the bars. Those variables are listed in order so you can see the overall range and skew of the data while a bar charts discrete categories may change depending on how the categories are arranged. Since histograms let you view data sets in ranges, you can tailor your histogram to show differing extremes.
iv. Identify patterns with scatter plots - Scatter plots let you see how closely your data may be correlated. If there is an apparent relationship between pieces of your data then there may be a single cause that could account for multiple variables.

25. Clarify the difference between a training set, a test set and a validation set in the machine learning model. Ans:
Training Dataset: The sample of data used to fit the model.
Validation Dataset: The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
Test Dataset: The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset.

26. Evaluating the accuracy of predictive model is one of the most important tasks in the data science project. It indicates how good predictions are. In classification problems we look at metrics called precision and recall.
Illustrate precision and recall using the confusion matrix.

Ans:
Understanding the confusion matrix, calculating precision and recall is easy.
Confusion Matrix for binary classification is made of four simple ratios:
o True Negative(TN): case was true negative and predicted negative
o True Positive(TP): case was true positive and predicted positive
o False Negative(FN): case was true positive but predicted negative
o False Positive(FP): case was true negative but predicted positive
map

Precision – is the ratio of correctly predicted positive observations to the total predicted positive observations, or what percent of positive predictions were correct?
Precision = TP/TP+FP

Recall – also called sensitivity, is the ratio of correctly predicted positive observations to all observations in actual class – yes, or what percent of the positive cases did you catch?
Recall = TP/TP+FN 27. Using a diagram, explain the concept of reproducible research which you are going to adopt for your data science project. Ans:
map

28. What are recommender systems?
Ans:
A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc

29a. What gave birth to data science?
Ans:
Data is the new oil for all the industries and data science is the electricity that powers the industry.
In today’s world data has become super-abundant and is going to increase exponentially for the next two decades. Before two or three decades, the data which we had with us was tiny, structured, and most of a single format and then the analytics performed was quite simple.
But with the rise of technology, this data started to explode, multiple sources started to generate huge amounts of unstructured data of different formats.
In other words, we had a lot of data with us, but we were not able to find out any insights from it. The need to understand and analyze data to make better decisions is what gave birth to Data Science.

29b. Getting a ride from Grab is easy. Relate where data science is applied in this apps.
Ans:
Getting a ride from Grab is easy. Simply you open the app, set your pickup and drop location, book a taxi, get picked up, and pay with your phone.
Whenever you book a taxi through Grab you get an estimated fare and time to cover the specific distance. How these apps are able to show all the information? The answer is data science. Data Science predictive analysis is helping Grab to give pickup, drop location and time for arriving.

29c. Referring to the three statements below, identify what is X, Y and Z. Assuming you have a students dataset, give ONE related example for each X,Y and Z.
Something that is apparent from the data / data set is called X.
A conclusion drawn from X is known as Y.
An action taken/to be taken from the Y is referred as Z.

Ans:
insights

30a. We need to rebuild our education system to support data-driven education. Discuss what data-driven education means, including its prerequisites.
Ans:
Data driven means that progress in an activity is compelled by data rather than by intuition or personal experience. An organization must be collecting data. Data must be accessible and queryable. People with skills to use the data, extract the right data and use that data to inform next steps.

30b. The hallmark of a data-driven organization is an effective “analytics value chain. Draw the Analytics Value Chain.
Ans:
map

The analytics value chain. In a data-driven organization, the data feeds reports, stimulating deeper analysis. Analysis is placed in the hands of the decision makers who incorporate them into their decision-making process, influencing the direction that the company takes and providing value and impact.

31a. Why was it not possible for data science to exist 20 years ago?. Ans:
Data science was driven by technology change, thus it was impossible to exist 20 years ago (slow computers, low computational power, primitive programming languages, etc.)

31b. Distinguish between data analysis, data analytics and data mining. Include a diagram in your explanation.
Ans:
Data Analysis involves extracting, cleaning, transforming, modeling and visualization of data with an intention to uncover meaningful and useful information that can help in deriving conclusion and take decisions. Data Analysis as a process has been around since 1960’s.

Analytics is about applying a mechanical or algorithmic process to derive the insights for example running through various data sets looking for meaningful correlations between them.

Data analysis is a broader term that refers to the process of compiling and analysing data in order to present findings to management to help inform business decision making. Data analytics is a subcomponent of data analysis that involves the use of technical tools and data analysis techniques.

Data mining is a systematic and sequential process of identifying and discovering hidden patterns and information in a large dataset. It is also known as Knowledge Discovery in Databases. It has been a buzz word since 1990’s.

32a. Digital disruption happens when advances in technology change our markets and our societies. Digital disruption has already happened.
Digital disruption is threatening the survival of many businesses and industries.
Identify TWO industries most vulnerable to digital disruption and show THREE successful digital disruption examples.

Ans:
Industries most vulnerable to digital disruption includes:
• Media and Entertainment
• Technology products and services • Financial services
• Retail
• Communications
• Education

Successful digital disruption examples:
map

32b. Why is datafication important? Provide TWO examples of datafication?
Ans:
Datafication is a modern technological trend turning many aspects of our life into computerised data and transforming this information into new forms of value.
The transition of the world into ever increasing usage and applications of digitization has created the need to create a system that can effectively handle all the information that is flowing around the globe. Therefore, datafication has evolved as a necessity of increasing digitization, aimed at creating value for businesses and individuals.
Some examples of datafication are:
• Facebook datafies our friendships and posts
• Twitter datafies our followers, following, Tweets, time of day, and interactions
• LinkedIn datafies our professional contacts, locations, likes, posts
• Fitbit datafies our physical activities to derive useful information
• GPS devices on smartphones, such as Google maps, are able to track where we are at certain times of the day.




1 2