Linked Data and examples, why they matter. Data driven strategies. Data mining: laws and applications. Data aggregation and fundamentals of data representation (table, bar chart, histogram, pie chart, line graph, scatter plot). Data science definition and job roles (who does what).
3. LESSON 5
A COUPLE OF DIGRESSIONS
▸ storage issues
▸ http://blog.odsi.co.uk/wp-content/uploads/2013/08/History-of-computer-
data-storage.png.jpg
▸ the rise of data center
▸ computational power
▸ the Internet
3
5. LESSON 5
DATA CENTER CLOUD (4.563 IN 2019)
5https://www.digitalic.it/tecnologia/data-center-cloud-numeri-e-diffusione-nel-mondo-litalia-tra-i-paesi-europei-che-ne-ospita-di-piu
7. LESSON 5
DEFINITION
The term “big data” refers to data that is so large, fast or complex that it’s difficult or impossible to
process using traditional methods. The concept of big data gained momentum in the early 2000s
when industry analyst Doug Laney articulated the definition of big data as the three V’s:
▸ Volume: Organizations collect data from a variety of sources, including business transactions,
smart (IoT) devices, industrial equipment, videos, social media and more. In the past, storing it
would have been a problem.
▸ Velocity: With the growth in the Internet of Things, data streams in to businesses at an
unprecedented speed and must be handled in a timely manner, near-real time.
▸ Variety: Data comes in all types of formats – from structured, numeric data in traditional
databases to unstructured text documents, emails, videos, audios, stock ticker data and financial
transactions.
7
12. LESSON 5
CORRELATION
When two sets of data are strongly linked together we say they have a High Correlation.
▸ Correlation is Positive when the values increase together, and
▸ Correlation is Negative when one value decreases as the other increases
Correlation can have a value:
▸ 1 is a perfect positive correlation
▸ 0 is no correlation (the values don't seem linked at all)
▸ -1 is a perfect negative correlation
12
13. LESSON 5
CORRELATION
Correlation is one of the most widely used statistical concepts.
Since the term "correlation" refers to a mutual relationship or association between
quantities, why is it a useful metric?
▸ Correlation can help in predicting one quantity from another
▸ Correlation can (but often does not) indicate the presence of a causal
relationship
▸ Correlation is used as a basic quantity and foundation for many other
modeling techniques
13
19. LESSON 5
LINKED DATA / LOD
19
Linked data is structured data which is interlinked with other data so it becomes
more useful through semantic queries.It builds upon standard Web technologies
but rather than using them to serve web pages only for human readers, it extends
them to share information in a way that can be read automatically by computers.
Part of the vision of linked data is for the Internet to become a global database.
Linked data may also be open data, in which case it is usually described as linked
open data (LOD).
▸ https://en.wikipedia.org/wiki/Linked_data
22. LESSON 5
WHY LINKED DATA MATTERS
Linked data is a method for publishing structured data using vocabularies like
schema.org that can be connected together and interpreted by machines. Using
linked data, statements encoded in triples can be spread across different
websites.
This enables data from different sources to be connected and queried.
▸ https://wordlift.io/blog/en/entity/linked-data/
22
24. LESSON 5
CONTEXT
You don’t have to be a fancy statistician to do data mining, but you do
have to know something about what the data signifies and how the
business works.
Only when you understand the data and the problem that you need to
solve can data-mining processes help you to discover useful
information and put it to use.
24
25. LESSON 5
NINE LAWS OF DATA MINING - 1
Pioneering data miner Thomas Khabaza developed his “Nine Laws of Data Mining”
to guide new data miners as they get down to work
▸ 1 - “Business Goals Law”
Business objectives are the origin of every data mining solution.
A data miner is someone who discovers useful information from data to support
specific business goals. Data mining isn’t defined by the tool you use.
▸ 2 - “Business Knowledge Law”
Business Knowledge is central to every step of the data mining process.
You don’t have to be a fancy statistician to do data mining, but you do have to
know something about what the data signifies and how the business works.
25
26. LESSON 5
NINE LAWS OF DATA MINING - 2
▸ 3. “Data Preparation Law”
Data preparation is more than half of every data mining process.
Pretty much every data miner will spend more time on data preparation than on
analysis.
▸ 4. “No Free Lunch for the Data Miner”
The right model for a given application can only be discovered by experiment.
In data mining, models are selected through trial and error.
▸ 5 - “Patterns”
There are always patterns in the data.
As a data miner, you explore data in search of useful patterns. Understanding patterns
in the data enables you to influence what happens in the future.
26
27. LESSON 5
NINE LAWS OF DATA MINING - 3
▸ 6. “Insight Law”
Data mining amplifies perception in the business domain.
Data mining methods enable you to understand your business better than you
could have done without them.
▸ 7 - “Prediction Law”
Prediction increases information locally by generalization.
Data mining helps us use what we know to make better predictions (or
estimates) of things we don’t know.
27
28. LESSON 5
NINE LAWS OF DATA MINING - 4
▸ 8. “Value Law”
The value of data mining results is not determined by the accuracy or stability
of predictive models.
Your model must produce good predictions, consistently. That’s it.
▸ 9. “Law of Change”
All patterns are subject to change.
Any model that gives you great predictions today may be useless tomorrow.
28
29. LESSON 5
PHASES OF THE DATA MINING PROCESS
The Cross-Industry Standard Process for
Data Mining (CRISP-DM) is the dominant
data-mining process framework. It’s an
open standard; anyone may use it.
29
30. LESSON 5
BUSINESS UNDERSTANDING
Get a clear understanding of the problem you’re out to solve, how it impacts your
organization, and your goals for addressing it.
Tasks in this phase include:
▸ Identifying your business goals
▸ Assessing your situation
▸ Defining your data mining goals
▸ Producing your project plan
30
31. LESSON 5
DATA UNDERSTANDING
Review the data that you have, document it, identify data management and data quality
issues.
Tasks in this phase include:
▸ Gathering data
▸ Describing
▸ Exploring
▸ Verifying quality
31
32. LESSON 5
DATA PREPARATION
Get your data ready to use for modeling.
Tasks in this phase include:
▸ Selecting data
▸ Cleaning data
▸ Constructing
▸ Integrating
▸ Formatting
32
33. LESSON 5
MODELING
Use mathematical techniques to identify patterns within your data.
Tasks in this phase include:
▸ Selecting techniques
▸ Designing tests
▸ Building models
▸ Assessing models
33
34. LESSON 5
EVALUATION
Review the patterns you have discovered and assess their potential for business
use.
Tasks in this phase include:
▸ Evaluating results
▸ Reviewing the process
▸ Determining the next steps
34
35. LESSON 5
DEPLOYMENT
Put your discoveries to work in everyday business.
Tasks in this phase include:
▸ Planning deployment (your methods for integrating data mining discoveries
into use)
▸ Reporting final results
▸ Reviewing final results
35
37. LESSON 5
DATA AGGREGATION
Data aggregation is the process where raw data is gathered and expressed in a summary
form for statistical analysis.
For example, raw data can be aggregated over a given time period to provide statistics. After
the data is aggregated and written to a view or report, you can analyze the aggregated data
to gain insights about particular resources or resource groups.
There are two types of data aggregation:
▸ Time aggregation - All data points for a single resource over a specified time period.
▸ Spatial aggregation - All data points for a group of resources over a specified
geographical area.
37
38. LESSON 5
SUMMARY STATISTICS
When data is aggregated, groups of observations are replaced with summary statistics based on those observations.
Summary statistics are used tto communicate the largest amount of information as simply as possible.
▸ Mean
▸ Count
▸ Maximum
▸ Median
▸ Minimum
▸ Mode
▸ Range
▸ Sum
38
39. LESSON 5
TABLES
Tables are the format in which most numerical data are initially stored and analysed and
are likely to be the means you use to organise data collected during experiments and
dissertation research.
Tables are an effective way of presenting data:
• when you wish to show how a single category of information varies when
measured at different points (in time or space).
• when the dataset contains relatively few numbers.
• when the precise value is crucial to your argument and a graph would not convey
39
40. LESSON 5
BAR CHARTS
Bar charts are one of the most commonly
used types of graph and are used to display
and compare the number, frequency or other
measure for different discrete categories or
groups.
The bars can be drawn either vertically or
horizontally depending upon the number of
categories and length or complexity of the
category labels.
40
41. LESSON 5
HISTOGRAMS
Histograms are a special form of bar chart
where the data represent continuous rather
than discrete categories. Since a
continuous category may have a large
number of possible values the data are
often grouped to reduce the number of data
points.
41
42. LESSON 5
PIE CHARTS
Pie charts are a visual way of displaying how
the total data are distributed between different
categories. Pie charts should only be used for
displaying nominal data. They are generally
best for showing information grouped into a
small number of categories and are a
graphical way of displaying data that might
otherwise be presented as a simple table.
42
Pie chart of populations of English native speakers
43. LESSON 5
LINE GRAPHS
Line graphs are usually used to show time
series data – that is how one or more
variables vary over a continuous period of
time. Line graphs are particularly useful for
identifying patterns and trends in the data
such as seasonal effects, large changes and
turning points. As well as time series data,
line graphs can also be appropriate for
displaying data that are measured over other
continuous variables such as distance.
43
45. LESSON 5
DEFINITION
Data Science is a blend of various tools, algorithms, and machine learning
principles with the goal to discover hidden patterns from the raw data and solve
analytically complicated problems.
45
48. LESSON 5
EXPLAINING VS PREDICTING
48
By 2020 more than 80 % of the data
will be unstructured. This data is
generated from different sources like
financial logs, text files, multimedia
forms, sensors, and instruments.
51. LESSON 5
51
The Data Scientist has the ability to handle the crude data using the latest
technologies and techniques, can perform the necessary analysis, and can
present the acquired knowledge to his associates in an informative way.
52. LESSON 5
52
The Data Analyst works with R, Python and SQL; the role combines technical
and analytical knowledge.
53. LESSON 5
53
The Data Architect integrates, centralizes, protects and maintains data
sources.
54. LESSON 5
54
The Statistician can be seen as the pioneer of the data science field. It is often
he who reaps the information from the data and transforms it into actionable
insights.
55. LESSON 5
55
The Database Administrator ensures that the database is accessible to every
stakeholder in the organizations and performs the necessary safety measures
to keep the stored data safe.
56. LESSON 5
56
The Business Analyst is probably the least technical profile, he has a deep
understanding of the various business processes that are in place. He often
performs the role of the middle person between the business folks and the
technicians.
57. LESSON 5
57
The Data and Analytics Manager steers the direction of the data science
team. He consolidates strong and specialized skills in a various arrangement
of advancements (SQL, R, SAS, … ) with the social aptitudes required to deal
with a group.