SlideShare ist ein Scribd-Unternehmen logo
1 von 60
Understanding your Data
Data Analytics Lifecycle and Machine Learning
Dr. Abzetdin ADAMOV
Director, Center for Data Analytics Research
School of IT & Engineering
ADA University
aadamov@ada.edu.az
Content
• Why now?
• Data Analytics Lifecycle
• Data Acquisition
• Data Repository
• Data Preprocessing
• Data Analytics and Machine Learning
• Data Visualization
• Data Governance
BIG DATA AT ADA
4th Big Data Day Baku 2018
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Computing Facilities at CeDSRT
Characteristics of computing cluster:
Processing Cores: 102
Memory: 1,568 TB
Storage: 136 TB
AAdamov, CeDAR, ADA University
Student Research - SDP Topics
1. Development of Lexical and Morphological Analysis System
2. Effective Installation of Multi-node Cluster based on Hadoop 3.0
3. Utilizing Artificial Intelligence (AI) to improve quality of life for people with
Dementia
4. Statistical Analysis and Data Visualization of DTS Data
5. Development of N-gram Model
6. Development of Semantic Similarity System
7. Development of Sentiment Analysis System
8. Personalized Offers and Customer Retention Platform in Banking
9. Data Retrieval, Storage and Manipulation of DTS Data
10. Network Security and IDS using Machine Learning
11. Development Spell Correction System
WHY NOW?
Where Data Comes From
Data is produced by:
• People
• Social Media, Public Web, Smartphones, …
• Organizations (Employer)
• OLTP, OLAP, BI, …
• Machines
• IoT, Satellites, Vehicles, Science, …
AAdamov, CeDAR, ADA University
Modern Data Sources
à Internet of Anything (IoAT)
• Wind Turbines, Oil Rigs, Cars
• Weather Stations, Smart Grids
• RFID Tags, Beacons, Wearables
à User Generated Content (Web & Mobile)
• Twitter, Facebook, Snapchat, YouTube
• Clickstream, Ads, User Engagement
• Payments: Paypal, Venmo
Addressing Data – Digital Universe
0
5
10
15
20
25
30
35
1995 2000 2003 2005 2008 2009 2010 2011 2012 2014 2016 2017 2018
DataGrowthinZettaBytes
Digital Universe Growth over time
Addressing Data – Hard Disk Capacity
0
10000
20000
30000
40000
50000
60000
70000
1991 1998 2003 2005 2007 2008 2009 2010 2011 2012 2014 2016 2017
CapacityinGigaBytes
Hard Drive Capacity Growth over time
Addressing Data – Storage Cost
1200000
100000
10000 800 10 1 0.1 0.003 0.0020
200000
400000
600000
800000
1000000
1200000
1400000
1980 1985 1990 1995 2000 2005 2010 2015 2020
Price$perGBytes
Data Storage Cost per Gigabyte
AAdamov, CeDAR, ADA University
Computation Power CPU and GPU
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2001 2002 2005 2006 2008 2009 2010 2012 2013 2014 2015 2017 2018
GFLOPS
Computation Power CPU and GPU
GPU
CPU
Data Growth vs. Processing Power
AAdamov, CeDAR, ADA University
Addressing Data – Transfer Rate
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1991 1998 2003 2005 2007 2009 2010 2012 2014 2016
TransferSpeedMB/sec
Hard Drive Data Transfer Rate
DATA ANALYTICS LIFECYCLE
Data Analytics Life-Cycle
Data
Acquisition
Data
Repository
Data
Processing
Data
Analytics /
ML
Data
Visualization
- Hadoop HDFS
- Microsoft Azure
- Amazon EC2
- Warehouse
- Statistical Analysis
- Machine Learning
- R Programming
- Python
- RapidMiner
- Weka
- ….
- Web Crawling
- Data Mining
- Information
Retrieval
- ….
- ETL
- Parsing
- Indexing
- Searching
- Ranking
- NLP
- ….
Big Data Management involves Data Science and Data Engineering areas for
implementing Data Mining Techniques
DATA ACQUISITION
Data Acquisition Techniques
1. Operational Systems
2. Data Warehouses and Data Marts
3. Online Analytical Processing (OLAP) / BI
4. Web Crawling
5. Data Brokers (Commercial Data)
6. Open Data Sources
7. Experimental Data Collection
8. Online Surveys
Data Acquisition Considerations
• Business Needs
• Data Standards (ISO, ITIS, FGDC, ISDM)
• Accuracy Requirements
• Currency of Data
• Time Constraints
• Format (CSV, XLS, XML, JSON, …)
• Cost
DATA REPOSITORY
Traditional Approach in Data Management
1TB Hard Drive
3 TB file
1TB of Data
1TB of Data
1TB of Data
STORAGE PROCESSING
DATA Processor
Raw DataProcessed Data
The “Big Data” Problem
à A single machine cannot process or even store all the data!
Problem
Solution
à Distribute data over large clusters
Difficulty
à How to split work across machines?
à Moving data over network is expensive
à Must consider data & network locality
à How to deal with failures?
à How to deal with slow nodes?
Addressing Data
• Standard Hard Drive data transmission speed 60 – 100 MB/sec
• Solid State Hard Drive (SSD) - 250 – 500 MB/sec
• Hard Drive capacity growing RAPIDLY (4 – 60 TB)
• Online data growth (double every 18 month)
• Processing Speed (relatively same growth)
• Hard Drive transmission speed is relatively FLAT
Moving Data IN and OUT of disk is the Bottleneck
BIG DATA and TRADITIONAL SYSTEMS
AAdamov, CeDAR, ADA University
Hadoop Input/Output Model
AAdamov, CeDAR, ADA University
A
128mb
Hadoop WRITE/READS blocks into HDFS sequentially
CLIENT
ABCD
B
128mb
C
128mb
D
128mb
File NEWS.txt (512 Mb) divided to 4 blocks
Hadoop Reads/Writes blocks sequentially, not in parallel. Its why Hadoop does not affect IO
performance significantly.
SOLUTION is Data Striping technique…
Distributed Architecture of HDFS
Rack 1
DN1
DN2
DN3
DN4
Switch
Rack 2
DN11
DN12
DN13
DN14
Switch
Rack 3
DN21
DN22
DN23
DN24
Switch
Rack 4
DN31
DN32
DN33
DN34
Switch
CC
AA
DD
BB
A
Where to write file ADA.txt (blocks A, B, C, D) in HDFS?
CLIENT
NAMENODE
A B C D
A – DN32, 11, 14
B – DN01, 22, 23
C – DN12, 02, 04
D – DN34, 12, 14
BCD
AAdamov, CeDAR, ADA University
Big Data and Virtialization
Traditional
Architecture
Distributed
Architecture
Operating System
HARDWARE
App App App
HARDWARE
App App App
HYPERVISOR
OS OS OS
HARDWARE HARDWARE HARDWARE
OS OS OS
HADOOP HDFS + YARN
App App App App App App
Virtualized
Architecture
BIG DOES NOT MEAN SLOW
AAdamov, CeDAR, ADA University
DATA PREPROCESSING
Good decisions require Good Data
Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality Analytics results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
Multi-Dimensional Measure of Data Quality
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
Major Tasks in Data Preprocessing
• Data Cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data Integration
• Integration of multiple databases, data cubes, or files
• Data Transformation
• Normalization and aggregation
• Data Reduction
• Obtains reduced representation in volume but produces the same or similar
analytical results
• Data Discretization
• Part of data reduction but with particular importance, especially for numerical data
Data Cleaning and Transformation
• Data Cleaning Tasks:
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Data Transformation Tasks:
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Normalization: scaled to fall within a small, specified range
• Generalization: concept hierarchy climbing
Data Reduction Strategies
• Why data reduction?
• Warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time to run on the complete data
set
• Data reduction
• Obtains a reduced representation of the data set that is much smaller in volume but
yet produces the same (or almost the same) analytical results
• Data reduction strategies
• Data Compression
• Sampling
• Data cube aggregation
• Dimensionality reduction
• Numerosity reduction
• Discretization and concept hierarchy generation
DATA ANALYTICS
You can't manage what you can't measure
Skills Requirements for Data Analytics
Statistics
Business
Domain
Computer
Science
Data Analytics
Meaning of Statistics
• The word statistics is used in either two senses:
• Commonly used to refer to data.
• Principles and methods of handling numerical data.
• Statistics is defined as a branch of mathematics that deals with the
collection, analysis and interpretation of numerical information
• Statistics changes numbers into information
• deciding how to collect data efficiently
• using data to give information
• using data to answer questions
• using data to make decisions.
• Statistics is the science of learning from data
Kinds of Data
• Quantitative – Data that is numerical, counted, or compared
• Demographic data
• Answers to closed-ended survey items
• Attendance data
• Scores on standardized instruments
• Qualitative – Narratives, logs, experience
• Interviews
• Open-ended survey items
• Categories
Statistical Measures
• Measure of central tendency
• Mean
• Median
• Mode
• Measure of variation
• Range
• Variance and standard deviation
• Interquartile range
• Proportion, Percentage
• Ratio, Rate
Statistical Analytics in R
• mean(), max(), min()
• median()
• var(), sd()
• cor()
• quantile()
• summary()
• hist()
• plot()
Mean, Median, Quantile
5,3,6,8,9,2,11,8,3,8,10,9
mean(vec) 6.83
median(vec) 8
5,3,6,8,9,2,11,8,3,8,10,1000
mean(vec) 89.41
median(vec) 8
quantile(vec)
0% 25% 50% 75% 100%
2.00 4.50 8.00 9.25 1000.00
quantile(var, probs = c(0, .75, 1))
0% 75% 100%
2.00 9.25 1000.00
quantile(vec, probs=seq(0, 1, .1))
Correlation
mpg cyl disp wt
mpg 1.0000000 -0.8521620 -0.8475514 -0.8676594
cyl -0.8521620 1.0000000 0.9020329 0.7824958
disp -0.8475514 0.9020329 1.0000000 0.8879799
wt -0.8676594 0.7824958 0.8879799 1.0000000
cor(mtcars[, c("mpg", "cyl", "disp", "wt")])
mpg cyl disp wt
Mazda RX4 21.0 6 160.0 2.620
Mazda RX4 Wag 21.0 6 160.0 2.875
Datsun 710 22.8 4 108.0 2.320
Hornet 4 Drive 21.4 6 258.0 3.215
Hornet Sportabout 18.7 8 360.0 3.440
Valiant 18.1 6 225.0 3.460
Duster 360 14.3 8 360.0 3.570
Merc 240D 24.4 4 146.7 3.190
Merc 230 22.8 4 140.8 3.150
Summary Function
summary(mtcars[, c("mpg", "cyl", "disp", "wt")])
mpg cyl disp wt
Min. :10.40 Min. :4.000 Min. : 71.1 Min. :1.513
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.:2.581
Median :19.20 Median :6.000 Median :196.3 Median :3.325
Mean :20.09 Mean :6.188 Mean :230.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:3.610
Max. :33.90 Max. :8.000 Max. :472.0 Max. :5.424
DATA VISUALIZATION
Anscombe Quartet
Anscombe Kvarteti
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
Visualization
MACHINE LEARNING
What id Machine Learning?
• “Machine learning refers to a system capable of the autonomous
acquisition and integration of knowledge.”
• “Learning denotes changes in a system that ... enable a system to do
the same task … more efficiently the next time.” - Herbert Simon
• Automating automation
• Getting computers to program themselves
• Writing software is the bottleneck
• Let the data do the work instead!
• Machine learning is primarily concerned with the accuracy and
effectiveness of the computer system.
Why Machine Learning?
• No human experts
• industrial/manufacturing control
• mass spectrometer analysis, drug design, astronomic discovery
• Black-box human expertise
• face/handwriting/speech recognition
• driving a car, flying a plane
• Rapidly changing phenomena
• credit scoring, financial modeling
• diagnosis, fraud detection
• Need for customization/personalization
• personalized news reader
• movie/book recommendation
Machine Learning Algorithms Categories
• Supervised Learning algorithm
• Logistic Regression,
• Neural Networks,
• Support Vector Machines (SVMs), and Naive Bayes classifiers)
• Unsupervised Learning algorithm
• K-means, Random Forests, Hierarchical clustering)
• Semi-supervised Learning algorithm
• Reinforcement learning algorithm (self-driving cars)
ML vs Traditional Programming
Traditional Programming
Machine Learning
Computer
Data
Program
Output
Computer
Data
Output
Program
Machine Learning in Python
DATA GOVERNANCE
Data Governance Metrics
• Digital Culture
• Naming Standard
• Professional Terms and Abbreviations
• Data Model, Documentation and Relationship
• Data Quality Rules and Metrics
• Hierarchy of Data Artifacts / Entities
• Classify your Data:
• Master Data, Transactional Data, Reference Data
But do you have the capacity to refine it?
DATA is the NEW OIL!
AAdamov, CeDAR, ADA University
Information is the oil of the 21st century,
and Analytics is the Combustion Engine
Q & A ?
Dr. Abzetdin Adamov
Email me at: aadamov@ada.edu.az
Follow me at: @
Link to me at: www.linkedin.com/in/adamov
Visit my blog at: aadamov.wordpress.com

Weitere ähnliche Inhalte

Was ist angesagt?

How to Build Data Governance Programs That Last: A Business-First Approach
How to Build Data Governance Programs That Last: A Business-First ApproachHow to Build Data Governance Programs That Last: A Business-First Approach
How to Build Data Governance Programs That Last: A Business-First ApproachPrecisely
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Simplilearn
 
DATA & ANALYTICS
DATA & ANALYTICSDATA & ANALYTICS
DATA & ANALYTICSfireflylabz
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysisAmazon Web Services
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
 
Accelerate and modernize your data pipelines
Accelerate and modernize your data pipelinesAccelerate and modernize your data pipelines
Accelerate and modernize your data pipelinesPaul Van Siclen
 
data-analytics-strategy-ebook.pptx
data-analytics-strategy-ebook.pptxdata-analytics-strategy-ebook.pptx
data-analytics-strategy-ebook.pptxMohamedHendawy17
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceCarole Goble
 
The Death of the Star Schema
The Death of the Star SchemaThe Death of the Star Schema
The Death of the Star SchemaDATAVERSITY
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know SnowflakeKnoldus Inc.
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceSnowflake Computing
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management DATAVERSITY
 
Data strategy demistifying data
Data strategy demistifying dataData strategy demistifying data
Data strategy demistifying dataHans Verstraeten
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceDATAVERSITY
 

Was ist angesagt? (20)

How to Build Data Governance Programs That Last: A Business-First Approach
How to Build Data Governance Programs That Last: A Business-First ApproachHow to Build Data Governance Programs That Last: A Business-First Approach
How to Build Data Governance Programs That Last: A Business-First Approach
 
Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...Data Science Training | Data Science For Beginners | Data Science With Python...
Data Science Training | Data Science For Beginners | Data Science With Python...
 
DATA & ANALYTICS
DATA & ANALYTICSDATA & ANALYTICS
DATA & ANALYTICS
 
Streaming data for real time analysis
Streaming data for real time analysisStreaming data for real time analysis
Streaming data for real time analysis
 
Data Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital TransformationData Architecture Strategies: Data Architecture for Digital Transformation
Data Architecture Strategies: Data Architecture for Digital Transformation
 
Accelerate and modernize your data pipelines
Accelerate and modernize your data pipelinesAccelerate and modernize your data pipelines
Accelerate and modernize your data pipelines
 
data-analytics-strategy-ebook.pptx
data-analytics-strategy-ebook.pptxdata-analytics-strategy-ebook.pptx
data-analytics-strategy-ebook.pptx
 
FAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practiceFAIRy stories: the FAIR Data principles in theory and in practice
FAIRy stories: the FAIR Data principles in theory and in practice
 
The Death of the Star Schema
The Death of the Star SchemaThe Death of the Star Schema
The Death of the Star Schema
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Let’s get to know Snowflake
Let’s get to know SnowflakeLet’s get to know Snowflake
Let’s get to know Snowflake
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Data strategy demistifying data
Data strategy demistifying dataData strategy demistifying data
Data strategy demistifying data
 
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Hybrid Data Platform
Hybrid Data Platform Hybrid Data Platform
Hybrid Data Platform
 
Five Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data GovernanceFive Things to Consider About Data Mesh and Data Governance
Five Things to Consider About Data Mesh and Data Governance
 
Predictive Analytics - An Introduction
Predictive Analytics - An IntroductionPredictive Analytics - An Introduction
Predictive Analytics - An Introduction
 

Ähnlich wie Understanding your Data - Data Analytics Lifecycle and Machine Learning

Big Data and High Performance Computing
Big Data and High Performance ComputingBig Data and High Performance Computing
Big Data and High Performance ComputingAbzetdin Adamov
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.Łukasz Grala
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 
Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopAnu Ravindranath
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3Simon Ambridge
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...Big Data Value Association
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxPriyadarshini648418
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Martin Bém
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedcedrinemadera
 

Ähnlich wie Understanding your Data - Data Analytics Lifecycle and Machine Learning (20)

Big Data and High Performance Computing
Big Data and High Performance ComputingBig Data and High Performance Computing
Big Data and High Performance Computing
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 
Architectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoopArchitectures styles and deployment on the hadoop
Architectures styles and deployment on the hadoop
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Dam...
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Big data
Big dataBig data
Big data
 
Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24Pitfalls of Data Warehousing_2019-04-24
Pitfalls of Data Warehousing_2019-04-24
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
SoftServe BI/BigData Workshop in Utah
SoftServe BI/BigData Workshop in UtahSoftServe BI/BigData Workshop in Utah
SoftServe BI/BigData Workshop in Utah
 

Mehr von Abzetdin Adamov

Big Data Ecosystem for Data-Driven Decision Making
Big Data Ecosystem for Data-Driven Decision MakingBig Data Ecosystem for Data-Driven Decision Making
Big Data Ecosystem for Data-Driven Decision MakingAbzetdin Adamov
 
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...Abzetdin Adamov
 
Steps and Tips to Protect Yourself and your Private Information while Online....
Steps and Tips to Protect Yourself and your Private Information while Online....Steps and Tips to Protect Yourself and your Private Information while Online....
Steps and Tips to Protect Yourself and your Private Information while Online....Abzetdin Adamov
 
Technical, Legal and Political Issues of Combating Terrorism on the Internet.
Technical, Legal and Political Issues of Combating Terrorism on the Internet.Technical, Legal and Political Issues of Combating Terrorism on the Internet.
Technical, Legal and Political Issues of Combating Terrorism on the Internet.Abzetdin Adamov
 
Introduction to object oriented programming
Introduction to object oriented programmingIntroduction to object oriented programming
Introduction to object oriented programmingAbzetdin Adamov
 
Qafqaz university-inegrated-management-information-system
Qafqaz university-inegrated-management-information-systemQafqaz university-inegrated-management-information-system
Qafqaz university-inegrated-management-information-systemAbzetdin Adamov
 
Üniversite Bilgi Sistemi - Birimlerin İşbirliği Platformu
Üniversite Bilgi Sistemi - Birimlerin İşbirliği PlatformuÜniversite Bilgi Sistemi - Birimlerin İşbirliği Platformu
Üniversite Bilgi Sistemi - Birimlerin İşbirliği PlatformuAbzetdin Adamov
 
INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...
INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...
INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...Abzetdin Adamov
 
e-Government Strategy. Government Transformation in Developing Countries of t...
e-Government Strategy. Government Transformation in Developing Countries of t...e-Government Strategy. Government Transformation in Developing Countries of t...
e-Government Strategy. Government Transformation in Developing Countries of t...Abzetdin Adamov
 
The Truth about Cloud Computing as new Paradigm in IT
The Truth about Cloud Computing  as new Paradigm in ITThe Truth about Cloud Computing  as new Paradigm in IT
The Truth about Cloud Computing as new Paradigm in ITAbzetdin Adamov
 
The Role of Business Process Management in Success of the e-Government Projec...
The Role of Business Process Management in Success of the e-Government Projec...The Role of Business Process Management in Success of the e-Government Projec...
The Role of Business Process Management in Success of the e-Government Projec...Abzetdin Adamov
 
University Management Information System
University Management Information SystemUniversity Management Information System
University Management Information SystemAbzetdin Adamov
 

Mehr von Abzetdin Adamov (16)

Big Data & Privacy
Big Data & PrivacyBig Data & Privacy
Big Data & Privacy
 
Big Data Ecosystem for Data-Driven Decision Making
Big Data Ecosystem for Data-Driven Decision MakingBig Data Ecosystem for Data-Driven Decision Making
Big Data Ecosystem for Data-Driven Decision Making
 
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...Latest Trends in Technology:BigData Analytics, Virtualization, Cloud Computi...
Latest Trends in Technology: BigData Analytics, Virtualization, Cloud Computi...
 
Steps and Tips to Protect Yourself and your Private Information while Online....
Steps and Tips to Protect Yourself and your Private Information while Online....Steps and Tips to Protect Yourself and your Private Information while Online....
Steps and Tips to Protect Yourself and your Private Information while Online....
 
Technical, Legal and Political Issues of Combating Terrorism on the Internet.
Technical, Legal and Political Issues of Combating Terrorism on the Internet.Technical, Legal and Political Issues of Combating Terrorism on the Internet.
Technical, Legal and Political Issues of Combating Terrorism on the Internet.
 
Introduction to object oriented programming
Introduction to object oriented programmingIntroduction to object oriented programming
Introduction to object oriented programming
 
Introduction to AJAX
Introduction to AJAXIntroduction to AJAX
Introduction to AJAX
 
Introduction to HTML
Introduction to HTMLIntroduction to HTML
Introduction to HTML
 
Qafqaz university-inegrated-management-information-system
Qafqaz university-inegrated-management-information-systemQafqaz university-inegrated-management-information-system
Qafqaz university-inegrated-management-information-system
 
Grid Computing
Grid ComputingGrid Computing
Grid Computing
 
Üniversite Bilgi Sistemi - Birimlerin İşbirliği Platformu
Üniversite Bilgi Sistemi - Birimlerin İşbirliği PlatformuÜniversite Bilgi Sistemi - Birimlerin İşbirliği Platformu
Üniversite Bilgi Sistemi - Birimlerin İşbirliği Platformu
 
INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...
INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...
INFORMATION TECHNOLOGIES AS THE BASE OF THE BUSINESS PROCESS MANAGEMENT IMPLE...
 
e-Government Strategy. Government Transformation in Developing Countries of t...
e-Government Strategy. Government Transformation in Developing Countries of t...e-Government Strategy. Government Transformation in Developing Countries of t...
e-Government Strategy. Government Transformation in Developing Countries of t...
 
The Truth about Cloud Computing as new Paradigm in IT
The Truth about Cloud Computing  as new Paradigm in ITThe Truth about Cloud Computing  as new Paradigm in IT
The Truth about Cloud Computing as new Paradigm in IT
 
The Role of Business Process Management in Success of the e-Government Projec...
The Role of Business Process Management in Success of the e-Government Projec...The Role of Business Process Management in Success of the e-Government Projec...
The Role of Business Process Management in Success of the e-Government Projec...
 
University Management Information System
University Management Information SystemUniversity Management Information System
University Management Information System
 

Kürzlich hochgeladen

Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Projectdanielbell861
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopThinkInnovation
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 

Kürzlich hochgeladen (13)

Cyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis ProjectCyclistic Memberships Data Analysis Project
Cyclistic Memberships Data Analysis Project
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Create Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI DesktopCreate Data Model & Conduct Visualisation in Power BI Desktop
Create Data Model & Conduct Visualisation in Power BI Desktop
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 

Understanding your Data - Data Analytics Lifecycle and Machine Learning

  • 1. Understanding your Data Data Analytics Lifecycle and Machine Learning Dr. Abzetdin ADAMOV Director, Center for Data Analytics Research School of IT & Engineering ADA University aadamov@ada.edu.az
  • 2. Content • Why now? • Data Analytics Lifecycle • Data Acquisition • Data Repository • Data Preprocessing • Data Analytics and Machine Learning • Data Visualization • Data Governance
  • 4. 4th Big Data Day Baku 2018
  • 6. Computing Facilities at CeDSRT Characteristics of computing cluster: Processing Cores: 102 Memory: 1,568 TB Storage: 136 TB
  • 7. AAdamov, CeDAR, ADA University
  • 8. Student Research - SDP Topics 1. Development of Lexical and Morphological Analysis System 2. Effective Installation of Multi-node Cluster based on Hadoop 3.0 3. Utilizing Artificial Intelligence (AI) to improve quality of life for people with Dementia 4. Statistical Analysis and Data Visualization of DTS Data 5. Development of N-gram Model 6. Development of Semantic Similarity System 7. Development of Sentiment Analysis System 8. Personalized Offers and Customer Retention Platform in Banking 9. Data Retrieval, Storage and Manipulation of DTS Data 10. Network Security and IDS using Machine Learning 11. Development Spell Correction System
  • 10. Where Data Comes From Data is produced by: • People • Social Media, Public Web, Smartphones, … • Organizations (Employer) • OLTP, OLAP, BI, … • Machines • IoT, Satellites, Vehicles, Science, … AAdamov, CeDAR, ADA University
  • 11. Modern Data Sources à Internet of Anything (IoAT) • Wind Turbines, Oil Rigs, Cars • Weather Stations, Smart Grids • RFID Tags, Beacons, Wearables à User Generated Content (Web & Mobile) • Twitter, Facebook, Snapchat, YouTube • Clickstream, Ads, User Engagement • Payments: Paypal, Venmo
  • 12. Addressing Data – Digital Universe 0 5 10 15 20 25 30 35 1995 2000 2003 2005 2008 2009 2010 2011 2012 2014 2016 2017 2018 DataGrowthinZettaBytes Digital Universe Growth over time
  • 13. Addressing Data – Hard Disk Capacity 0 10000 20000 30000 40000 50000 60000 70000 1991 1998 2003 2005 2007 2008 2009 2010 2011 2012 2014 2016 2017 CapacityinGigaBytes Hard Drive Capacity Growth over time
  • 14. Addressing Data – Storage Cost 1200000 100000 10000 800 10 1 0.1 0.003 0.0020 200000 400000 600000 800000 1000000 1200000 1400000 1980 1985 1990 1995 2000 2005 2010 2015 2020 Price$perGBytes Data Storage Cost per Gigabyte AAdamov, CeDAR, ADA University
  • 15. Computation Power CPU and GPU 0 500 1000 1500 2000 2500 3000 3500 4000 4500 2001 2002 2005 2006 2008 2009 2010 2012 2013 2014 2015 2017 2018 GFLOPS Computation Power CPU and GPU GPU CPU
  • 16. Data Growth vs. Processing Power AAdamov, CeDAR, ADA University
  • 17. Addressing Data – Transfer Rate 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 1991 1998 2003 2005 2007 2009 2010 2012 2014 2016 TransferSpeedMB/sec Hard Drive Data Transfer Rate
  • 19. Data Analytics Life-Cycle Data Acquisition Data Repository Data Processing Data Analytics / ML Data Visualization - Hadoop HDFS - Microsoft Azure - Amazon EC2 - Warehouse - Statistical Analysis - Machine Learning - R Programming - Python - RapidMiner - Weka - …. - Web Crawling - Data Mining - Information Retrieval - …. - ETL - Parsing - Indexing - Searching - Ranking - NLP - …. Big Data Management involves Data Science and Data Engineering areas for implementing Data Mining Techniques
  • 21. Data Acquisition Techniques 1. Operational Systems 2. Data Warehouses and Data Marts 3. Online Analytical Processing (OLAP) / BI 4. Web Crawling 5. Data Brokers (Commercial Data) 6. Open Data Sources 7. Experimental Data Collection 8. Online Surveys
  • 22. Data Acquisition Considerations • Business Needs • Data Standards (ISO, ITIS, FGDC, ISDM) • Accuracy Requirements • Currency of Data • Time Constraints • Format (CSV, XLS, XML, JSON, …) • Cost
  • 24. Traditional Approach in Data Management 1TB Hard Drive 3 TB file 1TB of Data 1TB of Data 1TB of Data STORAGE PROCESSING DATA Processor Raw DataProcessed Data
  • 25. The “Big Data” Problem à A single machine cannot process or even store all the data! Problem Solution à Distribute data over large clusters Difficulty à How to split work across machines? à Moving data over network is expensive à Must consider data & network locality à How to deal with failures? à How to deal with slow nodes?
  • 26. Addressing Data • Standard Hard Drive data transmission speed 60 – 100 MB/sec • Solid State Hard Drive (SSD) - 250 – 500 MB/sec • Hard Drive capacity growing RAPIDLY (4 – 60 TB) • Online data growth (double every 18 month) • Processing Speed (relatively same growth) • Hard Drive transmission speed is relatively FLAT Moving Data IN and OUT of disk is the Bottleneck
  • 27. BIG DATA and TRADITIONAL SYSTEMS AAdamov, CeDAR, ADA University
  • 28. Hadoop Input/Output Model AAdamov, CeDAR, ADA University A 128mb Hadoop WRITE/READS blocks into HDFS sequentially CLIENT ABCD B 128mb C 128mb D 128mb File NEWS.txt (512 Mb) divided to 4 blocks Hadoop Reads/Writes blocks sequentially, not in parallel. Its why Hadoop does not affect IO performance significantly. SOLUTION is Data Striping technique…
  • 29. Distributed Architecture of HDFS Rack 1 DN1 DN2 DN3 DN4 Switch Rack 2 DN11 DN12 DN13 DN14 Switch Rack 3 DN21 DN22 DN23 DN24 Switch Rack 4 DN31 DN32 DN33 DN34 Switch CC AA DD BB A Where to write file ADA.txt (blocks A, B, C, D) in HDFS? CLIENT NAMENODE A B C D A – DN32, 11, 14 B – DN01, 22, 23 C – DN12, 02, 04 D – DN34, 12, 14 BCD AAdamov, CeDAR, ADA University
  • 30. Big Data and Virtialization Traditional Architecture Distributed Architecture Operating System HARDWARE App App App HARDWARE App App App HYPERVISOR OS OS OS HARDWARE HARDWARE HARDWARE OS OS OS HADOOP HDFS + YARN App App App App App App Virtualized Architecture
  • 31. BIG DOES NOT MEAN SLOW AAdamov, CeDAR, ADA University
  • 33. Why Data Preprocessing? • Data in the real world is dirty • incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data • noisy: containing errors or outliers • inconsistent: containing discrepancies in codes or names • No quality data, no quality Analytics results! • Quality decisions must be based on quality data • Data warehouse needs consistent integration of quality data
  • 34. Multi-Dimensional Measure of Data Quality • Accuracy • Completeness • Consistency • Timeliness • Believability • Value added • Interpretability • Accessibility
  • 35. Major Tasks in Data Preprocessing • Data Cleaning • Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data Integration • Integration of multiple databases, data cubes, or files • Data Transformation • Normalization and aggregation • Data Reduction • Obtains reduced representation in volume but produces the same or similar analytical results • Data Discretization • Part of data reduction but with particular importance, especially for numerical data
  • 36. Data Cleaning and Transformation • Data Cleaning Tasks: • Fill in missing values • Identify outliers and smooth out noisy data • Correct inconsistent data • Data Transformation Tasks: • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Normalization: scaled to fall within a small, specified range • Generalization: concept hierarchy climbing
  • 37. Data Reduction Strategies • Why data reduction? • Warehouse may store terabytes of data • Complex data analysis/mining may take a very long time to run on the complete data set • Data reduction • Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results • Data reduction strategies • Data Compression • Sampling • Data cube aggregation • Dimensionality reduction • Numerosity reduction • Discretization and concept hierarchy generation
  • 38. DATA ANALYTICS You can't manage what you can't measure
  • 39. Skills Requirements for Data Analytics Statistics Business Domain Computer Science Data Analytics
  • 40. Meaning of Statistics • The word statistics is used in either two senses: • Commonly used to refer to data. • Principles and methods of handling numerical data. • Statistics is defined as a branch of mathematics that deals with the collection, analysis and interpretation of numerical information • Statistics changes numbers into information • deciding how to collect data efficiently • using data to give information • using data to answer questions • using data to make decisions. • Statistics is the science of learning from data
  • 41. Kinds of Data • Quantitative – Data that is numerical, counted, or compared • Demographic data • Answers to closed-ended survey items • Attendance data • Scores on standardized instruments • Qualitative – Narratives, logs, experience • Interviews • Open-ended survey items • Categories
  • 42. Statistical Measures • Measure of central tendency • Mean • Median • Mode • Measure of variation • Range • Variance and standard deviation • Interquartile range • Proportion, Percentage • Ratio, Rate
  • 43. Statistical Analytics in R • mean(), max(), min() • median() • var(), sd() • cor() • quantile() • summary() • hist() • plot()
  • 44. Mean, Median, Quantile 5,3,6,8,9,2,11,8,3,8,10,9 mean(vec) 6.83 median(vec) 8 5,3,6,8,9,2,11,8,3,8,10,1000 mean(vec) 89.41 median(vec) 8 quantile(vec) 0% 25% 50% 75% 100% 2.00 4.50 8.00 9.25 1000.00 quantile(var, probs = c(0, .75, 1)) 0% 75% 100% 2.00 9.25 1000.00 quantile(vec, probs=seq(0, 1, .1))
  • 45. Correlation mpg cyl disp wt mpg 1.0000000 -0.8521620 -0.8475514 -0.8676594 cyl -0.8521620 1.0000000 0.9020329 0.7824958 disp -0.8475514 0.9020329 1.0000000 0.8879799 wt -0.8676594 0.7824958 0.8879799 1.0000000 cor(mtcars[, c("mpg", "cyl", "disp", "wt")]) mpg cyl disp wt Mazda RX4 21.0 6 160.0 2.620 Mazda RX4 Wag 21.0 6 160.0 2.875 Datsun 710 22.8 4 108.0 2.320 Hornet 4 Drive 21.4 6 258.0 3.215 Hornet Sportabout 18.7 8 360.0 3.440 Valiant 18.1 6 225.0 3.460 Duster 360 14.3 8 360.0 3.570 Merc 240D 24.4 4 146.7 3.190 Merc 230 22.8 4 140.8 3.150
  • 46. Summary Function summary(mtcars[, c("mpg", "cyl", "disp", "wt")]) mpg cyl disp wt Min. :10.40 Min. :4.000 Min. : 71.1 Min. :1.513 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.:2.581 Median :19.20 Median :6.000 Median :196.3 Median :3.325 Mean :20.09 Mean :6.188 Mean :230.7 Mean :3.217 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:3.610 Max. :33.90 Max. :8.000 Max. :472.0 Max. :5.424
  • 48. Anscombe Quartet Anscombe Kvarteti I II III IV x y x y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71 9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84 11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47 14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04 6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25 4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50 12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56 7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91 5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89
  • 51. What id Machine Learning? • “Machine learning refers to a system capable of the autonomous acquisition and integration of knowledge.” • “Learning denotes changes in a system that ... enable a system to do the same task … more efficiently the next time.” - Herbert Simon • Automating automation • Getting computers to program themselves • Writing software is the bottleneck • Let the data do the work instead! • Machine learning is primarily concerned with the accuracy and effectiveness of the computer system.
  • 52. Why Machine Learning? • No human experts • industrial/manufacturing control • mass spectrometer analysis, drug design, astronomic discovery • Black-box human expertise • face/handwriting/speech recognition • driving a car, flying a plane • Rapidly changing phenomena • credit scoring, financial modeling • diagnosis, fraud detection • Need for customization/personalization • personalized news reader • movie/book recommendation
  • 53. Machine Learning Algorithms Categories • Supervised Learning algorithm • Logistic Regression, • Neural Networks, • Support Vector Machines (SVMs), and Naive Bayes classifiers) • Unsupervised Learning algorithm • K-means, Random Forests, Hierarchical clustering) • Semi-supervised Learning algorithm • Reinforcement learning algorithm (self-driving cars)
  • 54. ML vs Traditional Programming Traditional Programming Machine Learning Computer Data Program Output Computer Data Output Program
  • 57. Data Governance Metrics • Digital Culture • Naming Standard • Professional Terms and Abbreviations • Data Model, Documentation and Relationship • Data Quality Rules and Metrics • Hierarchy of Data Artifacts / Entities • Classify your Data: • Master Data, Transactional Data, Reference Data
  • 58. But do you have the capacity to refine it? DATA is the NEW OIL! AAdamov, CeDAR, ADA University
  • 59. Information is the oil of the 21st century, and Analytics is the Combustion Engine
  • 60. Q & A ? Dr. Abzetdin Adamov Email me at: aadamov@ada.edu.az Follow me at: @ Link to me at: www.linkedin.com/in/adamov Visit my blog at: aadamov.wordpress.com