Understanding your Data - Data Analytics Lifecycle and Machine Learning
1. Understanding your Data
Data Analytics Lifecycle and Machine Learning
Dr. Abzetdin ADAMOV
Director, Center for Data Analytics Research
School of IT & Engineering
ADA University
aadamov@ada.edu.az
2. Content
• Why now?
• Data Analytics Lifecycle
• Data Acquisition
• Data Repository
• Data Preprocessing
• Data Analytics and Machine Learning
• Data Visualization
• Data Governance
8. Student Research - SDP Topics
1. Development of Lexical and Morphological Analysis System
2. Effective Installation of Multi-node Cluster based on Hadoop 3.0
3. Utilizing Artificial Intelligence (AI) to improve quality of life for people with
Dementia
4. Statistical Analysis and Data Visualization of DTS Data
5. Development of N-gram Model
6. Development of Semantic Similarity System
7. Development of Sentiment Analysis System
8. Personalized Offers and Customer Retention Platform in Banking
9. Data Retrieval, Storage and Manipulation of DTS Data
10. Network Security and IDS using Machine Learning
11. Development Spell Correction System
10. Where Data Comes From
Data is produced by:
• People
• Social Media, Public Web, Smartphones, …
• Organizations (Employer)
• OLTP, OLAP, BI, …
• Machines
• IoT, Satellites, Vehicles, Science, …
AAdamov, CeDAR, ADA University
11. Modern Data Sources
à Internet of Anything (IoAT)
• Wind Turbines, Oil Rigs, Cars
• Weather Stations, Smart Grids
• RFID Tags, Beacons, Wearables
à User Generated Content (Web & Mobile)
• Twitter, Facebook, Snapchat, YouTube
• Clickstream, Ads, User Engagement
• Payments: Paypal, Venmo
12. Addressing Data – Digital Universe
0
5
10
15
20
25
30
35
1995 2000 2003 2005 2008 2009 2010 2011 2012 2014 2016 2017 2018
DataGrowthinZettaBytes
Digital Universe Growth over time
13. Addressing Data – Hard Disk Capacity
0
10000
20000
30000
40000
50000
60000
70000
1991 1998 2003 2005 2007 2008 2009 2010 2011 2012 2014 2016 2017
CapacityinGigaBytes
Hard Drive Capacity Growth over time
14. Addressing Data – Storage Cost
1200000
100000
10000 800 10 1 0.1 0.003 0.0020
200000
400000
600000
800000
1000000
1200000
1400000
1980 1985 1990 1995 2000 2005 2010 2015 2020
Price$perGBytes
Data Storage Cost per Gigabyte
AAdamov, CeDAR, ADA University
15. Computation Power CPU and GPU
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2001 2002 2005 2006 2008 2009 2010 2012 2013 2014 2015 2017 2018
GFLOPS
Computation Power CPU and GPU
GPU
CPU
16. Data Growth vs. Processing Power
AAdamov, CeDAR, ADA University
17. Addressing Data – Transfer Rate
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1991 1998 2003 2005 2007 2009 2010 2012 2014 2016
TransferSpeedMB/sec
Hard Drive Data Transfer Rate
19. Data Analytics Life-Cycle
Data
Acquisition
Data
Repository
Data
Processing
Data
Analytics /
ML
Data
Visualization
- Hadoop HDFS
- Microsoft Azure
- Amazon EC2
- Warehouse
- Statistical Analysis
- Machine Learning
- R Programming
- Python
- RapidMiner
- Weka
- ….
- Web Crawling
- Data Mining
- Information
Retrieval
- ….
- ETL
- Parsing
- Indexing
- Searching
- Ranking
- NLP
- ….
Big Data Management involves Data Science and Data Engineering areas for
implementing Data Mining Techniques
21. Data Acquisition Techniques
1. Operational Systems
2. Data Warehouses and Data Marts
3. Online Analytical Processing (OLAP) / BI
4. Web Crawling
5. Data Brokers (Commercial Data)
6. Open Data Sources
7. Experimental Data Collection
8. Online Surveys
22. Data Acquisition Considerations
• Business Needs
• Data Standards (ISO, ITIS, FGDC, ISDM)
• Accuracy Requirements
• Currency of Data
• Time Constraints
• Format (CSV, XLS, XML, JSON, …)
• Cost
24. Traditional Approach in Data Management
1TB Hard Drive
3 TB file
1TB of Data
1TB of Data
1TB of Data
STORAGE PROCESSING
DATA Processor
Raw DataProcessed Data
25. The “Big Data” Problem
à A single machine cannot process or even store all the data!
Problem
Solution
à Distribute data over large clusters
Difficulty
à How to split work across machines?
à Moving data over network is expensive
à Must consider data & network locality
à How to deal with failures?
à How to deal with slow nodes?
26. Addressing Data
• Standard Hard Drive data transmission speed 60 – 100 MB/sec
• Solid State Hard Drive (SSD) - 250 – 500 MB/sec
• Hard Drive capacity growing RAPIDLY (4 – 60 TB)
• Online data growth (double every 18 month)
• Processing Speed (relatively same growth)
• Hard Drive transmission speed is relatively FLAT
Moving Data IN and OUT of disk is the Bottleneck
27. BIG DATA and TRADITIONAL SYSTEMS
AAdamov, CeDAR, ADA University
28. Hadoop Input/Output Model
AAdamov, CeDAR, ADA University
A
128mb
Hadoop WRITE/READS blocks into HDFS sequentially
CLIENT
ABCD
B
128mb
C
128mb
D
128mb
File NEWS.txt (512 Mb) divided to 4 blocks
Hadoop Reads/Writes blocks sequentially, not in parallel. Its why Hadoop does not affect IO
performance significantly.
SOLUTION is Data Striping technique…
29. Distributed Architecture of HDFS
Rack 1
DN1
DN2
DN3
DN4
Switch
Rack 2
DN11
DN12
DN13
DN14
Switch
Rack 3
DN21
DN22
DN23
DN24
Switch
Rack 4
DN31
DN32
DN33
DN34
Switch
CC
AA
DD
BB
A
Where to write file ADA.txt (blocks A, B, C, D) in HDFS?
CLIENT
NAMENODE
A B C D
A – DN32, 11, 14
B – DN01, 22, 23
C – DN12, 02, 04
D – DN34, 12, 14
BCD
AAdamov, CeDAR, ADA University
30. Big Data and Virtialization
Traditional
Architecture
Distributed
Architecture
Operating System
HARDWARE
App App App
HARDWARE
App App App
HYPERVISOR
OS OS OS
HARDWARE HARDWARE HARDWARE
OS OS OS
HADOOP HDFS + YARN
App App App App App App
Virtualized
Architecture
31. BIG DOES NOT MEAN SLOW
AAdamov, CeDAR, ADA University
33. Why Data Preprocessing?
• Data in the real world is dirty
• incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• noisy: containing errors or outliers
• inconsistent: containing discrepancies in codes or names
• No quality data, no quality Analytics results!
• Quality decisions must be based on quality data
• Data warehouse needs consistent integration of quality data
34. Multi-Dimensional Measure of Data Quality
• Accuracy
• Completeness
• Consistency
• Timeliness
• Believability
• Value added
• Interpretability
• Accessibility
35. Major Tasks in Data Preprocessing
• Data Cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data Integration
• Integration of multiple databases, data cubes, or files
• Data Transformation
• Normalization and aggregation
• Data Reduction
• Obtains reduced representation in volume but produces the same or similar
analytical results
• Data Discretization
• Part of data reduction but with particular importance, especially for numerical data
36. Data Cleaning and Transformation
• Data Cleaning Tasks:
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Data Transformation Tasks:
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Normalization: scaled to fall within a small, specified range
• Generalization: concept hierarchy climbing
37. Data Reduction Strategies
• Why data reduction?
• Warehouse may store terabytes of data
• Complex data analysis/mining may take a very long time to run on the complete data
set
• Data reduction
• Obtains a reduced representation of the data set that is much smaller in volume but
yet produces the same (or almost the same) analytical results
• Data reduction strategies
• Data Compression
• Sampling
• Data cube aggregation
• Dimensionality reduction
• Numerosity reduction
• Discretization and concept hierarchy generation
39. Skills Requirements for Data Analytics
Statistics
Business
Domain
Computer
Science
Data Analytics
40. Meaning of Statistics
• The word statistics is used in either two senses:
• Commonly used to refer to data.
• Principles and methods of handling numerical data.
• Statistics is defined as a branch of mathematics that deals with the
collection, analysis and interpretation of numerical information
• Statistics changes numbers into information
• deciding how to collect data efficiently
• using data to give information
• using data to answer questions
• using data to make decisions.
• Statistics is the science of learning from data
41. Kinds of Data
• Quantitative – Data that is numerical, counted, or compared
• Demographic data
• Answers to closed-ended survey items
• Attendance data
• Scores on standardized instruments
• Qualitative – Narratives, logs, experience
• Interviews
• Open-ended survey items
• Categories
42. Statistical Measures
• Measure of central tendency
• Mean
• Median
• Mode
• Measure of variation
• Range
• Variance and standard deviation
• Interquartile range
• Proportion, Percentage
• Ratio, Rate
46. Summary Function
summary(mtcars[, c("mpg", "cyl", "disp", "wt")])
mpg cyl disp wt
Min. :10.40 Min. :4.000 Min. : 71.1 Min. :1.513
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.:2.581
Median :19.20 Median :6.000 Median :196.3 Median :3.325
Mean :20.09 Mean :6.188 Mean :230.7 Mean :3.217
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:3.610
Max. :33.90 Max. :8.000 Max. :472.0 Max. :5.424
51. What id Machine Learning?
• “Machine learning refers to a system capable of the autonomous
acquisition and integration of knowledge.”
• “Learning denotes changes in a system that ... enable a system to do
the same task … more efficiently the next time.” - Herbert Simon
• Automating automation
• Getting computers to program themselves
• Writing software is the bottleneck
• Let the data do the work instead!
• Machine learning is primarily concerned with the accuracy and
effectiveness of the computer system.
52. Why Machine Learning?
• No human experts
• industrial/manufacturing control
• mass spectrometer analysis, drug design, astronomic discovery
• Black-box human expertise
• face/handwriting/speech recognition
• driving a car, flying a plane
• Rapidly changing phenomena
• credit scoring, financial modeling
• diagnosis, fraud detection
• Need for customization/personalization
• personalized news reader
• movie/book recommendation
57. Data Governance Metrics
• Digital Culture
• Naming Standard
• Professional Terms and Abbreviations
• Data Model, Documentation and Relationship
• Data Quality Rules and Metrics
• Hierarchy of Data Artifacts / Entities
• Classify your Data:
• Master Data, Transactional Data, Reference Data
58. But do you have the capacity to refine it?
DATA is the NEW OIL!
AAdamov, CeDAR, ADA University
59. Information is the oil of the 21st century,
and Analytics is the Combustion Engine
60. Q & A ?
Dr. Abzetdin Adamov
Email me at: aadamov@ada.edu.az
Follow me at: @
Link to me at: www.linkedin.com/in/adamov
Visit my blog at: aadamov.wordpress.com