SlideShare ist ein Scribd-Unternehmen logo
1 von 60
DATA CLEANING & PROFILING 
UNDERSTANDING INDIA … CENSUS 
2011 
Datameet 4 Bhavin Dalal
What is Data Quality?? 
 Data quality is a perception or an assessment of 
data's fitness to serve its purpose in a given context. 
 Aspects of data quality include: 
 Accuracy – How much accurate the data is ? 
 Completeness – Is all the data present ? 
 Update status – How old is the data ? 
 Relevance – Is data relevant to solve the purpose ? 
 Consistency – Is data consistent from different sources? 
 Reliability – How much can we rely on the data ? 
 Appropriate presentation – Is the data presented in a way 
that makes it usable ? 
 Accessibility – Is the data accessible by all those who 
require it? 
2
Data Quality Problems 
 Referential Integrity 
 Use of NULL 
 Value checking for reasonableness 
 Date value for example 
 Value constrained to pre-defined domain Eg: 
Salutation 
3
Before doing data quality 
 Profiling of data 
 Conformity check 
 Standardization 
 Gender -> M/F or Male/Female or Unknown or Null ? 
 Duplicate Values 
 Survivorship 
 Best quality set from different records 
4
Basic Data Cleaning Steps 
 Removing spaces and nonprinting characters 
 Fixing Number and Number Signs 
 Fixing Date and Time 
 Merging and Splitting Columns 
 Eg: Names (First Name + Last Name / Full 
Name) 
 Need for transformation 
 Checking data quality through joining and 
matching 
5
Finding duplicate values 
 Below are the algorithms to find duplicates 
based on the phonetics 
 Hamming() 
 Jaro-winkler() 
 Levenshtein() 
 Damerau-Levenshtein() --- Advanced version 
 Q-gram() 
 Cosine() 
 Soundex() 
6
Hamming 
 Number of positions with same symbol in both 
strings. Only defined for strings of equal 
length. 
 distance(‘abcdd‘,’abbcd‘) = 3 
7
Jaro-winkler 
 This distance is a formula of 5 parameters 
determined by the two compared strings 
(A,B,m,t,l) and p chosen from [0, 0.25]. 
8
Levenshtein 
 Minimal number of ins e rtio ns , d e le tio ns and 
re p la c e m e nts needed for transforming string a 
into string b. 
9
N-gram / Q-gram 
 Sum of absolute differences between N-gram 
vectors of both strings. 
10
Cosine 
 1 minus the cosine similarity of both N-gram 
vectors. 
11
Soundex 
 SOUNDEX converts an alphanumeric string to a 
four-character code that is based on how the 
string sounds when spoken. The letters A, E, I, 
O, U, H, W, and Y are ignored unless they are 
the first letter of the string. Zeroes are added at 
the end if necessary to produce a four-character 
code. 
 SOUNDEX (‘Ahmedabad') = A531 
 SOUNDEX (‘Amdavad') = A531 
12
13 Steps to Data Cleaning
Components of Address 
Sujit Joshi 
88 Ashoka Appts 
Juhu 
Bombay 
Tel: 6201670 
Cell: 998054046 
Email: Sujit.Joshi@gml.com 
Mr. Sujit Joshi 
88 Ashoka Apartments 
Gandhigram Road 
Juhu 
Mumbai – 400 049 
India 
Tel: (22) 26201670 
Cell: 998054046X 
Email: 
Sujit.Joshi@gmail.com 
Missing 
salutation 
Abbreviated 
house name 
Missing 
postcode 
Old telephone 
number 
Salutation 
added 
House name 
standardised 
Postcode & 
Country 
added 
Correct telephone 
number for known 
changes (add 2 to 7 
digit numbers; include 
STD code for the city) 
Old telephone 
number 
Incorrect 
email id 
Tag Cell Number to 
be of invalid format 
Email id typo 
corrected
Steps in Data Cleansing 
 Parsing 
 Correcting 
 Standardizing 
 Matching 
 Consolidating 
15
Parsing 
 Parsing locates and identifies individual data 
elements in the source files and then isolates 
these data elements in the target files. 
16
Parsing 
Input Data from Source File 
Beth Christine Parker, SLS MGR 
Regional Port Authority 
Federal Building 
12800 Lake Calumet 
Hedgewisch, IL 
Parsed Data in Target File 
First Name: Beth 
Middle Name: Christine 
Last Name: Parker 
Title: SLS MGR 
Firm: Regional Port Authority 
Location: Federal Building 
Number: 12800 
Street: Lake Calumet 
City: Hedgewisch 
State: IL 
17
Correcting 
 Corrects parsed individual data components 
using sophisticated data algorithms and 
secondary data sources. 
18
Correcting 
Corrected Data 
First Name: Beth 
Middle Name: Christine 
Last Name: Parker 
Title: SLS MGR 
Firm: Regional Port Authority 
Location: Federal Building 
Number: 12800 
Street: South Butler Drive 
City: Chicago 
State: IL 
Zip: 60633 
Zip+Four: 2398 
Parsed Data 
First Name: Beth 
Middle Name: Christine 
Last Name: Parker 
Title: SLS MGR 
Firm: Regional Port Authority 
Location: Federal Building 
Number: 12800 
Street: Lake Calumet 
City: Hedgewisch 
State: IL 
19
Standardizing 
 Standardizing applies conversion routines to 
transform data into its preferred (and 
consistent) format using both standard and 
custom business rules. 
20
Standardizing 
Corrected Data 
First Name: Beth 
Middle Name: Christine 
Last Name: Parker 
Title: SLS MGR 
Firm: Regional Port Authority 
Location: Federal Building 
Number: 12800 
Street: South Butler Drive 
City: Chicago 
State: IL 
Zip: 60633 
Zip+Four: 2398 
Corrected Data 
Pre-name: Ms. 
First Name: Beth 
1st Name Match 
Standards: Elizabeth, Bethany, Bethel 
Middle Name: Christine 
Last Name: Parker 
Title: Sales Mgr. 
Firm: Regional Port Authority 
Location: Federal Building 
Number: 12800 
Street: S. Butler Dr. 
City: Chicago 
State: IL 
Zip: 60633 
Zip+Four: 2398 
21
Matching 
 Searching and matching records within and 
across the parsed, corrected and standardized 
data based on predefined business rules to 
eliminate duplications. 
22
Match Patterns 
Business 
Name 
Street Branch 
Type 
Customer 
#/Tax ID 
City Vendor 
Code 
Pattern Pattern 
I.D. 
Exact 
Exact Exact 
Exact Exact Exact Exact Exact 
Exact Exact 
Exact Exact 
Exact 
Exact Exact 
VClose Exact 
Exact 
Exact 
Exact 
VClose 
VClose 
VClose 
VClose VClose 
Close 
Close 
Close 
Blanks 
Blanks 
AAAAAA 
ABAAA-ABA- 
AA 
ABCCAA 
BBACAA 
P110 
P115 
P120 
S300 
S310 
23
Matching 
Corrected Data (Data Source #1) 
Pre-name: Ms. 
First Name: Beth 
1st Name Match 
Standards: Elizabeth, Bethany, Bethel 
Middle Name: Christine 
Last Name: Parker 
Title: Sales Mgr. 
Firm: Regional Port Authority 
Location: Federal Building 
Number: 12800 
Street: S. Butler Dr. 
City: Chicago 
State: IL 
Zip: 60633 
Zip+Four: 2398 
Corrected Data (Data Source #2) 
Pre-name: Ms. 
First Name: Elizabeth 
1st Name Match 
Standards: Beth, Bethany, Bethel 
Middle Name: Christine 
Last Name: Parker-Lewis 
Title: 
Firm: Regional Port Authority 
Location: Federal Building 
Number: 12800 
Street: S. Butler Dr., Suite 2 
City: Chicago 
State: IL 
Zip: 60633 
Zip+Four: 2398 
Phone: 708-555-1234 
Fax: 708-555-5678 
24
Consolidating 
 Analyzing and identifying relationships 
between matched records and 
consolidating/merging them into ONE 
representation. 
25
Consolidating 
Corrected Data (Data Source #1) 
Corrected Data (Data Source #2) 
Consolidated Data 
Name: Ms. Beth (Elizabeth) 
Christine Parker-Lewis 
Title: Sales Mgr. 
Firm: Regional Port Authority 
Location: Federal Building 
Address: 12800 S. Butler Dr., Suite 2 
Chicago, IL 60633-2398 
Phone: 708-555-1234 
Fax: 708-555-5678 
26
Sometime Such Algo’s Don’t 
Work !!! 
 So we do manual cleaning 
27
Example of Manual Cleaning 
Car Name Correct Name 
Waganer Wagon R 
Sujhuki Suzuki 
Benj Mercedes Benz 
Faurtuner Fortuner 
Scopeio Scorpio 
Sevrole Chevrolet 
Furrarree Ferrari 
Landcrusher Land Cruiser 
28
Car Cleaning Approach 
29
Other data that we have 
cleaned 
 Occupation 
 Marital Status 
 Gender 
 And many other fields … 
30
31 Data Capture Tips
Top Ten Data Capture Tips 
 Every contact is data capture opportunity 
 Make it easy for end user to give you 
information 
 Incentivise your end user to part with their 
details 
 Collect data in-line with private regulations 
 Decide what data you need and prioritise 
32
Top Ten Data Capture Tips 
 Don’t ask everything at once – build it over 
time 
 Set targets for breadth, depth and quality 
 Collect data in standardized format 
 Streamline the data from point of capture to 
storage 
 If you cant collect it, BUY it!!! 
33
End of Part 1 
34
Understanding India … Census 
2011 
35
Census in India 
36 
The first census in India in modern times was 
conducted in 1872. 
Population census has been carried out every 
10 years. 
The census is carried out by the office of the 
Registrar General and Census Commissioner of 
India, Delhi, an office in the Ministry of Home 
Affairs, Government of India, under the 1948 
Census of India Act.
CENSU 
37 S 
 The 15th Indian National census was conducted in two phases 
 House listing 
 Population enumeration. 
 The Census covered 
 640 districts 
 5767 tehsils 
 7742 towns 
 More than 6 lac villages. 
 2.7 million officials visited households in 7,742 towns and 6,40,867 
villages, classifying the population in different segments
POPULATION COMPARISON 
38 
2021 
2011 
2001 The population of India has 
increased by more than 181 million 
during the decade 2001-2011.This 
addition is slightly lower than the 
population of Brazil, the fifth most 
populous country in the world !!
India as compared to the world 
39 
The gap between India, the 
country with the second largest 
population in the world and 
China, the country with the 
largest population in the world 
has narrowed from 238 million 
in 2001 to nearly 131 million in 
2011. On the other hand, the 
gap between India and the 
United States of America, 
which has the third largest 
population, has now widened 
to about 902 million from 741 
million in 2001.
State wise 
population 2001 
40
State wise 
population 2011 
41
Census report of 2011 
42
Se x Ra tio 
43 
The sex 
ratio of 
India is 
940. The 
sex ratio at 
the 
National 
level has 
risen by 
seven 
points 
since the 
last 
Census in 
2001. This 
is the 
highest 
since 1971.
Sex Ratio Trend in India 
44 
The sex ratio in India has been historically negative or in 
other words, unfavourable to females. Sex ratio reached its 
lowest in 1991 but since then has kept rising.
45 
State-wise 
Sex Ratios
Census Facts 2011 
46 
 Thane district of Maharashtra is the most populated district of India. 
 Dibang Valley of Arunachal Pradesh is the least populated. 
 Kurung Kumey of Arunachal Pradesh registered highest population growth 
rate of 111.01 percent. 
 Longleng district of Nagaland registered negative population growth rate of 
(-58.39). 
 Mahe district of Puducherry has highest sex ratio of 1176 females per 1000 
males. 
 Daman district has lowest sex ratio of 533 females per 1000 males. 
 Serchhip district of Mizoram has highest literacy rate of 98.76 percent. 
 Alirajpur of MP is the least literate district of India with figure of 37.22 
percent only. 
 North East Delhi has the higest density with figure of 37346 person per 
square kilometer. 
 Dibang Valley has the least density of 1 person per sq. km.
States having highest 
population 47 
 Uttar Pradesh - (19.96 Crore) increased at the 
rate of 20% from 2001 
 Maharashtra- (11.24 Crore) increased at the 
rate of 15% since last census. 
 Delhi is most densely populated with a 
density of 11297 per sq km ( an increase of 
21% from 2001) 
 Bihar is the most densely populated state with 
a density of 1102 per sq km ( an increase of 
25% from 2001).
States with highest literacy 
48
Interesting Facts 
49
Interesting facts- Telecom 
50 
 “More phones than toilets” Census 2011 
sheds light on changing India. 
 63.2 per cent households in India now have a 
telephone/mobile facility( 82 per cent in urban 
and 54 per cent in rural area.) 
 The penetration of mobile phone is 59 per cent 
and landline is 10 per cent. 
 More than half of Indian households (some 
53.1 per cent) do not have access to 
something as basic as a toilet.
Facts- Communication 
51 
 The penetration of computers and laptops in India 
is only 9.4 per cent or less than one out of 10 
households with only 3 per cent having internet 
facility. 
 The penetration of internet is 8 per cent in urban 
as compared to less than 1 per cent in rural area. 
 Maharashtra is the biggest Indian Internet market 
with 18% . 
 47.2 % of Indians own a Television 
 19.9 % of Indians own a Radio/Transistors 
 13.42 Million broadband connections (Home + 
Offices ) combined.
Facts- Literacy and Population 
52 
 Uttar Pradesh is the most populous state and 
the combined population of Uttar Pradesh and 
Maharashtra is more than that of the USA. 
 Ten states and union territories have attained 
literacy rate of above 85 per cent. 
 According to the Census report India's population 
is now bigger than the combined population of 
USA, Indonesia, Brazil, Pakistan and Bangladesh. 
 74% of Indians can now read, write and do basic 
maths (like adding, subtracting) — that means 
that 3 out of every 4 Indians are literate.
Facts : General 
53 
 Females outnumber males in Goa. 
 Population 
 50% <=25 yrs of age 
 65% <=35 yrs of age 
 It is anticipated that the median age of an Indian 
citizen will be 29 years in 2020, in comparison to 
48 for Japan and 37 for China. 
 India covers 2.4% of the land territory of the world 
and represents more than 17.5% of the population 
of the world.
Facts : General 
54 
Total expenditure and materials used : 
• Cost Rs. 2200 crore 
• Cost per person Rs. 18.33 
• No. of Census Functionaries 2.7 million 
• No. of Languages in which Schedules were canvassed 16 
• No. of Languages in which Training Manuals prepared 18 
• No. of Schedules Printed 340 million 
• No. of Training Manuals Printed 5.4 million 
• Paper Utilised 12,000 MTs 
• Material Moved 10,500 MTs
What do we do with Census ?? 
 Census is more than population, literacy and 
sex ratio. 
 Census can provide insights about various 
dimensions !!! 
 The data is available in the xls format 
 The data is available free of cost 
 The data is clean 
 It has proper database architecture with codes 
in place 
55
Two stages of Census 
56 
Houselisting 
Population 
Enumeration
Houselisting questionaire 
57
Population enumeration 
58
References 
59 
 http://www.census2011.co.in/ 
 http://articles.timesofindia.indiatimes.com/2011-03-31/ http://censusindia.gov.in/ 
 http://en.wikipedia.org/wiki/2011_census_of_India 
 http://www.mapsofindia.com/census2011/
THANK YOU 60

Weitere ähnliche Inhalte

Was ist angesagt?

Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Edureka!
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with PythonDavis David
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysismlong24
 
Principles of data visualisation 2021
Principles of data visualisation 2021Principles of data visualisation 2021
Principles of data visualisation 2021Marié Roux
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data AnalyticsUtkarsh Sharma
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data ManagementAmanda Whitmire
 
Data Mining Tools / Orange
Data Mining Tools / OrangeData Mining Tools / Orange
Data Mining Tools / OrangeYasemin Karaman
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualizationDr. Hamdan Al-Sabri
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxOTA13NayabNakhwa
 
Data Visualisation.pdf
Data Visualisation.pdfData Visualisation.pdf
Data Visualisation.pdfThiyagu K
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 

Was ist angesagt? (20)

Lecture #01
Lecture #01Lecture #01
Lecture #01
 
Analytical tools
Analytical toolsAnalytical tools
Analytical tools
 
Data analytics
Data analyticsData analytics
Data analytics
 
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
Statistics And Probability Tutorial | Statistics And Probability for Data Sci...
 
Lecture 01 Data Mining
Lecture 01 Data MiningLecture 01 Data Mining
Lecture 01 Data Mining
 
Data science
Data scienceData science
Data science
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
 
Principles of data visualisation 2021
Principles of data visualisation 2021Principles of data visualisation 2021
Principles of data visualisation 2021
 
Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Introduction to Data Management
Introduction to Data ManagementIntroduction to Data Management
Introduction to Data Management
 
Data Mining Tools / Orange
Data Mining Tools / OrangeData Mining Tools / Orange
Data Mining Tools / Orange
 
Exploratory data analysis data visualization
Exploratory data analysis data visualizationExploratory data analysis data visualization
Exploratory data analysis data visualization
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptxDATASCIENCE vs BUSINESS INTELLIGENCE.pptx
DATASCIENCE vs BUSINESS INTELLIGENCE.pptx
 
Text MIning
Text MIningText MIning
Text MIning
 
Data Visualisation.pdf
Data Visualisation.pdfData Visualisation.pdf
Data Visualisation.pdf
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 

Andere mochten auch

Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineBertram Ludäscher
 
Comprehensive Validation with Laravel 4
Comprehensive Validation with Laravel 4Comprehensive Validation with Laravel 4
Comprehensive Validation with Laravel 4Kirk Bushell
 
HOW TO PROCESS DATA IN VARIOUS GEO'S A COMPARATIVE ANALYSIS BY SANJEEV SINGH...
HOW TO PROCESS DATA IN VARIOUS GEO'S A  COMPARATIVE ANALYSIS BY SANJEEV SINGH...HOW TO PROCESS DATA IN VARIOUS GEO'S A  COMPARATIVE ANALYSIS BY SANJEEV SINGH...
HOW TO PROCESS DATA IN VARIOUS GEO'S A COMPARATIVE ANALYSIS BY SANJEEV SINGH...Sanjeev Bharwan
 
HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)
HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)
HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)Sanjeev Bharwan
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Matteo Manca
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionInside Analysis
 
Probability and basic statistics with R
Probability and basic statistics with RProbability and basic statistics with R
Probability and basic statistics with RAlberto Labarga
 
Basics of html5, data_storage, css3
Basics of html5, data_storage, css3Basics of html5, data_storage, css3
Basics of html5, data_storage, css3Sreejith Nair
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryInside Analysis
 
Basic Data Storage
Basic Data StorageBasic Data Storage
Basic Data Storageneptonia
 
Bba203 unit 2data processing concepts
Bba203   unit 2data processing conceptsBba203   unit 2data processing concepts
Bba203 unit 2data processing conceptskinjal patel
 
Data cleaning and screening
Data cleaning and screeningData cleaning and screening
Data cleaning and screeningHassan Hussein
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unitbhagathk
 
Impact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceImpact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceSkillet Tony
 
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...DataStax
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Peter Skomoroch
 

Andere mochten auch (20)

Theory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefineTheory & Practice of Data Cleaning: Introduction to OpenRefine
Theory & Practice of Data Cleaning: Introduction to OpenRefine
 
Comprehensive Validation with Laravel 4
Comprehensive Validation with Laravel 4Comprehensive Validation with Laravel 4
Comprehensive Validation with Laravel 4
 
HOW TO PROCESS DATA IN VARIOUS GEO'S A COMPARATIVE ANALYSIS BY SANJEEV SINGH...
HOW TO PROCESS DATA IN VARIOUS GEO'S A  COMPARATIVE ANALYSIS BY SANJEEV SINGH...HOW TO PROCESS DATA IN VARIOUS GEO'S A  COMPARATIVE ANALYSIS BY SANJEEV SINGH...
HOW TO PROCESS DATA IN VARIOUS GEO'S A COMPARATIVE ANALYSIS BY SANJEEV SINGH...
 
HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)
HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)
HIPPA COMPLIANCE (SANJEEV.S.BHARWAN)
 
Intro to open refine
Intro to open refineIntro to open refine
Intro to open refine
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Data Cleaning
Data CleaningData Cleaning
Data Cleaning
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
 
Real time analytics in Big Data
Real time analytics in Big DataReal time analytics in Big Data
Real time analytics in Big Data
 
Probability and basic statistics with R
Probability and basic statistics with RProbability and basic statistics with R
Probability and basic statistics with R
 
Basics of html5, data_storage, css3
Basics of html5, data_storage, css3Basics of html5, data_storage, css3
Basics of html5, data_storage, css3
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
 
Basic Data Storage
Basic Data StorageBasic Data Storage
Basic Data Storage
 
Data Wrangling
Data WranglingData Wrangling
Data Wrangling
 
Bba203 unit 2data processing concepts
Bba203   unit 2data processing conceptsBba203   unit 2data processing concepts
Bba203 unit 2data processing concepts
 
Data cleaning and screening
Data cleaning and screeningData cleaning and screening
Data cleaning and screening
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
Impact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherenceImpact of health education on tuberculosis drug adherence
Impact of health education on tuberculosis drug adherence
 
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
Adaptive Data Cleansing with StreamSets and Cassandra (Pat Patterson, StreamS...
 
Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011Geo Analytics Tutorial - Where 2.0 2011
Geo Analytics Tutorial - Where 2.0 2011
 

Ähnlich wie DataMeet 4: Data cleaning & census data

In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...Privitar
 
Data Services Company Presentation
Data Services Company PresentationData Services Company Presentation
Data Services Company PresentationData Services, Inc.
 
BigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTBigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTZuhair khayyat
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with RJeffrey Breen
 
Database fundamentals
Database fundamentalsDatabase fundamentals
Database fundamentalscrystalpullen
 
Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...
Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...
Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...Precisely
 
NCEDC Presentation
NCEDC Presentation NCEDC Presentation
NCEDC Presentation FCBR
 
American Community Survey and the Census
American Community Survey and the CensusAmerican Community Survey and the Census
American Community Survey and the CensusLynda Kellam
 
Census2010presentation
Census2010presentationCensus2010presentation
Census2010presentationLynda Kellam
 
Enhance Subscriber Growth with a Modern Approach to Network Planning
Enhance Subscriber Growth with a Modern Approach to Network PlanningEnhance Subscriber Growth with a Modern Approach to Network Planning
Enhance Subscriber Growth with a Modern Approach to Network PlanningPrecisely
 
Enhance Subscriber Growth with a Modern Approach to Network Planning
Enhance Subscriber Growth with a Modern Approach to Network PlanningEnhance Subscriber Growth with a Modern Approach to Network Planning
Enhance Subscriber Growth with a Modern Approach to Network PlanningPrecisely
 
Enhance Subscriber Growth with a Modern Approach to Network Planning
Enhance Subscriber Growth with a Modern Approach to Network PlanningEnhance Subscriber Growth with a Modern Approach to Network Planning
Enhance Subscriber Growth with a Modern Approach to Network PlanningPrecisely
 
United Way releases needs assessment Garofolo, Chris . Th.docx
United Way releases needs assessment  Garofolo, Chris . Th.docxUnited Way releases needs assessment  Garofolo, Chris . Th.docx
United Way releases needs assessment Garofolo, Chris . Th.docxdickonsondorris
 
Final%20Analysis%20Code%20Displayed.html
Final%20Analysis%20Code%20Displayed.htmlFinal%20Analysis%20Code%20Displayed.html
Final%20Analysis%20Code%20Displayed.htmlRyan Haeri
 
Preserving Privacy and Utility in Text Data Analysis
Preserving Privacy and Utility in Text Data AnalysisPreserving Privacy and Utility in Text Data Analysis
Preserving Privacy and Utility in Text Data AnalysisTom Diethe
 
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...GIS in the Rockies
 
Melissa data overview
Melissa data overviewMelissa data overview
Melissa data overviewAmi_Surati
 
The data bath
The data bathThe data bath
The data bathrjdudley
 

Ähnlich wie DataMeet 4: Data cleaning & census data (20)

In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
In:Confidence 2019 - Balancing the conflicting objectives of data access and ...
 
Data Services Company Presentation
Data Services Company PresentationData Services Company Presentation
Data Services Company Presentation
 
BigDansing presentation slides for KAUST
BigDansing presentation slides for KAUSTBigDansing presentation slides for KAUST
BigDansing presentation slides for KAUST
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with R
 
Database fundamentals
Database fundamentalsDatabase fundamentals
Database fundamentals
 
Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...
Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...
Geocoding Best Practices for Financial Services – Revealing Actionable Insigh...
 
NCEDC Presentation
NCEDC Presentation NCEDC Presentation
NCEDC Presentation
 
Consumer Models
Consumer ModelsConsumer Models
Consumer Models
 
American Community Survey and the Census
American Community Survey and the CensusAmerican Community Survey and the Census
American Community Survey and the Census
 
Census2010presentation
Census2010presentationCensus2010presentation
Census2010presentation
 
Enhance Subscriber Growth with a Modern Approach to Network Planning
Enhance Subscriber Growth with a Modern Approach to Network PlanningEnhance Subscriber Growth with a Modern Approach to Network Planning
Enhance Subscriber Growth with a Modern Approach to Network Planning
 
Enhance Subscriber Growth with a Modern Approach to Network Planning
Enhance Subscriber Growth with a Modern Approach to Network PlanningEnhance Subscriber Growth with a Modern Approach to Network Planning
Enhance Subscriber Growth with a Modern Approach to Network Planning
 
Enhance Subscriber Growth with a Modern Approach to Network Planning
Enhance Subscriber Growth with a Modern Approach to Network PlanningEnhance Subscriber Growth with a Modern Approach to Network Planning
Enhance Subscriber Growth with a Modern Approach to Network Planning
 
Gender Gaps, Diasporas and Other Diversity Analytics
Gender Gaps, Diasporas and Other Diversity AnalyticsGender Gaps, Diasporas and Other Diversity Analytics
Gender Gaps, Diasporas and Other Diversity Analytics
 
United Way releases needs assessment Garofolo, Chris . Th.docx
United Way releases needs assessment  Garofolo, Chris . Th.docxUnited Way releases needs assessment  Garofolo, Chris . Th.docx
United Way releases needs assessment Garofolo, Chris . Th.docx
 
Final%20Analysis%20Code%20Displayed.html
Final%20Analysis%20Code%20Displayed.htmlFinal%20Analysis%20Code%20Displayed.html
Final%20Analysis%20Code%20Displayed.html
 
Preserving Privacy and Utility in Text Data Analysis
Preserving Privacy and Utility in Text Data AnalysisPreserving Privacy and Utility in Text Data Analysis
Preserving Privacy and Utility in Text Data Analysis
 
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
2013 GISCO Track, Quality Assessment and Improvement for Addressed Locations ...
 
Melissa data overview
Melissa data overviewMelissa data overview
Melissa data overview
 
The data bath
The data bathThe data bath
The data bath
 

Mehr von Ritvvij Parrikh

Introduction to Pykih's Services
Introduction to Pykih's ServicesIntroduction to Pykih's Services
Introduction to Pykih's ServicesRitvvij Parrikh
 
"A primer for custom data visualization" - An approach towards getting starte...
"A primer for custom data visualization" - An approach towards getting starte..."A primer for custom data visualization" - An approach towards getting starte...
"A primer for custom data visualization" - An approach towards getting starte...Ritvvij Parrikh
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with DataRitvvij Parrikh
 
Visualizing Data Journalism (HasGeek Fifth Elephant)
Visualizing Data Journalism (HasGeek Fifth Elephant)Visualizing Data Journalism (HasGeek Fifth Elephant)
Visualizing Data Journalism (HasGeek Fifth Elephant)Ritvvij Parrikh
 
Talk at eChai, EDI, Ahmedabad
Talk at eChai, EDI, AhmedabadTalk at eChai, EDI, Ahmedabad
Talk at eChai, EDI, AhmedabadRitvvij Parrikh
 
Offline Advertisements Analytics Dashboard
Offline Advertisements Analytics DashboardOffline Advertisements Analytics Dashboard
Offline Advertisements Analytics DashboardRitvvij Parrikh
 
Google Analytics Dashboard Design
Google Analytics Dashboard DesignGoogle Analytics Dashboard Design
Google Analytics Dashboard DesignRitvvij Parrikh
 
Google Analytics Dashboard designed as an Infographic
Google Analytics Dashboard designed as an InfographicGoogle Analytics Dashboard designed as an Infographic
Google Analytics Dashboard designed as an InfographicRitvvij Parrikh
 
JARVIS:BI for FMCG Sales Managers
JARVIS:BI for FMCG Sales ManagersJARVIS:BI for FMCG Sales Managers
JARVIS:BI for FMCG Sales ManagersRitvvij Parrikh
 
Payroll Giving Management with TracksGiving
Payroll Giving Management with TracksGivingPayroll Giving Management with TracksGiving
Payroll Giving Management with TracksGivingRitvvij Parrikh
 
9 ways how cause marketing can help you achieve your marketing objectives.
9 ways how cause marketing can help you achieve your marketing objectives.9 ways how cause marketing can help you achieve your marketing objectives.
9 ways how cause marketing can help you achieve your marketing objectives.Ritvvij Parrikh
 
How TracksGiving can help you implement your campaigning software up quicker ...
How TracksGiving can help you implement your campaigning software up quicker ...How TracksGiving can help you implement your campaigning software up quicker ...
How TracksGiving can help you implement your campaigning software up quicker ...Ritvvij Parrikh
 

Mehr von Ritvvij Parrikh (16)

Introduction to Pykih's Services
Introduction to Pykih's ServicesIntroduction to Pykih's Services
Introduction to Pykih's Services
 
PykQuery.js
PykQuery.jsPykQuery.js
PykQuery.js
 
"A primer for custom data visualization" - An approach towards getting starte...
"A primer for custom data visualization" - An approach towards getting starte..."A primer for custom data visualization" - An approach towards getting starte...
"A primer for custom data visualization" - An approach towards getting starte...
 
Taxonomy of charts
Taxonomy of chartsTaxonomy of charts
Taxonomy of charts
 
Getting comfortable with Data
Getting comfortable with DataGetting comfortable with Data
Getting comfortable with Data
 
Visualizing Data Journalism (HasGeek Fifth Elephant)
Visualizing Data Journalism (HasGeek Fifth Elephant)Visualizing Data Journalism (HasGeek Fifth Elephant)
Visualizing Data Journalism (HasGeek Fifth Elephant)
 
Talk at eChai, EDI, Ahmedabad
Talk at eChai, EDI, AhmedabadTalk at eChai, EDI, Ahmedabad
Talk at eChai, EDI, Ahmedabad
 
Offline Advertisements Analytics Dashboard
Offline Advertisements Analytics DashboardOffline Advertisements Analytics Dashboard
Offline Advertisements Analytics Dashboard
 
Google Analytics Dashboard Design
Google Analytics Dashboard DesignGoogle Analytics Dashboard Design
Google Analytics Dashboard Design
 
Dashboard fhub
Dashboard fhubDashboard fhub
Dashboard fhub
 
Google Analytics Dashboard designed as an Infographic
Google Analytics Dashboard designed as an InfographicGoogle Analytics Dashboard designed as an Infographic
Google Analytics Dashboard designed as an Infographic
 
Company presentation
Company presentationCompany presentation
Company presentation
 
JARVIS:BI for FMCG Sales Managers
JARVIS:BI for FMCG Sales ManagersJARVIS:BI for FMCG Sales Managers
JARVIS:BI for FMCG Sales Managers
 
Payroll Giving Management with TracksGiving
Payroll Giving Management with TracksGivingPayroll Giving Management with TracksGiving
Payroll Giving Management with TracksGiving
 
9 ways how cause marketing can help you achieve your marketing objectives.
9 ways how cause marketing can help you achieve your marketing objectives.9 ways how cause marketing can help you achieve your marketing objectives.
9 ways how cause marketing can help you achieve your marketing objectives.
 
How TracksGiving can help you implement your campaigning software up quicker ...
How TracksGiving can help you implement your campaigning software up quicker ...How TracksGiving can help you implement your campaigning software up quicker ...
How TracksGiving can help you implement your campaigning software up quicker ...
 

Kürzlich hochgeladen

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 

Kürzlich hochgeladen (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 

DataMeet 4: Data cleaning & census data

  • 1. DATA CLEANING & PROFILING UNDERSTANDING INDIA … CENSUS 2011 Datameet 4 Bhavin Dalal
  • 2. What is Data Quality??  Data quality is a perception or an assessment of data's fitness to serve its purpose in a given context.  Aspects of data quality include:  Accuracy – How much accurate the data is ?  Completeness – Is all the data present ?  Update status – How old is the data ?  Relevance – Is data relevant to solve the purpose ?  Consistency – Is data consistent from different sources?  Reliability – How much can we rely on the data ?  Appropriate presentation – Is the data presented in a way that makes it usable ?  Accessibility – Is the data accessible by all those who require it? 2
  • 3. Data Quality Problems  Referential Integrity  Use of NULL  Value checking for reasonableness  Date value for example  Value constrained to pre-defined domain Eg: Salutation 3
  • 4. Before doing data quality  Profiling of data  Conformity check  Standardization  Gender -> M/F or Male/Female or Unknown or Null ?  Duplicate Values  Survivorship  Best quality set from different records 4
  • 5. Basic Data Cleaning Steps  Removing spaces and nonprinting characters  Fixing Number and Number Signs  Fixing Date and Time  Merging and Splitting Columns  Eg: Names (First Name + Last Name / Full Name)  Need for transformation  Checking data quality through joining and matching 5
  • 6. Finding duplicate values  Below are the algorithms to find duplicates based on the phonetics  Hamming()  Jaro-winkler()  Levenshtein()  Damerau-Levenshtein() --- Advanced version  Q-gram()  Cosine()  Soundex() 6
  • 7. Hamming  Number of positions with same symbol in both strings. Only defined for strings of equal length.  distance(‘abcdd‘,’abbcd‘) = 3 7
  • 8. Jaro-winkler  This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25]. 8
  • 9. Levenshtein  Minimal number of ins e rtio ns , d e le tio ns and re p la c e m e nts needed for transforming string a into string b. 9
  • 10. N-gram / Q-gram  Sum of absolute differences between N-gram vectors of both strings. 10
  • 11. Cosine  1 minus the cosine similarity of both N-gram vectors. 11
  • 12. Soundex  SOUNDEX converts an alphanumeric string to a four-character code that is based on how the string sounds when spoken. The letters A, E, I, O, U, H, W, and Y are ignored unless they are the first letter of the string. Zeroes are added at the end if necessary to produce a four-character code.  SOUNDEX (‘Ahmedabad') = A531  SOUNDEX (‘Amdavad') = A531 12
  • 13. 13 Steps to Data Cleaning
  • 14. Components of Address Sujit Joshi 88 Ashoka Appts Juhu Bombay Tel: 6201670 Cell: 998054046 Email: Sujit.Joshi@gml.com Mr. Sujit Joshi 88 Ashoka Apartments Gandhigram Road Juhu Mumbai – 400 049 India Tel: (22) 26201670 Cell: 998054046X Email: Sujit.Joshi@gmail.com Missing salutation Abbreviated house name Missing postcode Old telephone number Salutation added House name standardised Postcode & Country added Correct telephone number for known changes (add 2 to 7 digit numbers; include STD code for the city) Old telephone number Incorrect email id Tag Cell Number to be of invalid format Email id typo corrected
  • 15. Steps in Data Cleansing  Parsing  Correcting  Standardizing  Matching  Consolidating 15
  • 16. Parsing  Parsing locates and identifies individual data elements in the source files and then isolates these data elements in the target files. 16
  • 17. Parsing Input Data from Source File Beth Christine Parker, SLS MGR Regional Port Authority Federal Building 12800 Lake Calumet Hedgewisch, IL Parsed Data in Target File First Name: Beth Middle Name: Christine Last Name: Parker Title: SLS MGR Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: Lake Calumet City: Hedgewisch State: IL 17
  • 18. Correcting  Corrects parsed individual data components using sophisticated data algorithms and secondary data sources. 18
  • 19. Correcting Corrected Data First Name: Beth Middle Name: Christine Last Name: Parker Title: SLS MGR Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: South Butler Drive City: Chicago State: IL Zip: 60633 Zip+Four: 2398 Parsed Data First Name: Beth Middle Name: Christine Last Name: Parker Title: SLS MGR Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: Lake Calumet City: Hedgewisch State: IL 19
  • 20. Standardizing  Standardizing applies conversion routines to transform data into its preferred (and consistent) format using both standard and custom business rules. 20
  • 21. Standardizing Corrected Data First Name: Beth Middle Name: Christine Last Name: Parker Title: SLS MGR Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: South Butler Drive City: Chicago State: IL Zip: 60633 Zip+Four: 2398 Corrected Data Pre-name: Ms. First Name: Beth 1st Name Match Standards: Elizabeth, Bethany, Bethel Middle Name: Christine Last Name: Parker Title: Sales Mgr. Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: S. Butler Dr. City: Chicago State: IL Zip: 60633 Zip+Four: 2398 21
  • 22. Matching  Searching and matching records within and across the parsed, corrected and standardized data based on predefined business rules to eliminate duplications. 22
  • 23. Match Patterns Business Name Street Branch Type Customer #/Tax ID City Vendor Code Pattern Pattern I.D. Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact VClose Exact Exact Exact Exact VClose VClose VClose VClose VClose Close Close Close Blanks Blanks AAAAAA ABAAA-ABA- AA ABCCAA BBACAA P110 P115 P120 S300 S310 23
  • 24. Matching Corrected Data (Data Source #1) Pre-name: Ms. First Name: Beth 1st Name Match Standards: Elizabeth, Bethany, Bethel Middle Name: Christine Last Name: Parker Title: Sales Mgr. Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: S. Butler Dr. City: Chicago State: IL Zip: 60633 Zip+Four: 2398 Corrected Data (Data Source #2) Pre-name: Ms. First Name: Elizabeth 1st Name Match Standards: Beth, Bethany, Bethel Middle Name: Christine Last Name: Parker-Lewis Title: Firm: Regional Port Authority Location: Federal Building Number: 12800 Street: S. Butler Dr., Suite 2 City: Chicago State: IL Zip: 60633 Zip+Four: 2398 Phone: 708-555-1234 Fax: 708-555-5678 24
  • 25. Consolidating  Analyzing and identifying relationships between matched records and consolidating/merging them into ONE representation. 25
  • 26. Consolidating Corrected Data (Data Source #1) Corrected Data (Data Source #2) Consolidated Data Name: Ms. Beth (Elizabeth) Christine Parker-Lewis Title: Sales Mgr. Firm: Regional Port Authority Location: Federal Building Address: 12800 S. Butler Dr., Suite 2 Chicago, IL 60633-2398 Phone: 708-555-1234 Fax: 708-555-5678 26
  • 27. Sometime Such Algo’s Don’t Work !!!  So we do manual cleaning 27
  • 28. Example of Manual Cleaning Car Name Correct Name Waganer Wagon R Sujhuki Suzuki Benj Mercedes Benz Faurtuner Fortuner Scopeio Scorpio Sevrole Chevrolet Furrarree Ferrari Landcrusher Land Cruiser 28
  • 30. Other data that we have cleaned  Occupation  Marital Status  Gender  And many other fields … 30
  • 32. Top Ten Data Capture Tips  Every contact is data capture opportunity  Make it easy for end user to give you information  Incentivise your end user to part with their details  Collect data in-line with private regulations  Decide what data you need and prioritise 32
  • 33. Top Ten Data Capture Tips  Don’t ask everything at once – build it over time  Set targets for breadth, depth and quality  Collect data in standardized format  Streamline the data from point of capture to storage  If you cant collect it, BUY it!!! 33
  • 34. End of Part 1 34
  • 35. Understanding India … Census 2011 35
  • 36. Census in India 36 The first census in India in modern times was conducted in 1872. Population census has been carried out every 10 years. The census is carried out by the office of the Registrar General and Census Commissioner of India, Delhi, an office in the Ministry of Home Affairs, Government of India, under the 1948 Census of India Act.
  • 37. CENSU 37 S  The 15th Indian National census was conducted in two phases  House listing  Population enumeration.  The Census covered  640 districts  5767 tehsils  7742 towns  More than 6 lac villages.  2.7 million officials visited households in 7,742 towns and 6,40,867 villages, classifying the population in different segments
  • 38. POPULATION COMPARISON 38 2021 2011 2001 The population of India has increased by more than 181 million during the decade 2001-2011.This addition is slightly lower than the population of Brazil, the fifth most populous country in the world !!
  • 39. India as compared to the world 39 The gap between India, the country with the second largest population in the world and China, the country with the largest population in the world has narrowed from 238 million in 2001 to nearly 131 million in 2011. On the other hand, the gap between India and the United States of America, which has the third largest population, has now widened to about 902 million from 741 million in 2001.
  • 42. Census report of 2011 42
  • 43. Se x Ra tio 43 The sex ratio of India is 940. The sex ratio at the National level has risen by seven points since the last Census in 2001. This is the highest since 1971.
  • 44. Sex Ratio Trend in India 44 The sex ratio in India has been historically negative or in other words, unfavourable to females. Sex ratio reached its lowest in 1991 but since then has kept rising.
  • 46. Census Facts 2011 46  Thane district of Maharashtra is the most populated district of India.  Dibang Valley of Arunachal Pradesh is the least populated.  Kurung Kumey of Arunachal Pradesh registered highest population growth rate of 111.01 percent.  Longleng district of Nagaland registered negative population growth rate of (-58.39).  Mahe district of Puducherry has highest sex ratio of 1176 females per 1000 males.  Daman district has lowest sex ratio of 533 females per 1000 males.  Serchhip district of Mizoram has highest literacy rate of 98.76 percent.  Alirajpur of MP is the least literate district of India with figure of 37.22 percent only.  North East Delhi has the higest density with figure of 37346 person per square kilometer.  Dibang Valley has the least density of 1 person per sq. km.
  • 47. States having highest population 47  Uttar Pradesh - (19.96 Crore) increased at the rate of 20% from 2001  Maharashtra- (11.24 Crore) increased at the rate of 15% since last census.  Delhi is most densely populated with a density of 11297 per sq km ( an increase of 21% from 2001)  Bihar is the most densely populated state with a density of 1102 per sq km ( an increase of 25% from 2001).
  • 48. States with highest literacy 48
  • 50. Interesting facts- Telecom 50  “More phones than toilets” Census 2011 sheds light on changing India.  63.2 per cent households in India now have a telephone/mobile facility( 82 per cent in urban and 54 per cent in rural area.)  The penetration of mobile phone is 59 per cent and landline is 10 per cent.  More than half of Indian households (some 53.1 per cent) do not have access to something as basic as a toilet.
  • 51. Facts- Communication 51  The penetration of computers and laptops in India is only 9.4 per cent or less than one out of 10 households with only 3 per cent having internet facility.  The penetration of internet is 8 per cent in urban as compared to less than 1 per cent in rural area.  Maharashtra is the biggest Indian Internet market with 18% .  47.2 % of Indians own a Television  19.9 % of Indians own a Radio/Transistors  13.42 Million broadband connections (Home + Offices ) combined.
  • 52. Facts- Literacy and Population 52  Uttar Pradesh is the most populous state and the combined population of Uttar Pradesh and Maharashtra is more than that of the USA.  Ten states and union territories have attained literacy rate of above 85 per cent.  According to the Census report India's population is now bigger than the combined population of USA, Indonesia, Brazil, Pakistan and Bangladesh.  74% of Indians can now read, write and do basic maths (like adding, subtracting) — that means that 3 out of every 4 Indians are literate.
  • 53. Facts : General 53  Females outnumber males in Goa.  Population  50% <=25 yrs of age  65% <=35 yrs of age  It is anticipated that the median age of an Indian citizen will be 29 years in 2020, in comparison to 48 for Japan and 37 for China.  India covers 2.4% of the land territory of the world and represents more than 17.5% of the population of the world.
  • 54. Facts : General 54 Total expenditure and materials used : • Cost Rs. 2200 crore • Cost per person Rs. 18.33 • No. of Census Functionaries 2.7 million • No. of Languages in which Schedules were canvassed 16 • No. of Languages in which Training Manuals prepared 18 • No. of Schedules Printed 340 million • No. of Training Manuals Printed 5.4 million • Paper Utilised 12,000 MTs • Material Moved 10,500 MTs
  • 55. What do we do with Census ??  Census is more than population, literacy and sex ratio.  Census can provide insights about various dimensions !!!  The data is available in the xls format  The data is available free of cost  The data is clean  It has proper database architecture with codes in place 55
  • 56. Two stages of Census 56 Houselisting Population Enumeration
  • 59. References 59  http://www.census2011.co.in/  http://articles.timesofindia.indiatimes.com/2011-03-31/ http://censusindia.gov.in/  http://en.wikipedia.org/wiki/2011_census_of_India  http://www.mapsofindia.com/census2011/

Hinweis der Redaktion

  1. http://www.computerweekly.com/feature/How-clean-is-your-data
  2. https://support.office.com/en-ie/article/Top-ten-ways-to-clean-your-data-2844b620-677c-47a7-ac3e-c2e157d1db19
  3. http://stackoverflow.com/questions/6683380/techniques-for-finding-near-duplicate-records http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms/
  4. http://www.adma.com.au/connect/articles/top-ten-data-capture-tips/
  5. http://www.adma.com.au/connect/articles/top-ten-data-capture-tips/
  6. POP IN 2001
  7. POPULATION IN 2011 – MAHARASHTRA , UP , BIHAR POP INCREASES MOST. OTHER STATES ALSO ON UPWARD TREND. NAGALAND POP REDUCES. Arrows show the growth rate.