1. DATA CLEANING & PROFILING
UNDERSTANDING INDIA … CENSUS
2011
Datameet 4 Bhavin Dalal
2. What is Data Quality??
Data quality is a perception or an assessment of
data's fitness to serve its purpose in a given context.
Aspects of data quality include:
Accuracy – How much accurate the data is ?
Completeness – Is all the data present ?
Update status – How old is the data ?
Relevance – Is data relevant to solve the purpose ?
Consistency – Is data consistent from different sources?
Reliability – How much can we rely on the data ?
Appropriate presentation – Is the data presented in a way
that makes it usable ?
Accessibility – Is the data accessible by all those who
require it?
2
3. Data Quality Problems
Referential Integrity
Use of NULL
Value checking for reasonableness
Date value for example
Value constrained to pre-defined domain Eg:
Salutation
3
4. Before doing data quality
Profiling of data
Conformity check
Standardization
Gender -> M/F or Male/Female or Unknown or Null ?
Duplicate Values
Survivorship
Best quality set from different records
4
5. Basic Data Cleaning Steps
Removing spaces and nonprinting characters
Fixing Number and Number Signs
Fixing Date and Time
Merging and Splitting Columns
Eg: Names (First Name + Last Name / Full
Name)
Need for transformation
Checking data quality through joining and
matching
5
6. Finding duplicate values
Below are the algorithms to find duplicates
based on the phonetics
Hamming()
Jaro-winkler()
Levenshtein()
Damerau-Levenshtein() --- Advanced version
Q-gram()
Cosine()
Soundex()
6
7. Hamming
Number of positions with same symbol in both
strings. Only defined for strings of equal
length.
distance(‘abcdd‘,’abbcd‘) = 3
7
8. Jaro-winkler
This distance is a formula of 5 parameters
determined by the two compared strings
(A,B,m,t,l) and p chosen from [0, 0.25].
8
9. Levenshtein
Minimal number of ins e rtio ns , d e le tio ns and
re p la c e m e nts needed for transforming string a
into string b.
9
10. N-gram / Q-gram
Sum of absolute differences between N-gram
vectors of both strings.
10
11. Cosine
1 minus the cosine similarity of both N-gram
vectors.
11
12. Soundex
SOUNDEX converts an alphanumeric string to a
four-character code that is based on how the
string sounds when spoken. The letters A, E, I,
O, U, H, W, and Y are ignored unless they are
the first letter of the string. Zeroes are added at
the end if necessary to produce a four-character
code.
SOUNDEX (‘Ahmedabad') = A531
SOUNDEX (‘Amdavad') = A531
12
14. Components of Address
Sujit Joshi
88 Ashoka Appts
Juhu
Bombay
Tel: 6201670
Cell: 998054046
Email: Sujit.Joshi@gml.com
Mr. Sujit Joshi
88 Ashoka Apartments
Gandhigram Road
Juhu
Mumbai – 400 049
India
Tel: (22) 26201670
Cell: 998054046X
Email:
Sujit.Joshi@gmail.com
Missing
salutation
Abbreviated
house name
Missing
postcode
Old telephone
number
Salutation
added
House name
standardised
Postcode &
Country
added
Correct telephone
number for known
changes (add 2 to 7
digit numbers; include
STD code for the city)
Old telephone
number
Incorrect
email id
Tag Cell Number to
be of invalid format
Email id typo
corrected
15. Steps in Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating
15
16. Parsing
Parsing locates and identifies individual data
elements in the source files and then isolates
these data elements in the target files.
16
17. Parsing
Input Data from Source File
Beth Christine Parker, SLS MGR
Regional Port Authority
Federal Building
12800 Lake Calumet
Hedgewisch, IL
Parsed Data in Target File
First Name: Beth
Middle Name: Christine
Last Name: Parker
Title: SLS MGR
Firm: Regional Port Authority
Location: Federal Building
Number: 12800
Street: Lake Calumet
City: Hedgewisch
State: IL
17
18. Correcting
Corrects parsed individual data components
using sophisticated data algorithms and
secondary data sources.
18
19. Correcting
Corrected Data
First Name: Beth
Middle Name: Christine
Last Name: Parker
Title: SLS MGR
Firm: Regional Port Authority
Location: Federal Building
Number: 12800
Street: South Butler Drive
City: Chicago
State: IL
Zip: 60633
Zip+Four: 2398
Parsed Data
First Name: Beth
Middle Name: Christine
Last Name: Parker
Title: SLS MGR
Firm: Regional Port Authority
Location: Federal Building
Number: 12800
Street: Lake Calumet
City: Hedgewisch
State: IL
19
20. Standardizing
Standardizing applies conversion routines to
transform data into its preferred (and
consistent) format using both standard and
custom business rules.
20
21. Standardizing
Corrected Data
First Name: Beth
Middle Name: Christine
Last Name: Parker
Title: SLS MGR
Firm: Regional Port Authority
Location: Federal Building
Number: 12800
Street: South Butler Drive
City: Chicago
State: IL
Zip: 60633
Zip+Four: 2398
Corrected Data
Pre-name: Ms.
First Name: Beth
1st Name Match
Standards: Elizabeth, Bethany, Bethel
Middle Name: Christine
Last Name: Parker
Title: Sales Mgr.
Firm: Regional Port Authority
Location: Federal Building
Number: 12800
Street: S. Butler Dr.
City: Chicago
State: IL
Zip: 60633
Zip+Four: 2398
21
22. Matching
Searching and matching records within and
across the parsed, corrected and standardized
data based on predefined business rules to
eliminate duplications.
22
23. Match Patterns
Business
Name
Street Branch
Type
Customer
#/Tax ID
City Vendor
Code
Pattern Pattern
I.D.
Exact
Exact Exact
Exact Exact Exact Exact Exact
Exact Exact
Exact Exact
Exact
Exact Exact
VClose Exact
Exact
Exact
Exact
VClose
VClose
VClose
VClose VClose
Close
Close
Close
Blanks
Blanks
AAAAAA
ABAAA-ABA-
AA
ABCCAA
BBACAA
P110
P115
P120
S300
S310
23
24. Matching
Corrected Data (Data Source #1)
Pre-name: Ms.
First Name: Beth
1st Name Match
Standards: Elizabeth, Bethany, Bethel
Middle Name: Christine
Last Name: Parker
Title: Sales Mgr.
Firm: Regional Port Authority
Location: Federal Building
Number: 12800
Street: S. Butler Dr.
City: Chicago
State: IL
Zip: 60633
Zip+Four: 2398
Corrected Data (Data Source #2)
Pre-name: Ms.
First Name: Elizabeth
1st Name Match
Standards: Beth, Bethany, Bethel
Middle Name: Christine
Last Name: Parker-Lewis
Title:
Firm: Regional Port Authority
Location: Federal Building
Number: 12800
Street: S. Butler Dr., Suite 2
City: Chicago
State: IL
Zip: 60633
Zip+Four: 2398
Phone: 708-555-1234
Fax: 708-555-5678
24
25. Consolidating
Analyzing and identifying relationships
between matched records and
consolidating/merging them into ONE
representation.
25
26. Consolidating
Corrected Data (Data Source #1)
Corrected Data (Data Source #2)
Consolidated Data
Name: Ms. Beth (Elizabeth)
Christine Parker-Lewis
Title: Sales Mgr.
Firm: Regional Port Authority
Location: Federal Building
Address: 12800 S. Butler Dr., Suite 2
Chicago, IL 60633-2398
Phone: 708-555-1234
Fax: 708-555-5678
26
28. Example of Manual Cleaning
Car Name Correct Name
Waganer Wagon R
Sujhuki Suzuki
Benj Mercedes Benz
Faurtuner Fortuner
Scopeio Scorpio
Sevrole Chevrolet
Furrarree Ferrari
Landcrusher Land Cruiser
28
32. Top Ten Data Capture Tips
Every contact is data capture opportunity
Make it easy for end user to give you
information
Incentivise your end user to part with their
details
Collect data in-line with private regulations
Decide what data you need and prioritise
32
33. Top Ten Data Capture Tips
Don’t ask everything at once – build it over
time
Set targets for breadth, depth and quality
Collect data in standardized format
Streamline the data from point of capture to
storage
If you cant collect it, BUY it!!!
33
36. Census in India
36
The first census in India in modern times was
conducted in 1872.
Population census has been carried out every
10 years.
The census is carried out by the office of the
Registrar General and Census Commissioner of
India, Delhi, an office in the Ministry of Home
Affairs, Government of India, under the 1948
Census of India Act.
37. CENSU
37 S
The 15th Indian National census was conducted in two phases
House listing
Population enumeration.
The Census covered
640 districts
5767 tehsils
7742 towns
More than 6 lac villages.
2.7 million officials visited households in 7,742 towns and 6,40,867
villages, classifying the population in different segments
38. POPULATION COMPARISON
38
2021
2011
2001 The population of India has
increased by more than 181 million
during the decade 2001-2011.This
addition is slightly lower than the
population of Brazil, the fifth most
populous country in the world !!
39. India as compared to the world
39
The gap between India, the
country with the second largest
population in the world and
China, the country with the
largest population in the world
has narrowed from 238 million
in 2001 to nearly 131 million in
2011. On the other hand, the
gap between India and the
United States of America,
which has the third largest
population, has now widened
to about 902 million from 741
million in 2001.
43. Se x Ra tio
43
The sex
ratio of
India is
940. The
sex ratio at
the
National
level has
risen by
seven
points
since the
last
Census in
2001. This
is the
highest
since 1971.
44. Sex Ratio Trend in India
44
The sex ratio in India has been historically negative or in
other words, unfavourable to females. Sex ratio reached its
lowest in 1991 but since then has kept rising.
46. Census Facts 2011
46
Thane district of Maharashtra is the most populated district of India.
Dibang Valley of Arunachal Pradesh is the least populated.
Kurung Kumey of Arunachal Pradesh registered highest population growth
rate of 111.01 percent.
Longleng district of Nagaland registered negative population growth rate of
(-58.39).
Mahe district of Puducherry has highest sex ratio of 1176 females per 1000
males.
Daman district has lowest sex ratio of 533 females per 1000 males.
Serchhip district of Mizoram has highest literacy rate of 98.76 percent.
Alirajpur of MP is the least literate district of India with figure of 37.22
percent only.
North East Delhi has the higest density with figure of 37346 person per
square kilometer.
Dibang Valley has the least density of 1 person per sq. km.
47. States having highest
population 47
Uttar Pradesh - (19.96 Crore) increased at the
rate of 20% from 2001
Maharashtra- (11.24 Crore) increased at the
rate of 15% since last census.
Delhi is most densely populated with a
density of 11297 per sq km ( an increase of
21% from 2001)
Bihar is the most densely populated state with
a density of 1102 per sq km ( an increase of
25% from 2001).
50. Interesting facts- Telecom
50
“More phones than toilets” Census 2011
sheds light on changing India.
63.2 per cent households in India now have a
telephone/mobile facility( 82 per cent in urban
and 54 per cent in rural area.)
The penetration of mobile phone is 59 per cent
and landline is 10 per cent.
More than half of Indian households (some
53.1 per cent) do not have access to
something as basic as a toilet.
51. Facts- Communication
51
The penetration of computers and laptops in India
is only 9.4 per cent or less than one out of 10
households with only 3 per cent having internet
facility.
The penetration of internet is 8 per cent in urban
as compared to less than 1 per cent in rural area.
Maharashtra is the biggest Indian Internet market
with 18% .
47.2 % of Indians own a Television
19.9 % of Indians own a Radio/Transistors
13.42 Million broadband connections (Home +
Offices ) combined.
52. Facts- Literacy and Population
52
Uttar Pradesh is the most populous state and
the combined population of Uttar Pradesh and
Maharashtra is more than that of the USA.
Ten states and union territories have attained
literacy rate of above 85 per cent.
According to the Census report India's population
is now bigger than the combined population of
USA, Indonesia, Brazil, Pakistan and Bangladesh.
74% of Indians can now read, write and do basic
maths (like adding, subtracting) — that means
that 3 out of every 4 Indians are literate.
53. Facts : General
53
Females outnumber males in Goa.
Population
50% <=25 yrs of age
65% <=35 yrs of age
It is anticipated that the median age of an Indian
citizen will be 29 years in 2020, in comparison to
48 for Japan and 37 for China.
India covers 2.4% of the land territory of the world
and represents more than 17.5% of the population
of the world.
54. Facts : General
54
Total expenditure and materials used :
• Cost Rs. 2200 crore
• Cost per person Rs. 18.33
• No. of Census Functionaries 2.7 million
• No. of Languages in which Schedules were canvassed 16
• No. of Languages in which Training Manuals prepared 18
• No. of Schedules Printed 340 million
• No. of Training Manuals Printed 5.4 million
• Paper Utilised 12,000 MTs
• Material Moved 10,500 MTs
55. What do we do with Census ??
Census is more than population, literacy and
sex ratio.
Census can provide insights about various
dimensions !!!
The data is available in the xls format
The data is available free of cost
The data is clean
It has proper database architecture with codes
in place
55
56. Two stages of Census
56
Houselisting
Population
Enumeration
POPULATION IN 2011 – MAHARASHTRA , UP , BIHAR POP INCREASES MOST. OTHER STATES ALSO ON UPWARD TREND. NAGALAND POP REDUCES. Arrows show the growth rate.