1. الرحيم الرحمن هللا بسم
An Introduction to Statistical Tools and SPSS
used in Social Research
Presented by
Professor Dr. Md. Nazrul Islam Mondal
Visiting Scholar
ERASMUS+ Fellowship Program
Department of Sociology
Middle East Technical University
Ankara, TURKEY
E-mail: nazrulupm@gmail.com
2. Outline
i. Data presentation
ii. Central tendency
iii. Skewness and kurtosis
iv. Measures of dispersion
v. Correlation
vi. Regression
2
3. DATA PRESENTATION
• Statistics:
Statistics is a branch of scientific methodology.
Data: Data is a collection of facts or information from
which conclusions may be drawn.
• Stages of statistical investigation:
Collection,
Organization,
Presentation,
Analysis, and
Interpretation.
3
4. • Population and sample
Essential purpose of statistics:
i. describe about the numerical properties of
populations, and
ii. draw inferences about the population from the
samples.
• Population: It is the entire category under
consideration. Its size is usually denoted by N.
• Sample: A sample is a representative part of
the population. It is a subset or portion of the
population. Its size is denoted by n.
4
5. • Parameter: It is a characteristic or measure
obtained from a population.
• Statistic: It is characteristic or measure obtained
from a sample.
• Descriptive statistics: They provide simple
summaries about the sample and the measures.
• Statistical inference: To draw conclusions about
population parameters from sample.
It performs:
i. hypothesis testing;
ii. determine relationships between
variables, and
iii. makes predictions.
5
6. • Variables: The measureable characteristics are
called variables.
• Types:
i. Qualitative, and
ii. Quantitative
• Random Variable: A variable whose values are
determined by chance.
• Constant: A constant is a particular type of
variable, which does not vary from one
member of a group to another.
6
7. • Discrete variables: Usually it is obtained by
counting. Integers.
• Examples: number of children, number of students in METU, etc.
• Continuous variables: Usually it is obtained by
measurement. Real numbers.
• Examples: height, weight, etc.
• Data types (collection ways)
– Primary data: Primary data come mainly from
direct field operations.
– Secondary data: Secondary data are usually
obtained from already published or unpublished
documents.
7
8. • Types of data
i. categorical, and
ii. numerical.
• Categorical data types:
i. nominal (gender, religion, room numbers, etc.),
ii. ordinal (education level, book chapters, health status, etc.)
• Numerical data types:
i. interval scale (age groups, dates, etc.)
ii. ii. ratio scale (BMI, CWR, etc.)
8
9. • Frequency: The number of times a certain
value or class of values occurs.
• Frequency distribution: A tabular
arrangement of data by classes together with
the corresponding class frequencies is called
frequency distribution.
9
10. • Data types:
i. Ungrouped data (without frequency),
Ii. Grouped data (with frequency).
• Data presentation by tables, graphs
– frequency distributions,
– bar diagram,
– histograms,
– multiple bar diagram,
– Box plot,
– Frequency polygon,
– other graphs.
10
11. Examples
11
Activity Number of Students
Play Sports 45
Talk on Phone 53
Visit With Friends 99
Earn Money 44
Chat Online 66
School Clubs 22
Watch TV 37
n=366
The students in the Department of Sociology, METU were involved the following activities:
12. 45
53
99
44
66
22
37
0
20
40
60
80
100
120
Play Sports Talk on Phone Visit With
Friends
Earn Money Chat Online School Clubs Watch TV
Series1
12
Bar diagram
Note: Bar diagram can be presented vertically or horizontally
to show comparisons among categories.
13. Histogram: A graphical representation of how many times different, mutually
exclusive events are observed in an experiment.
The data represents ages of a group of people.
13
36 25 38 46 55 68 72 55 36 38
67 45 22 48 91 46 52 61 58 55
15. Multiple bar diagram: Data on several variables in respect of different places
or time points may be presented by multiple bar diagram.
Grades
Years
1st year 2nd year 3rd year 4th year
Grade A 5 7 9 10
Grade B 15 18 15 12
Grade C 20 15 10 8
15
0
5
10
15
20
25
1st year 2nd year 3rd year 4th year
Grade A Grade B Grade C
16. Pie chart: Different components of data may be exhibited by splitting s circle. The
angle at the center of a circle is proportionally divided and accordingly splitting the
circle we exhibit different components of the data. The division is also done in
percentage according to the relative magnitude of different components. Usually the
components are demarked by different colors.
Grades 1st year
Grade A 5
Grade B 15
Grade C 20
Total 40
16
Grade A
12%
Grade B
38%
Grade C
50%
Social Science
17. Frequency polygon: A frequency polygon gives the idea about the shape of
the data distribution. The two end points of a frequency polygon always lie on
the x-axis.
17
18. Box plot: The box plot (box and whisker diagram) is the five number
summary: minimum, first quartile, median, third quartile, and maximum.
18
19. Scatter plot: A plot of the data values on a coordinate system. A
scatter plot (or scatter diagram) is used to show the relationship
between two variables. The independent variable is graphed
along the x-axis and the dependent variable along the y-axis.
19
20. Cumulative frequency polygon or ogive: A graph showing the cumulative
frequency less than any upper class boundary plotted against the upper class
boundary is called a cumulative frequency polygon or ogive.
20
21. CENTRAL TENDENCY
• Central Tendency: It is a single value that
attempts to describe a set of data by identifying
the central position within that set of data.
• Measures are:
–Mean,
–Median,
–Mode,
–Quartiles,
–Deciles, and
–Percentiles.
21
24. • Median: The median is the middle score for a set of
data that has been arranged in order of magnitude.
• Note: At first arrange the data according to their magnitudes.
.5.2
2
32
2
2
6
is97,3,2,1,0,ofmedianThe:Example
.,
2
2
.23
2
15
theis73,2,1,0,ofmedianThe:Example
.,
2
1
,
termnexttermth
numberevenanisnwhen
termnexttermth
n
Me
termrdtermth
numberoddanisnwhentermth
n
MeMedian
24
25. • Mode (M0)= Most frequent number in the observation.
– Example: 1, 3, 4, 5, 9, 3. Mode=3
• Unimodal: A distribution having only one mode is called
unimodal.
• Bimodal: A distribution having two modes is called
unimodal.
Example: 1, 3, 4, 5, 9, 3, 4. Mode=3 and 4
• Mode= mean-3(mean-median)
• Midrange: The mean of the highest and lowest values.
(Max + Min) / 2.
25
26. • QUANTILES: It divides the total frequency into a number of equal parts.
Types:
i. Quartiles,
ii. Deciles, and
iii. Percentiles.
• Quartiles: It divides the total frequency into four equal parts.
• Types:
i. 1st quartile, Q1 ,
ii. 2nd quartile, Q2, and
iii. 3rd quartile, Q3 .
• The 2nd quartile is identical with median,
• the 1st quartile is the value at or below which one-fourth (25%) of all items in the
series, and
• the 3rd quartile is the value at or below which three-fourths (75%) of the item lie.
26
28. Deciles: Deciles divide the total frequency into ten equal parts. There
are nine types of deciles: 1st decile , 2nd decile ,…,9th decile .
.;9,....,3,2,1,
2
10
.;9....,3,2,1,
10
1
numberevenanisnwhenj
termnexttermthj
n
D
numberoddanisnwhenjtermth
jn
D
j
j
Calculation methods
28
29. Percentiles: It divide the total frequency into 100 equal parts.
There are 99 types of percentiles: 1st percentile, P1 ; 2nd
percentile, P2 ; …,99th percentile, P99 .
.;99,....3,2,1,
2
100
.;99.....,3,2,1,
100
1
numberevenanisnwhenk
termnexttermthk
n
P
numberoddanisnwhenktermth
kn
P
k
k
Calculation methods
29
31. For grouped data:
.,,, nsobservatioofnumbertotaltheNfwhere
f
xf
xmeanSimpleAM i
i
ii
Arithmetic mean (AM),
Again for the class interval grouped data, mean,
,
i
ii
f
Xf
x
where Xi be the mid-values of the classes.
31
32. • Geometric mean (GM),
• Harmonic mean (HM),
.,........
1
21
21
numberspositivenadzerononforonlyxxxx nf
k
ff k
.,
.........
2
2
1
1
numberszerononforonly
x
f
n
x
f
x
f
x
f
n
x
i
i
k
k
32
39. SKEWNESS AND KURTOSIS
Moments
• The r-th moment of a variable x for ungrouped
data,
• for grouped data,
N
x
x
r
ir
.,
,
nsobservatioofnumbertotaltheisNf
N
xf
x
i
r
iir
39
40. • The r-th moment of a variable x for
ungrouped data about the AM, is given by
• also for grouped data,
N
xx
r
i
r
N
xxf
r
i
r
40
41. • Symmetrical curve: A frequency curve is said to be
symmetrical if it can be folded along a vertical line at
centre so that the two halves of the figures coincide.
In a symmetrical distribution, the values of mean,
median and mode coincide.
41
42. • Skewness: Skewness means lack of symmetry
of a curve. It indicated whether the curve is
turned more to one side than to the other.
• Measures of skewness
• Karl Pearson’s Coefficient of Skewness:
42
.
)(3
MexMox
Skp
44. • Measure of skewness based on moments,
• if then distribution is positively skewed;
• If then distribution is negatively skewed,
• If then the distribution is symmetric.
3
3
2
3
2
3
11
01
01
01
44
45. • Kurtosis: the sharpness of the peak of a
frequency-distribution curve.
• Measures of kurtosis
Moment coefficient of kurtosis,
45
2
2
4
2
47. MEASURES OF DISPERSION
Measures of dispersion measure how spread
out a set of data.
• Types of measures:
i. Absolute measures, and ii. Relative measures.
• Absolute measures:
i. Range,
ii. Mean deviation,
iii. Quartile deviation,
iv. Standard deviation,
v. Variance, and
vi. Standard error.
47
48. • Calculation methods, examples:
48
data.ofsetsfor twomeasureaisIt.,Covariance*
.100var*
nobservatioofnumbertheisn,)(*
var(x)=Variance*
.
,,
,,*
=IQRrange,quartileInter*,
2
*
,*
2
2
2
13
13
n
yyxx
yxCov
x
CViationoftCoefficien
n
xSEmeanoferrorStadard
AMisxandnsobservatioofnumbertotaltheisNwhere
datagroupedfor
N
xxf
dataungroupedfor
N
xx
SD
QQ
QQ
QDdeviationQuartile
datagroupedfor
n
xxf
xMDdeviationMean
ii
ii
i
ii
49. • CORRELATION ANALYSIS
• Correlation: It is a statistical measurement to find out of the
relationship (linear) between two variables.
• Pearson's Correlation Coefficient, r: It is a statistic or parameter
which measures the strength and direction of a relationship
between two variables.
49
50. The value of r denotes the strength of the
association as illustrated by the following diagram.
-1 10-0.25-0.75 0.750.25
strong strongintermediate intermediateweak weak
no relation
perfect
correlation
perfect
correlation
Directindirect
11 r
50
51. 51
Calculation
.,,
22
2
2
2
2
22
yyYandxxXwhere
YX
XY
n
y
y
n
x
x
n
yx
yx
yyxx
yyxx
r
ii
i
i
i
i
ii
ii
ii
ii
rij is the simple or zero order correlation coefficient,
In partial correlation coefficient, rij.k is called the
first order correlation coefficient, and so on
52. Probable error (PE) of r
• It helps in determining the accuracy and
reliability of the value of r.
• Upper limit of r= r + PE, lower limit of r=r - PE
• If r*<r, then r is significant, otherwise it is
insignificant.
52
N
r
rrPE
2
* 1
6745.0
53. • Significance test of r
53
n.correlatiotsignificanaisthere;0:
ncorrelatiotsignificannoisthere;0:
0
0
H
H
Formula for the t-test for r (Table t-value)
.,
)2(,
1
2
2
rejectedisHthenttIf
ndfwith
r
n
rt
otabulatedcalculated
54. Partial Correlation
• It is a measure of association between two
variables, while controlling the effect of one or
more additional variables.
• Partial correlation coefficient: The correlation
coefficient of x1 and x2 holding the effect of x3
constant,
54
..
)1)(1( 2
23
2
13
231312
3.12 ncorrelatioorderfirstaisIt
rr
rrr
r
55. Multiple correlation coefficient
• It shows the correlation among more than
two variables and it is denoted by R.
• Suppose that there are 3 variables x1
(dependent), and x2, x3 (independent) then
the multiple correlation coefficient is R1.23
and it is determined by
55
.
1
2
2
23
132312
2
13
2
12
23.1
2
r
rrrrr
R
56. Coefficient of determination, R2
Examples
• It measures the proportion of the variation in
Y explained by X.
• It ranges from 0 to 1.0 (or, 0% to 100%)
• R2 is actually equal to r2 for simple regression
model.
56
.
varianceTotal
variancedUnexplaine
R-1iondeterminat-nonoftCoefficien
.
varianceTotal
varianceExplained
RiondeterminatoftCoefficien
22
2
k
57. REGRESSION ANALYSIS
• Regression: It is a statistical process for estimating the
relationships among dependent variable (DV) and independent
variables (IVs).
• It can be used to infer causal relationships between the DV and IVs.
• Regression line: The best fit line.
57
59. Linear regression and its types
• Linear regression is a common Statistical
data analysis technique. It is used to
determine the extent to which there is
a linear relationship between a DV and one or
more IVs.
• Types:
i. Simple linear regression (one IV), and
ii. Multiple linear regression (more IVs).
59
60. Simple linear regression
• It allows us to summarize and study relationships
between two continuous (quantitative) variables.
• A regression equation takes the form of
y=a+bx+c, a is the intercept of the line, b is the
coefficient, and c is a value called the regression
residual (mean 0).
• Y: is referred to as DV, response variable or
predicted variable.
• X: is referred to as IV, explanatory variable,
factor, carrier, covariate, regressor or predictor
variable.
60
61. • Calculation, example
61
xy
yyYxx
XSS
XYSP
xxn
yxyxn
xx
yyxx
xy
n
x
n
n
n
y
orxy
lineregisxy
ii
ii
iiii
i
ii
ˆˆThen
.,X
)(
)(ˆ
ˆˆ
,0,,
.
222
62. • The estimates are called the least
square estimates of because they are
the solution to the least squares method.
• The filleted line is called least squares
regression line.
62
ˆˆ and
and
63. Multiple regression
In some cases DV is influenced by some IVS. The method of
estimating the rate of average change in the value of two or
more IVs is known as multiple regression.
63
.0
.
).......,3,2,1(,intercept
..........
1
0
22110
xandybetweeniprelationshlinearnoistherethatmeans
termerrorrandomisand
xofregressionpartialoftscoefficientheare
kjthebewhere
xxxy
i
i
j
ikk
64. Interpreting parameter values (model coefficients)
• “Intercept, ” - value of y when all predictors
are 0.
• - describes the expected change in y per unit
increment in xj when all other predictors in the
model are held at a constant value.
64
0
j
65. Estimating model parameters of multiple regression
Assuming a random sample of n observations (yi, xi1,xi2,...,xik), i=1,2,...,n. The
estimates of the parameters for the best predicting equation:
65
n
i
kik
n
i
iik
n
i
ik
n
i
iikik
n
i
kiki
n
i
i
n
i
i
n
i
iii
n
i
kik
n
i
i
n
i
i
n
i
n
i
ikkiiiii
k
ikkiii
xxxxyxx
xxxxyxx
xxny
xxxyyySSE
xxxy
1
2
1
110
11
1
1
1
1
2
10
1
1
1
11
11
110
1
1 1
2
22110
2
10
22110
ˆˆˆ
ˆˆˆ
ˆˆˆ1
estimates.parameterfor the
equationsobtain thetounknowns1+kinequations1+kofsystemthisSolve0.oequation t
eachequateandk,,…1,0,respect toithfunction wSSEtheofsderivativepartialtheTake
)()ˆ(expressiontheminimizewhich
ˆ,,ˆ,ˆvaluesthechoosingbyfoundis
ˆˆˆˆˆ
66. Multicollinearity
• The predictors (x1, x2, ... xk) are statistically highly correlated.
• It leads to
– Numerical instability in the estimates of the regression parameters
– No longer have simple interpretations for the regression coefficients in the additive model.
• Ways to detect multicollinearity
– Scatterplots of the predictor variables.
– Correlation matrix for the predictor variables – the higher these correlations the worse the
problem.
– Variance Inflation Factors (VIFs) reported by software packages. Values larger than 10 usually
signal a substantial amount of collinearity.
• What can be done
– Regression estimates are still OK, but the resulting confidence/prediction intervals are very
wide.
– Choose explanatory variables wisely! (E.g. consider omitting one of two highly correlated
variables.)
– More advanced solutions: principal components analysis; ridge regression.
66
67. Stepwise regression
• It is an automated tool used in the exploratory
stages of model building to identify a useful
subset of predictors. The process systematically
adds the most significant variable or removes
the least significant variable during each step.
67