An easy to understand primer about the "split-apply-combine" concept popularized by Hadley Wickham applied to data visualization. Following that I go through a simple introduction to the perceptual variables available for data visualization and some common mistakes.
3. Men of great rank, or active business, can only
pay attention to particulars of use […] it is hoped
that with the assistance of these Charts,
information will be got, without the fatigue and
trouble of studying the particulars [...]
William Playfair - Commercial and Political Atlas, 1786
4. Data visualization is the art of
*reducing information in a data set while
preserving the knowledge contained in it.
*we can talk about what “reducing information” means in this case...
5. Data Preparation Data Visualization
Discovery of
knowledge
Conceptual data analysis workflow
6. Hadley Wickham popularized a concept called
split-apply-combine
as a way of thinking about data querying.
http://www.jstatsoft.org/v40/i01/paper
7. For the four most revenue generating
countries, what are the top three most
revenue generating categories?
Country Venue Type Sum Revenue
United States Fast Food $16
Street $10
Restaurant $9
France Cafe $18
Pub $12
Restaurant $2
Canada Cafe $10
Fast Food $4
Street $3
Japan Street $5
Fast Food $4
Pub $1
8. apply: Sum Revenue
Canada
United States
Germany
France
Japan
split by country
combine: sort descending by
Sum Revenue, limit 4
Country Sum Revenue
United States
France
Canada
Japan
$ 83
$ 42
$ 36
$ 18
data
Sum Revenue =
$ 36
Sum Revenue =
$ 83
Sum Revenue =
$ 8
Sum Revenue =
$ 42
Sum Revenue =
$ 18
The basics of split-apply-combine
9. Canada
United States
Germany
France
Japan
data
bus stop
fastfood
park
...
restaurant
hair saloon
pub
...
restaurant
street
cafe
...
park
pub
street
Country Sum Revenue
United States
France
Canada
Japan
$ 16
$ 10
$ 9
$ 18
$ 12
$ 2
$ 10
$ 4
$ 3
$ 5
$ 4
$ 1
Venue type
fastfood
street
restaurant
cafe
pub
restaurant
cafe
fastfood
park
street
fastfood
pub
...
The basics of split-apply-combine
10. Country Sum Revenue
United States
France
Canada
Japan
split by country,
combine by sorting
desc. on Sum
Revenue,
map to the vertical
axis using an ordinal
scale.
add labels
apply: sum revenue,
call it Sum Revenue,
plot rectangles and map
length to the horizontal
axis using a linear scale,
Color with #45808E.
Use `Country` as label
Split-apply-combine thinking translates to visualizations
11. 1. split on state
apply sum population
combine: sort desc. by population; limit 6
Nested split-apply-combine underpins more complex visualizations
2. split on age (bin by 5 year)
combine: sort by age
apply sum population
12. Data Visualization can be thought as a
visual mapping function applied
during the *Apply and Combine steps.
*although it can be thought as applied exclusively during the combine step…
15. Types of data
ID Timestamp Location Name Operation Lines Pass Test?
0000001 11-05-2013 10.45 am San Francisco Vadim Added 100 Yes
0000002 11-05-2013 11.12 am San Bruno Luca Removed 34 Yes
0000003 11-05-2013 11.30 am San Francisco Vadim Added 65 Yes
0000004 11-05-2013 11.34 am San Francisco Vadim Removed 5 Yes
0000005 11-05-2013 11.43 am San Bruno Luca Added 24 No
0000006 11-05-2013 11.45 am San Francisco Vadim Removed 71 Yes
0000007 11-05-2013 12.51 pm San Francisco Luca Removed 45 Yes
0000008 11-05-2013 12.55 pm San Francisco Vadim Added 7 No
... ... ... ... ... ... ...
Categorical # Discrete
# Continuous# Discrete
Boolean
16. There are other ways to classify data,
but this one will get you very far.
pick up a good statistics book and just start reading...
17. Types of variables
1. Independent
a. a variable that isn't changed by the other
variables you are trying to measure. It
usually goes on the x axis.
2. Dependent
a. It is a variable that changes depending on
other variable(s). It usually goes on the y
axis.
19. Variables of a visualization
1. Position (x,y)
2. Size (big, small…)
3. Value (bright, dark…)
4. Texture (hatched, dotted…)
5. Color (blue, red…)
6. Orientation (degree)
7. Shape (triangle, circle…)
y
x
20. # Discrete # Continuous Categorical Boolean
y
x
y
x
y
x
y
x
Optimal mappings by type
21. -960
LucaVadim
1531
-321
739
0
1k
2k
-2k
-1k
AddedRemoved
Name Operation Lines
Vadim Added 100
Luca Removed 34
Vadim Added 65
Vadim Removed 5
Luca Added 24
Vadim Removed 71
Luca Removed 45
Vadim Added 7
... ... ...
Split on Name
Split on Operation
Apply Sum(Added)
Apply Sum(Removed)
Combine -Removed map to
Red, value to size
Combine Added map to
Green, value to size
Combine Name map to x axis
22. Apply the minimum number of mappings
that illustrates the underlying question
you are trying to answer.
24. 1. Label your axes
2. Include measurement units
3. Explain your encodings (add a legend)
4. Remove redundant information
5. Don’t fuck with distort the axis, especially with time series
Golden rules - Part 1
25. Golden rules - Part 2
1. If you are trying to visualize rate of change, then do it
2. Remove outliers, but know they are there
3. Tools have their own biases and quirks, know them.
4. The solution to 80% of your problems are bar charts and
histograms
5. Data Tables are visualizations too
...there are thousands of good rules, but the best one is still “keep it simple”