This document discusses expectations and challenges when visualizing data. The key points are:
1. Expect to find the real need by understanding the audience and goals better than the client. Expect to clean data, which can take a significant amount of time due to multiple sources and formats.
2. Prepare to iterate as the initial visualization may not meet needs or deadlines. Celebrate failures as learning opportunities.
3. Visualization projects include storytelling projects with strict deadlines and analytical tools to support data exploration by technical teams over the long term. The project lifecycle involves identifying needs, prototyping, refining, and maintaining the visualization.
6. (P.S. These are actually not my robots, but our competitors’.)
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
7. Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
PhD in Computer Science
Information Visualization
Univ. of Maryland
8. Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
IBM
Microsoft
PhD in Computer Science
Information Visualization
Univ. of Maryland
9. PhD in Computer Science
Information Visualization
Univ. of Maryland
IBM
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
Data Scientist
Analytics, Experiment
Twitter
Microsoft
10. PhD in Computer Science
Information Visualization
Univ. of Maryland
IBM
Krist Wongsuphasawat / @kristw
Computer Engineer
Bangkok, Thailand
Engineering Manager
Data Experience
Airbnb
Microsoft
Twitter
24. GOALS
Present data
Communicate information effectively
Analyze data
Exploratory data analysis
Tools to analyze data
Reusable tools for exploration
Enjoy
Combination of above
25. GOALS
Present data
Communicate information effectively
Analyze data
Exploratory data analysis
Tools to analyze data
Reusable tools for exploration
Enjoy
Combination of above
Who are the audience?
What do you want to tell?
What are the questions?
Who will use this?
What would they use this for?
Who are the audience?
35. DATA SOURCES
Open data
Publicly available
Internal data
Private, owned by clients’ organization
Self-collected data
Manual, site scraping, etc.
Combine the above
36. DATA FORMAT
Standalone files
txt, csv, tsv, json, Google Docs, …, pdf*
Databases
doesn’t necessary mean they are organized
API
better quality with more overhead
Website
Big data*
43. IS THIS CLEAN?
USER RESTAURANT RATING
========================
A MCDONALD’S 3
B MCDONALDS 3
C MCDONALD 4
D MCDONALDS 5
E IHOP 4
F SUBWAY 4
44. IS THIS CLEAN?
USER RESTAURANT RATING
========================
A MCDONALD’S 3
B MCDONALDS 3
C MCDONALD 4
D MCDONALDS 5
E IHOP 4
F SUBWAY 4
How many reviews are there?
Clean.
How many restaurants are there?
Not clean.
McDonald, McDonald’s, McDonalds
51. Hadoop Cluster
Scalding (slow)
Data Storage
Tool
Final dataset
Tool node.js / python / excel (fast)
Your laptop
GETTING BIG DATA
Smaller dataset
52. CHALLENGES
Slow
Long processing time (hours)
Get relevant Tweets
hashtag: #oscars
keywords: “parasite” (movie name)
Too big
Need to aggregate & reduce size
Harder to spot problems
53. CHALLENGES
Slow
Long processing time (hours)
Get relevant Tweets
hashtag: #oscars
keywords: “parasite” (movie name)
Too big
Need to aggregate & reduce size
Harder to spot problems
57. RECOMMENDATIONS
Always think that you will have to do it again
document the process, automation
Reusable scripts
break a gigantic do-it-all function into smaller ones
Reusable data
keep for future project
61. TIPS
Don’t give up.
If stuck, look for inspirations.
The vis that gives you insights may or may not be the best vis for sharing.
Exploration vs. Communication
Keep it as simple as possible
but not simpler.
64. TIPS
Don’t give up.
If stuck, look for inspirations.
The vis that gives you insights may or may not be the best vis for sharing.
Exploration vs. Communication
Keep it as simple as possible
but not simpler.
Set milestones and deadline.
66. STORYTELLING PROJECTS
timely
Deadline is strict. Also can be unexpected events.
wide audience
easy to explain and understand, multi-device support
one-off project
scope
analyze data to find stories and find best way to present them
85. While humans are busy killing each other,
ice zombies “White walkers” are invading from the North.
The only group who seems to care about this
is neutral group called the Night’s Watch.
86. HBO’s Game of Thrones
Based on a book series “A Song of Ice and Fire”
Medieval Fantasy. Knights, magic and dragons.
Many characters.
Anybody can die.
8 seasons
Multiple storylines in each episode
94. Sample data
Character Count
Hodor 10000
Jon Snow 5000
Daenerys 4000
Bran Stark 3000
… …
*These numbers are made up for presentation, not real data.
96. + episodes
The Guardian & Google Trends
http://www.theguardian.com/news/datablog/ng-interactive/2016/apr/22/game-of-thrones-the-most-googled-characters-episode-by-episode
101. Sample data
Character Count
Jon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character Count
Hodor 10000
Jon Snow 5000
Daenerys 4000
… …
INDIVIDUALS CONNECTIONS
+ top emojis + top emojis
*These numbers are made up for presentation, not real data.
102. Graph
NODES LINKS
+ top emojis + top emojis
Character Count
Jon Snow+Sansa 1000
Tormund+Brienne 500
Bran Stark+Hodor 300
… …
Character Count
Hodor 1000
Jon Snow 500
Daenerys 400
… …
*These numbers are made up for presentation, not real data.
119. Colors
Default: D3 category10
Distinct but nothing about the context
Custom palette
Colors related to the groups/houses.
Black = Night’s Watch
Blue = North
Red = Daenerys
Gold = Lannister
…
135. VISUAL ANALYTICS TOOL PROJECTS
richer, more features
to support exploration of complex data
more technical audience
product managers, engineers, data scientists
accuracy
designed for dynamic input
long-term projects
136. PROJECT LIFECYCLE
Identify needs
Design and prototype
Make it work for sample dataset
Refine, generalize and productionize
Make it work for other cases
Document and release
Maintain and support
Keep it running, Feature requests & Bugs fix
169. “The first 90% of the code
accounts for the first 90% of the development time.
The remaining 10% of the code
accounts for the other 90% of the development time.”
— Tom Cargill, Bell Labs
170. REFINE & POLISH
UX / UI + Mobile Support
Color
Animation / Transition
Metadata for SEO
Social media preview images
Performance
Loading time, Data file size
“The little of visualisation design” by Andy Kirk
http://www.visualisingdata.com/2016/03/little-visualisation-design/
177. THE ORIGIN
From a paper “Interim pre-
pandemic planning guidance:
community strategy for pandemic
influenza mitigation in the United
States: early, targeted, layered use
of nonpharmaceutical
interventions”
published in 2007 by the CDC
https://stacks.cdc.gov/view/cdc/11425
178. REVIVAL
Rosamund Pearce, a data journalist
at The Economist, rebuild it for a
piece about COVID-19.
Changed the labeling scheme to
assist colorblind readers.
https://www.economist.com/briefing/2020/02/29/covid-19-is-now-in-50-countries-and-things-will-get-worse
179. THE LINE
Drew Harris, an assistant
professor at the Thomas
Jefferson University, came across
the graphic in The Economist.
He recalled using it a decade
earlier as a pandemic
preparedness trainer.
So he added the dotted line
“healthcare system capacity”
https://www.nytimes.com/article/flatten-curve-coronavirus.html
188. HOW TO BE BETTER?
Retrospective
What could have been better?
Wishlist
Expand skillset
Learning opportunities
Get help
Grow the team
Improve tooling
Solve a problem once and for all
Automate repetitive tasks
200. 6 STEPS
1.
2.
3.
4.
5.
6.
Krist Wongsuphasawat / @kristw
kristw.yellowpigz.com
Expect to find the real need
Expect to clean data a lot
Prepare to iterate
Reserve time for refinement
Plan for feedback
Look back for improvement
201. My former and current colleagues at Twitter and Airbnb
for their collaboration and support in these projects;
and my wife for taking care of our two kids
while I make these slides.
ACKNOWLEDGEMENT