At Redbubble, millions of users browse through independent artwork in the pursuit of that illusive unique art, which they want to own on a high quality product like t shirts, cushion covers etc. At any point there are lot of idea to improve the customer experience, most of which are unquantified, backed by lots of unvalidated assumptions. From a product development point of view, this is understandable. All great ideas have to be seeded on some assumptions. However, we strive to validate assumptions, so we can get a better idea of impact and opportunity size.
This led us to statistical modelling of customer data, as we want to draw inference and quantify the relative impact of specific user visit attributes. We have been dabbling with statistical analysis of big consumer data sets, so we decided to take a plunge and see what can we unearth.
3. Product Development
● Millions of users
● Lots of ideas
● Lots of unquantified & unvalidated assumptions
● What are the Biggest problems
● What should we pursue first = best opportunity
● We want to build the right thing
4. Existing Techniques
● User interviews and surveys
○ Interpretation of wants and needs is tricky
■ Not dependable
○ Expensive & time consuming
● Analytic tools (GoogleAnalytics, Flurry) provide high
level views
○ difficult to gauge effect of each variable on its own :
Lots of factors at play, how much did a singular thing
affect the outcome
5. What is lacking
● Ability to get insights from real user actions/visits
● Make it Quick and Cheap to support/reject assumptions
● Confidence, like probabilities, external factors and stuff :
-)
7. What we do
● Statistical modelling of customer data and infer
● Quantification of relative impact of the user behaviours
and visit attributes
“Lets put some science in data analysis”
8. ● Give a starting point
● Define the goal for measuring success
● Keeps you focussed and honest
● Hunches are powerful - use domain knowledge
Strongest Hypotheses
9. Identify hypotheses
○ HypothesisA: “Users jumping along & looking at
multiple search result pages are having a bad
experience”
○ HypothesisB: “Users navigating to a listing from
search results are having a good experience”
○ HypothesisC: “Users typing in keywords in search
box multiple times are not having a good
experience”
10. Measurable User Journeys
● Identify particular user journeys in a visit
○ hypothesisA: SPPPSPSP
○ hypothesisB: SLL
● Journeys don’t need to be exclusive - they are not!
● Lots of log parsing, mapreduce
● Usually the process varies for each business
11. Data Preparation
● Start with a small sample size
● Focus more on quality
● Look out for anomalies & outliers
● Remove correlated variables - noise
12. Data Visualization
● Visualize your data
○ Simple Histogram will tell you a lot of things
○ Scatter plots are good for identifying outliers
13. Regression analysis
● Statistical process for estimating the relationships
among variables
● Choice of method largely depends of the form of data
and variable types
● Linear regression is your go-to method for initial pokes
● Poisson or logit model are also very useful tools for
most ecommerce related datasets
14. Example (Using R)
Independent Variables Estimate Std. Error z value Pr(>z) Significance
clickThroughToListings 0.34065 0.12654 2.692 0.00710
**
pagingAroundSearchResults -0.28925 0.08688 -3.329 0.00087
***
usingSearchBoxTooMuch 0.12038 0.12608 0.955 0.33967
glm(
formula = addToCart ~ clickThroughToListings +
pagingAroundSearchResults +
usingSearchBoxTooMuch,
family = "binomial",
data = summary.df
)
15. Independent Variables Estimate Std. Error z value Pr(>z) Significance
clickThroughToListings 0.34065 0.12654 2.692 0.00710
**
pagingAroundSearchResults -0.28925 0.08688 -3.329 0.00087
***
usingSearchBoxTooMuch 0.12038 0.12608 0.955 0.33967
How to interpret signal
Direction
16. How to interpret signal
Independent Variables Estimate Std. Error z value Pr(>z) Significance
clickThroughToListings 0.34065 0.12654 2.692 0.00710
**
pagingAroundSearchResults -0.28925 0.08688 -3.329 0.00087
***
usingSearchBoxTooMuch 0.12038 0.12608 0.955 0.33967
Significance
17. Concrete Direction
● Now we know which user segments present a real
opportunity to make improvements
● How big is the customer segment = problem size
● Knowing problem size helps in prioritizing