Weitere ähnliche Inhalte Ähnlich wie Data Mining & Engineering (20) Kürzlich hochgeladen (20) Data Mining & Engineering1. Copyright ©2014 Visible Technologies, Inc. All rights reserved.1
Data Mining and Engineering
Lucas Parker
Senior Software Development Engineer, Research & Development
Presented by
2. Copyright ©2014 Visible Technologies, Inc. All rights reserved.2
About Visible
Our Mission: Customer Value
Global Authoritative Content
• Most comprehensive global content sourcing model
• Clean & accurate data
Powerful Search & Discovery
• Only the most pertinent results, based on your criteria
• Pivot and drill to identify discussion drivers
Engagement/Social CRM
•Social Media workflow and engagement for individual or team
•Integrate with CRM for continuous relationship management
Sophisticated Social Analytics
•Measure, compare, and contrast program & communication results
•Segment results by product attributes, reputation drivers, etc.
Actionable Insights
•Discovery & analytics to uncover insights in real time
•Holistic consumer insights, integrated with other market research
3. Copyright ©2014 Visible Technologies, Inc. All rights reserved.3
What We Do
• Domain is “social media”
• Twitter, Facebook, forums, blogs, etc
• Huge data sets, lots of noise.
• Enrichment, aggregation, reporting.
4. Copyright ©2014 Visible Technologies, Inc. All rights reserved.4
Visible – Target Business Groups
• Customer Servicing
• Interactions between business and customer.
• Marketing
• Brand effort, campaigns, periodic messaging.
• Corporate Communications
• PR, reputation of company and stakeholders
• Research
• Audience definition, demographics, psychographics
6. Copyright ©2014 Visible Technologies, Inc. All rights reserved.6
Articulating the Problem
“Marketing analysts need to
understand the impact of their
campaigns and we can provide them
an avenue to do so.”
- Surf and turf
“We should totally
Hadoop something!”
- Knuckle sandwich
7. Copyright ©2014 Visible Technologies, Inc. All rights reserved.7
Feature Engineering
• Your concise data features are easy to grasp, but do
they provide for an adequate model?
• Your 600-dimension model is totally awesome, but
does it scale?
• How much is “good enough”?
8. Copyright ©2014 Visible Technologies, Inc. All rights reserved.8
Proposing Solutions to the Business
• Understand scale issues.
• Provide alternatives:
• There is no such thing as a perfect system.
• Communicate clearly about real and opportunity costs.
9. Copyright ©2014 Visible Technologies, Inc. All rights reserved.9
The Hazards of Third Party Data
• Data might not be available forever.
• Vendors might change terms.
• Entrenchment can impede growth/change due to poor
quality over time (data sources can decay, vendors may
slack on maintenance).
11. Copyright ©2014 Visible Technologies, Inc. All rights reserved.11
Productionalizing Prototypes
• Isn’t that a fancy word?
• Strike balance between awesome and simple.
• This is almost impossible to get right.
• Even if you get it right once, it won’t last.
• Better for everybody if you give me as simple a
mechanism as possible.
12. Copyright ©2014 Visible Technologies, Inc. All rights reserved.12
Expanding and Maintaining 1
• Data drift
• How does data change organically over time?
• Bit rot
• Does anybody even remember how to refit the model?
• Split maintenance
• Keeping the research model up to date with the
production model never happens.
13. Copyright ©2014 Visible Technologies, Inc. All rights reserved.13
Expanding and Maintaining 2
• Horizontal expansion exposes original scope
assumptions.
• “We have it in English. What do you mean we can’t get
it in Swahili?”
• Value trumps veracity. Sacrifices of purity cause
degradation.
• Business needs results in accretion of surrounding goo.
14. Copyright ©2014 Visible Technologies, Inc. All rights reserved.14
Document Tone: “NLP” versus “Statistical”
• NLP/Probablistic Grammars:
• Effective.
• Slow.
• Costly reference grammars. Consider a vendor.
• Vector space modeling (term vectors/n-grams)
• Very fast at runtime.
• Work best with lots of training data.
• Can fit yourself, so long as you can afford to maintain it.
17. Copyright ©2014 Visible Technologies, Inc. All rights reserved.17
Language-Detection: Features
• Supports 53 languages.
• Fitted on Wikipedia corpora.
• Classic “one-versus-all” classification.
18. Copyright ©2014 Visible Technologies, Inc. All rights reserved.18
Language-Detection: Mechanism
• Determines the frequency with which n-grams of 1-3
characters appear inside of a labeled corpus.
“To what extent does
each 1-3 character n-gram
participate in a label?”
"tho":134583,"thr":87801,"the":3415279,"thi":110969,"tha":240340
19. Copyright ©2014 Visible Technologies, Inc. All rights reserved.19
Language-Detection: Practicalities
• Downsides?
• Twitter and Facebook!
• Letter casing (“I love you” versus “i love you”).
• Mixed-language documents (e.g. Chinese documents
with English words).
21. Copyright ©2014 Visible Technologies, Inc. All rights reserved.21
Overview
• Airline passengers found sewing needles in
sandwiches.
• Airline attempted to redirect the conversation and
measure the results.
• Visible tracked this event in social media.
22. Copyright ©2014 Visible Technologies, Inc. All rights reserved.22
Delta Airlines: Needle Sandwiches
Purchased a
refinery to
reduce fuel
costs
Passengers
found needles
in their on-
flight
sandwiches
Free
tickets
given away
as a
promotion
Prominent
terms at a
week view.
Prominent
terms at a
month view.
Prominent
terms at a
three month
view.
23. Copyright ©2014 Visible Technologies, Inc. All rights reserved.23
Delta Volumes Over Time
Purchased a
refinery to
reduce fuel
costs
Needles found
in on flight
Turkey
Sandwiches
Free tickets
given away
as a
promotion
27. Copyright ©2014 Visible Technologies, Inc. All rights reserved.27
PR Case Study: Conclusion
• Contest didn’t pay off in the long term.
• Attempts to redirect the conversation may be
ham-fisted.
• Thoughts? Conjecture?