SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Copyright ©2014 Visible Technologies, Inc. All rights reserved.1
Data Mining and Engineering
Lucas Parker
Senior Software Development Engineer, Research & Development
Presented by
Copyright ©2014 Visible Technologies, Inc. All rights reserved.2
About Visible
Our Mission: Customer Value
Global Authoritative Content
• Most comprehensive global content sourcing model
• Clean & accurate data
Powerful Search & Discovery
• Only the most pertinent results, based on your criteria
• Pivot and drill to identify discussion drivers
Engagement/Social CRM
•Social Media workflow and engagement for individual or team
•Integrate with CRM for continuous relationship management
Sophisticated Social Analytics
•Measure, compare, and contrast program & communication results
•Segment results by product attributes, reputation drivers, etc.
Actionable Insights
•Discovery & analytics to uncover insights in real time
•Holistic consumer insights, integrated with other market research
Copyright ©2014 Visible Technologies, Inc. All rights reserved.3
What We Do
• Domain is “social media”
• Twitter, Facebook, forums, blogs, etc
• Huge data sets, lots of noise.
• Enrichment, aggregation, reporting.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.4
Visible – Target Business Groups
• Customer Servicing
• Interactions between business and customer.
• Marketing
• Brand effort, campaigns, periodic messaging.
• Corporate Communications
• PR, reputation of company and stakeholders
• Research
• Audience definition, demographics, psychographics
Data Mining Meets Engineering
Copyright ©2014 Visible Technologies, Inc. All rights reserved.6
Articulating the Problem
“Marketing analysts need to
understand the impact of their
campaigns and we can provide them
an avenue to do so.”
- Surf and turf
“We should totally
Hadoop something!”
- Knuckle sandwich
Copyright ©2014 Visible Technologies, Inc. All rights reserved.7
Feature Engineering
• Your concise data features are easy to grasp, but do
they provide for an adequate model?
• Your 600-dimension model is totally awesome, but
does it scale?
• How much is “good enough”?
Copyright ©2014 Visible Technologies, Inc. All rights reserved.8
Proposing Solutions to the Business
• Understand scale issues.
• Provide alternatives:
• There is no such thing as a perfect system.
• Communicate clearly about real and opportunity costs.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.9
The Hazards of Third Party Data
• Data might not be available forever.
• Vendors might change terms.
• Entrenchment can impede growth/change due to poor
quality over time (data sources can decay, vendors may
slack on maintenance).
Copyright ©2014 Visible Technologies, Inc. All rights reserved.10
Bonini’s Paradox
Copyright ©2014 Visible Technologies, Inc. All rights reserved.11
Productionalizing Prototypes
• Isn’t that a fancy word?
• Strike balance between awesome and simple.
• This is almost impossible to get right.
• Even if you get it right once, it won’t last.
• Better for everybody if you give me as simple a
mechanism as possible.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.12
Expanding and Maintaining 1
• Data drift
• How does data change organically over time?
• Bit rot
• Does anybody even remember how to refit the model?
• Split maintenance
• Keeping the research model up to date with the
production model never happens.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.13
Expanding and Maintaining 2
• Horizontal expansion exposes original scope
assumptions.
• “We have it in English. What do you mean we can’t get
it in Swahili?”
• Value trumps veracity. Sacrifices of purity cause
degradation.
• Business needs results in accretion of surrounding goo.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.14
Document Tone: “NLP” versus “Statistical”
• NLP/Probablistic Grammars:
• Effective.
• Slow.
• Costly reference grammars. Consider a vendor.
• Vector space modeling (term vectors/n-grams)
• Very fast at runtime.
• Work best with lots of training data.
• Can fit yourself, so long as you can afford to maintain it.
Language Detection
Engineering Case Study
Copyright ©2014 Visible Technologies, Inc. All rights reserved.16
Language-Detection
Copyright ©2014 Visible Technologies, Inc. All rights reserved.17
Language-Detection: Features
• Supports 53 languages.
• Fitted on Wikipedia corpora.
• Classic “one-versus-all” classification.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.18
Language-Detection: Mechanism
• Determines the frequency with which n-grams of 1-3
characters appear inside of a labeled corpus.
“To what extent does
each 1-3 character n-gram
participate in a label?”
"tho":134583,"thr":87801,"the":3415279,"thi":110969,"tha":240340
Copyright ©2014 Visible Technologies, Inc. All rights reserved.19
Language-Detection: Practicalities
• Downsides?
• Twitter and Facebook!
• Letter casing (“I love you” versus “i love you”).
• Mixed-language documents (e.g. Chinese documents
with English words).
Delta Airlines and “Needle Sandwiches”
PR Case Study:
Copyright ©2014 Visible Technologies, Inc. All rights reserved.21
Overview
• Airline passengers found sewing needles in
sandwiches.
• Airline attempted to redirect the conversation and
measure the results.
• Visible tracked this event in social media.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.22
Delta Airlines: Needle Sandwiches
Purchased a
refinery to
reduce fuel
costs
Passengers
found needles
in their on-
flight
sandwiches
Free
tickets
given away
as a
promotion
Prominent
terms at a
week view.
Prominent
terms at a
month view.
Prominent
terms at a
three month
view.
Copyright ©2014 Visible Technologies, Inc. All rights reserved.23
Delta Volumes Over Time
Purchased a
refinery to
reduce fuel
costs
Needles found
in on flight
Turkey
Sandwiches
Free tickets
given away
as a
promotion
Copyright ©2014 Visible Technologies, Inc. All rights reserved.24
Delta Volumes Over Time
Copyright ©2014 Visible Technologies, Inc. All rights reserved.25
Month View
Copyright ©2014 Visible Technologies, Inc. All rights reserved.26
3 Month View
Copyright ©2014 Visible Technologies, Inc. All rights reserved.27
PR Case Study: Conclusion
• Contest didn’t pay off in the long term.
• Attempts to redirect the conversation may be
ham-fisted.
• Thoughts? Conjecture?
Copyright ©2014 Visible Technologies, Inc. All rights reserved.28
Conclusion
Questions?
Thank You
www.visibletechnologies.com
info@visibletechnologies.com
Twitter: @Visible
Phone: (888) 852-0320

Weitere ähnliche Inhalte

Ähnlich wie Data Mining & Engineering

Sp meetup 17 slidedeck
Sp meetup 17 slidedeckSp meetup 17 slidedeck
Sp meetup 17 slidedeck
Ric Centre
 
Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014
Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014
Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014
Clarke & Esposito, LLC
 

Ähnlich wie Data Mining & Engineering (20)

Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
Big Data Day LA 2015 - Building a Big Data Culture in the Entertainment Indus...
 
Overcoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus onOvercoming the AI hype — and what enterprises should really focus on
Overcoming the AI hype — and what enterprises should really focus on
 
10 commandments in rdm funder compliancy
10 commandments in rdm funder compliancy10 commandments in rdm funder compliancy
10 commandments in rdm funder compliancy
 
Enterprise DevOps: Crossing the Great Divide with DevOps Training
Enterprise DevOps: Crossing the Great Divide with DevOps TrainingEnterprise DevOps: Crossing the Great Divide with DevOps Training
Enterprise DevOps: Crossing the Great Divide with DevOps Training
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
Helping Developers with Privacy
Helping Developers with PrivacyHelping Developers with Privacy
Helping Developers with Privacy
 
Social Intranets for Smarter Enterprise Collaboration
Social Intranets for Smarter Enterprise CollaborationSocial Intranets for Smarter Enterprise Collaboration
Social Intranets for Smarter Enterprise Collaboration
 
Open Source: What is It?
Open Source: What is It?Open Source: What is It?
Open Source: What is It?
 
The Very Best Intranets and Digital Workplaces of 2017
The Very Best Intranets and Digital Workplaces of 2017The Very Best Intranets and Digital Workplaces of 2017
The Very Best Intranets and Digital Workplaces of 2017
 
Agile data science
Agile data scienceAgile data science
Agile data science
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
Sp meetup 17 slidedeck
Sp meetup 17 slidedeckSp meetup 17 slidedeck
Sp meetup 17 slidedeck
 
Connor big data
Connor big dataConnor big data
Connor big data
 
Data science workshop
Data science workshopData science workshop
Data science workshop
 
Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group Hortonworks and Clarity Solution Group
Hortonworks and Clarity Solution Group
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014
Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014
Rearguard and Vanguard: A Presentation to ALPLS, April 11, 2k014
 
Big Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil GamesBig Data at a Gaming Company: Spil Games
Big Data at a Gaming Company: Spil Games
 
Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"Algorithm Marketplace and the new "Algorithm Economy"
Algorithm Marketplace and the new "Algorithm Economy"
 
Conversational User Interfaces, Past and Future
Conversational User Interfaces, Past and FutureConversational User Interfaces, Past and Future
Conversational User Interfaces, Past and Future
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Kürzlich hochgeladen (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Data Mining & Engineering

  • 1. Copyright ©2014 Visible Technologies, Inc. All rights reserved.1 Data Mining and Engineering Lucas Parker Senior Software Development Engineer, Research & Development Presented by
  • 2. Copyright ©2014 Visible Technologies, Inc. All rights reserved.2 About Visible Our Mission: Customer Value Global Authoritative Content • Most comprehensive global content sourcing model • Clean & accurate data Powerful Search & Discovery • Only the most pertinent results, based on your criteria • Pivot and drill to identify discussion drivers Engagement/Social CRM •Social Media workflow and engagement for individual or team •Integrate with CRM for continuous relationship management Sophisticated Social Analytics •Measure, compare, and contrast program & communication results •Segment results by product attributes, reputation drivers, etc. Actionable Insights •Discovery & analytics to uncover insights in real time •Holistic consumer insights, integrated with other market research
  • 3. Copyright ©2014 Visible Technologies, Inc. All rights reserved.3 What We Do • Domain is “social media” • Twitter, Facebook, forums, blogs, etc • Huge data sets, lots of noise. • Enrichment, aggregation, reporting.
  • 4. Copyright ©2014 Visible Technologies, Inc. All rights reserved.4 Visible – Target Business Groups • Customer Servicing • Interactions between business and customer. • Marketing • Brand effort, campaigns, periodic messaging. • Corporate Communications • PR, reputation of company and stakeholders • Research • Audience definition, demographics, psychographics
  • 5. Data Mining Meets Engineering
  • 6. Copyright ©2014 Visible Technologies, Inc. All rights reserved.6 Articulating the Problem “Marketing analysts need to understand the impact of their campaigns and we can provide them an avenue to do so.” - Surf and turf “We should totally Hadoop something!” - Knuckle sandwich
  • 7. Copyright ©2014 Visible Technologies, Inc. All rights reserved.7 Feature Engineering • Your concise data features are easy to grasp, but do they provide for an adequate model? • Your 600-dimension model is totally awesome, but does it scale? • How much is “good enough”?
  • 8. Copyright ©2014 Visible Technologies, Inc. All rights reserved.8 Proposing Solutions to the Business • Understand scale issues. • Provide alternatives: • There is no such thing as a perfect system. • Communicate clearly about real and opportunity costs.
  • 9. Copyright ©2014 Visible Technologies, Inc. All rights reserved.9 The Hazards of Third Party Data • Data might not be available forever. • Vendors might change terms. • Entrenchment can impede growth/change due to poor quality over time (data sources can decay, vendors may slack on maintenance).
  • 10. Copyright ©2014 Visible Technologies, Inc. All rights reserved.10 Bonini’s Paradox
  • 11. Copyright ©2014 Visible Technologies, Inc. All rights reserved.11 Productionalizing Prototypes • Isn’t that a fancy word? • Strike balance between awesome and simple. • This is almost impossible to get right. • Even if you get it right once, it won’t last. • Better for everybody if you give me as simple a mechanism as possible.
  • 12. Copyright ©2014 Visible Technologies, Inc. All rights reserved.12 Expanding and Maintaining 1 • Data drift • How does data change organically over time? • Bit rot • Does anybody even remember how to refit the model? • Split maintenance • Keeping the research model up to date with the production model never happens.
  • 13. Copyright ©2014 Visible Technologies, Inc. All rights reserved.13 Expanding and Maintaining 2 • Horizontal expansion exposes original scope assumptions. • “We have it in English. What do you mean we can’t get it in Swahili?” • Value trumps veracity. Sacrifices of purity cause degradation. • Business needs results in accretion of surrounding goo.
  • 14. Copyright ©2014 Visible Technologies, Inc. All rights reserved.14 Document Tone: “NLP” versus “Statistical” • NLP/Probablistic Grammars: • Effective. • Slow. • Costly reference grammars. Consider a vendor. • Vector space modeling (term vectors/n-grams) • Very fast at runtime. • Work best with lots of training data. • Can fit yourself, so long as you can afford to maintain it.
  • 16. Copyright ©2014 Visible Technologies, Inc. All rights reserved.16 Language-Detection
  • 17. Copyright ©2014 Visible Technologies, Inc. All rights reserved.17 Language-Detection: Features • Supports 53 languages. • Fitted on Wikipedia corpora. • Classic “one-versus-all” classification.
  • 18. Copyright ©2014 Visible Technologies, Inc. All rights reserved.18 Language-Detection: Mechanism • Determines the frequency with which n-grams of 1-3 characters appear inside of a labeled corpus. “To what extent does each 1-3 character n-gram participate in a label?” "tho":134583,"thr":87801,"the":3415279,"thi":110969,"tha":240340
  • 19. Copyright ©2014 Visible Technologies, Inc. All rights reserved.19 Language-Detection: Practicalities • Downsides? • Twitter and Facebook! • Letter casing (“I love you” versus “i love you”). • Mixed-language documents (e.g. Chinese documents with English words).
  • 20. Delta Airlines and “Needle Sandwiches” PR Case Study:
  • 21. Copyright ©2014 Visible Technologies, Inc. All rights reserved.21 Overview • Airline passengers found sewing needles in sandwiches. • Airline attempted to redirect the conversation and measure the results. • Visible tracked this event in social media.
  • 22. Copyright ©2014 Visible Technologies, Inc. All rights reserved.22 Delta Airlines: Needle Sandwiches Purchased a refinery to reduce fuel costs Passengers found needles in their on- flight sandwiches Free tickets given away as a promotion Prominent terms at a week view. Prominent terms at a month view. Prominent terms at a three month view.
  • 23. Copyright ©2014 Visible Technologies, Inc. All rights reserved.23 Delta Volumes Over Time Purchased a refinery to reduce fuel costs Needles found in on flight Turkey Sandwiches Free tickets given away as a promotion
  • 24. Copyright ©2014 Visible Technologies, Inc. All rights reserved.24 Delta Volumes Over Time
  • 25. Copyright ©2014 Visible Technologies, Inc. All rights reserved.25 Month View
  • 26. Copyright ©2014 Visible Technologies, Inc. All rights reserved.26 3 Month View
  • 27. Copyright ©2014 Visible Technologies, Inc. All rights reserved.27 PR Case Study: Conclusion • Contest didn’t pay off in the long term. • Attempts to redirect the conversation may be ham-fisted. • Thoughts? Conjecture?
  • 28. Copyright ©2014 Visible Technologies, Inc. All rights reserved.28 Conclusion Questions?