Dealing with unstructured data at scale

•

1 like•416 views

Great Wide Open

John Hammink Developer Evangelist Treasure Data Inc. Great Wide Open 2016 Atlanta, GA March 16th, 2016

Technology

Dealing with
Unstructured Data
Scaling to Inﬁnity
Image: Boykung/Shutterstock

Copyright ©2014 Treasure Data. All Rights Reserved.
Results Push
Results Push
SQL
Big Data Simplified: One ApproachAppServers
Multi-structured Events
• register
• login
• start_event
• purchase
• etc
SQL-based
Ad-hoc Queries
SQL-based Dashboards
DBs & Data Marts
Other Apps
Results Push
Familiar &
Table-oriented
Infinite & Economical
Cloud Data Store
✓App log data
✓Mobile event data
✓Sensor data
✓Telemetry
Mobile SDKs
Web SDK
Multi-structured Events
Multi-structured Events
Multi-structured Events
Multi-structured Events
Agent
Agent
Agent
Agent Agent
Agent
Agent
Agent
Embedded SDKs
Server-side Agents

Copyright ©2014 Treasure Data. All Rights Reserved.
What is the point of all this data?
BI
Business
Intelligence
Using Very Large
Sets of Data

Copyright ©2015 Treasure Data. All Rights
Reserved.
Service Launched
Series A Funding
100 Customers
Selected by Gartner as
Cool Vendor in Big Data
10 Trillion
Records
5 Trillion Records
Treasure Data By the Numbers (Jan-2015):
13T+ records of data imported since launch
500K+ records imported each second
1.5 Trillion+ records imported each month
12B records sent per day by one customer
13 Trillion Records
Series B Funding
Data Records Stored in the Treasure Data Cloud Service
0
3500000000000
7000000000000
10500000000000
14000000000000
Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13 Oct-13 Dec-13 Feb-14 Apr-14 Jun-14 Aug-14 Oct-14 Dec-14
8
Last 2 years

Statistics
Total Records
Stored
25
Trillion
Managed &
Supported
24 * 7 *
365
Uptime
99.99%
New Records /
second
1
Million Daily Twitter
volume
100x
1 0 1 1 0
0 0 1 0 1
1 1 0 0
0 0 1
24 / 7

A solution?
• There are trade-offs to consider
• Any trade off should make it easy to collect data
• Easy does it! un- and semi-structured data (multi-
structured data)
• Open source means it’s free; also means that you need
someone on hand to maintain and implement
• Cloud storage means you don’t have to scale and/or
shard; tradeoff means performance hit against bare metal
Image: John Hammink

Images: Lightspring/Shutterstock, John Hammink, Treasure Data
There are a few intro to
Data Science blogs at
blog.treasuredata.com!

Open vs. Closed source
Image: Heather Craig/Shutterstock

Images: PC World, Data-Hive, Wallpapersmela
or
or
?

# logs from a file
<source>
type tail
path /var/log/
httpd.log
format apache2
tag web.access
</source>
# logs from client
libraries
<source>
type forward
port 24224
</source>
# store logs to ES and
HDFS
<match *.*>
type copy
<store>
type elasticsearch
logstash_format

Multi- structured data
• un-structured data
better for data for
ultimate use in
statistics

an open-source bulk data loader that helps data
transfer between various databases, storages, ﬁle
formats, and cloud services
embulk.org/docs

Hivemall
Hivemall is a scalable machine learning library that
runs on Apache Hive.
Hivemall is designed to be scalable to the number
of training instances as well as the number of
training features.
• Classification
• Regression
• Recommendation
• k-nearest neighbor
• Anomaly Detection
• Feature Engineering
https://github.com/myui/hivemall

The Hadoop Story on MongoDB
Image courtesy of Steven Francia @ Docker

What's hot

E-Commerce and MongoDB at Backcountry.comMongoDB

Everything you need to know about external sharing in OneDrive, SharePoint, a...Drew Madelung

Dutch Information Worker User Group - January 2022 - eDiscovery and Microsoft...Albert Hoitingh

What your IT Doesn't Know about Publishing DITA Contentctnitchie

O365Engage17 - Protecting O365 Data in a Modern WorldNCCOMMS

What’s new in SharePoint 2016!AntonioMaio2

Hybrid Dilemma: Dividing Content Between Azure, Office 365 & SharePoint 2016Adam Levithan

OneDrive & SharePoint Better TogetherDrew Madelung

Data Security and Protection in DevOps Karen Lopez

SharePoint 2013 ediscovery overviewElie Kash

Oracle Document Cloud ServiceArush Jain

SharePoint Saturday Ottawa - How secure is my data in office 365?AntonioMaio2

Good to Great SharePoint GovernanceNCCOMMS

Oracle documents cloud serviceGetting value from IoT, Integration and Data Analytics

O365Engage17 - Skype for Business Cloud PBX in the Real WorldNCCOMMS

Delve and the Office Graph for IT- Pros & AdminsSPC Adriatics

SharePoint Migration Series: Success Takes Three ActionsAdam Levithan

Is BCS Dead?Jeff Fried

Governance is Not An Optionspsnyc

Navigating the Mess of a Shared drive Migration to SharePointJoanne Klein

What's hot (20)

E-Commerce and MongoDB at Backcountry.com

Everything you need to know about external sharing in OneDrive, SharePoint, a...

Dutch Information Worker User Group - January 2022 - eDiscovery and Microsoft...

What your IT Doesn't Know about Publishing DITA Content

O365Engage17 - Protecting O365 Data in a Modern World

What’s new in SharePoint 2016!

Hybrid Dilemma: Dividing Content Between Azure, Office 365 & SharePoint 2016

OneDrive & SharePoint Better Together

Data Security and Protection in DevOps

SharePoint 2013 ediscovery overview

Oracle Document Cloud Service

SharePoint Saturday Ottawa - How secure is my data in office 365?

Good to Great SharePoint Governance

Oracle documents cloud service

O365Engage17 - Skype for Business Cloud PBX in the Real World

Delve and the Office Graph for IT- Pros & Admins

SharePoint Migration Series: Success Takes Three Actions

Is BCS Dead?

Governance is Not An Option

Navigating the Mess of a Shared drive Migration to SharePoint

Viewers also liked

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Peter Wren-Hilton

Unstructured data processing webinar 06272016George Roth

Hotsos 2013 - Creating Structure in Unstructured DataMarco Gralike

Lecture 11 Unstructured Data and the Data Warehousephanleson

The Analytic System: Finding Patterns in the DataHealth Catalyst

Unstructured Data in BIMonaheng Diaho

Analyzing Unstructured Data in Hadoop WebinarDatameer

Analysis of ‘Unstructured’ DataSeth Grimes

Using Hadoop as a platform for Master Data ManagementDataWorks Summit

Viewers also liked (9)

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...

Unstructured data processing webinar 06272016

Hotsos 2013 - Creating Structure in Unstructured Data

Lecture 11 Unstructured Data and the Data Warehouse

The Analytic System: Finding Patterns in the Data

Unstructured Data in BI

Analyzing Unstructured Data in Hadoop Webinar

Analysis of ‘Unstructured’ Data

Using Hadoop as a platform for Master Data Management

Similar to Dealing with unstructured data at scale

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Microsoft Azure Big Data AnalyticsMark Kromer

Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra

Data Vault 2.0: Big Data Meets Data WarehousingAll Things Open

Data Lake OverviewJames Serra

Fundamentals Big Data and AI ArchitectureGuido Schmutz

datavault2.pptxMounika662749

Big Data Analytics in the Cloud with Microsoft AzureMark Kromer

Reliable Data Intestion in BigData / IoTGuido Schmutz

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM

ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY

Building IoT and Big Data Solutions on AzureIdo Flatow

Take Action: The New Reality of Data-Driven BusinessInside Analysis

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks

Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.

The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis

SendGrid Improves Email Delivery with Hybrid Data WarehousingAmazon Web Services

Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?TechWell

Big Data in AzureDataWorks Summit/Hadoop Summit

How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador

Similar to Dealing with unstructured data at scale (20)

Data Lakehouse, Data Mesh, and Data Fabric (r1)

Microsoft Azure Big Data Analytics

Data Lakehouse, Data Mesh, and Data Fabric (r2)

Data Vault 2.0: Big Data Meets Data Warehousing

Data Lake Overview

Fundamentals Big Data and AI Architecture

datavault2.pptx

Big Data Analytics in the Cloud with Microsoft Azure

Reliable Data Intestion in BigData / IoT

Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS

ADV Slides: Building and Growing Organizational Analytics with Data Lakes

Building IoT and Big Data Solutions on Azure

Take Action: The New Reality of Data-Driven Business

Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...

Enabling Next Gen Analytics with Azure Data Lake and StreamSets

The Maturity Model: Taking the Growing Pains Out of Hadoop

SendGrid Improves Email Delivery with Hybrid Data Warehousing

Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?

Big Data in Azure

How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)

Recently uploaded

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Advanced Computer Architecture – An IntroductionDilum Bandara

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

"ML in Production",Oleksandr BaganFwdays

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Recently uploaded (20)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Dev Dives: Streamline document processing with UiPath Studio Web

Advanced Test Driven-Development @ php[tek] 2024

SIP trunking in Janus @ Kamailio World 2024

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx

Designing IA for AI - Information Architecture Conference 2024

Human Factors of XR: Using Human Factors to Design XR Systems

DSPy a system for AI to Write Prompts and Do Fine Tuning

Streamlining Python Development: A Guide to a Modern Project Setup

The Ultimate Guide to Choosing WordPress Pros and Cons

Anypoint Exchange: It’s Not Just a Repo!

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Advanced Computer Architecture – An Introduction

SAP Build Work Zone - Overview L2-L3.pptx

"ML in Production",Oleksandr Bagan

Search Engine Optimization SEO PDF for 2024.pdf

Developer Data Modeling Mistakes: From Postgres to NoSQL

Connect Wave/ connectwave Pitch Deck Presentation

How AI, OpenAI, and ChatGPT impact business and software.

Dealing with unstructured data at scale

1. Dealing with Unstructured Data Scaling to Inﬁnity Image: Boykung/Shutterstock

2. Image: John Hammink

4. There are many sources of information

5. Copyright ©2014 Treasure Data. All Rights Reserved. Results Push Results Push SQL Big Data Simplified: One ApproachAppServers Multi-structured Events • register • login • start_event • purchase • etc SQL-based Ad-hoc Queries SQL-based Dashboards DBs & Data Marts Other Apps Results Push Familiar & Table-oriented Infinite & Economical Cloud Data Store ✓App log data ✓Mobile event data ✓Sensor data ✓Telemetry Mobile SDKs Web SDK Multi-structured Events Multi-structured Events Multi-structured Events Multi-structured Events Agent Agent Agent Agent Agent Agent Agent Agent Embedded SDKs Server-side Agents

8. Copyright ©2015 Treasure Data. All Rights Reserved. Service Launched Series A Funding 100 Customers Selected by Gartner as Cool Vendor in Big Data 10 Trillion Records 5 Trillion Records Treasure Data By the Numbers (Jan-2015): 13T+ records of data imported since launch 500K+ records imported each second 1.5 Trillion+ records imported each month 12B records sent per day by one customer 13 Trillion Records Series B Funding Data Records Stored in the Treasure Data Cloud Service 0 3500000000000 7000000000000 10500000000000 14000000000000 Aug-12 Oct-12 Dec-12 Feb-13 Apr-13 Jun-13 Aug-13 Oct-13 Dec-13 Feb-14 Apr-14 Jun-14 Aug-14 Oct-14 Dec-14 8 Last 2 years

9. Statistics Total Records Stored 25 Trillion Managed & Supported 24 * 7 * 365 Uptime 99.99% New Records / second 1 Million Daily Twitter volume 100x 1 0 1 1 0 0 0 1 0 1 1 1 0 0 0 0 1 24 / 7

10. A solution? • There are trade-offs to consider • Any trade off should make it easy to collect data • Easy does it! un- and semi-structured data (multi- structured data) • Open source means it’s free; also means that you need someone on hand to maintain and implement • Cloud storage means you don’t have to scale and/or shard; tradeoff means performance hit against bare metal Image: John Hammink

11. Image: Dreamstime

12. Images: Lightspring/Shutterstock, John Hammink, Treasure Data There are a few intro to Data Science blogs at blog.treasuredata.com!

13. What does a pipeline need?

14. Open vs. Closed source Image: Heather Craig/Shutterstock

15. Images: PC World, Data-Hive, Wallpapersmela or or ?

16. LAMBDA ARCHITECTURE

17. # logs from a file <source> type tail path /var/log/ httpd.log format apache2 tag web.access </source> # logs from client libraries <source> type forward port 24224 </source> # store logs to ES and HDFS <match *.*> type copy <store> type elasticsearch logstash_format

18. LESS SIMPLE FORWARDING

19. Before fluentd

20. Multi- structured data • un-structured data better for data for ultimate use in statistics

21. fluentd! http://www.ﬂuentd.org/

22. http://msgpack.org/

23. an open-source bulk data loader that helps data transfer between various databases, storages, ﬁle formats, and cloud services embulk.org/docs

24.

25.

26. Hivemall Hivemall is a scalable machine learning library that runs on Apache Hive. Hivemall is designed to be scalable to the number of training instances as well as the number of training features. • Classification • Regression • Recommendation • k-nearest neighbor • Anomaly Detection • Feature Engineering https://github.com/myui/hivemall

27. The Hadoop Story on MongoDB Image courtesy of Steven Francia @ Docker

28. Questions?

Dealing with unstructured data at scale

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Dealing with unstructured data at scale

Similar to Dealing with unstructured data at scale (20)

More from Great Wide Open

More from Great Wide Open (20)

Recently uploaded

Recently uploaded (20)

Dealing with unstructured data at scale