SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Downloaden Sie, um offline zu lesen
Big Data 101
    Bouvet BigOne, 2013-03-14
    Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga

1
2
3
What is big data?

      Big Data is                 Small Data is
      any thing                when is fit in RAM.
       which is                Big Data is when is
     crash Excel.               crash because is
                                 not fit in RAM.




                                          Or, in other words, Big Data is data
                                          in volumes too great to process by
                                          traditional methods.


     https://twitter.com/devops_borat

4
Data accumulation

    • Today, data is accumulating at tremendous
      rates
       –   click streams from web visitors
       –   supermarket transactions
       –   sensor readings
       –   video camera footage
       –   GPS trails
       –   social media interactions
       –   ...
    • It really is becoming a challenge to store
      and process it all in a meaningful way

5
From WWW to VVV

    • Volume
      – data volumes are becoming unmanageable
    • Variety
      – data complexity is growing
      – more types of data captured than previously
    • Velocity
      – some data is arriving so rapidly that it must either
        be processed instantly, or lost
      – this is a whole subfield called “stream processing”




6
The promise of Big Data

• Data contains information of great
  business value
• If you can extract those insights you can
  make far better decisions
• ...but is data really that valuable?
8
9
“quadrupling the average cow's
     milk production since your parents
     were born”



     "When Freddie [as he is known]
     had no daughter records our
     equations predicted from his DNA
     that he would be the best bull,"
     USDA research geneticist Paul
     VanRaden emailed me with a
     detectable hint of pride. "Now he is
     the best progeny tested bull (as
     predicted)."




10
Ok, ok, but ... does it apply to our
     customers?
     • Norwegian Food Safety Authority
        – accumulates data on all farm animals
        – birth, death, movements, medication, samples, ...
     • Hafslund
        – time series from hydroelectric dams, power prices,
          meters of individual customers, ...
     • Social Security Administration
        – data on individual cases, actions taken, outcomes...
     • Statoil
        – massive amounts of data from oil exploration,
          operations, logistics, engineering, ...
     • Retailers
        – see Target example above
        – also, connection between what people buy, weather
          forecast, logistics, ...
11
How to extract insight from data?




        Monthly Retail Sales in New South Wales
       (NSW) Retail Department Stores
12
Estimating real estate prices

     • Take parameters
        –   x1    square meters
        –   x2    number of rooms
        –   x3    number of floors
        –   x4    energy cost per year
        –   x5    meters to nearest subway station
        –   x6    years since built
        –   x7    years since last refurbished
        –   ...
     • a x1 + b x2 + c x3 + ... = price
        – strip out the x-es and you have a vector
        – collect N samples of real flats with prices = matrix
        – welcome to the world of linear algebra
13
Types of algorithms

     •   Clustering
     •   Association learning
     •   Parameter estimation
     •   Recommendation engines
     •   Support Vector Machines
     •   Similarity matching
     •   Neural networks
     •   Bayesian networks
     •   Genetic algorithms


14
Basically, it’s all maths...

     •   Linear algebra
     •   Calculus
     •   Probability theory                      Only 10% in
     •   Graph theory                         devops are know
     •   ...                                     how of work
                                                with Big Data.
                                                 Only 1% are
                                               realize they are
                                              need 2 Big Data
                                                   for fault
                                                  tolerance




15
           https://twitter.com/devops_borat
Big data skills gap

     • Hardly anyone knows this stuff
     • It’s a big field, with lots and lots of theory
     • And it’s all maths, so it’s tricky to learn




     http://www.ibmbigdatahub.com/blog/addressing-big-data-skills-gap
16
     http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap
Two orthogonal aspects

     • Analytics / machine learning
       – learning insights from data
     • Big data
       – handling massive data volumes
     • Can be combined, or used separately




17
How to process Big Data?

     • If relational databases are not enough,
       what is?

                                                 Mining of Big
                                                     Data is
                                                 problem solve
                                                  in 2013 with
                                                      zgrep




18
              https://twitter.com/devops_borat
MapReduce

     • A framework for writing massively parallel
       code
     • Simple, straightforward model
     • Based on “map” and “reduce” functions
       from functional programming (LISP)




19
Things you can do in MapReduce

     • Google’s PageRank algorithm
       – easily expressible in MapReduce
       – one of the first applications of MapReduce
     • SQL
       – relational algebra has straightforward translation
         to the MapReduce model
     • Linear algebra
       – matrix operations are easily MapReducible
       – (PageRank is just a bunch of matrix operations)
     • Recommendation engines
       – also MapReducible (the SON algorithm)
       – ...
20
NoSQL and Big Data

     • Not really that relevant
     • Traditional databases handle big data sets,
       too
     • NoSQL databases have poor analytics
     • MapReduce often works from text files
        – can obviously work from SQL and NoSQL, too
     • NoSQL is more for high throughput
        – basically, AP from the CAP theorem, instead of CP
     • In practice, really Big Data is likely to be a
       mix
        – text files, NoSQL, and SQL
21
The 4th V: Veracity

     “The greatest enemy of knowledge is not
     ignorance, it is the illusion of knowledge.”
                        Daniel Borstin, in The Discoverers (1983)



                                                       95% of time,
                                                      when is clean Big
                                                      Data is get Little
                                                            Data




22
                   https://twitter.com/devops_borat
Data quality

     • A huge problem in practice
       – any manually entered data is suspect
       – most data sets are in practice deeply problematic
     • Even automatically gathered data can be a
       problem
       – systematic problems with sensors
       – errors causing data loss
       – incorrect metadata about the sensor
     • Never, never, never trust the data without
       checking it!
       – garbage in, garbage out, etc

23
Conclusion

     • Vast potential
        – to both big data and machine learning
     • Very difficult to realize that potential
        – requires mathematics, which nobody knows
     • We need to wake up!




24
Where to learn more

     • University of Oslo
       – has courses on linear algebra, probability, graph
         theory, ...
     • Stanford University
       – https://www.coursera.org/course/ml
     • Mining Massive Datasets
       – http://infolab.stanford.edu/~ullman/mmds.html




25

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolutionitnewsafrica
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big datakk1718
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataHari Priya
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles ParkerBigMine
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentalsrjain51
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapSrinath Perera
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBernard Marr
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big DataeXascale Infolab
 
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Edureka!
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big dataRichard Vidgen
 
Big Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesBig Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesCRISIL Limited
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science EducationJames Hendler
 

Was ist angesagt? (20)

Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
Are you ready for BIG DATA?
Are you ready for BIG DATA?Are you ready for BIG DATA?
Are you ready for BIG DATA?
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 
Big Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and RoadmapBig Data Analytics Strategy and Roadmap
Big Data Analytics Strategy and Roadmap
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must KnowBig Data - The 5 Vs Everyone Must Know
Big Data - The 5 Vs Everyone Must Know
 
An Introduction to Big Data
An Introduction to Big DataAn Introduction to Big Data
An Introduction to Big Data
 
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
Big Data Applications | Big Data Analytics Use-Cases | Big Data Tutorial for ...
 
Big data Introduction by Mohan
Big data Introduction by MohanBig data Introduction by Mohan
Big data Introduction by Mohan
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Big data
Big dataBig data
Big data
 
Big data mining
Big data miningBig data mining
Big data mining
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Big data analysis
Big data analysisBig data analysis
Big data analysis
 
Big Data’s Big Impact on Businesses
Big Data’s Big Impact on BusinessesBig Data’s Big Impact on Businesses
Big Data’s Big Impact on Businesses
 
Big Data and Computer Science Education
Big Data and Computer Science EducationBig Data and Computer Science Education
Big Data and Computer Science Education
 

Andere mochten auch

Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Matt Turck
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 
Big Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityBig Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityThushara M
 
Big Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityBig Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityAnishek Kamal
 
Emotiv System Team 8
Emotiv System  Team 8Emotiv System  Team 8
Emotiv System Team 8Ashish Tandon
 
Emotive presentation
Emotive presentationEmotive presentation
Emotive presentationethansm
 
Cracking the Data Conundrum: How Successful Companies Make #BigData Operational
Cracking the Data Conundrum: How Successful Companies Make #BigData OperationalCracking the Data Conundrum: How Successful Companies Make #BigData Operational
Cracking the Data Conundrum: How Successful Companies Make #BigData OperationalCapgemini
 
Infografia i Visualització UOC Meet
Infografia i Visualització UOC MeetInfografia i Visualització UOC Meet
Infografia i Visualització UOC MeetIgnasi Alcalde
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Big Data Spain
 
8 M&E: Data Sources
8 M&E: Data Sources8 M&E: Data Sources
8 M&E: Data SourcesTony
 
101 Marketing Charts
101 Marketing Charts101 Marketing Charts
101 Marketing ChartsHubSpot
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Sparkdatamantra
 
Emotiv Epoc/EEG/BCI
Emotiv Epoc/EEG/BCIEmotiv Epoc/EEG/BCI
Emotiv Epoc/EEG/BCISuhail Khan
 

Andere mochten auch (20)

Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)Big Data, Big Deal? (A Big Data 101 presentation)
Big Data, Big Deal? (A Big Data 101 presentation)
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
What is big data?
What is big data?What is big data?
What is big data?
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Data 101: A Gentle Introduction
Data 101: A Gentle IntroductionData 101: A Gentle Introduction
Data 101: A Gentle Introduction
 
Big Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityBig Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data University
 
Emotiv epoc introduction
Emotiv epoc introductionEmotiv epoc introduction
Emotiv epoc introduction
 
Big Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data UniversityBig Data University BD0101EN Certificate _ Big Data University
Big Data University BD0101EN Certificate _ Big Data University
 
Emotiv System Team 8
Emotiv System  Team 8Emotiv System  Team 8
Emotiv System Team 8
 
Emotive presentation
Emotive presentationEmotive presentation
Emotive presentation
 
Emotiv epoc
Emotiv epocEmotiv epoc
Emotiv epoc
 
Cracking the Data Conundrum: How Successful Companies Make #BigData Operational
Cracking the Data Conundrum: How Successful Companies Make #BigData OperationalCracking the Data Conundrum: How Successful Companies Make #BigData Operational
Cracking the Data Conundrum: How Successful Companies Make #BigData Operational
 
Infografia i Visualització UOC Meet
Infografia i Visualització UOC MeetInfografia i Visualització UOC Meet
Infografia i Visualització UOC Meet
 
Project Monitoring and Evaluation
Project Monitoring and EvaluationProject Monitoring and Evaluation
Project Monitoring and Evaluation
 
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
Coordinating the Many Tools of Big Data - Apache HCatalog, Apache Pig and Apa...
 
8 M&E: Data Sources
8 M&E: Data Sources8 M&E: Data Sources
8 M&E: Data Sources
 
101 Marketing Charts
101 Marketing Charts101 Marketing Charts
101 Marketing Charts
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Emotiv Epoc/EEG/BCI
Emotiv Epoc/EEG/BCIEmotiv Epoc/EEG/BCI
Emotiv Epoc/EEG/BCI
 

Ähnlich wie Big data 101

Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataRoi Blanco
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupSri Kanajan
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationDoug Denton
 
Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101Peter O'Kelly
 
Big Data Analysis and Business Intelligence
Big Data Analysis and Business IntelligenceBig Data Analysis and Business Intelligence
Big Data Analysis and Business IntelligenceDaqing Zhao
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Fabricio Quintanilla
 
Big Data & the importance of Data Science
Big Data & the importance of Data ScienceBig Data & the importance of Data Science
Big Data & the importance of Data ScienceWim Van Leuven
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data RampageNiko Vuokko
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosSpiros Antonatos
 
Big & Open Data: Challenges for Smartcity
Big & Open Data:  Challenges for SmartcityBig & Open Data:  Challenges for Smartcity
Big & Open Data: Challenges for SmartcityVictoria López
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinBigDataCloud
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Big and Small Web Data
Big and Small Web DataBig and Small Web Data
Big and Small Web DataMarieke Guy
 
Introduction to big data for the EA course at Solvay MBA
Introduction to big data for the EA course at Solvay MBAIntroduction to big data for the EA course at Solvay MBA
Introduction to big data for the EA course at Solvay MBAWim Van Leuven
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptxKannanThangavelu2
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusersBob Hardaway
 

Ähnlich wie Big data 101 (20)

Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
Level Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentationLevel Seven - Expedient Big Data presentation
Level Seven - Expedient Big Data presentation
 
Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101Gilbane Boston 2012 Big Data 101
Gilbane Boston 2012 Big Data 101
 
Big Data Analysis and Business Intelligence
Big Data Analysis and Business IntelligenceBig Data Analysis and Business Intelligence
Big Data Analysis and Business Intelligence
 
Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?Sql saturday el salvador 2016 - Me, A Data Scientist?
Sql saturday el salvador 2016 - Me, A Data Scientist?
 
Big Data & the importance of Data Science
Big Data & the importance of Data ScienceBig Data & the importance of Data Science
Big Data & the importance of Data Science
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 
Data science meetup - Spiros Antonatos
Data science meetup - Spiros AntonatosData science meetup - Spiros Antonatos
Data science meetup - Spiros Antonatos
 
Big & Open Data: Challenges for Smartcity
Big & Open Data:  Challenges for SmartcityBig & Open Data:  Challenges for Smartcity
Big & Open Data: Challenges for Smartcity
 
What Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will WinWhat Does Big Data Mean and Who Will Win
What Does Big Data Mean and Who Will Win
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Big Data
Big DataBig Data
Big Data
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Big and Small Web Data
Big and Small Web DataBig and Small Web Data
Big and Small Web Data
 
Introduction to big data for the EA course at Solvay MBA
Introduction to big data for the EA course at Solvay MBAIntroduction to big data for the EA course at Solvay MBA
Introduction to big data for the EA course at Solvay MBA
 
DataScienceIntroduction.pptx
DataScienceIntroduction.pptxDataScienceIntroduction.pptx
DataScienceIntroduction.pptx
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Big data4businessusers
Big data4businessusersBig data4businessusers
Big data4businessusers
 

Mehr von Lars Marius Garshol

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformationLars Marius Garshol
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at SchibstedLars Marius Garshol
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityLars Marius Garshol
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLars Marius Garshol
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityLars Marius Garshol
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceLars Marius Garshol
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programmingLars Marius Garshol
 

Mehr von Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 
Semantisk integrasjon
Semantisk integrasjonSemantisk integrasjon
Semantisk integrasjon
 

Kürzlich hochgeladen

Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 

Kürzlich hochgeladen (20)

Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 

Big data 101

  • 1. Big Data 101 Bouvet BigOne, 2013-03-14 Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga 1
  • 2. 2
  • 3. 3
  • 4. What is big data? Big Data is Small Data is any thing when is fit in RAM. which is Big Data is when is crash Excel. crash because is not fit in RAM. Or, in other words, Big Data is data in volumes too great to process by traditional methods. https://twitter.com/devops_borat 4
  • 5. Data accumulation • Today, data is accumulating at tremendous rates – click streams from web visitors – supermarket transactions – sensor readings – video camera footage – GPS trails – social media interactions – ... • It really is becoming a challenge to store and process it all in a meaningful way 5
  • 6. From WWW to VVV • Volume – data volumes are becoming unmanageable • Variety – data complexity is growing – more types of data captured than previously • Velocity – some data is arriving so rapidly that it must either be processed instantly, or lost – this is a whole subfield called “stream processing” 6
  • 7. The promise of Big Data • Data contains information of great business value • If you can extract those insights you can make far better decisions • ...but is data really that valuable?
  • 8. 8
  • 9. 9
  • 10. “quadrupling the average cow's milk production since your parents were born” "When Freddie [as he is known] had no daughter records our equations predicted from his DNA that he would be the best bull," USDA research geneticist Paul VanRaden emailed me with a detectable hint of pride. "Now he is the best progeny tested bull (as predicted)." 10
  • 11. Ok, ok, but ... does it apply to our customers? • Norwegian Food Safety Authority – accumulates data on all farm animals – birth, death, movements, medication, samples, ... • Hafslund – time series from hydroelectric dams, power prices, meters of individual customers, ... • Social Security Administration – data on individual cases, actions taken, outcomes... • Statoil – massive amounts of data from oil exploration, operations, logistics, engineering, ... • Retailers – see Target example above – also, connection between what people buy, weather forecast, logistics, ... 11
  • 12. How to extract insight from data? Monthly Retail Sales in New South Wales (NSW) Retail Department Stores 12
  • 13. Estimating real estate prices • Take parameters – x1 square meters – x2 number of rooms – x3 number of floors – x4 energy cost per year – x5 meters to nearest subway station – x6 years since built – x7 years since last refurbished – ... • a x1 + b x2 + c x3 + ... = price – strip out the x-es and you have a vector – collect N samples of real flats with prices = matrix – welcome to the world of linear algebra 13
  • 14. Types of algorithms • Clustering • Association learning • Parameter estimation • Recommendation engines • Support Vector Machines • Similarity matching • Neural networks • Bayesian networks • Genetic algorithms 14
  • 15. Basically, it’s all maths... • Linear algebra • Calculus • Probability theory Only 10% in • Graph theory devops are know • ... how of work with Big Data. Only 1% are realize they are need 2 Big Data for fault tolerance 15 https://twitter.com/devops_borat
  • 16. Big data skills gap • Hardly anyone knows this stuff • It’s a big field, with lots and lots of theory • And it’s all maths, so it’s tricky to learn http://www.ibmbigdatahub.com/blog/addressing-big-data-skills-gap 16 http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap
  • 17. Two orthogonal aspects • Analytics / machine learning – learning insights from data • Big data – handling massive data volumes • Can be combined, or used separately 17
  • 18. How to process Big Data? • If relational databases are not enough, what is? Mining of Big Data is problem solve in 2013 with zgrep 18 https://twitter.com/devops_borat
  • 19. MapReduce • A framework for writing massively parallel code • Simple, straightforward model • Based on “map” and “reduce” functions from functional programming (LISP) 19
  • 20. Things you can do in MapReduce • Google’s PageRank algorithm – easily expressible in MapReduce – one of the first applications of MapReduce • SQL – relational algebra has straightforward translation to the MapReduce model • Linear algebra – matrix operations are easily MapReducible – (PageRank is just a bunch of matrix operations) • Recommendation engines – also MapReducible (the SON algorithm) – ... 20
  • 21. NoSQL and Big Data • Not really that relevant • Traditional databases handle big data sets, too • NoSQL databases have poor analytics • MapReduce often works from text files – can obviously work from SQL and NoSQL, too • NoSQL is more for high throughput – basically, AP from the CAP theorem, instead of CP • In practice, really Big Data is likely to be a mix – text files, NoSQL, and SQL 21
  • 22. The 4th V: Veracity “The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.” Daniel Borstin, in The Discoverers (1983) 95% of time, when is clean Big Data is get Little Data 22 https://twitter.com/devops_borat
  • 23. Data quality • A huge problem in practice – any manually entered data is suspect – most data sets are in practice deeply problematic • Even automatically gathered data can be a problem – systematic problems with sensors – errors causing data loss – incorrect metadata about the sensor • Never, never, never trust the data without checking it! – garbage in, garbage out, etc 23
  • 24. Conclusion • Vast potential – to both big data and machine learning • Very difficult to realize that potential – requires mathematics, which nobody knows • We need to wake up! 24
  • 25. Where to learn more • University of Oslo – has courses on linear algebra, probability, graph theory, ... • Stanford University – https://www.coursera.org/course/ml • Mining Massive Datasets – http://infolab.stanford.edu/~ullman/mmds.html 25