SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Scraping data to drive
content marketing
campaigns
(without knowing how to code)
@jeremycabral
Data-driven content
DOMINATES
Insights from
Analyzing 1
Million Articles
“Original research based content
has the potential to achieve much
higher numbers of domain links
than other forms of content”
- Steve Rayson (Director -
BuzzSumo)
BuzzSumo Study
Priceonomics
From price guides to content
marketing
Pivot to data-driven
content marketing
23,000+ linking root domains
Price comparison: Airbnb vs Hotels
125
Linking root
domains
URL: https://priceonomics.com/hotels/
The Hipster Music Index
204,219
views
92
Linking root domains
URL: https://priceonomics.com/the-hipster-music-index/
Data mining fuels fast,
cheap and repeatable
content marketing ideas
But… what if the data you
need isn’t available by API
or downloadable?
Disclaimer
Seek legal advice before
committing to a scraping
project
Scraping data could breach the
terms of service of a website
Scraping at a disruptive rate
could slow down or even crash
a website
What is data
scraping?
Data scraping is an automated way
using scripts and crawlers to
1. Fetch a page
2. Parse the data in that page to
extract information
3. Format the data in an
organised way
4. Store or export that data to
create a dataset (DB, CSV,
TXT etc)
Patterns in HTML & CSS
It’s easier to scrape content broken up by a unique id or class assigned to the
element you want to extract
Basic overview of XPath
XPath can be used to navigate through
elements and attributes in a document
Important to understand how tags are nested as
a scraper will follow this tree
Learn more:
https://www.slideshare.net/scrapinghub/xpath-
for-web-scraping
Finding an API
Learn more: http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
Important
Excel analysis
skills
1. Match the same data across
multiple spreadsheets:
a. VLOOKUP
b. INDEX MATCH
2. Summarising data
a. Pivot Tables
b. Charts
3. Cleaning data
a. =TRIM()
b. =SPLIT()
Learn more:
● https://www.distilled.net/excel-for-seo/
● https://trumpexcel.com/clean-data-in-excel/
Data sources
Web Apps
● Engines / Listings (product
data, reviews)
● Search results (with filters
applied)
Calculators
● Automatically input values
with scripts
● Store every calculator
results combination
Public
Datasets
● Upside: easy to download,
regularly maintained by
others
● Downside: everyone has
access easily to the same
data as you
APIs
● Upside: everything is
structured and (often)
documented
● Downside: sometimes not all
data is available in an API
How to get the data
Scraping
Frameworks &
Languages
Popular languages
PHP
Python
Ruby
Perl
Node.js
These are important for your
own development or choosing a
freelancer
Try and use a language your
developers are familiar with
Simulating the
user in the
browser
● Selenium Web Driver
● PhantomJS w/CasperJS
Data scraping tools
Desktop tools
Scrapesimilar
artoo.js
Tabula - extract tables from PDFs
Parsehub (free & paid versions)
Screaming Frog
URL Profiler
Scripts run on your local machine
Hosted Services
Google Sheets (ImportXML,
ImportJSON, ImportHTML)
Import.io - automatic page scraper
Mozenda - point and click screen
scraping (Windows only)
DIFFBot (Artificial Intelligence)
Connotate
Scraping with Google Sheets
Google Sheets Formulas (built in)
=importXML(url, xpath_query) -- imports
structured data using XPath
=importHTML(url, query, index) – imports data
from a table or list within an HTML page. Index
identifies which table in the source code
Learn more:
https://www.distilled.net/blog/distilled/guide-to-
google-docs-importxml/
= ImportJSON(url, query, parseOptions) --
imports JSON feeds into Google Sheets
http://date.jsontest.com/
{
"time": "11:35:24 AM",
"milliseconds_since_epoch": 1493552124786,
"date": "02-14-2014"
}
Learn more:
http://blog.fastfedora.com/projects/import-json
Scraping with
Screaming
Frog
Using custom extraction and
filters
Learn more: http://www.seerinteractive.com/blog/screaming-frog-guide/
Import.io example
Run on a frequency you set + stores data historically
Predictive model for real estate
value
Learn more: http://www.louisdorard.com/guest/everyone-can-do-data-science
Realtor.com scraped by import.io => cleaned with Pandas => model built by BigML
Scrape Similar (“Scraper”)
Learn more: http://ipullrank.com/how-to-scrape-every-single-page-on-the-web/
Diffbot.com
4 main APIs that use artificial
intelligence for data extraction
1. Article: clean text from article,
html, author, date info, related
images, videos
2. Discussion: content of forum
threads, article comments,
product reviews
3. Product: pricing information,
product IDs, images, product
specs
4. Video: Author/uploader,
duration, title, description, date
uploaded, stats.
Getting help with data
scraping
Find scraping
experts
Upwork
Freelancer.com
Codementor.io (CodementorX)
Briefing a freelancer
Inputs:
1. Project Goal
2. List of URLs
1. Provide it yourself
2. Provide an endpoint and a pattern of URLs
that you’d like captured
3. Specific inputs into any filters/data input fields
which may be required to capture all the data
combinations
1. Form values (numbers, sliders, etc)
2. Login details
4. Technical requirements
1. Location of IP when scraping
2. Frequency of scrape
3. Scraping language
Outputs:
1. Where the data will be stored?
a. Local file (CSV, TXT)
b. Database (SQLite)
c. Stored on webserver
2. Provide an example spreadsheet showing
how you would like to data presented
3. Specify any data manipulation needed to
have clean output from the scrape
4. Specify how the data will be used
a. HTML Table or
b. Single page application (React/Angular JS)
embedded with oEmbed
Avoid getting
blocked
● Spoof header as Googlebot
● Run scrape from multiple IP
addresses
● Run the scrape slowly
● Be careful scraping behind a
login
Data scraping
services
Typically $2k+ per project..
ouch!
Priceonomics
Promptcloud
Scrapinghub
Datahen
How to turn your data into
visually appealing content
Maps
In order to do this you need to
capture addresses or
latitude/longitude
Batchgeo
Turn spreadsheets into maps
Charts
Easiest way to visualise data,
hardest to make look sexy with
Excel & Google Sheets
Source: https://www.labnol.org/software/find-right-chart-type-for-your-
data/6523/
Tip: Tableau
suggests
charts
Place your data set in Tableau and
use the ‘show me’ functionality
Interactive
Tables
Helpful to use a database tool
for larger datasets
https://tablepress.org/
Interactive
visualisations
Highly engaging and allows the
user to filter the data
Source:
https://www.lowyinstitute.org/lowyinstitutepollinteractive/feelings-
towards-other-nations/
Inspect
element to find
frameworks
This visualisation is using
amcharts.com
Interactive
visualisation
brief example
Download the full brief template
http://bit.ly/datavizbrief
Data
visualisation
inspiration
Graphiq.com
TheAtlas.com
Flowingdata.com
Storybench.org
Dribbble
reddit.com/r/Dataisbeautiful
Blueprints for data-driven
content marketing
Provide a new dimension on a
dataset
How? IMDB + the idea that people want their fav tv shows to come back on air
335,830
views
142
Linking root domains
Recognise patterns and service
them
How? Combined results from NBN map search + real estate listings
1,000+
New users within 72
hours
Display data in an accessible format
How? Allflicks.net combining IMDB with Netflix library plus filters
1.13k
Linking root domains
● Filterable
● Sortable
● Categorised
● Indexable!
Visualise trends
How? Twitter API + Maptimize mapping engine - onemilliontweetmap.com
426
Linking root domains
Big data analysis ‘taster’
How? Scraped Google to analyse rich snippets + blog post with ‘taste’ of the data
128
Linking root domains
+
Lead source
Want more
ideas?
1. Scrape an online community to get a
list of URLs and their
a. Post titles
b. # of Upvotes
c. # of comments
d. Date posted
2. Mash together the data with social
shares, link data using URL Profiler
3. Analyse the data using pivot tables in
Excel or Google Sheets
Learn how: https://blog.parsehub.com/boost-your-
content-marketing-with-web-scraping-and-pivot-tables/
Scrape reddit,
growthhackers.com,
inbound.org, hackernews
Promoting data-driven
content
Content
Distribution
Supernodes:
Reddit
Digg
Hacker News
Slashdot
Inbound.org
Q&A websites (Quora, etc)
Online communities
Forums
Subreddits
Facebook Groups/Pages
List of content distribution websites: bit.ly/content-distribution-list
Good ol’
fashioned
reachout
Find websites with audiences that
will be interested in your data
Give journalists and bloggers a
unique angle and potentially a
different dimension on the
dataset so they can write their
own unique story
Make contact - don’t be afraid to
use the phone or go for a
coffee
List of content distribution websites: bit.ly/content-distribution-list
Build email
lists
Even small email lists can be
powerful to spread your content
online
We are always hiring!
finder.com.au/careers
jeremy@finder.com

Weitere ähnliche Inhalte

Was ist angesagt?

Yahoo! Search BOSS
Yahoo! Search BOSSYahoo! Search BOSS
Yahoo! Search BOSSPraveen P N
 
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and moreLeveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and moreBarbaraStarr2009
 
Why and how does the SEO industry use expired domains
Why and how does the SEO industry use expired domainsWhy and how does the SEO industry use expired domains
Why and how does the SEO industry use expired domainsKalin Karakehayov
 
Smarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformSmarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformOntotext
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...South London Geek Nights
 
GlobalAIBootcamp - Knowledge Mining using Azure Cognitive Search
GlobalAIBootcamp - Knowledge Mining using Azure Cognitive SearchGlobalAIBootcamp - Knowledge Mining using Azure Cognitive Search
GlobalAIBootcamp - Knowledge Mining using Azure Cognitive SearchAnupam Ranku
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentationTao Feng
 
Introduction to Azure Search
Introduction to Azure SearchIntroduction to Azure Search
Introduction to Azure SearchRadoslav Gatev
 
Advanced data-driven technical SEO - SMX London 2019
Advanced data-driven technical SEO - SMX London 2019Advanced data-driven technical SEO - SMX London 2019
Advanced data-driven technical SEO - SMX London 2019Bastian Grimm
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchablePeter Mika
 
GraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandGraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandOntotext
 
Optimizing public facing SharePoint sites
Optimizing public facing SharePoint sitesOptimizing public facing SharePoint sites
Optimizing public facing SharePoint sitesGunnar Peipman
 
Fc3 integration strategies
Fc3 integration strategiesFc3 integration strategies
Fc3 integration strategiesGabrieleSani3
 
Google search vs Solr search for Enterprise search
Google search vs Solr search for Enterprise searchGoogle search vs Solr search for Enterprise search
Google search vs Solr search for Enterprise searchVeera Shekar
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Neo4j
 
Pemari CA PPM Dynamic Data Display - Screenshots
Pemari   CA PPM Dynamic Data Display - ScreenshotsPemari   CA PPM Dynamic Data Display - Screenshots
Pemari CA PPM Dynamic Data Display - ScreenshotsPeter Hughes
 
Optimizing Content with SEO and Social Media
Optimizing Content with SEO and Social MediaOptimizing Content with SEO and Social Media
Optimizing Content with SEO and Social MediaErudite
 

Was ist angesagt? (20)

Yahoo! Search BOSS
Yahoo! Search BOSSYahoo! Search BOSS
Yahoo! Search BOSS
 
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and moreLeveraging the semantic web meetup, Semantic Search, Schema.org and more
Leveraging the semantic web meetup, Semantic Search, Schema.org and more
 
Why and how does the SEO industry use expired domains
Why and how does the SEO industry use expired domainsWhy and how does the SEO industry use expired domains
Why and how does the SEO industry use expired domains
 
Smarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing PlatformSmarter content with a Dynamic Semantic Publishing Platform
Smarter content with a Dynamic Semantic Publishing Platform
 
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
NoSQL: what does it mean, how did we get here, and why should I care? - Hugo ...
 
GlobalAIBootcamp - Knowledge Mining using Azure Cognitive Search
GlobalAIBootcamp - Knowledge Mining using Azure Cognitive SearchGlobalAIBootcamp - Knowledge Mining using Azure Cognitive Search
GlobalAIBootcamp - Knowledge Mining using Azure Cognitive Search
 
Strata sf - Amundsen presentation
Strata sf - Amundsen presentationStrata sf - Amundsen presentation
Strata sf - Amundsen presentation
 
Introduction to Azure Search
Introduction to Azure SearchIntroduction to Azure Search
Introduction to Azure Search
 
Advanced data-driven technical SEO - SMX London 2019
Advanced data-driven technical SEO - SMX London 2019Advanced data-driven technical SEO - SMX London 2019
Advanced data-driven technical SEO - SMX London 2019
 
Making the Web searchable
Making the Web searchableMaking the Web searchable
Making the Web searchable
 
GraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on DemandGraphDB Cloud: Enterprise Ready RDF Database on Demand
GraphDB Cloud: Enterprise Ready RDF Database on Demand
 
Introduction to RDF*
Introduction to RDF*Introduction to RDF*
Introduction to RDF*
 
Optimizing public facing SharePoint sites
Optimizing public facing SharePoint sitesOptimizing public facing SharePoint sites
Optimizing public facing SharePoint sites
 
Fc3 integration strategies
Fc3 integration strategiesFc3 integration strategies
Fc3 integration strategies
 
JahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with JahiaJahiaOne - Semantic Web with Jahia
JahiaOne - Semantic Web with Jahia
 
Google search vs Solr search for Enterprise search
Google search vs Solr search for Enterprise searchGoogle search vs Solr search for Enterprise search
Google search vs Solr search for Enterprise search
 
Meetup SF - Amundsen
Meetup SF  -  AmundsenMeetup SF  -  Amundsen
Meetup SF - Amundsen
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
 
Pemari CA PPM Dynamic Data Display - Screenshots
Pemari   CA PPM Dynamic Data Display - ScreenshotsPemari   CA PPM Dynamic Data Display - Screenshots
Pemari CA PPM Dynamic Data Display - Screenshots
 
Optimizing Content with SEO and Social Media
Optimizing Content with SEO and Social MediaOptimizing Content with SEO and Social Media
Optimizing Content with SEO and Social Media
 

Ähnlich wie Data-driven content marketing strategies using web scraping

Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & AnalysisScott Sanders
 
ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)Abdelkrim Boujraf
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Aparna Sharma
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Amazon Web Services Korea
 
Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web DevelopmentRobert J. Stein
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...Sylvain Zimmer
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?confluent
 
I2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
I2 - SharePoint Hybrid Search Start to Finish - Thomas VochtenI2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
I2 - SharePoint Hybrid Search Start to Finish - Thomas VochtenSPS Paris
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsClusterpoint
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming VisualizationGuido Schmutz
 
Daniel Egan Msdn Tech Days Oc
Daniel Egan Msdn Tech Days OcDaniel Egan Msdn Tech Days Oc
Daniel Egan Msdn Tech Days OcDaniel Egan
 
Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...Enterprise Ireland
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
 
ApacheCon NA 2018 : Apache Unomi, an Open Source Customer Data Platformapache...
ApacheCon NA 2018 : Apache Unomi, an Open Source Customer Data Platformapache...ApacheCon NA 2018 : Apache Unomi, an Open Source Customer Data Platformapache...
ApacheCon NA 2018 : Apache Unomi, an Open Source Customer Data Platformapache...Serge Huber
 

Ähnlich wie Data-driven content marketing strategies using web scraping (20)

Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)ALT-F1.BE : The Accelerator (Google Cloud Platform)
ALT-F1.BE : The Accelerator (Google Cloud Platform)
 
Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022Top 13 web scraping tools in 2022
Top 13 web scraping tools in 2022
 
What is web scraping?
What is web scraping?What is web scraping?
What is web scraping?
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)Big data on_aws in korea by abhishek sinha (lunch and learn)
Big data on_aws in korea by abhishek sinha (lunch and learn)
 
Lecture7
Lecture7Lecture7
Lecture7
 
Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web Development
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
I2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
I2 - SharePoint Hybrid Search Start to Finish - Thomas VochtenI2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
I2 - SharePoint Hybrid Search Start to Finish - Thomas Vochten
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutions
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 
Daniel Egan Msdn Tech Days Oc
Daniel Egan Msdn Tech Days OcDaniel Egan Msdn Tech Days Oc
Daniel Egan Msdn Tech Days Oc
 
Web Scrapping Using Python
Web Scrapping Using PythonWeb Scrapping Using Python
Web Scrapping Using Python
 
Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...Data analytics and SEO to grow your international business | John Caldwell | ...
Data analytics and SEO to grow your international business | John Caldwell | ...
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
ApacheCon NA 2018 : Apache Unomi, an Open Source Customer Data Platformapache...
ApacheCon NA 2018 : Apache Unomi, an Open Source Customer Data Platformapache...ApacheCon NA 2018 : Apache Unomi, an Open Source Customer Data Platformapache...
ApacheCon NA 2018 : Apache Unomi, an Open Source Customer Data Platformapache...
 

Kürzlich hochgeladen

Word Count for Writers: Examples of Word Counts for Sample Genres
Word Count for Writers: Examples of Word Counts for Sample GenresWord Count for Writers: Examples of Word Counts for Sample Genres
Word Count for Writers: Examples of Word Counts for Sample GenresLisa M. Masiello
 
McDonald's: A Journey Through Time (PPT)
McDonald's: A Journey Through Time (PPT)McDonald's: A Journey Through Time (PPT)
McDonald's: A Journey Through Time (PPT)DEVARAJV16
 
Fueling A_B experiments with behavioral insights (1).pdf
Fueling A_B experiments with behavioral insights (1).pdfFueling A_B experiments with behavioral insights (1).pdf
Fueling A_B experiments with behavioral insights (1).pdfVWO
 
Red bull marketing presentation pptxxxxx
Red bull marketing presentation pptxxxxxRed bull marketing presentation pptxxxxx
Red bull marketing presentation pptxxxxx216310017
 
Influencer Marketing Power point presentation
Influencer Marketing  Power point presentationInfluencer Marketing  Power point presentation
Influencer Marketing Power point presentationdgtivemarketingagenc
 
Master the Art of Digital Recruitment in Asia.pdf
Master the Art of Digital Recruitment in Asia.pdfMaster the Art of Digital Recruitment in Asia.pdf
Master the Art of Digital Recruitment in Asia.pdfHigher Education Marketing
 
The power of SEO-driven market intelligence
The power of SEO-driven market intelligenceThe power of SEO-driven market intelligence
The power of SEO-driven market intelligenceHinde Lamrani
 
Jai Institute for Parenting Program Guide
Jai Institute for Parenting Program GuideJai Institute for Parenting Program Guide
Jai Institute for Parenting Program Guidekiva6
 
ASO Process: What is App Store Optimization
ASO Process: What is App Store OptimizationASO Process: What is App Store Optimization
ASO Process: What is App Store OptimizationAli Raza
 
Talent Management for mba 3rd sem useful
Talent Management for mba 3rd sem usefulTalent Management for mba 3rd sem useful
Talent Management for mba 3rd sem usefulAtifaArbar
 
Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...
Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...
Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...CIO Business World
 
DIGITAL MARKETING STRATEGY_INFOGRAPHIC IMAGE.pdf
DIGITAL MARKETING STRATEGY_INFOGRAPHIC IMAGE.pdfDIGITAL MARKETING STRATEGY_INFOGRAPHIC IMAGE.pdf
DIGITAL MARKETING STRATEGY_INFOGRAPHIC IMAGE.pdfmayanksharma0441
 
The 10 Most Inspirational Leaders LEADING THE WAY TO SUCCESS, 2024
The 10 Most Inspirational Leaders LEADING THE WAY TO SUCCESS, 2024The 10 Most Inspirational Leaders LEADING THE WAY TO SUCCESS, 2024
The 10 Most Inspirational Leaders LEADING THE WAY TO SUCCESS, 2024CIO Business World
 
marketing strategy of tanishq word PPROJECT.pdf
marketing strategy of tanishq word PPROJECT.pdfmarketing strategy of tanishq word PPROJECT.pdf
marketing strategy of tanishq word PPROJECT.pdfarsathsahil
 
Exploring The World Of Adult Ad Networks.pdf
Exploring The World Of Adult Ad Networks.pdfExploring The World Of Adult Ad Networks.pdf
Exploring The World Of Adult Ad Networks.pdfadult marketing
 
Digital Marketing Spotlight: Lifecycle Advertising Strategies.pdf
Digital Marketing Spotlight: Lifecycle Advertising Strategies.pdfDigital Marketing Spotlight: Lifecycle Advertising Strategies.pdf
Digital Marketing Spotlight: Lifecycle Advertising Strategies.pdfDemandbase
 
The Pitfalls of Keyword Stuffing in SEO Copywriting
The Pitfalls of Keyword Stuffing in SEO CopywritingThe Pitfalls of Keyword Stuffing in SEO Copywriting
The Pitfalls of Keyword Stuffing in SEO CopywritingJuan Pineda
 
What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...
What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...
What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...Ahrefs
 
Michael Kors marketing assignment swot analysis
Michael Kors marketing assignment swot analysisMichael Kors marketing assignment swot analysis
Michael Kors marketing assignment swot analysisjunaid794917
 
Storyboards for my Final Major Project Video
Storyboards for my Final Major Project VideoStoryboards for my Final Major Project Video
Storyboards for my Final Major Project VideoSineadBidwell
 

Kürzlich hochgeladen (20)

Word Count for Writers: Examples of Word Counts for Sample Genres
Word Count for Writers: Examples of Word Counts for Sample GenresWord Count for Writers: Examples of Word Counts for Sample Genres
Word Count for Writers: Examples of Word Counts for Sample Genres
 
McDonald's: A Journey Through Time (PPT)
McDonald's: A Journey Through Time (PPT)McDonald's: A Journey Through Time (PPT)
McDonald's: A Journey Through Time (PPT)
 
Fueling A_B experiments with behavioral insights (1).pdf
Fueling A_B experiments with behavioral insights (1).pdfFueling A_B experiments with behavioral insights (1).pdf
Fueling A_B experiments with behavioral insights (1).pdf
 
Red bull marketing presentation pptxxxxx
Red bull marketing presentation pptxxxxxRed bull marketing presentation pptxxxxx
Red bull marketing presentation pptxxxxx
 
Influencer Marketing Power point presentation
Influencer Marketing  Power point presentationInfluencer Marketing  Power point presentation
Influencer Marketing Power point presentation
 
Master the Art of Digital Recruitment in Asia.pdf
Master the Art of Digital Recruitment in Asia.pdfMaster the Art of Digital Recruitment in Asia.pdf
Master the Art of Digital Recruitment in Asia.pdf
 
The power of SEO-driven market intelligence
The power of SEO-driven market intelligenceThe power of SEO-driven market intelligence
The power of SEO-driven market intelligence
 
Jai Institute for Parenting Program Guide
Jai Institute for Parenting Program GuideJai Institute for Parenting Program Guide
Jai Institute for Parenting Program Guide
 
ASO Process: What is App Store Optimization
ASO Process: What is App Store OptimizationASO Process: What is App Store Optimization
ASO Process: What is App Store Optimization
 
Talent Management for mba 3rd sem useful
Talent Management for mba 3rd sem usefulTalent Management for mba 3rd sem useful
Talent Management for mba 3rd sem useful
 
Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...
Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...
Most Impressive Construction Leaders in Tech, Making Waves in the Industry, 2...
 
DIGITAL MARKETING STRATEGY_INFOGRAPHIC IMAGE.pdf
DIGITAL MARKETING STRATEGY_INFOGRAPHIC IMAGE.pdfDIGITAL MARKETING STRATEGY_INFOGRAPHIC IMAGE.pdf
DIGITAL MARKETING STRATEGY_INFOGRAPHIC IMAGE.pdf
 
The 10 Most Inspirational Leaders LEADING THE WAY TO SUCCESS, 2024
The 10 Most Inspirational Leaders LEADING THE WAY TO SUCCESS, 2024The 10 Most Inspirational Leaders LEADING THE WAY TO SUCCESS, 2024
The 10 Most Inspirational Leaders LEADING THE WAY TO SUCCESS, 2024
 
marketing strategy of tanishq word PPROJECT.pdf
marketing strategy of tanishq word PPROJECT.pdfmarketing strategy of tanishq word PPROJECT.pdf
marketing strategy of tanishq word PPROJECT.pdf
 
Exploring The World Of Adult Ad Networks.pdf
Exploring The World Of Adult Ad Networks.pdfExploring The World Of Adult Ad Networks.pdf
Exploring The World Of Adult Ad Networks.pdf
 
Digital Marketing Spotlight: Lifecycle Advertising Strategies.pdf
Digital Marketing Spotlight: Lifecycle Advertising Strategies.pdfDigital Marketing Spotlight: Lifecycle Advertising Strategies.pdf
Digital Marketing Spotlight: Lifecycle Advertising Strategies.pdf
 
The Pitfalls of Keyword Stuffing in SEO Copywriting
The Pitfalls of Keyword Stuffing in SEO CopywritingThe Pitfalls of Keyword Stuffing in SEO Copywriting
The Pitfalls of Keyword Stuffing in SEO Copywriting
 
What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...
What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...
What I learned from auditing over 1,000,000 websites - SERP Conf 2024 Patrick...
 
Michael Kors marketing assignment swot analysis
Michael Kors marketing assignment swot analysisMichael Kors marketing assignment swot analysis
Michael Kors marketing assignment swot analysis
 
Storyboards for my Final Major Project Video
Storyboards for my Final Major Project VideoStoryboards for my Final Major Project Video
Storyboards for my Final Major Project Video
 

Data-driven content marketing strategies using web scraping

  • 1. Scraping data to drive content marketing campaigns (without knowing how to code) @jeremycabral
  • 3. Insights from Analyzing 1 Million Articles “Original research based content has the potential to achieve much higher numbers of domain links than other forms of content” - Steve Rayson (Director - BuzzSumo) BuzzSumo Study
  • 4. Priceonomics From price guides to content marketing Pivot to data-driven content marketing 23,000+ linking root domains
  • 5. Price comparison: Airbnb vs Hotels 125 Linking root domains URL: https://priceonomics.com/hotels/
  • 6. The Hipster Music Index 204,219 views 92 Linking root domains URL: https://priceonomics.com/the-hipster-music-index/
  • 7. Data mining fuels fast, cheap and repeatable content marketing ideas
  • 8. But… what if the data you need isn’t available by API or downloadable?
  • 9. Disclaimer Seek legal advice before committing to a scraping project Scraping data could breach the terms of service of a website Scraping at a disruptive rate could slow down or even crash a website
  • 10. What is data scraping? Data scraping is an automated way using scripts and crawlers to 1. Fetch a page 2. Parse the data in that page to extract information 3. Format the data in an organised way 4. Store or export that data to create a dataset (DB, CSV, TXT etc)
  • 11. Patterns in HTML & CSS It’s easier to scrape content broken up by a unique id or class assigned to the element you want to extract
  • 12. Basic overview of XPath XPath can be used to navigate through elements and attributes in a document Important to understand how tags are nested as a scraper will follow this tree Learn more: https://www.slideshare.net/scrapinghub/xpath- for-web-scraping
  • 13. Finding an API Learn more: http://www.gregreda.com/2015/02/15/web-scraping-finding-the-api/
  • 14. Important Excel analysis skills 1. Match the same data across multiple spreadsheets: a. VLOOKUP b. INDEX MATCH 2. Summarising data a. Pivot Tables b. Charts 3. Cleaning data a. =TRIM() b. =SPLIT() Learn more: ● https://www.distilled.net/excel-for-seo/ ● https://trumpexcel.com/clean-data-in-excel/
  • 16. Web Apps ● Engines / Listings (product data, reviews) ● Search results (with filters applied)
  • 17. Calculators ● Automatically input values with scripts ● Store every calculator results combination
  • 18. Public Datasets ● Upside: easy to download, regularly maintained by others ● Downside: everyone has access easily to the same data as you
  • 19. APIs ● Upside: everything is structured and (often) documented ● Downside: sometimes not all data is available in an API
  • 20. How to get the data
  • 21. Scraping Frameworks & Languages Popular languages PHP Python Ruby Perl Node.js These are important for your own development or choosing a freelancer Try and use a language your developers are familiar with
  • 22. Simulating the user in the browser ● Selenium Web Driver ● PhantomJS w/CasperJS
  • 23. Data scraping tools Desktop tools Scrapesimilar artoo.js Tabula - extract tables from PDFs Parsehub (free & paid versions) Screaming Frog URL Profiler Scripts run on your local machine Hosted Services Google Sheets (ImportXML, ImportJSON, ImportHTML) Import.io - automatic page scraper Mozenda - point and click screen scraping (Windows only) DIFFBot (Artificial Intelligence) Connotate
  • 24. Scraping with Google Sheets Google Sheets Formulas (built in) =importXML(url, xpath_query) -- imports structured data using XPath =importHTML(url, query, index) – imports data from a table or list within an HTML page. Index identifies which table in the source code Learn more: https://www.distilled.net/blog/distilled/guide-to- google-docs-importxml/ = ImportJSON(url, query, parseOptions) -- imports JSON feeds into Google Sheets http://date.jsontest.com/ { "time": "11:35:24 AM", "milliseconds_since_epoch": 1493552124786, "date": "02-14-2014" } Learn more: http://blog.fastfedora.com/projects/import-json
  • 25. Scraping with Screaming Frog Using custom extraction and filters Learn more: http://www.seerinteractive.com/blog/screaming-frog-guide/
  • 26. Import.io example Run on a frequency you set + stores data historically
  • 27. Predictive model for real estate value Learn more: http://www.louisdorard.com/guest/everyone-can-do-data-science Realtor.com scraped by import.io => cleaned with Pandas => model built by BigML
  • 28. Scrape Similar (“Scraper”) Learn more: http://ipullrank.com/how-to-scrape-every-single-page-on-the-web/
  • 29. Diffbot.com 4 main APIs that use artificial intelligence for data extraction 1. Article: clean text from article, html, author, date info, related images, videos 2. Discussion: content of forum threads, article comments, product reviews 3. Product: pricing information, product IDs, images, product specs 4. Video: Author/uploader, duration, title, description, date uploaded, stats.
  • 30.
  • 31. Getting help with data scraping
  • 33. Briefing a freelancer Inputs: 1. Project Goal 2. List of URLs 1. Provide it yourself 2. Provide an endpoint and a pattern of URLs that you’d like captured 3. Specific inputs into any filters/data input fields which may be required to capture all the data combinations 1. Form values (numbers, sliders, etc) 2. Login details 4. Technical requirements 1. Location of IP when scraping 2. Frequency of scrape 3. Scraping language Outputs: 1. Where the data will be stored? a. Local file (CSV, TXT) b. Database (SQLite) c. Stored on webserver 2. Provide an example spreadsheet showing how you would like to data presented 3. Specify any data manipulation needed to have clean output from the scrape 4. Specify how the data will be used a. HTML Table or b. Single page application (React/Angular JS) embedded with oEmbed
  • 34. Avoid getting blocked ● Spoof header as Googlebot ● Run scrape from multiple IP addresses ● Run the scrape slowly ● Be careful scraping behind a login
  • 35. Data scraping services Typically $2k+ per project.. ouch! Priceonomics Promptcloud Scrapinghub Datahen
  • 36. How to turn your data into visually appealing content
  • 37. Maps In order to do this you need to capture addresses or latitude/longitude
  • 39. Charts Easiest way to visualise data, hardest to make look sexy with Excel & Google Sheets Source: https://www.labnol.org/software/find-right-chart-type-for-your- data/6523/
  • 40. Tip: Tableau suggests charts Place your data set in Tableau and use the ‘show me’ functionality
  • 41. Interactive Tables Helpful to use a database tool for larger datasets https://tablepress.org/
  • 42. Interactive visualisations Highly engaging and allows the user to filter the data Source: https://www.lowyinstitute.org/lowyinstitutepollinteractive/feelings- towards-other-nations/
  • 43. Inspect element to find frameworks This visualisation is using amcharts.com
  • 44. Interactive visualisation brief example Download the full brief template http://bit.ly/datavizbrief
  • 47. Provide a new dimension on a dataset How? IMDB + the idea that people want their fav tv shows to come back on air 335,830 views 142 Linking root domains
  • 48. Recognise patterns and service them How? Combined results from NBN map search + real estate listings 1,000+ New users within 72 hours
  • 49. Display data in an accessible format How? Allflicks.net combining IMDB with Netflix library plus filters 1.13k Linking root domains ● Filterable ● Sortable ● Categorised ● Indexable!
  • 50. Visualise trends How? Twitter API + Maptimize mapping engine - onemilliontweetmap.com 426 Linking root domains
  • 51. Big data analysis ‘taster’ How? Scraped Google to analyse rich snippets + blog post with ‘taste’ of the data 128 Linking root domains + Lead source
  • 52. Want more ideas? 1. Scrape an online community to get a list of URLs and their a. Post titles b. # of Upvotes c. # of comments d. Date posted 2. Mash together the data with social shares, link data using URL Profiler 3. Analyse the data using pivot tables in Excel or Google Sheets Learn how: https://blog.parsehub.com/boost-your- content-marketing-with-web-scraping-and-pivot-tables/ Scrape reddit, growthhackers.com, inbound.org, hackernews
  • 54. Content Distribution Supernodes: Reddit Digg Hacker News Slashdot Inbound.org Q&A websites (Quora, etc) Online communities Forums Subreddits Facebook Groups/Pages List of content distribution websites: bit.ly/content-distribution-list
  • 55. Good ol’ fashioned reachout Find websites with audiences that will be interested in your data Give journalists and bloggers a unique angle and potentially a different dimension on the dataset so they can write their own unique story Make contact - don’t be afraid to use the phone or go for a coffee List of content distribution websites: bit.ly/content-distribution-list
  • 56. Build email lists Even small email lists can be powerful to spread your content online
  • 57. We are always hiring! finder.com.au/careers jeremy@finder.com

Hinweis der Redaktion

  1. Originally part of Y-Combinator Blogs Airbnb rates
  2. Combined data from pitchfork music review websites + facebook likes for each review. Pitchfork: music reviews for independent music Facebook likes: artists with the least likes were the most hipster, because their second criteria is that it should be a band you’ve never heard of
  3. Understanding this will help you understand page structures and what’s possible
  4. Data.gov.au Kaggle the data science community recently bought by google publishes alot of their datasets
  5. Talk about they aren’t niche Second opinion
  6. https://blog.hartleybrody.com/web-scraping/
  7. Talk about other JSON examples
  8. 30 seconds to produce this
  9. Point and click Can run on a regular basis
  10. easy-to-use tool for intermediate to advanced users who are comfortable with XPath. More advanced than ImportXML. Allows you to capture more information than what is possible in google sheets
  11. Pricey. Expect to pay upwards of $2k per project
  12. Where to live next, where they can get the NBN