SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
JR Oakes | @jroakes | #TechSEOBoost
JR Oakes
Building a Simple Crawler on
a Toy Internet
JR Oakes | @jroakes | #TechSEOBoost
About Me
Senior Director, Technical SEO Research, at
@LocomotiveSEO
Passionate about:
• Development
• Learning
• Community
• Technology
JR Oakes | @jroakes | #TechSEOBoost
About Me
• Write some and do the Twitter thing.
• Share as much as I can on Github.
• Love to organize meetups
• Always testing something
• Love the brilliant team at Locomotive
JR Oakes | @jroakes | #TechSEOBoost
What we will learn
JR Oakes | @jroakes | #TechSEOBoost
What we will learn
• Overview of Crawling Landscape
• Key Components of Crawler
• Building a Toy Internet
• Building a Crawler and Renderer
JR Oakes | @jroakes | #TechSEOBoost
Overview of Crawling
Landscape
JR Oakes | @jroakes | #TechSEOBoost
The Web is Big
We have worked on sites with as many as a
billion potential pages. Google only crawls
(or knows about) a fraction of those.
• Crawled
• Want to Crawl (frontier)
• Unseen (or not wanted to be seen)
Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
JR Oakes | @jroakes | #TechSEOBoost
The Web is Big
PageRank (or node popularity metrics) is a
good way to measure how deep to go.
Hypothesis is that a measurement of node
popularity can deprioritize links from very
unpopular nodes.
JR Oakes | @jroakes | #TechSEOBoost
The Web is Big
Google has over 25 BILLION results in
their inverted index.
JR Oakes | @jroakes | #TechSEOBoost
What a crawler must do
• Be robust. Handle spider traps and malicious behavior.
• Be distributed. Run across many machines.
• Be scalable. Easy to add more machines.
• Be efficient. Use network and processing resources wisely.
• Prioritize. Know the quality and priority of pages.
• Operate continuously.
• Be adaptable. Easy to change with new data / web needs.
• Be a good citizen. Respect robots.txt and server load.
Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
JR Oakes | @jroakes | #TechSEOBoost
Key Components of
Crawler
JR Oakes | @jroakes | #TechSEOBoost
Basic Crawl Architecture
Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
JR Oakes | @jroakes | #TechSEOBoost
My Inferred Crawl Architecture
JR Oakes | @jroakes | #TechSEOBoost
My Inferred Crawl Architecture
Hard to believe Google is wasting
resources to render something
that has not changed in 40 years.
JR Oakes | @jroakes | #TechSEOBoost
Key Learnings
• Frontier is broken into two sections, a Front Queue, that manages priority, and a Back
Queue that manages politeness
• All queues are FIFO
• Each host has its own Back Queue
• Min Hashes (Sketches) are an effective way of deduping content
• Duplicates vs Near Duplicates measured by edit distance
• Everything is cached to reduce latency
• URL normalization is handled at the parser (eg /page-path/ to https://domain/page-path/)
• There are interesting things that can happen in the DOM rather than just parsing
retrieved URL
JR Oakes | @jroakes | #TechSEOBoost
Building a Toy Internet
JR Oakes | @jroakes | #TechSEOBoost
Criteria
• Build quickly with topically similar pages for
each site
• Exist on separate domains
• Linked to each other, but not to any other
pages on the internet
• Contain basic SEO elements like title,
description, canonical, etc
JR Oakes | @jroakes | #TechSEOBoost
Solution
• Github Pages
• Jekyll
• Wikipedia
• Python
• search-engine-optimization-blog.github.io
• data-science-blog.github.io
• python-software.github.io
JR Oakes | @jroakes | #TechSEOBoost
PBN Maker 3000
JR Oakes | @jroakes | #TechSEOBoost
PBN Maker 3000
JR Oakes | @jroakes | #TechSEOBoost
Building a Crawler and
Renderer
JR Oakes | @jroakes | #TechSEOBoost
Step One
I have no idea how to start. So
let’s do some research.
I <3 Github
JR Oakes | @jroakes | #TechSEOBoost
Step Two
I don’t want to reinvent the wheel,
so let’s see what is already out
there that I can use.
JR Oakes | @jroakes | #TechSEOBoost
Step Three
A lot of coffee
… and some beer.
JR Oakes | @jroakes | #TechSEOBoost
A little help along the way
Streamlit is the first app
framework specifically for
Machine Learning and
Data Science teams.
So you can stop spending time on
frontend development and get
back to what you do best.
JR Oakes | @jroakes | #TechSEOBoost
Criteria
• Use existing libraries where possible
• Be hardy enough to crawl my toy internet
• Make it as simple and approachable as possible (e.g. I use Pandas
a lot)
• Try to be true (as possible) to what is known that Google does
• Process linearly. No threading or extra services
• Include unit testing
• Include a Jupyter Notebook
• Include READMEs
• Include a simple indexer and search apparatus to play with results
(Thanks John M.!)
JR Oakes | @jroakes | #TechSEOBoost
Parts
• PageRank
• Chrome Headless Rendering
• Text NLP Normalization
• Bert Embeddings
• Robots
• Duplicate Content Shingling
• URL Hashing
• Document Frequency Functions (BM25 and TFIDF)
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
JR Oakes | @jroakes | #TechSEOBoost
Learnings
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
• BERT is easily accessible.
JR Oakes | @jroakes | #TechSEOBoost
Learnings
Embeddings
https://github.com/huggingface/transformers
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
• BERT is easily accessible.
• I made some things waaaaayy simpler than they would be in real life.
JR Oakes | @jroakes | #TechSEOBoost
Learnings
JR Oakes | @jroakes | #TechSEOBoost
Learnings
• Applying PageRank to similar document clusters is an effective way of picking the right one.
• Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for
crawling and consolidation in HTML vs Rendered).
• Index compression techniques made my eyes glaze over.
• BERT models need all the (or most of) content.
• BERT is easily accessible.
• I made some things way simpler than they would be in real life.
• Sentencepiece and BPE encoding is revolutionary for indexes and NLG
• A minor code change can make the crawler go crazy. Hats off to Google and Screaming Frog.
• Minhash comparison made checking rendering to crawled comparison, easy.
JR Oakes | @jroakes | #TechSEOBoost
Result
A crawler written in Python that we are releasing as
open source.
Keep in mind:
1. This was written in a month
2. Google engineers would laugh at it
3. It probably has bugs
4. It is really fun to play around with
JR Oakes | @jroakes | #TechSEOBoost
Result
We also built a simple UI in
Streamlit so you can play
around with the results and
parameters.
JR Oakes | @jroakes | #TechSEOBoost
Result
Complete with Ads!
JR Oakes | @jroakes | #TechSEOBoost
Thank You
Start playing at the link below
https://locomotive.agency/coal-crawler-renderer-indexer-caboose
–
Find me on Twitter at: @jroakes

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 

Kürzlich hochgeladen (20)

Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 

Building a Simple Crawler on a Toy Internet

  • 1. JR Oakes | @jroakes | #TechSEOBoost JR Oakes Building a Simple Crawler on a Toy Internet
  • 2. JR Oakes | @jroakes | #TechSEOBoost About Me Senior Director, Technical SEO Research, at @LocomotiveSEO Passionate about: • Development • Learning • Community • Technology
  • 3. JR Oakes | @jroakes | #TechSEOBoost About Me • Write some and do the Twitter thing. • Share as much as I can on Github. • Love to organize meetups • Always testing something • Love the brilliant team at Locomotive
  • 4. JR Oakes | @jroakes | #TechSEOBoost What we will learn
  • 5. JR Oakes | @jroakes | #TechSEOBoost What we will learn • Overview of Crawling Landscape • Key Components of Crawler • Building a Toy Internet • Building a Crawler and Renderer
  • 6. JR Oakes | @jroakes | #TechSEOBoost Overview of Crawling Landscape
  • 7. JR Oakes | @jroakes | #TechSEOBoost The Web is Big We have worked on sites with as many as a billion potential pages. Google only crawls (or knows about) a fraction of those. • Crawled • Want to Crawl (frontier) • Unseen (or not wanted to be seen) Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
  • 8. JR Oakes | @jroakes | #TechSEOBoost The Web is Big PageRank (or node popularity metrics) is a good way to measure how deep to go. Hypothesis is that a measurement of node popularity can deprioritize links from very unpopular nodes.
  • 9. JR Oakes | @jroakes | #TechSEOBoost The Web is Big Google has over 25 BILLION results in their inverted index.
  • 10. JR Oakes | @jroakes | #TechSEOBoost What a crawler must do • Be robust. Handle spider traps and malicious behavior. • Be distributed. Run across many machines. • Be scalable. Easy to add more machines. • Be efficient. Use network and processing resources wisely. • Prioritize. Know the quality and priority of pages. • Operate continuously. • Be adaptable. Easy to change with new data / web needs. • Be a good citizen. Respect robots.txt and server load. Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
  • 11. JR Oakes | @jroakes | #TechSEOBoost Key Components of Crawler
  • 12. JR Oakes | @jroakes | #TechSEOBoost Basic Crawl Architecture Ref: [Crawling and Duplicates, Chris Manning and Pandu Nayak] (http://web.stanford.edu/class/cs276/19handouts/lecture18-crawling.ppt)
  • 13. JR Oakes | @jroakes | #TechSEOBoost My Inferred Crawl Architecture
  • 14. JR Oakes | @jroakes | #TechSEOBoost My Inferred Crawl Architecture Hard to believe Google is wasting resources to render something that has not changed in 40 years.
  • 15. JR Oakes | @jroakes | #TechSEOBoost Key Learnings • Frontier is broken into two sections, a Front Queue, that manages priority, and a Back Queue that manages politeness • All queues are FIFO • Each host has its own Back Queue • Min Hashes (Sketches) are an effective way of deduping content • Duplicates vs Near Duplicates measured by edit distance • Everything is cached to reduce latency • URL normalization is handled at the parser (eg /page-path/ to https://domain/page-path/) • There are interesting things that can happen in the DOM rather than just parsing retrieved URL
  • 16. JR Oakes | @jroakes | #TechSEOBoost Building a Toy Internet
  • 17. JR Oakes | @jroakes | #TechSEOBoost Criteria • Build quickly with topically similar pages for each site • Exist on separate domains • Linked to each other, but not to any other pages on the internet • Contain basic SEO elements like title, description, canonical, etc
  • 18. JR Oakes | @jroakes | #TechSEOBoost Solution • Github Pages • Jekyll • Wikipedia • Python • search-engine-optimization-blog.github.io • data-science-blog.github.io • python-software.github.io
  • 19. JR Oakes | @jroakes | #TechSEOBoost PBN Maker 3000
  • 20. JR Oakes | @jroakes | #TechSEOBoost PBN Maker 3000
  • 21. JR Oakes | @jroakes | #TechSEOBoost Building a Crawler and Renderer
  • 22. JR Oakes | @jroakes | #TechSEOBoost Step One I have no idea how to start. So let’s do some research. I <3 Github
  • 23. JR Oakes | @jroakes | #TechSEOBoost Step Two I don’t want to reinvent the wheel, so let’s see what is already out there that I can use.
  • 24. JR Oakes | @jroakes | #TechSEOBoost Step Three A lot of coffee … and some beer.
  • 25. JR Oakes | @jroakes | #TechSEOBoost A little help along the way Streamlit is the first app framework specifically for Machine Learning and Data Science teams. So you can stop spending time on frontend development and get back to what you do best.
  • 26. JR Oakes | @jroakes | #TechSEOBoost Criteria • Use existing libraries where possible • Be hardy enough to crawl my toy internet • Make it as simple and approachable as possible (e.g. I use Pandas a lot) • Try to be true (as possible) to what is known that Google does • Process linearly. No threading or extra services • Include unit testing • Include a Jupyter Notebook • Include READMEs • Include a simple indexer and search apparatus to play with results (Thanks John M.!)
  • 27. JR Oakes | @jroakes | #TechSEOBoost Parts • PageRank • Chrome Headless Rendering • Text NLP Normalization • Bert Embeddings • Robots • Duplicate Content Shingling • URL Hashing • Document Frequency Functions (BM25 and TFIDF)
  • 28. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content.
  • 29. JR Oakes | @jroakes | #TechSEOBoost Learnings
  • 30. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content. • BERT is easily accessible.
  • 31. JR Oakes | @jroakes | #TechSEOBoost Learnings Embeddings https://github.com/huggingface/transformers
  • 32. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content. • BERT is easily accessible. • I made some things waaaaayy simpler than they would be in real life.
  • 33. JR Oakes | @jroakes | #TechSEOBoost Learnings
  • 34. JR Oakes | @jroakes | #TechSEOBoost Learnings • Applying PageRank to similar document clusters is an effective way of picking the right one. • Deciding where to process and where (and when) to update values is hard. (e.g. canonical tags for crawling and consolidation in HTML vs Rendered). • Index compression techniques made my eyes glaze over. • BERT models need all the (or most of) content. • BERT is easily accessible. • I made some things way simpler than they would be in real life. • Sentencepiece and BPE encoding is revolutionary for indexes and NLG • A minor code change can make the crawler go crazy. Hats off to Google and Screaming Frog. • Minhash comparison made checking rendering to crawled comparison, easy.
  • 35. JR Oakes | @jroakes | #TechSEOBoost Result A crawler written in Python that we are releasing as open source. Keep in mind: 1. This was written in a month 2. Google engineers would laugh at it 3. It probably has bugs 4. It is really fun to play around with
  • 36. JR Oakes | @jroakes | #TechSEOBoost Result We also built a simple UI in Streamlit so you can play around with the results and parameters.
  • 37. JR Oakes | @jroakes | #TechSEOBoost Result Complete with Ads!
  • 38. JR Oakes | @jroakes | #TechSEOBoost Thank You Start playing at the link below https://locomotive.agency/coal-crawler-renderer-indexer-caboose – Find me on Twitter at: @jroakes