SlideShare ist ein Scribd-Unternehmen logo
1 von 64
Downloaden Sie, um offline zu lesen
Web Spam Research: Good Robots vs Bad Robots
Matthew Peters
Scientist
SEOmoz
@mattthemathman
Penguin (and Panda)
Practical SEO considerations
SEOmoz engineering challenges
SEOmoz engineering challenges
Processing
(~4 weeks,
40-200
computers)
Mozscape
Index
Open Site Explorer, Mozscape
API, PRO app
SEOmoz engineering challenges
Due to scale, need an algorithmic approach.
Processing
(~4 weeks,
40-200
computers)
Mozscape
Index
Open Site Explorer, Mozscape
API, PRO app
Goals
Machine Learning 101
Web
crawler
Machine Learning 101
“Features”
Web
crawler
Machine Learning 101
BLACK
BOX
SPAM
NOT
SPAM
??
Machine learning
algorithm
“Features”
Web
crawler
In-link and on-page features
Spam sites (link
farms, fake blogs,
etc)
Legitimate sites
that may have
some spam in-
links
In-link and on-page features
Spam sites (link
farms, fake blogs,
etc)
Legitimate sites
that may have
some spam in-
links
A spam
site with
spam
in/out
links
In-link and on-page features
Spam sites (link
farms, fake blogs,
etc)
Legitimate sites
that may have
some spam in-
links
A legit site
with spam
in links
On-page features
Organized research conferences:
WEBSPAM-UK2006/7, ECML/PKDD 2010 Discovery challenge
On-page features
Ntoulas et al: Detecting Spam Web Pages through Content Analysis, WWW „06
Number of words in title
On-page features
Ntoulas et al: Detecting Spam Web Pages through Content Analysis, WWW „06
Number of words in title
Histogram
(probability
density) of all
pages
On-page features
Ntoulas et al: Detecting Spam Web Pages through Content Analysis, WWW „06
Number of words in title
Histogram
(probability
density) of all
pages
Percent of spam
for each title
length
On-page features
Ntoulas et al: Detecting Spam Web Pages through Content Analysis, WWW „06
Percent of anchor text words
These few features are remarkably effective
(assuming your model is complex enough)
Erdélyi et al: Web Spam Classification: a Few Features Worth More, WebQuality 2011
On-page features > in-link features
Erdélyi et al: Web Spam Classification: a Few Features Worth More, WebQuality 2011
In-link features (mozTrust)
Gyöngyi et al: Combating Web Spam with TrustRank, 2004
See also: Abernethy et al: Graph regularization methods for Web spam detection, 2010
Seed
site
High
mozTrust
mozTrust (TrustRank)
measures the average
distance from a trusted set
of “seed” sites
Moderate
mozTrust
Moderate
mozTrust
Anchor text
Ryan Kent: http://www.seomoz.org/blog/identifying-link-penalties-in-2012
Are these still relevant today?
Banned: manual penalty and removed from index.
KurtisBohrnstedt: http://www.seomoz.org/blog/web-directory-submission-danger
Google penalized sites
Penalized sites: algorithmic penalty
demoted off first page.
I will group both banned an penalized
sites together and call them simply
“penalized.”
KurtisBohrnstedt: http://www.seomoz.org/blog/web-directory-submission-danger
Data sources
47K sites
Mozscape (200 mil)
Stratified
sample by
mozRank
Directory +
suspected SPAM (3K)
Data sources
47K sites
Mozscape (200 mil)
Wikipedia,
SEMRush
5 pages /
site
Stratified
sample by
mozRank
Directory +
suspected SPAM (3K)
Data sources
47K sites
Mozscape (200 mil)
Wikipedia,
SEMRush
Filter by
HTTP 200,
English
22K sites
5 pages /
site
Stratified
sample by
mozRank
Directory +
suspected SPAM (3K)
Data sources
47K sites
Mozscape (200 mil)
Wikipedia,
SEMRush
Filter by
HTTP 200,
English
22K sites
5 pages /
site
Stratified
sample by
mozRank
Directory +
suspected SPAM (3K)
Results
(show me the graphs!)
Overall results
Overall 17% of sites are penalized, 5% banned
mozTrust
mozTrust is a strong predictor of spam
mozTrustvsmozRank
mozRank is also a strong predictor, although not a good as mozTrust
In-links
In-links increase = Spam decrease, except for some sites with many internal links
The trend in domain size is similar
Domain size
Linking root domains exhibit the same overall trend as linking URLs
Link diversity – Linking root domains
Anchor text
Simple heuristic for
branded/organic anchor text:
(1) Strip off all sub-domains, TLD
extensions, paths from URLs.
Remove white space.
(2) Exact or partial match
between the result and the
target domain.
(3) Use a specified list for
“organic” (click, here, …)
(4) Compute the percentage of
unbranded anchor text.
There are some more technical details (certain symbols are removed, another heuristic for acronyms), but this is the main idea.
Anchor text
Large percent of
unbranded anchor
text is a spam signal.
Anchor text
Large percent of
unbranded anchor
text is a spam signal.
Mix of branded and
unbranded anchor
text is best.
Entire in-link profile
18
55
48
22
18
28
37
62
68
Entire in-link profile
On-page features – Anchor text
Internal vs External Anchor text
Higher spam percent
for sites without
internal anchor text
Internal vs External Anchor text
Higher spam percent
for sites without
internal anchor text
Spam increases with
external anchor text
Title characteristics
Unlike in 2006, the length of the title isn‟t very informative.
Number of words
Short documents are much more likely to be spam now.
Visible ratio
Spam percent increases for visible ratio above 25%.
Plus lots of other features…
Commercial intent
Commercial intent
Idea: measure the “commercial
intent” using lists of high CPC
and search volume queries
Commercial intent
Unfortunately the results are inconclusive. With hindsight, we need a larger data set.
What features are missing?
What features are missing?
“Clean money
provides detailed
info concerning
online monetary
unfold Betting
corporations”
Huh?
What features are missing?
0 comments, no
shares or tweets.
If this was a real blog, it
would have some user
interaction
What features are missing?
There’s something
strange about these
sidebar links…
How well can we model spam with these features?
How well can we model spam with these features?
Quite well!
Using a logistic regression model, we can obtain
86%1accuracy and 0.82 AUC using just 32 features
(11 in-link features and 21 on-page features).
How well can we model spam with these features?
Quite well!
Using a logistic regression model, we can obtain
86%1accuracy and 0.82 AUC using just 32 features
(11 in-link features and 21 on-page features).
1 Well, we can get 83% accuracy by always choosing not-spam so
accuracy isn’t the best measure. The 0.82 AUC is quite good for such
a simple model.
Overfitting was controlled with L2 regularization and k-fold cross-validation.
More sophisticated modeling
Logistic
SPAM
NOT
SPAM
??
Logistic
In-link features
On-page features
M
i
x
t
u
r
e
Can use a mixture of logistic models, one for in-link
and one for on-page. Use EM to set parameters.
More sophisticated modeling
90% penalized
50% in-link
50% on-page
More sophisticated modeling
65% penalized
85% in-link
15% on-page
A mixture of logistic models attributes “responsibility” to both the in-link and on-page
features as well as predicts the likelihood of a penalty.
Takeaways!
“Unnatural” sites or link profiles
With lots of data,
“unnatural” sites or link
profiles are moderately easy
to detect algorithmically.
You are at risk to be
penalized if you build
obvious low quality links.
MozTrust Rules!
mozTrust is a good predictor of spam. Be
careful if you are building links from low
mozTrust sites.
mozTrust, an engineering feat of
awesomeness.
SEOmoz Tools Future
We hope to have a spam score of
some sort available in Mozscape in the
future.
In the more near term, we plan to
repurpose some of this work for
improving Freshscape.
Matthew Peters
Scientist
SEOmoz
@mattthemathman

Weitere ähnliche Inhalte

Was ist angesagt?

How to quickly spy on your competition
How to quickly spy on your competitionHow to quickly spy on your competition
How to quickly spy on your competitionSteve Williams
 
Onpage and Offpage SEO
Onpage and Offpage SEOOnpage and Offpage SEO
Onpage and Offpage SEONikhil Pevekar
 
A SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMY
A SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMYA SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMY
A SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMYIJNSA Journal
 
Seojocktoberfest - link health audits and organic traffic recovery - Scott McLay
Seojocktoberfest - link health audits and organic traffic recovery - Scott McLaySeojocktoberfest - link health audits and organic traffic recovery - Scott McLay
Seojocktoberfest - link health audits and organic traffic recovery - Scott McLayYard
 
Done rerea dlink spam alliances good
Done rerea dlink spam alliances goodDone rerea dlink spam alliances good
Done rerea dlink spam alliances goodJames Arnold
 
SEO 101 - Google Search Console Explained
SEO 101 - Google Search Console Explained SEO 101 - Google Search Console Explained
SEO 101 - Google Search Console Explained Steve Weber
 
Quick Technical SEO Audit Checklist - Peter Handley Brighton SEO April 2014
Quick Technical SEO Audit Checklist - Peter Handley Brighton SEO April 2014Quick Technical SEO Audit Checklist - Peter Handley Brighton SEO April 2014
Quick Technical SEO Audit Checklist - Peter Handley Brighton SEO April 2014Peter Handley
 
A Simple Guide to SEO
A Simple Guide to SEOA Simple Guide to SEO
A Simple Guide to SEOChris Walton
 
Recruiting using Social media Cpl
Recruiting using Social media Cpl Recruiting using Social media Cpl
Recruiting using Social media Cpl Peter Cosgrove
 
This presentation is based on alan november’s book
This presentation is based on alan november’s bookThis presentation is based on alan november’s book
This presentation is based on alan november’s bookcampbelltricia
 
SEO Tools For Marketers - Seo tools for you
SEO Tools For Marketers - Seo tools for youSEO Tools For Marketers - Seo tools for you
SEO Tools For Marketers - Seo tools for youUy Hoàng
 
Social Websites And Seo Social Dev Camp Chicago2008 By John Fairley
Social Websites And Seo Social Dev Camp Chicago2008 By John FairleySocial Websites And Seo Social Dev Camp Chicago2008 By John Fairley
Social Websites And Seo Social Dev Camp Chicago2008 By John FairleyJohn Fairley
 
Link Building Clinic - A Website Backlink Health Clinic
Link Building Clinic - A Website Backlink Health ClinicLink Building Clinic - A Website Backlink Health Clinic
Link Building Clinic - A Website Backlink Health Clinicinteractivitymarketing
 
Advanced internet search
Advanced internet searchAdvanced internet search
Advanced internet searchMegan Heuer
 
ARTDM 171, Week 15: Search Engine Optimization (SEO)
ARTDM 171, Week 15: Search Engine Optimization (SEO)ARTDM 171, Week 15: Search Engine Optimization (SEO)
ARTDM 171, Week 15: Search Engine Optimization (SEO)Gilbert Guerrero
 
CSM Module 2: Branding and identity
CSM Module 2: Branding and identityCSM Module 2: Branding and identity
CSM Module 2: Branding and identityJulian Matthews
 

Was ist angesagt? (20)

How to quickly spy on your competition
How to quickly spy on your competitionHow to quickly spy on your competition
How to quickly spy on your competition
 
Seo onpage & offpage
Seo onpage & offpage Seo onpage & offpage
Seo onpage & offpage
 
Onpage and Offpage SEO
Onpage and Offpage SEOOnpage and Offpage SEO
Onpage and Offpage SEO
 
A SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMY
A SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMYA SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMY
A SURVEY ON WEB SPAM DETECTION METHODS: TAXONOMY
 
prestiva_blackhat
prestiva_blackhatprestiva_blackhat
prestiva_blackhat
 
Seojocktoberfest - link health audits and organic traffic recovery - Scott McLay
Seojocktoberfest - link health audits and organic traffic recovery - Scott McLaySeojocktoberfest - link health audits and organic traffic recovery - Scott McLay
Seojocktoberfest - link health audits and organic traffic recovery - Scott McLay
 
Done rerea dlink spam alliances good
Done rerea dlink spam alliances goodDone rerea dlink spam alliances good
Done rerea dlink spam alliances good
 
SEO 101 - Google Search Console Explained
SEO 101 - Google Search Console Explained SEO 101 - Google Search Console Explained
SEO 101 - Google Search Console Explained
 
Quick Technical SEO Audit Checklist - Peter Handley Brighton SEO April 2014
Quick Technical SEO Audit Checklist - Peter Handley Brighton SEO April 2014Quick Technical SEO Audit Checklist - Peter Handley Brighton SEO April 2014
Quick Technical SEO Audit Checklist - Peter Handley Brighton SEO April 2014
 
A Simple Guide to SEO
A Simple Guide to SEOA Simple Guide to SEO
A Simple Guide to SEO
 
Recruiting using Social media Cpl
Recruiting using Social media Cpl Recruiting using Social media Cpl
Recruiting using Social media Cpl
 
This presentation is based on alan november’s book
This presentation is based on alan november’s bookThis presentation is based on alan november’s book
This presentation is based on alan november’s book
 
SEO Tools For Marketers - Seo tools for you
SEO Tools For Marketers - Seo tools for youSEO Tools For Marketers - Seo tools for you
SEO Tools For Marketers - Seo tools for you
 
Social Websites And Seo Social Dev Camp Chicago2008 By John Fairley
Social Websites And Seo Social Dev Camp Chicago2008 By John FairleySocial Websites And Seo Social Dev Camp Chicago2008 By John Fairley
Social Websites And Seo Social Dev Camp Chicago2008 By John Fairley
 
Seoptimizing
SeoptimizingSeoptimizing
Seoptimizing
 
Link Building Clinic - A Website Backlink Health Clinic
Link Building Clinic - A Website Backlink Health ClinicLink Building Clinic - A Website Backlink Health Clinic
Link Building Clinic - A Website Backlink Health Clinic
 
Advanced internet search
Advanced internet searchAdvanced internet search
Advanced internet search
 
ARTDM 171, Week 15: Search Engine Optimization (SEO)
ARTDM 171, Week 15: Search Engine Optimization (SEO)ARTDM 171, Week 15: Search Engine Optimization (SEO)
ARTDM 171, Week 15: Search Engine Optimization (SEO)
 
CSM Module 2: Branding and identity
CSM Module 2: Branding and identityCSM Module 2: Branding and identity
CSM Module 2: Branding and identity
 
Basic SEO
Basic SEO Basic SEO
Basic SEO
 

Andere mochten auch

SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachReza Rahimi
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014StampedeCon
 
Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam FilteringiNazneen
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 

Andere mochten auch (6)

SMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning ApproachSMS Spam Filter Design Using R: A Machine Learning Approach
SMS Spam Filter Design Using R: A Machine Learning Approach
 
Spam Filtering
Spam FilteringSpam Filtering
Spam Filtering
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
What is SPAM?
What is SPAM?What is SPAM?
What is SPAM?
 
Spamming and Spam Filtering
Spamming and Spam FilteringSpamming and Spam Filtering
Spamming and Spam Filtering
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 

Ähnlich wie Algorithmic Web Spam detection - Matt Peters MozCon

Getting the Most out of Linkscape
Getting the Most out of LinkscapeGetting the Most out of Linkscape
Getting the Most out of LinkscapeNick Gerner
 
Diagnose SEO Issues with Live Search Webmaster Tools
Diagnose SEO Issues with Live Search Webmaster ToolsDiagnose SEO Issues with Live Search Webmaster Tools
Diagnose SEO Issues with Live Search Webmaster ToolsNathan Buggia
 
Introduction to search_marketing
Introduction to search_marketingIntroduction to search_marketing
Introduction to search_marketingBill Hunt
 
SEO Presentation
SEO PresentationSEO Presentation
SEO Presentationganeh17
 
SEO-HIGH TRAFFIC ROUTING
SEO-HIGH TRAFFIC ROUTINGSEO-HIGH TRAFFIC ROUTING
SEO-HIGH TRAFFIC ROUTINGBUDNET
 
Seo Audit Report Mudgal Analytics
Seo Audit Report Mudgal AnalyticsSeo Audit Report Mudgal Analytics
Seo Audit Report Mudgal AnalyticsBruce Clay
 
Search engine optimisation
Search engine optimisationSearch engine optimisation
Search engine optimisationrobclarkson
 
Crawling, Indicizzazione e SEO - Paolo Ramazzotti
Crawling, Indicizzazione e SEO - Paolo RamazzottiCrawling, Indicizzazione e SEO - Paolo Ramazzotti
Crawling, Indicizzazione e SEO - Paolo RamazzottiGimasi Sa
 
Il processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Il processo di Crawilng e Indexing di Google - Paolo RamazzottiIl processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Il processo di Crawilng e Indexing di Google - Paolo RamazzottiPaolo Ramazzotti
 
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015waqas ahmad
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Christopher Mbinda
 
Search Engine Optimisation
Search Engine OptimisationSearch Engine Optimisation
Search Engine Optimisationneilmoreilly
 
SEO: search Engine Optimization
SEO: search Engine OptimizationSEO: search Engine Optimization
SEO: search Engine Optimizationphoolchand yadav
 
Www snapdeal com-report
Www snapdeal com-reportWww snapdeal com-report
Www snapdeal com-reportMahipSingh13
 

Ähnlich wie Algorithmic Web Spam detection - Matt Peters MozCon (20)

Getting the Most out of Linkscape
Getting the Most out of LinkscapeGetting the Most out of Linkscape
Getting the Most out of Linkscape
 
Diagnose SEO Issues with Live Search Webmaster Tools
Diagnose SEO Issues with Live Search Webmaster ToolsDiagnose SEO Issues with Live Search Webmaster Tools
Diagnose SEO Issues with Live Search Webmaster Tools
 
TrustRank.PDF
TrustRank.PDFTrustRank.PDF
TrustRank.PDF
 
Introduction to search_marketing
Introduction to search_marketingIntroduction to search_marketing
Introduction to search_marketing
 
Seo questions
Seo questionsSeo questions
Seo questions
 
SEO Presentation
SEO PresentationSEO Presentation
SEO Presentation
 
SEO-HIGH TRAFFIC ROUTING
SEO-HIGH TRAFFIC ROUTINGSEO-HIGH TRAFFIC ROUTING
SEO-HIGH TRAFFIC ROUTING
 
Seo Audit Report Mudgal Analytics
Seo Audit Report Mudgal AnalyticsSeo Audit Report Mudgal Analytics
Seo Audit Report Mudgal Analytics
 
Search engine optimisation
Search engine optimisationSearch engine optimisation
Search engine optimisation
 
Crawling, Indicizzazione e SEO - Paolo Ramazzotti
Crawling, Indicizzazione e SEO - Paolo RamazzottiCrawling, Indicizzazione e SEO - Paolo Ramazzotti
Crawling, Indicizzazione e SEO - Paolo Ramazzotti
 
Il processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Il processo di Crawilng e Indexing di Google - Paolo RamazzottiIl processo di Crawilng e Indexing di Google - Paolo Ramazzotti
Il processo di Crawilng e Indexing di Google - Paolo Ramazzotti
 
seo.ppt
seo.pptseo.ppt
seo.ppt
 
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015Search Engine Optimization Tips: SEO Tips For Beginners in 2015
Search Engine Optimization Tips: SEO Tips For Beginners in 2015
 
Seo beginners
Seo beginners Seo beginners
Seo beginners
 
Search Engine Optimization (SEO)
Search Engine Optimization (SEO)Search Engine Optimization (SEO)
Search Engine Optimization (SEO)
 
Search Engine Optimisation
Search Engine OptimisationSearch Engine Optimisation
Search Engine Optimisation
 
SEO: search Engine Optimization
SEO: search Engine OptimizationSEO: search Engine Optimization
SEO: search Engine Optimization
 
Search engine optimization
Search engine optimization Search engine optimization
Search engine optimization
 
Beginners seo gs v3
Beginners seo gs v3Beginners seo gs v3
Beginners seo gs v3
 
Www snapdeal com-report
Www snapdeal com-reportWww snapdeal com-report
Www snapdeal com-report
 

Kürzlich hochgeladen

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 

Kürzlich hochgeladen (20)

How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 

Algorithmic Web Spam detection - Matt Peters MozCon

Hinweis der Redaktion

  1. ** EMPHASIZE SCALE PROBLEM. Explain what we mean by a web crawler
  2. ** EMPHASIZE SCALE PROBLEM
  3. ** EMPHASIZE SCALE PROBLEM
  4. Concerns:(1) Is it relevant/how does it compare to Google?This talk directly compares against sites banned/penalized by Google. Not sure final form of the metric yet, but will compare with Google’s.(2) You should be focusing on other challenges instead.We are focusing on a larger, fresher more consistent index first. This just a research project now that needs some engineering work to scale. Once we have solved the scaling challenges in a fresher more consistent index, we will address those scaling problems.(3) What about outing millions of sites?Two things: (a) delta score (b) search engines have the information now, we believe in transparency to make it available to all.
  5. ** feature engineering is interesting from intuitive perspective** black box isn’t black at all, rather transparent since I coded it myself. But for purposes of this talk is a black box** in practice this is iterative
  6. ** feature engineering is interesting from intuitive perspective** black box isn’t black at all, rather transparent since I coded it myself. But for purposes of this talk is a black box** in practice this is iterative
  7. ** feature engineering is interesting from intuitive perspective** black box isn’t black at all, rather transparent since I coded it myself. But for purposes of this talk is a black box** in practice this is iterative
  8. ** emphasize guidelines, organized publically available data
  9. ** explain both mozTrust and mozRank** mention graph regularization paper
  10. ** stratified sample by domainmozRank
  11. ** stratified sample by domainmozRank
  12. ** stratified sample by domainmozRank
  13. ** stratified sample by domainmozRank