Googlebot has been put on a diet of URLs by the Scheduler in the web crawling system. If your URL’s are not on the list, Googlebot is not coming in. The increasing influx of content flooding the internet means that there is a need for prioritisation on web pages and files visited. Are you telling Googlebot and the URL Scheduler that your content is less important than it is via technical SEO and architectural blunders? There’s a real need to understand Googlebot’s persona and that of the Scheduler, along with the jobs they do, in order to ‘talk to the spider’ and gain more from your time with it.
Sasconbeta 2015 Dawn Anderson - Talk To The Spider
1. Why
Googlebot &
The
URL
Scheduler
Should
Be
Amongst
Your
Key
Personas
And
How
To
Train
Them
TALK
TO
THE
SPIDER Dawn
Anderson
@
dawnieando
2. 9
types
of
Googlebot
THE KEY PERSONAS
02
SUPPORTING
ROLES
Indexer
/
Ranking
Engine
The
URL
Scheduler
History
Logs
Link
Logs
Anchor
Logs
3. ‘Ranks
nothing
at
all’
Takes
a
list
of
URLs
to
crawl
from
URL
Scheduler
Job
varies
based
on
‘bot’
type
Runs
errands
&
makes
deliveries
for
the
URL
server,
indexer
/
ranking
engine
and
logs
Makes
notes
of
outbound
linked
pages
and
additional
links
for
future
crawling
Takes
notes
of
‘hints’
from
URL
scheduler
when
crawling
Tells
tales
of
URL
accessibility
status,
server
response
codes,
notes
relationships
between
links
and
collects
content
checksums
(binary
data
equivalent
of
web
content)
for
comparison
with
past
visits
by
history
and
link
logs
03
GOOGLEBOT’S JOBS
4. 04
ROLES – MAJOR PLAYERS – A ‘BOSS’- URL
SCHEDULER
Think
of
it
as
Google’s
line
manager
or
‘air
traffic
controller’
for
Googlebots in
the
web
crawling
system
Schedules
Googlebot visits
to
URLs
Decides
which
URLs
to
‘feed’
to
Googlebot
Uses
data
from
the
history
logs
about
past
visits
Assigns
visit
regularity
of
Googlebot to
URLs
Drops
‘hints’
to
Googlebot to
guide
on
types
of
content
NOT
to
crawl
and
excludes
some
URLs
from
schedules
Analyses
past
‘change’
periods
and
predicts
future
‘change’
periods
for
URLs
for
the
purposes
of
scheduling
Googlebot visits
Checks
‘page
importance’
in
scheduling
visits
Assigns
URLs
to
‘layers
/
tiers’
for
crawling
schedules
5. Indexed
Web
contains at
least
4.73
billion
pages (13/11/2015)
05
TOO MUCH CONTENT
Total
number
of
websites
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
1,000,000,000
750,000,000
500,000,000
250,000,000
SINCE
2013
THE
WEB
IS
THOUGHT
TO
HAVE
INCREASED
IN
SIZE
BY
1/3
6. Capacity
limits
on
Google’s
crawling
system
By
prioritising
URLs
for
crawling
By
assigning
crawl
period
intervals
to
URLs
How
have
search
engines
responded?
By
creating
work
‘schedules’
for
Googlebots
06
TOO MUCH CONTENT
7. ‘Managing items in a
crawl schedule’
Include
07
GOOGLE CRAWL SCHEDULER PATENTS
‘Scheduling a recrawl’
‘Web crawler scheduler that
utilizes sitemaps from websites’
‘
‘Document reuse in a
search engine crawler’
‘Minimizing visibility of stale content in
web searching including revising web
crawl intervals of documents’
‘Scheduler for search engine’
8. Crawled
multiple
times
daily
Crawled
daily
Or
bi-‐daily
Crawled
least
on
a
‘round
robin’
basis
– only
‘active’
segment
is
crawledSplit
into
segments
on
random
rotation
08
MANAGING ITEMS IN A CRAWL
SCHEDULE (GOOGLE PATENT)
Real
Time
Crawl
Daily Crawl
Base
Layer
Crawl
3
layers
/
tiers URLs
are
moved
in
and
out
of
layers
based
on
past
visits
data
9. Scheduler
checks
URLs
for
‘importance’,
‘boost
factor’
candidacy,
‘probability
of
modification’
GOOGLEBOT’S BEEN PUT ON A
URL CONTROLLED DIET
09
The
URL
Scheduler
controls
the
meal
planner
Carefully
controls
the
list
of
URLs
Googlebot vits
‘Budgets’
are
allocated
£
10. CRAWL BUDGET
10
Roughly
proportionate
to
Page
Importance
(LinkEquity)
&
speed
Pages
with
a
lot
of
healthy
links
get
crawled
more
(Can
include
internal
links??)
Apportioned
by
the
URL
scheduler
to
Googlebots
WHAT
IS
A
CRAWL
BUDGET?
-‐ An
allocation
of
‘crawl
visit
frequency’
apportioned
to
URLs
on
a
site
But
there
are
other
factors
affecting
frequency
of
Googlebot visits
aside
from
importance
/
speed
The
vast
majority
of
URLs
on
the
web
don’t
get
a
lot
of
budget
allocated
to
them
12. Current
capacity
of
the
web
crawling
system
is
high
Your
URL
is
‘important’
Your
URL
is
in
the
real
time,
daily
crawl
or
‘active’
base
layer
segment
Your
URL
changes
a
lot
with
critical
material
content
change
Probability
and
predictability
of
critical
material
content
change
is
high
for
your
URL
Your
website
speed
is
fast
and
Googlebot gets
the
time
to
visit
your
URL
Your
URL
has
been
‘upgraded’
to
a
daily
or
real
time
crawl
layer
12
POSITIVE FACTORS AFFECTING
GOOGLEBOT VISIT FREQUENCY
13. Current
capacity
of
web
crawling
system
is
low
Your
URL
has
been
detected
as
a
‘spam’
URL
Your
URL
is
in
an
‘inactive’
base
layer
segment
Your
URLs
are
‘tripping
hints’
built
into
the
system
to
detect
non-‐critical
change
dynamic
content
Probability
and
predictability
of
critical
material
content
change
is
low
for
your
URL
Your
website
speed
is
slow
and
Googlebot doesn’t
get
the
time
to
visit
your
URL
Your
URL
has
been
‘downgraded’
to
an
‘inactive’
base
layer
segment
Your
URL
has
returned
an
‘unreachable’
server
response
code
recently
13
NEGATIVE FACTORS AFFECTING
GOOGLEBOT VISIT FREQUENCY
14. IT’S NOT JUST ABOUT ‘FRESHNESS’
14
It’s
about
the
probability
&
predictability
of
future
‘freshness’
BASED ON DATA FROM THE HISTORY LOGS - HOW CAN WE
INFLUENCE THEM TO ESCAPE THE BASE LAYER?
15. Going
‘where
the
action
is’
in
sites
The
‘need
for
speed’
Logical
structure
Correct
‘response’
codes
XML
sitemaps
‘Successful
crawl
visits
‘Seeing
everything’
on
a
page
Taking
‘hints’
Clear
unique
single
‘URL
fingerprints’
(no
duplicates)
Predicting
likelihood
of
‘future
change’
Slow
sites
Too
many
redirects
Being
bored
(Meh)
(‘Hints’
are
built
in
by
the
search
engine
systems
– Takes
‘hints’)
Being
lied
to
(e.g.
On
XML
sitemap
priorities)
Crawl
traps
and
dead
ends
Going
round
in
circles
(Infinite
loops)
Spam
URLs
Crawl
wasting
minor
content
change
URLs
‘Hidden’
and
blocked
content
Uncrawlable URLs
Not
just
any
change
Critical
material
change
Predicting
future
change
Dropping
‘hints’
to
Googlebot
Sending
Googlebot
Where
‘the
action
is’
CRAWL OPTIMISATION – STAGE 1 -
UNDERSTAND GOOGLEBOT & URL
SCHEDULER - LIKES & DISLIKES
15
LIKES DISLIKES CHANGE
IS
KEY
17. LOOK THROUGH ‘SPIDER EYES’ VIA
LOG ANALYSIS – ANALYSE GOOGLEBOT
17
PREPARE TO BE HORRIFIED
Incorrect
URL
header
response
codes
(e.g.
302s)
301
redirect
chains
Old
files
or
XML
sitemaps
left
on
server
from
years
ago
Infinite/
endless
loops
(circular
dependency)
On
parameter
driven
sites
URLs
crawled
which
produce
same
output
URLs
generated
by
spammers
Dead
image
files
being
visited
Old
css files
still
being
crawled
Identify
your
‘real
time’,
‘daily’
and
‘base
layer’
URLs
ARE
THEY
THE
ONES
YOU
WANT
THERE?
18. 18
FIX GOOGLEBOT’S JOURNEY
SPEED UP YOUR
SITE TO ‘FEED’
GOOGLEGOT
MORE
TECHNICAL
‘FIXES’
Speed
up
your
site
Implement
compression,
minification,
caching
‘
Fix
incorrect
header
response
codes
Fix
nonsensical
‘infinite
loops’
generated
by
database
driven
parameters
or
‘looping’
relative
URLs
Use
absolute
versus
relative
internal
links
Ensure
no
parts
of
content
is
blocked
from
crawlers
(e.g.
in
carousels,
concertinas
and
tabbed
content
Ensure
no
css or
javascript files
are
blocked
from
crawlers
Unpick
301
redirect
chains
19. Minimise
301
redirects
Minimise
canonicalisation
Use
‘if
modified’
headers
on
low
importance
‘hygiene’
pages
Use
‘expires
after’
headers
on
content
with
short
shelf
live
(e.g.
auctions,
job
sites,
event
sites)
Noindex low
search
volume
or
near
duplicate
URLs
(use
noindex directive
on
robots.txt)
Use
410
‘gone’
headers
on
dead
URLs
liberally
Revisit
.htaccess file
and
review
legacy
pattern
matched
301
redirects
Combine
CSS
and
javascript files
FIX GOOGLEBOT’S JOURNEY
19
SAVE
BUDGET
£
20. Revisit
‘Votes
for
self’
via
internal
links
in
GSC
Clear
‘unique’
URL
fingerprints
Use
XML
sitemaps
for
your
important
URLs
(don’t
put
everything
on
it)
Use
‘mega
menus’
(very
selectively)
to
key
pages
Use
‘breadcrumbs’
(for
hierarchical
structure)
Build
‘bridges’
and
‘shortcuts’
via
html
sitemaps
and
supplementary
content
for
‘cross
modular’
‘related’
internal
linking
to
key
pages
Consolidate
(merge)
important
but
similar
content
(e.g.
merge
FAQs)
Consider
flattening
your
site
structure
so
‘importance’
flows
further
Reduce
internal
linking
to
low
priority
URLs
BE
CLEAR
TO
GOOGLEBOT
WHICH
ARE
YOUR
MOST
IMPORTANT
PAGES
Not
just
any
change
– Critical
material
change
Keep
the
‘action’
in
the
key
areas -‐ NOT
JUST
THE
BLOG
Use
‘relevant
‘supplementary
content
to
keep
key
pages
‘fresh’
Remember
the
negative
impact
of
‘crawl
hints’
Regularly
update
key
content
Consider
‘updating’
rather
than
replacing
seasonal
content
URLs
Build
‘dynamism’
into
your
web
development
(sites
that
‘move’
win)
GOOGLEBOT
GOES
WHERE
THE
ACTION
IS
AND
IS
LIKELY
TO
BE
IN
THE
FUTURE
TRAIN GOOGLEBOT – ‘TALK TO THE
SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS)
20
EMPHASISE
PAGE
IMPORTANCE
TRAIN
ON
CHANGE
21. YSlow
Pingdom
Google
Page
Speed
Tests
Minificiation – JS
Compress
and
CSS
Minifier
Image
Compression
– Compressjpeg.com,
tinypng.com
21
TOOLS YOU CAN USE
GSC
Crawl
Stats
Deepcrawl
Screaming
Frog
Server
Logs
SEMRush (auditing
tools)
Webconfs (header
responses
/
similarity
checker)
Powermapper (birds
eye
view
of
site)
GSC
Internal
links
Report
(URL
importance)
Link
Research
Tools
(Strongest
sub
pages
reports)
GSC
Internal
links
(add
site
categories
and
sections
as
additional
profiles)
Powermapper
GSC
Index
levels
(over
indexation
checks)
GSC
Crawl
stats
Last
Accessed
Tools
(versus
competitors)
Server
logs
SPEED
SPIDER
EYES
URL
IMPORTANCE
SAVINGS
&
CHANGE
Webmaster Hangout Office Hours
22. IS THIS
YOUR BLOG??
HOPE NOT
22
WARNING SIGNS – TOO MANY
VOTES BY SELF FOR WRONG PAGES
Most Important Page 1
Most
Important
Page
2
Most
Important
Page
3
24. Tags:
I,
must,
tag,
this,
blog,
post,
with,
every,
possible,
word,
that,
pops,
into,
my,
head,
when,
I,
look,
at,
it,
and,
dilute,
all,
relevance,
from,
it,
to,
a,
pile,
of,
mush,
cow,
shoes,
sheep,
the,
and,
me,
of,
it
Image
Credit:
Buzzfeed
Creating
‘thin’
content
and
Even
more
URLs
to
crawl
24
WARNING SIGNS – TAG MAN