Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Why
Googlebot &
The
URL
Scheduler

Should
Be
Amongst
Your
Key
Personas

And
How
To
Train
Them
TALK
TO

THE
SPIDER Dawn
Anderson
@
dawnieando

9
types
of

Googlebot
THE KEY PERSONAS
02
SUPPORTING
ROLES
Indexer
/

Ranking
Engine
The
URL

Scheduler
History
Logs
Link
Logs
Anchor
Logs

‘Ranks
nothing
at
all’
Takes
a
list
of
URLs
to
crawl
from
URL
Scheduler
Job
varies
based
on
‘bot’
type
Runs
errands
&
makes
deliveries
for
the
URL
server,

indexer
/
ranking
engine
and
logs
Makes
notes
of
outbound
linked
pages
and
additional

links
for
future
crawling
Takes
notes
of
‘hints’
from
URL
scheduler
when
crawling
Tells
tales
of
URL
accessibility
status,
server
response

codes,
notes
relationships
between
links
and
collects

content
checksums
(binary
data
equivalent
of
web

content)
for
comparison
with
past
visits
by
history
and

link
logs
03
GOOGLEBOT’S JOBS

04
ROLES – MAJOR PLAYERS – A ‘BOSS’- URL
SCHEDULER
Think
of
it
as
Google’s

line
manager
or
‘air

traffic
controller’
for

Googlebots in
the

web
crawling
system
Schedules
Googlebot visits
to
URLs
Decides
which
URLs
to
‘feed’
to
Googlebot
Uses
data
from
the
history
logs
about
past
visits
Assigns
visit
regularity
of
Googlebot to
URLs
Drops
‘hints’
to
Googlebot to
guide
on
types
of
content
NOT
to

crawl
and
excludes
some
URLs
from
schedules
Analyses
past
‘change’
periods
and
predicts
future
‘change’

periods
for
URLs
for
the
purposes
of
scheduling
Googlebot visits
Checks
‘page
importance’
in
scheduling
visits
Assigns
URLs
to
‘layers
/
tiers’
for
crawling
schedules

Indexed
Web
contains at
least
4.73
billion
pages (13/11/2015)
05
TOO MUCH CONTENT
Total
number
of
websites
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
1,000,000,000
750,000,000
500,000,000
250,000,000
SINCE
2013
THE
WEB
IS

THOUGHT
TO
HAVE

INCREASED
IN
SIZE
BY
1/3

Capacity
limits

on
Google’s

crawling
system
By
prioritising

URLs
for

crawling
By
assigning

crawl
period

intervals
to
URLs
How
have

search
engines

responded?
By
creating
work

‘schedules’
for

Googlebots
06
TOO MUCH CONTENT

‘Managing items in a
crawl schedule’
Include
07
GOOGLE CRAWL SCHEDULER PATENTS
‘Scheduling a recrawl’
‘Web crawler scheduler that
utilizes sitemaps from websites’
‘
‘Document reuse in a
search engine crawler’
‘Minimizing visibility of stale content in
web searching including revising web
crawl intervals of documents’
‘Scheduler for search engine’

Crawled
multiple

times
daily
Crawled
daily

Or
bi-‐daily
Crawled
least
on
a
‘round

robin’
basis
– only
‘active’

segment
is
crawledSplit
into
segments

on
random
rotation
08
MANAGING ITEMS IN A CRAWL
SCHEDULE (GOOGLE PATENT)
Real
Time
Crawl
Daily Crawl
Base
Layer

Crawl
3
layers
/
tiers URLs
are
moved

in
and
out
of

layers
based
on

past
visits
data

Scheduler
checks
URLs

for
‘importance’,
‘boost

factor’
candidacy,

‘probability
of

modification’
GOOGLEBOT’S BEEN PUT ON A
URL CONTROLLED DIET
09
The
URL
Scheduler

controls
the
meal

planner
Carefully
controls

the
list
of
URLs

Googlebot vits
‘Budgets’
are
allocated
£

CRAWL BUDGET
10
Roughly
proportionate
to
Page
Importance
(LinkEquity)
&
speed
Pages
with
a
lot
of
healthy
links
get
crawled
more
(Can
include
internal
links??)
Apportioned
by
the
URL
scheduler
to
Googlebots
WHAT
IS
A
CRAWL
BUDGET?
-‐ An
allocation
of
‘crawl
visit
frequency’
apportioned
to
URLs
on
a
site
But
there
are
other
factors
affecting
frequency
of
Googlebot visits
aside
from
importance
/
speed
The
vast
majority
of
URLs
on
the
web
don’t
get
a
lot
of
budget
allocated
to
them

CRITICAL MATERIAL CONTENT
CHANGE
11
HINTS
&
C
=
∑
i =
0
n
-‐ 1

weight
i *
feature

Current
capacity
of
the
web
crawling
system
is
high
Your
URL
is
‘important’
Your
URL
is
in
the
real
time,
daily
crawl
or
‘active’
base

layer
segment
Your
URL
changes
a
lot
with
critical
material
content

change
Probability
and
predictability
of
critical
material
content

change
is
high
for
your
URL
Your
website
speed
is
fast
and
Googlebot gets
the
time
to

visit
your
URL
Your
URL
has
been
‘upgraded’
to
a
daily
or
real
time
crawl

layer
12
POSITIVE FACTORS AFFECTING
GOOGLEBOT VISIT FREQUENCY

Current
capacity
of
web
crawling
system
is
low
Your
URL
has
been
detected
as
a
‘spam’
URL
Your
URL
is
in
an
‘inactive’
base
layer
segment
Your
URLs
are
‘tripping
hints’
built
into
the
system
to

detect
non-‐critical
change
dynamic
content
Probability
and
predictability
of
critical
material
content

change
is
low
for
your
URL
Your
website
speed
is
slow
and
Googlebot doesn’t
get
the

time
to
visit
your
URL
Your
URL
has
been
‘downgraded’
to
an
‘inactive’
base

layer
segment
Your
URL
has
returned
an
‘unreachable’
server
response

code
recently
13
NEGATIVE FACTORS AFFECTING
GOOGLEBOT VISIT FREQUENCY

IT’S NOT JUST ABOUT ‘FRESHNESS’
14
It’s
about
the

probability
&

predictability
of
future

‘freshness’
BASED ON DATA FROM THE HISTORY LOGS - HOW CAN WE
INFLUENCE THEM TO ESCAPE THE BASE LAYER?

Going
‘where
the
action
is’
in
sites
The
‘need
for
speed’
Logical
structure
Correct
‘response’
codes
XML
sitemaps
‘Successful
crawl
visits
‘Seeing
everything’
on
a
page
Taking
‘hints’
Clear
unique
single
‘URL

fingerprints’
(no
duplicates)
Predicting
likelihood
of
‘future

change’
Slow
sites
Too
many
redirects
Being
bored
(Meh)
(‘Hints’
are
built
in
by
the

search
engine
systems
– Takes
‘hints’)
Being
lied
to
(e.g.
On
XML
sitemap
priorities)
Crawl
traps
and
dead
ends
Going
round
in
circles
(Infinite
loops)
Spam
URLs
Crawl
wasting
minor
content
change
URLs
‘Hidden’
and
blocked
content
Uncrawlable URLs
Not
just
any
change
Critical
material
change
Predicting
future
change
Dropping
‘hints’
to
Googlebot
Sending
Googlebot
Where
‘the
action
is’
CRAWL OPTIMISATION – STAGE 1 -
UNDERSTAND GOOGLEBOT & URL
SCHEDULER - LIKES & DISLIKES
15
LIKES DISLIKES CHANGE
IS
KEY

FIND GOOGLEBOT
16
AUTOMATE
SERVER
LOG

RETRIEVAL
VIA
CRON
JOB
grep Googlebot access_log
>googlebot_access.txt

LOOK THROUGH ‘SPIDER EYES’ VIA
LOG ANALYSIS – ANALYSE GOOGLEBOT
17
PREPARE TO BE HORRIFIED
Incorrect
URL
header
response
codes
(e.g.
302s)
301
redirect
chains
Old
files
or
XML
sitemaps
left
on
server
from
years
ago
Infinite/
endless
loops
(circular
dependency)
On
parameter
driven
sites
URLs
crawled
which
produce
same
output
URLs
generated
by
spammers
Dead
image
files
being
visited
Old
css files
still
being
crawled
Identify
your
‘real
time’,
‘daily’
and
‘base
layer’
URLs
ARE
THEY
THE
ONES
YOU
WANT
THERE?

18
FIX GOOGLEBOT’S JOURNEY
SPEED UP YOUR
SITE TO ‘FEED’
GOOGLEGOT
MORE
TECHNICAL
‘FIXES’

Speed
up
your
site
Implement
compression,
minification,
caching
‘
Fix
incorrect
header
response
codes
Fix
nonsensical
‘infinite
loops’
generated
by

database
driven
parameters
or
‘looping’
relative

URLs
Use
absolute
versus
relative
internal
links
Ensure
no
parts
of
content
is
blocked
from

crawlers
(e.g.
in
carousels,
concertinas
and

tabbed
content
Ensure
no
css or
javascript files
are
blocked
from

crawlers
Unpick
301
redirect
chains

Minimise
301
redirects
Minimise
canonicalisation
Use
‘if
modified’
headers
on
low
importance

‘hygiene’
pages
Use
‘expires
after’
headers
on
content
with
short

shelf
live
(e.g.
auctions,
job
sites,
event
sites)
Noindex low
search
volume
or
near
duplicate
URLs

(use
noindex directive
on
robots.txt)
Use
410
‘gone’
headers
on
dead
URLs
liberally
Revisit
.htaccess file
and
review
legacy
pattern

matched
301
redirects
Combine
CSS
and
javascript files
FIX GOOGLEBOT’S JOURNEY
19
SAVE
BUDGET
£

Revisit
‘Votes
for
self’
via
internal
links
in
GSC
Clear
‘unique’
URL
fingerprints
Use
XML
sitemaps
for
your
important
URLs
(don’t
put

everything
on
it)
Use
‘mega
menus’
(very
selectively)
to
key
pages
Use
‘breadcrumbs’
(for
hierarchical
structure)
Build
‘bridges’
and
‘shortcuts’
via
html
sitemaps
and

supplementary
content
for
‘cross
modular’
‘related’

internal
linking
to
key
pages
Consolidate
(merge)
important
but
similar
content
(e.g.

merge
FAQs)
Consider
flattening
your
site
structure
so
‘importance’

flows
further
Reduce
internal
linking
to
low
priority
URLs
BE
CLEAR
TO
GOOGLEBOT
WHICH
ARE

YOUR
MOST
IMPORTANT
PAGES
Not
just
any
change
– Critical
material
change
Keep
the
‘action’
in
the
key
areas -‐ NOT
JUST
THE
BLOG
Use
‘relevant
‘supplementary
content
to
keep
key
pages
‘fresh’
Remember
the
negative
impact
of

‘crawl
hints’
Regularly
update
key
content
Consider
‘updating’
rather
than
replacing
seasonal
content

URLs
Build
‘dynamism’
into
your
web
development
(sites
that
‘move’

win)
GOOGLEBOT
GOES
WHERE
THE
ACTION
IS
AND

IS
LIKELY
TO
BE
IN
THE
FUTURE
TRAIN GOOGLEBOT – ‘TALK TO THE
SPIDER’ (PROMOTE URLS TO HIGHER CRAWL LAYERS)
20
EMPHASISE
PAGE
IMPORTANCE

TRAIN
ON
CHANGE

YSlow
Pingdom
Google
Page
Speed
Tests
Minificiation – JS
Compress
and
CSS

Minifier
Image
Compression
– Compressjpeg.com,

tinypng.com
21
TOOLS YOU CAN USE
GSC
Crawl
Stats
Deepcrawl
Screaming
Frog
Server
Logs
SEMRush (auditing
tools)
Webconfs (header
responses
/
similarity

checker)
Powermapper (birds
eye
view
of
site)
GSC
Internal
links
Report
(URL
importance)
Link
Research
Tools
(Strongest
sub
pages

reports)
GSC
Internal
links
(add
site
categories
and

sections
as
additional
profiles)
Powermapper
GSC
Index
levels
(over
indexation
checks)
GSC
Crawl
stats
Last
Accessed
Tools
(versus
competitors)
Server
logs
SPEED
SPIDER
EYES
URL
IMPORTANCE
SAVINGS
&
CHANGE
Webmaster Hangout Office Hours

IS THIS
YOUR BLOG??
HOPE NOT
22
WARNING SIGNS – TOO MANY
VOTES BY SELF FOR WRONG PAGES
Most Important Page 1
Most
Important
Page
2
Most
Important
Page
3

23
WARNING SIGNS – OVER INDEXATION
FIX IT FOR
A BETTER
CRAWL

Tags:
I,
must,
tag,

this,
blog,
post,
with,

every,
possible,
word,
that,
pops,
into,
my,

head,
when,
I,
look,
at,
it,
and,
dilute,
all,

relevance,
from,
it,
to,
a,
pile,
of,
mush,

cow,
shoes,
sheep,
the,
and,
me,
of,
it
Image
Credit:
Buzzfeed
Creating
‘thin’
content
and

Even
more
URLs
to
crawl
24
WARNING SIGNS – TAG MAN

”Googlebot’s On
A
Strict
Diet”
“Make
sure
the
right
URLs
get
on
the
menu”
Dawn
Anderson
@
dawnieando
REMEMBER

Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Sasconbeta 2015 Dawn Anderson - Talk To The Spider

Ähnlich wie Sasconbeta 2015 Dawn Anderson - Talk To The Spider (20)

Mehr von Dawn Anderson MSc DigM

Mehr von Dawn Anderson MSc DigM (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Sasconbeta 2015 Dawn Anderson - Talk To The Spider