WHY YOU SHOULD CARE ABOUT TAKING CARE OF CRAWLS (INTELLIGENT USE OF CRAWL ALLOCATION (BUDGET)). Investigating 'crawl budget', 'crawl rank', 'crawl tank' and 'crawl scheduling by Search Engines'
Unifying feature management with experiments - Server Side Webinar (1).pdf
SEO Crawl Rank And Crawl Tank - Brighton SEO April 2016
1. SEO
‘Crawl
Tank’
-‐ ‘Death
and
Resurrection’
WHY
YOU
SHOULD
CARE
ABOUT
TAKING
CARE
OF
CRAWLS
(INTELLIGENT
USE
OF
CRAWL
ALLOCATION
(BUDGET))
THE
QUEST
FOR
‘CRAWL
RANK’ Dawn
Anderson
@
dawnieando
2. Indexed
Web
contains at
least
4.73
billion
pages (13/11/2015)
1
THE WEB IS ‘BIG’
Total
number
of
websites
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
1,000,000,000
750,000,000
500,000,000
250,000,000
SINCE
2013
THE
WEB
IS
THOUGHT
TO
HAVE
INCREASED
IN
SIZE
BY
1/3
3. 2THE ABILITY TO ‘SELF PUBLISH’ EASILY HAS CLEARLY
INFLUENCED THIS – WE ALL
‘LOVE CONTENT’
IMPORTANT
TO
NOTE
THAT
75%
OF
WEBSITES
ONLINE
ARE
DORMANT
(E.G.
PARKED
DOMAINS)
IMAGINE
HOW
MANY
UNIQUE
URLs
COMBINED
THIS
AMOUNTS
TO?
– A
LOT
http://www.internetlivestats.com/total-‐number-‐of-‐websites/
4. Capacity
limits
on
Google’s
crawling
system
By
prioritising
URLs
for
crawling
By
assigning
crawl
period
intervals
to
URLs
How
have
search
engines
responded?
By
creating
work
‘schedules’
for
Googlebots
3
TOO MUCH CONTENT
5. 4HERE’S WHY -> EVERYTHING HAS A
FINITE CAPACITY (EVEN CRAWLING)
“While
web
pages
can
be
manually
selected
for
crawling,
this
becomes
impracticable
as
the
number
of
web
pages
grows.
Moreover,
to
keep
within
the
capacity
limits
of
the
crawler,
automated
selection
mechanisms
are
needed
to
determine
not
only
which
web
pages
to
crawl,
but
which
web
pages
to
avoid
crawling.
For
instance,
as
of
the
end
of
2003,
the
WWW
is
believed
to
include
well
in
excess
of
10
billion
distinct
documents
or
web
pages,
while
a
search
engine
may
have
a
crawling
capacity
that
is
less
than
half
as
many
documents.”
-‐ Scheduler
for
search
engine
crawler Google
Patent
US
8042112
B1,
(Zhu
et
al)
6. ‘Managing items in crawl
schedule’ - US
8666964
B1
Include
5SOME GOOGLE CRAWL SCHEDULER
PATENTS
‘Scheduling a recrawl’ - US
8386459
B1
‘Web crawler scheduler that
utilizes sitemaps from websites’ -
US
8037054
B2
‘Document reuse in a
search engine crawler’
- US
8707312
B1
‘Minimizing visibility of stale content in
web searching including revising web
crawl intervals of documents’ - US
8407204
B2
‘Scheduler for search engine
crawler’ - US
8042112
B1
‘Distributed crawling of
hyperlinked documents’
- US
7305610
B1
IT
SEEMS
PRIORITIZATION
AND
GOOGLEBOT
CRAWL
EFFICIENCY
ARE
IMPORTANT
TO
SEARCH
ENGINES
7. Crawled
multiple
times
daily
Crawled
daily
Or
bi-‐daily
Crawled
least
on
a
‘round
robin’
basis
– only
‘active’
segment
is
crawledSplit
into
segments
on
random
rotation
6
“MANAGING ITEMS IN A CRAWL
SCHEDULE”
(GOOGLE PATENT US
8666964
B1)
Real
Time
Crawl
Daily Crawl
Base
Layer
Crawl
3
layers
/
tiers URLs
are
moved
in
and
out
of
layers
based
on
past
visits
data
(retrieved
from
logs)
PAGE
‘IMPORTANCE’
AND URL
SCHEDULING
8. 10
types
of
Googlebot
THE KEY SEARCH ENGINE (THE
APPLIANCE) CHARACTERS
7
SUPPORTING
ROLES
(LOG
MANAGERS
&
PAGE
RANKERS
Indexer
/
Ranking
Engine
The
URL
Scheduler
History
Logs
Link
Logs
/
Link
Maps
Anchor
Logs
/
Anchor
Maps
Status
Logs
Page
Rankers
9. 8THE ‘LOG’ MANAGERS (‘The Clerks’)
History
Logs
Link
Logs
JOBS
INCLUDE
JOBS
INCLUDE Other
Logs
JOBS
INCLUDE
Consider these as ‘record-keepers’ (record
info on the crawled URLS
Retrieves
previous
copies
of
documents
for
comparison
with
newly
retrieved
copies
for
purposes
of
’change
frequency’
and
‘change
weight’
calculation
(last
modified
&
update
rate)
Include:
“identifies
all
the
links
(e.g.,
URLs,
also
called
outbound
links)
that
are
found
in
the
document
associated
with
the
record
and
the
text
that
surrounds
the
link”
(Brawer
et
al,
Google
Patent)
INFO
USED
TO
MAKE
LINK
MAPS
• Anchor
Logs
&
Maps
• Status
Logs
A
LOT
MORE
INFO
ON
LOGS
AT:
Scheduler
for
Search
Engine
Crawler
US
20100241621
A1
10. 9
SUPERVISOR - TEAM LEADER – ‘THE URL
SCHEDULER’
Think
of
it
as
Google’s
line
manager
or
‘air
traffic
controller’
for
Googlebots in
the
web
crawling
system
JOBS
Schedules
Googlebot visits
to
URLs
Decides
which
URLs
to
‘feed’
to
Googlebot
Uses
data
from
the
history
logs
about
past
visits
Assigns
visit
regularity
of
Googlebot to
URLs
Drops
‘hints’
to
Googlebot to
guide
on
types
of
content
NOT
to
crawl
and
excludes
some
URLs
from
schedules
Analyses
past
‘change’
periods
and
predicts
future
‘change’
(BASED
ON
PAST
VISIT
DATA)
periods
for
URLs
for
the
purposes
of
scheduling
Googlebot visits
Checks
‘page
importance’
in
scheduling
visits
(PRIORITIES)
Assigns
URLs
to
‘layers
/
tiers’
for
crawling
schedules
(REAL
TIME,
DAILY,
BASE
LAYER
SEGMENT)
The
URL
Scheduler
controls
the
meal
planner
Scheduler
checks
URLs
for
‘importance’,
‘boost
factor’
candidacy,
‘probability
of
modification’
‘Budgets’
are
allocated
Carefully
controls
the
list
of
URLs
Googlebot visits
11. THE 10 GOOGLEBOTS
Image
Video News
Adsense Adsbot
PAID
SEARCH
TYPES
10
MEDIA
TYPES
Smartphone AppsFeaturephoneMobile
Adsense
MOBILE
TYPES
BOT TYPES HAVE
VARYING DEGREES OF
‘BUSY-NESS’
GOOGLEBOT
WEB
SEARCH
Crawls
images
only
Quality
Checks
Babybot (’the
Noob’)
12. GOOGLEBOT JOBS 11
JOBS
• ‘Ranks
nothing
at
all’
• Takes
a
list
of
URLs
to
crawl
from
URL
Scheduler
• Job
varies
based
on
‘bot’
type
(e.g.
Image
bot
seems
a
bit
of
a
‘part
timer’
(images
change
less
frequently))
• Runs
errands
&
makes
deliveries
for
the
URL
server,
indexer
/
ranking
engine
and
logs
• Makes
notes
of
outbound
linked
pages
and
additional
links
for
future
crawling
(in
order
for
them
to
be
assigned
to
future
crawling
schedules)
• Takes
notes
of
‘hints’
from
URL
scheduler
when
crawling
• Tells
tales
of
URL
accessibility
status,
server
response
codes,
notes
relationships
between
links
and
collects
content
checksums
(binary
data
equivalent
of
web
content)
for
comparison
with
past
visits
by
history
and
link
logs
13. 12
‘INDEXER’
Looks
at
all
of
the
evidence
from
the
various
logs
(and
the
page
rankers)
of
the
search
engine
to
index
the
URLs
• Uses
the
combined
data
collected
in
order
to
index
the
results
for
a
given
query
• TAKES
DATA
FROM
THE
LOGS
TO
GENERATE
INDEXES
“The
indexer(s) 724 use
the
anchor
maps 718
and
other
logs 716 to
generate
index(es) 726.
The
index(es)
are
used
by
the
search
engine
to
identify
documents
matching
queries
entered
by
users
of
the
search
engine.”
(Web
crawler
scheduler
that
utilizes
sitemaps
from
websites
US
8037054
B2,
Google
Patent,
Brawer
et
al,
pub
2011)
14. I ASKED JOHN MUELLER AT WEBMASTER HANGOUT
ABOUT URL QUEUES
14
GOOGLE
WEBMASTER
HANGOUT
QUESTION
ON
’URL
QUEUEING’
BUT
WHAT
OTHER
EVIDENCE
DO
WE
HAVE
TO
SUPPORT
OUT
THEORIES?
“URLS
ARE
NOT
ALL
CRAWLED
IN
ORDER,
BUT
THAT
SOME
RECEIVE
MULTIPLE
DAILY
CRAWLS,
SOME
DAILY,
SOME
WEEKLY
AND
SOME
VERY
INFREQUENTLY”
https://www.seroundtable.com/google-‐explains-‐why-‐
the-‐search-‐console-‐has-‐reporting-‐delays-‐21688.html
LOW
IMPORTANCE
URLs
APPEAR
TO
BE
‘QUEUED
FOR
LATER’
AND
VISITED
INFREQUENTLY
WHEN
THERE
IS
SPARE
CAPACITY
(LOWER
PRIORITY)
(SCHEDULES)
15. WHICH APPEARED TO SUPPORT… 15
“Priority
scores
are
computed
for
each
remaining
document
identifier
based
on
predetermined
criteria
(e.g.,
a
page
importance
score
of
the
document).”
(Zhu
et
al,
2011)
PATENT
-‐ Scheduler
for
search
engine
crawler
US
8042112
B1
16. 16
CRAWL BUDGET
1. CRAWL BUDGET – “AN ALLOCATION OF
CRAWL VISITS TO A HOST”
3. PAGES WITH A LOT OF LINKS GET
CRAWLED MORE
4. THE VAST MAJORITY OF URLS ON THE WEB DON’T GET A LOT
OF BUDGET ALLOCATED TO THEM (LOW TO 0 PAGERANK URLS).
2. ROUGHLY PROPORTIONATE TO
PAGERANK AND HOST SPEED / CAPACITY
Mostly
taken
from
Eric
Enge’s (interview
with
Matt
Cutts (@mattcutts)
interview
from
2010
https://www.stonetemple.com/matt-‐cutts-‐
interviewed-‐by-‐eric-‐enge-‐2/
17. I ASKED SOME STUFF ABOUT CRAWL
BUDGET ALLOCATION
17
DISTRIBUTED
CRAWLING
OF
HYPERLINKED
DOCUMENTS
-‐ Patent
Abstract
– “Hyperlinked
documents
to
be
crawled
are
grouped
by
host
and
the
host
to
be
crawled
next
is
selected
according
to
a
stall
time
of
the
host.
The
stall
time
can
indicate
the
earliest
time
that
the
host
should
be
crawled
and
the
stall
times
can
be
a
predetermined
amount
of
time,
vary
by
host
and
be
adjusted
according
to
actual
retrieval
times
from
the
host”
(Dean
et
al
(Google,
2014))
IT
SEEMS
– BUDGET
IS
ASSIGNED
TO
THE
HOST
(I.P)
AND
THEN
SHARED
BETWEEN
THE
SITES
THERE
18. I ASKED SOME STUFF ABOUT LINKS AND CRAWL
BUDGET (in light of 2012 ‘DISAVOW TOOL’)
18
TIP (IMHO - DAWN) –
YOU MAY NEED TO
RESTRUCTURE /
FLATTEN SO ‘BUDGET’
CAN REACH
IMPORTANT URLS
“Thanks
John”
-‐
Waving
J
19. 19IT SEEMS THERE MORE FACTORS AFFECTING ‘CRAWL
BUDGET??’
Transcript:
https://searchenginewatch.com/201
6/04/06/webpromos-‐qa-‐with-‐
googles-‐andrey-‐lipattsev-‐transcript/
WEB
PROMOS
Q
&
A
WITH
GOOGLES
ANDREY
LIPATTSEV
Andrev chatting
with
Ammon
J
seemed
to
imply
that
a
lot
more
things
affect
crawl
frequency
now
than
just
PageRank
20. 20
ARE
THERE
OTHER
FACTORS
AFFECTING
BUDGET
AND
/
OR
‘CRAWL
RANK’
AS
WELL
AS
PAGERANK
AND
SPEED?
I ASKED @johnmu IF I
COULD ASK WHETHER
THE FACTORS
AFFECTING CRAWL
BUDGET HAD
CHANGED?
JOHN
SAID
– “Sure…You
can
always
ask”
J J –
“But,
he
didn’t
tell
me
what
they
were
(if
any)”
SO I ASKED IF I COULD ASK IF FACTORS AFFECTING
CRAWL BUDGET / CRAWL FREQUENCY HAD
CHANGED – I.E. ADDITIONAL FACTORS?
21. 22
GOOGLE PATENT – ‘NOT ALL ‘CHANGE’ IS
CONSIDERED EQUAL’ (CRITICAL & NON-CRITICAL)
“Changes can be described as critical or non-critical and that
determination may depend on the portion of the document changed, or
the context of the changes, rather than the amount of text or content
changed. Sometimes a change to a document may be insubstantial,
e.g., the change of advertisements associated with a document. In this
case, it is more appropriate to ignore those accessory materials in a
document prior to making content comparisons. In other cases, e.g., as
part of a product search, not every piece of information in a
document is weighted equally by a potential user. For instance, the
user may care more about the unit price of the product and the
availability of the product. In this case, it is more appropriate to focus
on the changes associated with information that is deemed critical
to a potential user rather than something that is less significant,
e.g., a change in a product's colour” (Minimizing
Visibility
of
Stale
Content
in
Web
Searching
Including
Revising
Web
Crawl
Intervals
of
Documents -‐ Anton
Carver,
Google
Patent
-‐ US
20130226897
A1,
pub
2013)
Probability
&
predictability
of
future
‘freshness’
(newness
or
critical
material
change)
(‘CHANGE
RATE’
APPEARS
TO
BE
‘LEARNED’)
’CHANGE
RATE
&
CHANGE
WEIGHT
THRESHOLDS’
22. CRITICAL MATERIAL CONTENT CHANGE
(IMPORTANT CHANGE) & FEATURE WEIGHTS
21
C
=
∑
i =
0
n
-‐ 1
weight
i *
feature
NOT JUST ‘RANDOM’
CHANGE like
Shuffle($variable) or
RAND($variable)
NOT
ALL
‘FEATURES’
ARE
CREATED
EQUAL
ACCORDING
TO
THIS
LINE
IN
PATENTS
–”
weight
i *
feature”
EXAMPLE
FEATURES
– E.G.
A
CHANGE
IN
PRICE
(FEATURE)
MAY
BE
WEIGHTED
HIGHER
THAN
A
CHANGE
IN
COLOUR
(FEATURE)
– FEATURE
WEIGHT
PRICE
>
FEATURE
WEIGHT
COLOUR
”DEPENDS
ON
HOW
OFTEN
THE
PAGE
CHANGES”
IS
MENTIONED
A
LOT IN
WEBMASTER
HANGOUTS
Minimizing
Visibility
of
Stale
Content
in
Web
Searching
Including
Revising
Web
Crawl
Intervals
of
Documents -‐ Anton
Carver,
Google
Patent
-‐ US
20130226897
A1,
pub
2013
23. “BE CONSISTENT” - (@johnmu, Nov 2015) 23
SMX
MILAN
(November
2015),
reported
here
by
SERoundtable on
quote
from
Google’s
John
Mueller
@johnmu https://www.seroundtable.com/google-‐number-‐one-‐seo-‐advice-‐
be-‐consistent-‐21196.html
DA
-‐ I
HAVE
A
FEELING
CONSISTENCY
IS
IMPORTANT
FOR
‘HISTORY
LOGS’
TO
‘LEARN’
CHANGE
RATES
/
THRESHOLDS
24. URL EXCLUSIONS FOR ‘TRIPPING ‘MINIMUM-CRAWL-
THRESHOLD’ REVISIT ‘HINTS’ AND ‘SPAM’ URLs
24
‘RANDOM’ CHANGE created programmatically like
Shuffle($variable) or RAND($variable) may even be
seen as ‘hints’ TO GOOGLEBOT TO ‘NOT’ CRAWL
HINTS
=
‘MEH
CHANGES’
(E.G.
PATTERNS
OF
’SAME
OLD,
SAME
OLD
STUFF’
DUPLICATES,
PROGRAMMATICALLY
GENERATED
CONTENT)
"Hints may also be employed on pages that are automatically
generated and/or contain dynamically generated elements that result
in the page having a different checksum every time it is crawled”
(Managing Items In A Crawl Schedule, Google Patent - US
8666964
B1)
25. 26
GOOGLE THINKS CRAWL BUDGET IS
IMPORTANT FOR SEO
CIRCA
JULY
2015
BUT…
NO
ONE
HAS
EVER
OFFICIALLY
SAID
THAT
THERE’S
ANY
KIND
OF
RANKING
BENEFIT
FROM
POSITIVE
CRAWL
ACTIVITY
26. ENTER ‘CRAWL RANK’ - A BENEFIT OF
CRAWL OPTIMISATION??
27
“The
pages
that
aren’t
crawled
as
often
are
pages
with
little
to
no
PageRank.
CrawlRankis
the
difference
in
this
very
large
pool
of
pages.
You
win
if
you
get
your
low
PageRank
pages
crawled
more
frequently
than
the
competition.”
“I’m
still
not
entirely
convinced
this
is
what
is
happening,
but
I’m
seeing
success
using
this
philosophy.
“-‐ A
J
Kohn
@ajkohn
OTHERS
SEEM
TO
BE
TRACKING
IT
TOO
– E.G.
SEO
CLARITY
DOES
THE
MYTHOLOGICAL
‘CRAWL
RANK’
BENEFIT
EVEN
EXIST?
27. DOES ‘CRAWL RANK’ STILL APPLY? 28
I
ASKED
A
J
KOHN
IF
HE
STILL
THOUGHT
IT
APPLIED
NOW?
“Thanks
A.J”
-‐
Waving
J
”I
still
see
evidence
that
getting
pages
crawled
frequently
(within
7-‐10
days)
seems
to
have
an
impact
on
their
ability
to
rank
well”
(AJ
Kohn,
2016)
28. IS LONG-TAIL ‘LEAP-FROGGING’ (AND SOME
CLUSTERING) WHAT ‘CRAWL RANK’ LOOKS LIKE?
29
SITES
JUMPING
OVER
EACH
OTHER
ON
’LONG
TAILED
QUERIES’
IN
AN
ENDLESS
LAST
LAP
RACE?
29. HOW IT APPEARS TO WORK – ‘YOU DON’T
ALWAYS HAVE TO FIGHT THE ‘BOSS’
URLS’
30
Why
fight
with
the
Hulk
when
you
can
be
Yoda?
Image
Credit:
Flickr
30. EVEN STRONGER DOMAINS HAVE WEAKER URLS 31
THE
SITES
MAY
ALL
BE
STRONGER
THAN
YOU
BUT
THERE
ARE
A
LOT
OF
PAGES
ON
BIG
SITES
WITH
NO
STRENGTH
YOU
WON’T
BEAT
THE
STRONG
URLs
WITH
CRAWL
OPTIMISATION
ALONE
You
are
unlikely
to
beat
these
URLs
with
crawl
optimisation techniques
alone.
These
URLs
are
not
the
intended
target
for
these
tactics
– TOO
STRONG
SAVE
SOME
BATTLES
FOR
LATER
Strong
URLs
31. FIGHT AT A URL V URL OR TEMPLATE V TEMPLATE
LEVEL WITH LOW TO 0 PAGE RANK URLS
32
PICK
OFF
THE
WEAKER
URLS
WHEN
BATTLING
WITH
A
BIG
SITE
–
LOW
TO
NO
PAGE
RANK
URLS• TARGETS
THE
LOW
STRENGTH
PAGES
FURTHER
DOWN
IN
THE
SITES
OF
COMPETITORS
(SUBCATEGORY
PAGES
E.G.
IN
ECOMMERCE
SITES
• THERE
ARE
A
LOT
OF
PAGES
(MILLIONS
WITH
LITTLE
TO
NO
PAGE
RANK)
• YOU’RE
AIMING
TO
BEAT
THOSE
VIRTUALLY
NO
STRENGTH
IN
1,000s
OF
URLS
POWERFULWELL
KNOWN BRANDS
BUT NO STRENGTH
LOWER DOWN THE
ARCHITECTURE
MANY LOW VOL/
DEEPURLsARE
COMPLETE
WEEDS ON
BEHEMOTH SITES
Weak
URLs
32. 25
A BIG FACTOR? - ‘EMPHASIS OF ‘ URL
IMPORTANCE’’ (E.G. ON PARAMETERS)
FULL
TRANSCRIPT
-‐ https://www.stonetemple.com/matt-‐cutts-‐interviewed-‐by-‐eric-‐enge-‐2/
THIS
WAS
IN
THE
ORIGINAL
INTERVIEW
WITH
MATT
CUTTS
ALSO
LOTS
OF
THE
PATENTS
MENTION
“PAGE
IMPORTANCE
(WHICH
MAY
INCLUDE
PAGERANK)”
33. WHICH SEEMS TO SUPPORT THIS PAPER BY PAGE ET AL ON IMPORTANCE 13
“Thanks
Bill”
-‐
Waving
J
THIS
REFERENCES
THE
PROBLEM
OF
THE
SIZE
OF
THE
WEB
AND
PRIORITIZES
IMPORTANT
PAGES
Efficient
Crawling
Through
URL
Ordering
Page
et
al
34. ’POINT TO THE NEEDLE IN THE HAY’ –
EMPHASISE IMPORTANCE
33
• Googlebot is
also
‘hunting’…
Hunting
for
relevant
‘needles’
in
1,000,000,000s
of
straws
of
‘hay’
on
the
web
• It’s
about
making
your
‘one
needle’
stand
out
in
importance
in
not
just
your
own
site’s
haystack,
but
tens
of
thousands
of
competing
similar
straws
of
hay
in
other
site’s
haystacks…
(DON’T
JUST
MAKE
YOUR
HAYSTACK
BIGGER)
“Hey,
you
Googlebot…
This
is
the
needle”
via
architectural
internal
linking
without
blur
of
duplication
or
too
many
redirects
or
canonicalization
35. 13
WHICH OF YOUR URLs ARE IMPORTANT?
“If
you
don’t
consistently
indicate
via
clean
internal
individual
URL
importance
emphasis,
the
importance
of
your
URLs,
how
will
Googlebot know
which
are
the
most
important?”
36. 35
INTERNAL LINKS COUNT (A LOT)
(RELEVATIVE IMPORTANCE VOTES ON URL
IMPORTANCE FROM YOUR OWN SITE)
THESE
ARE
YOUR
‘VOTES’
TO
GOOGLEBOT
ON
THE
IMPORTANCE
OF
EACH
URL
EMPLOY
‘CONSISTENT’
INTERNAL
LINK
STRATEGIES
THINK
OF
THESE
AS
‘WALL-‐TIES’
HOLDING
YOUR
BUILDING
(SITE
ARCHITECTURE)
TOGETHER
STOP VOTING FOR
THE WRONG URLS
FROM WITHIN YOUR
OWN SITE.
WRONG TARGETS
RANKING?… CHECK
INTERNAL LINKS
From
Google
Support
Pages
Consistent internal
&
external
emphasis
of
a
URLs
’IMPORTANCE’
37. 38
NEGATIVE
CONSEQUENCES
FROM
POOR
CRAWL
VISITS
(E.G.
SPIDER
TRAPS
(INFINITE
LOOPS),
INDIVIDUAL
URLS
VISITED
LESS
AND
LESS
FREQUENTLY
BECAUSE
THERE’S
TOO
MANY)
BUT IS THERE PERHAPS AN OPPOSITE
OF ‘CRAWL RANK’? - ’CRAWL TANK’??
IS
THERE
ADVERSE
EFFECT
WHEN
CRAWLING
GOES
BAD?
38. WELL - I’VE SEEN ‘CRAWL TANK’ – IT
AIN’T PRETTY
39
SITE
SEO
DEATH
BY
TOO
MANY
URLS
AND
INSUFFICIENT
CRAWL
BUDGET
TO
SUPPORT
(EITHER
DUMPING
A
NEW
THIN
PARAMETER
INTO
A
SITE
OR
INFINITE
LOOP
(CODING
ERROR)
(SPIDER
TRAP))
”BEEN THERE, DONE THAT”
39. IT KIND OF LOOKS A BIT LIKE THIS 40
”BEEN THERE, DONE THAT”
DEFINITELY
40. 41
‘EXPONENTIAL URL UNIMPORTANCE’?
Your
URLs
exponentially,
CONSISTENTLY
confirmed
unimportant
to
queries
with
each
iterative
crawl
visit
to
other
similar
or
duplicate
content
checksum
URLs?
MULTPLE
RANDOM
URLs
competing
for
same
query
confirm
irrelevance
of
all
competing
in-‐site
URLs
with
no
dominant
relevant
IMPORTANT
URL?
42. Going
‘where
the
action
is’
in
sites
The
‘need
for
speed’
Logical
structure
Correct
‘response’
codes
XML
sitemaps
‘Successful
crawl
visits
‘Seeing
everything’
on
a
page
Taking
‘hints’
Clear
unique
single
‘URL
fingerprints’
(no
duplicates)
Predicting
likelihood
of
‘future
change’
Slow
sites
Too
many
redirects
Being
bored
(Meh)
(‘Hints’
are
built
in
by
the
search
engine
systems
– Takes
‘hints’)
Being
lied
to
(e.g.
On
XML
sitemap
priorities)
Crawl
traps
and
dead
ends
Going
round
in
circles
(Infinite
loops)
Spam
URLs
Crawl
wasting
minor
content
change
URLs
‘Hidden’
and
blocked
content
Uncrawlable
URLs
Duplicate
URLs
Not
just
any
change
Critical
material
change
Predicting
future
change
Dropping
‘hints’
to
Googlebot
Sending
Googlebot
Where
‘the
action
is’
43
LIKES DISLIKES CHANGE
IS
KEY
BASED ON DATA FROM THE HISTORY LOGS - CAN WE
INFLUENCE VIA CRAWL OPTIMISATION TO ESCAPE THE
‘BASE LAYER HOME’ OF THE ’UNIMPORTANT’ URLS?
43. 44HERE’S ONE I MADE EARLIER…SOME CAVEATS
THIS
IS
A
PERSONAL
PROJECT
– MY
20
IN
70:
20:10
MIX
IT’S
NOT
MOBILE
FRIENDLY
OR
HTTPS
(HANGS
HEAD
IN
SHAME),
AND
YES,
IT
NEEDS
A
MAKEOVER…
BUT…
TIME…
,
RESOURCES,
BUDGET…BLAH
BLAH
THERE
IS
NO
‘BIG
BRAND’
MARKETING,
VC
BACKING,
TV
OR
RADIO
ADS
(LIKE
COMPETITORS)
–
JUST
ME
-‐ ‘CHIPPING
AWAY’
90%+
OF
TRAFFIC
IS
NON-‐BRANDED
GENERIC
ORGANIC
44. URL CRAWL FREQUENCY ’CLOCKING’ 46
Spreadsheet
provided
by
@johnmu during
Webmaster
Hangout
https://goo.gl/1p
ToL8
ARE
THE
URLS
THAT
YOU
WANT
BEING
CRAWLED
‘REAL
TIME’,
DAILY
OR
INFREQUENTLY?
(REGULAR
LOG
ANALYSIS
AND
INTERVENTION
TO
EMPHASISE
IMPORTANCE)
MY
THOUGHTS
(DA)
-‐ You
need
to
find
out
which
ones
are
getting
crawled
in
the
‘real
time’
schedule,
the
‘daily
crawl’
schedule
and
via
random
selection
in
the
‘dross’
(or
UNLIKELY
TO
CHANGE
A
LOT
/
UNIMPORTANT)
‘base
layer’
section.
If
it’s
not
the
URLs
that
you
want
to
be
there,
then
formulate
a
plan
to
improve
the
‘importance’
of
URLS.
(NOTE:
JOHN
DID
NOT
SAY
THIS)
45. 45LOSE THE ‘DEAD WOOD’ SO GOOGLEBOT DETECTS
‘IMPORTANCE’
FIX IT FOR A
BETTER CRAWL
EMBRACE
THE ‘410
GONE’FLATTENING
ARCHITECTURES,
CONSISTENTLY
AVOIDING
CANNIBALISATION,
INTERNAL
LINK
STRATEGIES,
LINKING
RELEVANT
CONTENT
TO
RELEVANT
CONTENT,
UTILISING
XML
&
FRONT
FACING
SITEMAPS
AND
STRONG
HUB
PAGES
TO
‘HERD’
GOOGLEBOT
AROUND
THE
SITE
46. 47
40,000 TOWNS, CITIES & VILLAGES
40,000+
towns,
cities
and
villages
across
the
UK
multiplied
by
X
site
categories
(THAT’S
A
LOT
OF
LONG
TAIL
QUERY
VOLUME)
47. 48FWIW – LONG TAIL CRAWL TECHNIQUES SEEM TO
APPLY TO OTHER SEARCH ENGINES TOO
By
shortening
crawl
paths
and
crawl
frequency
intervals
and
emphasing important
to
subcategory
URLs
on
frequently
changed
URLs
(fresh)
it
appears
you
may
gain
a
competitive
advantage
on
long
tail
queries
48. IT’S ALIVE… NEEDS WORK… BUT ALIVE 49
CAVEAT:
IT’S
TOO
COMPLEX
TO
ANSWER
WITH
A
SIMPLE
FEW
EXAMPLES
OF
COURSE
(TOO
MANY
FACTORS)
– BUT…
FOOD
FOR
THOUGHT
‘CRITICAL
MATERIAL
CHANGE
FREQUENCY’
(FRESHNESS)
AND
DETECTED
URL
IMPORTANCE
EMPHASIS
VIA
EXTERNAL
OR
INTERNAL
SIGNALS
(INC
PAGERANK)
SEEM
KEY
IS
IT
‘CRAWL
RANK’
OR
‘EMPHASING
URL
IMPORTANCE’
BETTER
THAN
COMPETITORS
EMPHASE
IMPORTANCE
OF
LOW
TO
NO
PAGERANK
PAGES
WHERE
FEW
OTHER
FACTORS
SEPARATE?
49. 50CRAWL BUDGET & ‘CRAWL RANK’ – OTHER FACTORS??
1. IT APPEARS TO BE APPORTIONED
BY THE URL SCHEDULER (BUDGET)
2. PAGES WITH A LOT OF (HEALTHY??)
LINKS GET CRAWLED MORE (EXTERNAL
AND INTERNAL?) (BUDGET AND RANK?)
3. THERE ARE URL EXCLUSIONS – (
’HINT TRIPPERS’, OBJECTIONABLE
CONTENT AND ‘SPAM URLS’?? )
(BUDGET)
4 – ‘CRITICAL MATERIAL CHANGE’ (FRESHNESS) AND THE PROBABILITY
AND PREDICTABILITY OF CHANGE CORRELATE (BUDGET)
5 –’CONSISTENT’ EMPHASIS OF URL IMPORTANCE(BUT I THINK THAT THIS
WAS ALWAYS THERE) MAY BE ’CRAWL RANK’(BUDGET AND RANK??)
’CRAWL
RANK’
-‐ IS
IT
CORRELATION
OR
CAUSATION?
(DO
IMPORTANT
PAGES
GET
CRAWLED
MORE,
OR
IS
IT
BECAUSE
THEY
ARE
CRAWLED
MORE
THEY
ARE
IMPORTANT?)
50. CAN WEB PAGES CRAWLED
INFREQUENTLY
STILL RANK?
36
YES
THEY CAN STILL BE
’IMPORTANT’
IT’S THE ONES YOU’RE INDICATING ARE UNIMPORTANT
THAT YOU WANT TO KEEP AN EYE ON - #JUSTSAYING ;;)
51. “BE SMART ABOUT YOUR TAGS AND SITE
ARCHITECTURE, STAY FRESH AND RELEVANT”
(@maileohye, 2016)
37
SLIDE
FROM
APRIL
2016’S
SEJSUMMIT
ON
SEO
INSTRUCTIONS
2016
FROM
GOOGLE’S
@maileohye
52. 52
EITHER WAY - ARE ALL THE CHECKS AND BALANCES
INDICATING YOU ARE STILL ON TRACK?
BECAUSE
-‐ BRINGING
A
ROCKET
BACK
ON
COURSE
IS
‘CHALLENGING’
REGULAR
TESTS
AND
EARLY
DIAGNOSIS
ARE
CRUCIAL
–
STOP,
CHECK
AND
KEEP
CHECKING
‘TANK’
OR
‘RANK’?
– YOU
DECIDE
53. TWITTER
-‐ @dawnieando
GOOGLE+
-‐ +DawnAnderson888
LINKEDIN
-‐ msdawnanderson
THANKS
FOR
LISTENING
FOLKS
J Dawn
Anderson
@
dawnieando
ENJOY
BRIGHTON
SEO
54. REFERENCES
http://www.internetlivestats.com/total-‐number-‐of-‐websites/
Scheduler
for
search
engine
crawler Google
Patent
US
8042112
B1,
(Zhu
et
al) -‐ https://www.google.com/patents/US8707313
Managing
items
in
crawl
schedule
– Google
Patent
(Alpert)
http://www.google.ch/patents/US8666964
Document
reuse
in
a
search
engine
crawler
-‐ Google
Patent
(Zhu
et
al)
https://www.google.com/patents/US8707312
Web
crawler
scheduler
that
utilizes
sitemaps
(Brawer
et
al)
-‐
http://www.google.com/patents/US8037054
Distributed
crawling
of
hyperlinked
documents
(Dean
et
al)
-‐
http://www.google.co.uk/patents/US7305610
Minimizing
visibility
of
stale
content
(Carver)
-‐
http://www.google.ch/patents/US20130226897
55. REFERENCES
Efficient
Crawling
Through
URL
Ordering
(Page
et
al)
-‐ http://oak.cs.ucla.edu/~cho/papers/cho-‐order.pdf
Crawl
Optimisation (Blind
Five
Year
Old
– A
J
Kohn
-‐ @ajkohn)
http://www.blindfiveyearold.com/crawl-‐
optimization
Scheduling
a
recrawl (Auerbach)
-‐ http://www.google.co.uk/patents/US8386459
Scheduler
for
search
engine
crawler
(Zhu
et
al)
-‐ http://www.google.co.uk/patents/US8042112
Efficient
crawling
through
URL
ordering
(Page
et
al)
-‐ http://oak.cs.ucla.edu/~cho/papers/cho-‐order.pdf
Google
Explains
Why
The
Search
Console
Reporting
Is
Not
Real
Time
(SERoundtable)
https://www.seroundtable.com/google-‐explains-‐why-‐the-‐search-‐console-‐has-‐reporting-‐delays-‐21688.html
Crawl
Data
Aggregation
Propagation
(Mueller)
-‐ https://goo.gl/1pToL8
Matt
Cutts Interviewed
By
Eric
Enge -‐ https://www.stonetemple.com/matt-‐cutts-‐interviewed-‐by-‐eric-‐enge-‐
2/
Web
Promo
Q
and
A
with
Google’s
Andrev Lippatsev -‐
https://searchenginewatch.com/2016/04/06/webpromos-‐qa-‐with-‐googles-‐andrey-‐lipattsev-‐transcript/
Google
Number
1
SEO
Advice
– Be
Consistent
-‐ https://www.seroundtable.com/google-‐number-‐one-‐seo-‐
advice-‐be-‐consistent-‐21196.html