9. @dawnieando from
@MoveItMarketing
‘URL
CRUFT’
IS
A
THING
“characters relevant
or
meaningful
only
to
the
people
who
created
the
site,
such
as
implementation
details
of
the
computer
system
which
serves
the
page.
Examples
of
URL
cruft
include filename
extensions such
as .php or .html,
and
internal
organizational
details
such
as /public/or /Users/john/work/draft
s/.[9]”
(Wikipedia
Definition)
10. ALL
THE
RANDOM CRAP
PEOPLE
ADD
TO
QUERY
STRINGS,
PARAMETERS,
DIRECTORY
FOLDERS
AND
URL
STRUCTURES
12. @dawnieando from
@MoveItMarketing
“COOL
URIs
DON’T
CHANGE”
Sir
Tim
Berners-‐Lee
(Inventor
of
the
World
Wide
Web)
https://www.w3.org/Provider/Style/URI
Attrubution:
By
Uldis Bojārs (Flickr.)
[CC
BY-‐SA
2.0
(http://creativecommons.org/licenses/by-‐sa/2.0)],
via
Wikimedia
Commons
16. @dawnieando from
@MoveItMarketing
404
NOT
FOUND
&
410
GONE
§ “Of
course,
we
won’t
redirect
everything…”
§ “Not
everything
will
be
worth
redirecting”
19. @dawnieando from
@MoveItMarketing
302
==
Default 301
==
Intentional
404
==
Default 410
==
Intentional
“The
410
response
is
primarily
intended
to
assist
the
task
of
web
maintenance
by
notifying
the
recipient
that
the
resource
is
intentionally
unavailable
and
that
the
server
owners
desire
that
remote
links
to
that
resource
be
removed.”
(RFC
7231)
https://tools.ietf.org/html/rfc7231#section-‐6.5.9
ARE YOU SURE?
MAYBE YES
21. @dawnieando from
@MoveItMarketing
DO NOT THINK 410s WON’T BE
RECRAWLED AGAIN
Source:
https://www.docsplace.org/4578/09/410-‐gone-‐stops-‐crawling-‐dead-‐urls/
22. @dawnieando from
@MoveItMarketing
“We
knew
there
was
content
there
at
some
point
so
we
just
swing
by
every
now
and
then
to
see
if
anything
came
back”
(John
Mueller,
2016)
In Reality… Gone Is Never Gone
23. @dawnieando from
@MoveItMarketing
ZOMBIES
ARE
NEVER
GONE
NO
URLS
ARE
EVER
GONE
ONLY
THE
RESOURCE
THERE
IS
GONE
https://www.seroundtable.com/google-‐410-‐indexing-‐22584.html
5
YEARS
LATER
30. @dawnieando from
@MoveItMarketing
INCREMENTAL CRAWLING NEVER ENDS
“Crawling
method
based
on
crawl
frequency
based
on
URL
historical
change
&
importance
rate”
Crawling
Which
Never
Ends
Ongoing
34. @dawnieando from
@MoveItMarketing
PAST DATA ON CHANGE IS A GREAT
PREDICTOR OF FUTURE DATA
PREDICTION
BASED
PRIORITY
SCHEDULING
…
WHEN
THERE
IS
CONSISTENCY
“past
changes
to
a
page
are
a
good
predictor
of
future
changes.
This
result
has
practical
implications
for
incremental
web
crawlers
that
seek
to
maximize
the
freshness
of
a
web
page
collection
or
index.”
(
40. @dawnieando from
@MoveItMarketing
History Log Records Include:
• URL
fingerprint
• Timestamp
(last
crawl
or
download
attempt)
• Crawl
status
(success
or
error)
(Response
code)
• Content
checksum
(binary
code)
• Source
ID
(accessed
from
cache
or
downloaded)
• Segment
identifier
(Crawl
segment
assigned
to??)
• Page
importance
(a
measure
of
importance
assigned
to
the
URL)
41. @dawnieando from
@MoveItMarketing
”The
URL
page
importance
score
can
be
retrieved
from
the
…
URL
history
log …or
it
can
be
obtained
by
obtaining
the
historical
page
importance
score
for
the
URL
for
a
predefined
number
of
prior
crawls
and
then
performing
a
predefined
filtering
function
on
those
values
to
obtain
the
URL
page
importance
score.”
Scheduler
for
Search
Engine
Crawler
https://www.google.com/patents/US8042112
DOC
ID CRAWL
1
IMPORTANCE
RECORD
CRAWL
2
IMPORTANCE
RECORD
CRAWL 3
IMPORTANCE
RECORD
CRAWL
4
IMPORTANCE
RECORD
CRAWL
5
IMPORTANCE
RECORD
CRAWL
6
IMPORTANCE
RECORD
DOC
ID
1 1 0.8 0.6 0.4 0.2 0
DOC
ID
2 0 0.2 0.4 0.6 0.8 1
42. @dawnieando from
@MoveItMarketing
URL_SEEN TEST
YOU CAN’T JUST KEEP TRYING TO JUMP
THE INDEXING QUEUE EITHER
PUSH
INDEXING PULLINDEXING
E.G.
FETCH
AS
GOOGLEBOT
&
SUBMIT
TO
INDEX
VISITS
BY
NATURAL
CRAWLING
&
DISCOVERY
OF
URLS
/
URL
VISIT
SCHEDULING
/
REVISITS
43. @dawnieando from
@MoveItMarketing
‘Sampling’ in Crawling for Efficiency
‘SMALL
TEST
VISITS
TO
A
SITE
TO
UNDERSTAND
WHETHER
IT
IS
WORTH
CRAWLING
&
UNDERSTAND
URL
PATTERNS
&
RESOURCES
THERE’
46. @dawnieando from
@MoveItMarketing
Aged ‘Patchwork Quilt’ Sites
A
LITTLE
BIT
OF
THIS
CMS
AND
A
LITTLE
BIT
OF
THAT
CMS
MANY
HISTORICAL
PARAMETERS
CREATED
&
CRAWLING
SAMPLE
PATTERNS
47. @dawnieando from
@MoveItMarketing
Every Version of Your Past Ecommerce Sites
“Exponentially
multiplicative
URLs”
Had
potential
to
spew…
at
some
point…
DIFFERENT
PARAMETERS
&
URL
PATTERNS
WHICH
ARE
LEARNED
BY
CRAWLERS…
AND
REMEMBERED…
FOREVER
51. @dawnieando from
@MoveItMarketing
REUSE LOW
IMPORTANCE
and /
or
DOESN’T
CHANGE OFTEN
REUSE IF
NOT
MODIFIED
SINCE LIKELY
TO
CHANGE
BY
X
DATE
(SINCE DATE)
DOWNLOAD CHANGES
FREQUENTLY WITH
IMPORTANT
CHANGE
OR
IS
AN
IMPORTANT
DOCUMENT
REUSE
IF
NOT
MODIFIED
SINCE
https://www.google.com/patents/US8042112
53. @dawnieando from
@MoveItMarketing
YOU BROKE YOUR SILO STRUCTURE
Image
credit:
https://www.slideshare.net/patrickstox/nlp-‐sitemap-‐smx-‐2016-‐
patrick-‐stox-‐latest-‐in-‐advanced-‐technical-‐seo
SEMANTIC
LOSS
54. @dawnieando from
@MoveItMarketing
‘CONCEPT DRIFT’
IS A THING
fuzzy difficult to perceive;; indistinct or vague.
synonyms: blurry, blurred, indistinct; unclear, bleary, misty, distorted, out
of
focus, unfocused, lacking
definition, low
resolution, nebulous;
Ill-‐
defined, indefinite, vague, hazy, imprecise, inexact, loose, woolly
"a
fuzzy
picture"
https://en.wikipedia.org/wiki/Concept_drift
AI
ALERT
67. @dawnieando from
@MoveItMarketing
DIAGNOSE: Validate & Retain in GSC ALL Past
Domains & Past Site Versions (Protocols (HTTPS /
HTTP)
THERE
MAY
STILL
BE
UNDETECTED
ACTIVITY
GOING
ON
THERE
68. @dawnieando from
@MoveItMarketing
URL Parameter Handling is Your Friend
Help
Google
Build
‘Crawling
Rules’
for
your
site
rather
than
wasting
time
on
‘sampling’
and
giving
a
bad
impression
GIVE
HELP
AND
GUIDANCE
WITH
THE
CRAWL
RULE
AND
HINT
BUILDING
69. @dawnieando from
@MoveItMarketing
Help
Google
Build
‘Crawling
Rules’
for
your
site
rather
than
wasting
time
on
‘sampling’
and
giving
a
bad
impression
BE
VERY
CAREFUL
75. @dawnieando from
@MoveItMarketing
REVIEW & UNDERSTAND - THE
CANONICAL LINK RELATION
§ 30X
redirects
§ Canonical
tag
§ Href lang
§ HTTPS
protocol
§ Global
canonicalization
rules
§ URL
normalization
In
’ALL’
its
forms
78. @dawnieando from
@MoveItMarketing
DIAGNOSE: SERVER LOG FILE ANALYSIS
BUT
WATCH
OUT
FOR
OTHER
TOOLS
EMULATING
GOOGLEBOT
AND
FILTER
THEM
OUT
ANALYSE
THE
LOGS
FOR
‘ALL’
YOUR
SITES
AND
‘ALL’
PROTOCOLS
TO
SEE
THE
PATTERNS
EMERGE
80. @dawnieando from
@MoveItMarketing
REVISIT ALLPAST .HTACCESS FILES
Can
you
rewrite
the
rules
to
be
more
efficient
with
regex
or
cut
out
some
old
rules
still
firing
unnecessarily?
(CREATE
SHORTCUTS)
REMEMBER
.HTACCESS
RULES
RUN
IN
ORDER
OF
THEIR
APPEARANCE
IN
THE
FILE.
CAN
YOU
USE
WILDCARDS
TO
OPTIMIZE
OR
SKIP
STEPS?
.HTACCESS
SITE
1
.HTACCESS
SITE
2
.HTACCESS
SITE
3
82. @dawnieando from
@MoveItMarketing
Help Googlebot Get Round its Shopping List
OPEN
MORE
CHECKOUTS
WIDEN
THE
AISLES
MAKE
THINGS
EASY
TO
FIND
DON’T
CONFUSE
GOOGLEBOT
HELP
FILL
THE
TROLLEY
QUICKLY
SPEED,
SPEED,
SPEED
83. @dawnieando from
@MoveItMarketing
XML Sitemaps Are Your Friend… (Strong
Foundations)
They
help
to
pass
‘importance’
signals
to
URLs
But…
never
leave
them
to
just
autogenerate
without
periodically
checking
‘The
foundations’
underneath
a
site
84. @dawnieando from
@MoveItMarketing
EXTERNALLY HOSTED XML SITEMAPS
• Take
back
control
• Jump
the
dev
queue
• Allows
for
custom
configuration
of
optimal
canonical
click
paths
• Allows
for
consistent
signals
of
importance
to
included
URLs
• Forget
about
setting
priority
• Forget
about
last
modified
• Even
a
simple
list
of
URLs
FTW
will
do
• Keep
them
organised for
granular
analysis
of
problem
site
sections
85. @dawnieando from
@MoveItMarketing
INSTEAD
OF
REMOVE…
CONSIDER…
DISTRACT
&
ITERATIVELY
IMPROVE
STRATEGIC
USE
OF
INTERNAL
LINK
POPULARITY
REDUCE
IMPORTANCE
SIGNALS
TO
DIFFERENT
PAGES
INCLUDE
IMPORTANT
PAGES
IN
XML
SITEMAPS
INCLUDE
IMPORTANT
PAGES
IN
HTML
SITEMAPS
86. @dawnieando from
@MoveItMarketing
BUILD WELL CATEGORIZED AND
CONCEPTUALLY STRUCTURED
SITEMAPS
https://www.slideshare.net/p
atrickstox/nlp-‐sitemap-‐smx-‐
2016-‐patrick-‐stox-‐latest-‐in-‐
advanced-‐technical-‐seo
87. @dawnieando from
@MoveItMarketing
SOLUTION: Increase ‘Importance’ quickly of
target URLs
• Internal
link
optimization
• Canonicalise to
(if
relevant)
• Strengthen
up
importance
signals
• Inclusion
in
front
facing
HTML
and
XML
sitemaps
• Improve
the
content
&
keep
it
updated
• 301
redirect
to
(if
relevant
redundant
content)
• Topical
hubs
and
strong
information
views
to
navigate
users
&
add
relevance
88. @dawnieando from
@MoveItMarketing
SOLUTION: Reduce ‘Importance’ quickly of old
URLs
• Internal
link
UNOPTIMIZATION
• 410
• Dig
out
URLs
with
links
to
them
• Orphan
URLs
• Canonicals
to
HTTPs
• EXCLUSION
from
XML
sitemaps
(even
old
ones
on
the
server)
• Archiving
of
content
90. @dawnieando from
@MoveItMarketing
IT’S
VERY
IMPORTANT…
YOU
STAY
OUT
OF
SERVER
ERROR
STATUS
500
‘Try
again’
intervals
likely
extended
between
each
failed
connection
attempt
93. @dawnieando from
@MoveItMarketing
410 Likely Get Deindexed Quicker
https://plus.google.com/+JohnMueller/
posts/NEsqE7Sr4Z4
“Usually
seeing
it
(410)
1-‐2
times
is
enough
for
us
to
drop
those
URLs
from
the
index”
John
M
on
Google+
(https://plus.google.com/u/0
/+JohnMueller/posts/NEsq
E7Sr4Z4)
94. @dawnieando from
@MoveItMarketing
LEGACY ISSUES VIA CANONICALS OR
REDIRECTION (COMMON MISTAKES)
• PAGE
CANONICALIZED
TO
IS
NOT
A
SUPERSET
OR
DUPLICATIVE
(IT
IS
NOT
RELEVANT
ENOUGH)
• 301s
TO
IRRELEVANT
PAGES
BECOME
SOFT
404
• FOLDING
UP
PRODUCT
PAGES
TO
CATEGORES
(PEOPLE
WERE
LOOKING
FOR
A
SPECIFIC
PRODUCT)
• CANONICALIZATION
TO
PAGES
WHEN
IN
THE
FUTURE
301
REDIRECT
TO
ANOTHER
URL
THEREFORE
NEGATING
THE
PAGES
CANONICALIZING
TO
THEM
• CONFLICTS
BETWEEN
HREF
LANG
AND
CANONICALIZATION
95. @dawnieando from
@MoveItMarketing
MORE CAUSES
SEARCH ENGINES ARE CRAWLING MORE CODE THAN YOU MIGHT HAVE
INTENDED IN THE FIRST PLACE
JAVASCRIPT ERRORS FROM LEGACY CODE & LIBRARIES
LEGACY 302s FROM REDIRECTED LEGACY DOMAINS WHICH CONFUSE
INTERMEDIATE SIGNALS BETWEEN 301S (WHICH ARE INTENDED DEFINITE
REDIRECTIONS)
ABANDONED URLS
AJAX URLS (NOT THE SAME AS THE NAMED ANCHOR) – DEPRECATION OF
AJAX CRAWLING (ASYNCHRONOUS JAVASCRIPT & XML)
96. @dawnieando from
@MoveItMarketing
“If
“change”
means
“any
change”,
then
about
40%
of
all
web
pages
change
weekly
[12].
Even
if
we
consider
only
pages
that
change
by
a
third
or
more,
about
7%
of
all
web
pages
change
weekly
[17].”
(Broder,
A.Z.,
Najork,
M.
and
Wiener,
J.L.,
2003)
EVEN
AS
FAR
BACK
IN
2003
40% of ALL web pages
changed weekly
___________________
7%
of
web
pages
changed
a
1/3
of
their
page
content
or
more
weekly
97. @dawnieando from
@MoveItMarketing
HOW
MUCH
BIGGER
&
DYNAMIC
IS
THE
WEB
NOW
IN
2017?
http://www.internetlivestats.com/total-‐number-‐of-‐websites/
99. @dawnieando from
@MoveItMarketing
THESE
THINGS
ADD
UP
THEY
ALSO
STILL
NEED
TO
BE
DISCOVERED
WHICH
REQUIRES
INITIAL
CRAWLING
https://twitter.com/dawnieando/status/906465965029969920
100. @dawnieando from
@MoveItMarketing
“404
vs
410
doesn't
affect
the
recrawl
rate:
we'll
still
occasionally
check
to
see
if
these
pages
are
still
gone,
especially
when
we
spot
a
new
link
to
them”
John
Mueller,
Google+
2015
https://plus.google.com/u/0/+JohnMu
eller/posts/NEsqE7Sr4Z4
ESPECIALLY IF
THERE ARE
LINKS TO IT
102. @dawnieando from
@MoveItMarketing
THINK CAREFULLY ABOUT URL CREATION
Not
EVERYTHING
is
worthy
of
its
own
URL
VARIANTS
STEMMINGS
PLURALS
RANDOM
TAGS
LONG,
LONG,
LONG
TAIL
PARAMETERS
103. @dawnieando from
@MoveItMarketing
ONLY
DOWNLOAD
IF
THERE
IS
SUBSTANTIVE
CHANGE
TAKE
SOME
CONTROL
WITH
304
&
EXPIRES
AFTER
HEADERS
ON
LESS
IMPORTANT
PAGES
https://developers.google.com/web/fundamentals/pe
rformance/optimizing-‐content-‐efficiency/http-‐caching
VALID
REPRESENTATION
THE
URL
WILL
STILL
BE
VISITED
BUT
0
(ZERO)
WILL
BE
DOWNLOADED
SO
IT
IS
STRAIGHT
ON
TO
THE
NEXT
URL
VERY
QUICKLY
https://webmasters.googleblog.com/2006/09/better-‐
details-‐about-‐when-‐googlebot.html
https://tools.ietf.org/html/rfc7232#section-‐4.1
104. @dawnieando from
@MoveItMarketing
A
URI
is
like
a
fine
wine
Maturing
over
time
“COOL
URIs
DON’T
CHANGE”
Sir
Tim
Berners-‐Lee
(Inventor
of
the
World
Wide
Web)
https://www.w3.org/Provider/Style/URI
105. @dawnieando from
@MoveItMarketing
A
LONG,
LONG
TIME
AGO
• You
need
to
go
right
back
to
the
beginning
• What
domains
did
the
organisation EVER
register?
• Where
do
they
redirect
to?
• Is
it
via
301,
302
or
are
they
merely
parked
domains?
• Who
would
know?
Who
is
responsible?
• Verify
them
all
in
Google
Search
Console
• Some
of
these
may
EVEN
HAVE
PENALTIES
HISTORICALLY
• If
there
are
links
to
any
there
is
likely
still
crawling
activity
there
• Analyse logs
across
multiple
subdomains
&
protocols
106. @dawnieando from
@MoveItMarketing
QUESTIONS TO ASK
HOW MANY MICRO-SITES HAVE YOU HAD?
HOW MANY SUBDOMAINS?
HOW MANY OTHER DOMAINS?
WHO IS RESPONSIBLE FOR DOMAIN REG
WHO KNOWS WITHIN THE ORGANISATION?
WHO REGISTERED THE DOMAINS?
WHO CAN UPDATE DNS RECORDS?
ARE THESE SITES STILL ON SERVERS?
HAVE ANY OF THESE SITES HAD MANUALACTIONS?
HOW ARE THESE SITES REDIRECTED?
ARE THEY PARKED DOMAINS?
108. @dawnieando from
@MoveItMarketing
SOLUTION: REVISITING BLOATED
APPENDED .HTACCESS FILES ON ALL
LEGACY SITES (IF NOT REDIRECTING
AT A DNS LEVEL)
NOT
JUST
THE
.HTACCESS
FILE
ON
THE
EXISTING
SITE
EITHER.
GOOGLEBOT
MAY
HIT
.HTACCESS
ON
PAST
SITES
SO
THEY
MAY
ALSO
NEED
OPTIMIZING
.HTACCESS
RUN
IN
ORDER
SO
PROVIDE
OPPORTUNITY
FOR
SHORT
CUTS
109. @dawnieando from
@MoveItMarketing
SOME TYPES OF URL CRUFT
• INCORRECTLY
APPLIED
CANONICAL
TAGS
• CONFLICTING
HREF
LANG
&
CANONICAL
TAGS
• MIXED
CONTENT
• URL
SHORTENERS
• SESSION
IDS
• UTM
TAGGING
• OLD
AJAX
FRAGMENTS
• PARAMETERS
FROM
MULTI
FACET
DROP
DOWN
CHOICES
• .html,
.php,
.index.html,
.aspx
• LEGACY
URL
REWRITING
&
PARAMETERS
IN
.HTACCESS
FILES
• LEGACY
FOLDERS
WHICH
CONTRIBUTE
NO
MEANING
TO
SITE
ONTOLOGY
UNCRUFTY
www.myeasyurlwillmakeyouw
onder.com/resume
CRUFTY
www.myeasyurlwillmakeyouw
onder.com/resume.html
CRUFTY
http://nymag.com/scienceofus/2015/07/how-‐
to-‐recover-‐from-‐an-‐all-‐
nighter.html?om_rid=AAENcg&om_mid=_BTtF
a0B869PyJp&utm_content=buffer8fdd1&utm_
medium=social&utm_source=twitter.com&ut
m_campaign=buffer
110. @dawnieando from
@MoveItMarketing
INDEX
TIERING
Presented
by
B
Cambazoglu at
European
Summer
School
Information
Retrieval
2017
– (Cambazoglu,
B.B.
and
Baeza-‐Yates,
R.,
2011.
Scalability
challenges
in
web
search
engines.
In Advanced
topics
in
information
retrieval (pp.
27-‐50).
Springer
Berlin
Heidelberg.)
112. @dawnieando from
@MoveItMarketing
TWO-PHASE
RANKING IN
A SEARCH
NODE
Presented
by
B
Cambazoglu at
European
Summer
School
Information
Retrieval
2017
– (Cambazoglu,
B.B.
and
Baeza-‐Yates,
R.,
2011.
Scalability
challenges
in
web
search
engines.
In Advanced
topics
in
information
retrieval (pp.
27-‐50).
Springer
Berlin
Heidelberg.)
114. @dawnieando from
@MoveItMarketing
EVERY
SINGLE
TIME
YOU
MIGRATE,
CHANGE
DESIGN,
REDIRECT,
REINVENT
A
SITE
/
URL
A
CLEAN
START
REDIRECTIONS
ANOTHER
STRUCTURE
FIRST
SITE
STRUCTURE
NEW
CRAWLING
‘RULES’
BUILT
CRAWLING
‘RULES’
BUILT
EVERYTHING
IS
‘200
OK’
MORE
URLs
MIXED
RESPONSE
CODES
REDIRECTIONS
‘FUZZINESS’
IS
EMERGING
NEW
CRAWLING
‘RULES’
BUILT
MORE
URLs
REDIRECT
CHAINS
&
MIXED
RESPONSE
CODES
NEW
SEO’s
DON’T
KNOW
THE
‘HISTORY’
TARGET
URLs
NOW
‘VERY
FUZZY’
118. @dawnieando from
@MoveItMarketing
The Generational ’Snail Trail’
• Old
XML
sitemaps
• Redirects
drop
away
on
old
site
.htaccess
• DNS
issues
• People
link
to
old
site
but
wrong
protocol
• Old
sites
not
verified
in
GSC
• Not
all
protocols
redirecting
Leaving
it’s
slithery
footprint
120. @dawnieando from
@MoveItMarketing
REDUCTION & REPOPULATION OF INTERNAL LINK
POPULARITY (IBP) BETWEEN URL
SCHEDULING
IT’S
NOT
ONLY
THEIR
‘INTERNAL
PAGE
RANK’
BUT
ALSO
THE
ANCHORS,
INTER-‐
CONNECTING
CONCEPTUAL
/
TOPIC
RELEVANCE
IN
CONTENT
AND
THE
TEXT
SURROUNDING
INTERNAL
LINK
ANCHORS
(AND
PROBABLY
OTHER
THINGS
TOO)
SEMANTIC
’CLUES’
WERE
LOST
ALONG
THE
WAY
SEMANTIC
‘CONTEXT’ & IBP
BUCKET IS
LEAKING
123. @dawnieando from
@MoveItMarketing
THE
USE
OF
REUSE
TABLESTABLE
I
Reuse
Table
Example
URL URL
Record
No. Fingerprint
(FP) Reuse
Type If
Modified
Since
.
.
.
1 2123242 REUSE
2 2323232 REUSE
IF
NOT Feb.
5,
2004
MODIFIED
SINCE
3 3343433 DOWNLOAD
. . . .
. . . .
. . . .
https://www.google.com/patents/US8042112
126. @dawnieando from
@MoveItMarketing
Sources & References
Bar-‐Yossef,
Z.,
Keidar,
I.
and
Schonfeld,
U.,
2009.
Do
not
crawl
in
the
dust:
different
urls with
similar
text. ACM
Transactions
on
the
Web
(TWEB), 3(1),
p.3
Broder,
A.Z.,
Najork,
M.
and
Wiener,
J.L.,
2003,
May.
Efficient
URL
caching
for
world
wide
web
crawling.
In Proceedings
of
the
12th
international
conference
on
World
Wide
Web (pp.
679-‐689).
ACM
Cambazoglu,
B.B.
and
Baeza-‐Yates,
R.,
2011.
Scalability
challenges
in
web
search
engines.
In Advanced
topics
in
information
retrieval (pp.
27-‐50).
Springer
Berlin
Heidelberg.
Cho,
J.,
Garcia-‐Molina,
H.
and
Page,
L.,
1998.
Efficient
crawling
through
URL
ordering. Computer
Networks
and
ISDN
Systems, 30(1),
pp.161-‐172
Fetterly,
D.,
Manasse,
M.,
Najork,
M.
and
Wiener,
J.,
2003,
May.
A
large-‐scale
study
of
the
evolution
of
web
pages.
In Proceedings
of
the
12th
international
conference
on
World
Wide
Web (pp.
669-‐678).
ACM
127. @dawnieando from
@MoveItMarketing
Sources & References
• Olston,
C.
and
Najork,
M.,
2010.
Web
crawling. Foundations
and
Trends®
in
Information
Retrieval, 4(3),
pp.175-‐246.
• Pandey,
S.
and
Olston,
C.,
2008,
February.
Crawl
ordering
by
search
impact.
In Proceedings
of
the
2008
International
Conference
on
Web
Search
and
Data
Mining (pp.
3-‐14).
ACM.
• Olston,
C.
and
Pandey,
S.,
2008,
April.
Recrawl scheduling
based
on
information
longevity.
In Proceedings
of
the
17th
international
conference
on
World
Wide
Web (pp.
437-‐446).
ACM
• Pandey,
S.
and
Olston,
C.,
2005,
May.
User-‐centric
web
crawling.
In Proceedings
of
the
14th
international
conference
on
World
Wide
Web (pp.
401-‐411).
ACM.
• Pandey,
S.
and
Olston,
C.,
2008,
February.
Crawl
ordering
by
search
impact.
In Proceedings
of
the
2008
International
Conference
on
Web
Search
and
Data
Mining (pp.
3-‐14).
ACM
128. @dawnieando from
@MoveItMarketing
Sources & References
• https://patentimages.storage.googleapis.com/US8042112B1/US08042112-‐
20111018-‐D00000.png
• Randall,
K.H.,
Google
Inc.,
2010. Scheduler
for
search
engine
crawler.
U.S.
Patent
7,725,452.