SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Spanner: Google’s Globally- 
Distributed Database 
J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. 
Hochschild, and others, “Spanner: Google’s globally distributed database,” ACM Transactions on Computer 
Systems (TOCS), vol. 31, no. 3, p. 8, 2013.
Helpful Definitions 
Sharding: is a type of database partitioning that separates very large 
databases the into smaller, faster, more easily managed parts called 
data shards. The word shard means a small part of a whole. 
Replication: Sharing of information to improve availability, durability 
1. Transactional replication. This is the model for replicating transactional 
data, for example a database or some other form of transactional storage 
structure with one-copy serializability. 
2. State machine replication. This model assumes that replicated process is a 
deterministic finite automaton and that atomic broadcast of every event is 
possible, usually implemented by a replicated log consisting of multiple 
subsequent rounds of the Paxos algorithm. 
3. Virtual synchrony. This computational model is used when a group of 
processes cooperate to replicate in-memory data or to coordinate actions.
Helpful Definitions 
Clock drift: several related phenomena where a clock does not run at 
the exact right speed compared to another clock. That is, after some 
time the clock "drifts apart" from the other clock. 
- Everyday clocks such as wristwatches have finite precision. 
Eventually they require correction to remain accurate. The rate of 
drift depends on the clock's quality, sometimes the stability of the 
power source, the ambient temperature, and other subtle 
environmental variables. Thus the same clock can have different 
drift rates at different occasions. 
- Atomic clocks are very precise and have nearly no clock drift. Even 
the Earth's rotation rate has more drift and variation in drift than an 
atomic clock. 
- As Einstein predicted, relativistic effects can also cause clock drift 
due to time dilation or gravitational distortions.
Wilson Hsieh, Google Inc presents Spanner at OSDI 2012 
https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett
Spanner in a Nutshell 
1. Lock-Free distributed read only transactions 
2. External consistency of writes 
3. Temporal multiple versioning 
4. Schematized, semi-relational database with 
associated structured query language 
5. Applications control replication and placement 
6. All of this because of TrueTime
Implementation 
Spanner “Universe” 
Zones 
DataCenters
Zone Server Organization 
- zonemaster (1 per zone) assigns 
data to the span servers (similar 
to NameNode in GFS) 
- spanserver (100 - 3000 per 
zone) serves data to the clients 
- Proxies are used by clients to 
locate span servers with the 
correct data (shard lookup) these 
are used to prevent bottleneck at 
zonemaster 
1 per universe: 
- universemaster has a console 
with status information 
- placement driver handles data 
movement for replication or load 
balancing purposes
Storage Data Model 
(key:string, timestamp:int64) → 
string 
- Tablet: a bag of timestamped key,val pairs as above 
- Tablet’s state is tracked and stored with B-tree-like files and a 
write-ahead log. Data storage is on a distributed filesystem 
(Colossus, the successor to GFS at Google). 
Tablets are optimized data structures first implemented by BigTable to 
track data by row/column position (the keys) and their locations on 
GFS. In BigTable (and presumably Spanner as well) Tablets are 
compressed using Snappy to keep them approximately 200 MB in size. 
Tablets (for now) can be seen as the basic unit of replication/sharding.
Zone Software Stack 
- each tablet gets a Paxos 
state machine that maintains 
its log and state in the tablet 
(Paxos writes twice) 
- Paxos has long lived leaders 
(10 seconds) and is pipelined 
to improve for WAN latencies 
- Paxos is used to maintain 
consistency among replicas; 
every write must initiate at the 
Paxos leader. 
- The set of replicas is 
collectively a Paxos group.
Zone Software Stack Continued 
- Each leader maintains a lock 
table with state for two phase 
locking, mapping range of 
keys to lock states. 
- Long lived leader and is 
critical, design for long-lived 
transactions (reports). 
- Transaction manager is used 
to implement distributed 
transactions and to select a 
participant leader, which 
coordinates Paxos between 
participant groups.
Buckets* and Data Locality 
- Directories: collection of contiguous keys that share a prefix. 
- Prefixes allow applications to control the locality of their data. 
- All data in a directory has same replica configuration; movement of 
data between Paxos groups is via directory, in order to store data in 
a group closer to its accessors. 
- Spanner Tablet != BigTable Tablet, Spanner containers 
encapsulate multiple partitions of row space. 
- Smallest unit of geographic replication properties (placement)
Application Data Model 
- schematized, semi-relational 
tables and a structured query 
language + general purpose 
transactions. 
- Is 2PC too expensive? Leave 
it to the application to deal 
with performance. 
- On top of Buckets an 
application creates a 
database in the universe, 
composed of tables. 
- But not actually relational; all 
rows must have primary keys.
TrueTime 
- Time is an interval with timestamp endpoints 
- TT.now() → returns a TTinterval: [earliest, latest] 
- TT.after(t) → returns True iff t > latest 
- TT.before(t) → returns True iff t < earliest 
-  is the instantaneous error bound (half the intervals width) 
- Clients poll TimeMasters (GPS and Atomic Clocks) and use 
Marzullo’s algorithm to detect and reject liars. Also detect bad local 
times or clocks. 
- Between polls,  gets increasingly larger until the next poll and 
this increase is derived from worst-case local clock drift 
(significantly larger than expected).
Marzullo’s Algorithm 
- An agreement algorithm used to select sources for estimating 
accurate time from a number of noisy time sources. 
- A refined version of it, renamed the intersection algorithm, forms 
part of the modern NTP. 
- Usually the best estimate is taken to be the smallest interval 
consistent with the largest number of sources.
Concurrency Control 
- read-only transactions 
- Not simply read-write transactions w/o writes 
- execute at a system chosen timestamp 
- non-blocking (incoming writes aren’t locked) 
- proceeds on any replica sufficiently up to date 
- snapshot-reads 
- client specified timestamp 
- proceeds on any replica sufficiently up to date 
- commit is inevitable once timestamp is 
chosen so clients can avoid buffering.
scope LastTS() 
sread slates 
Read Only Transactions 
t 
searlies 
t 
Paxos Group Leader 
Up to date replica
2PC and RW Transactions 
- timestamps assigned anytime when locks have been 
acquired but before lock is released. 
- within each Paxos group, Spanner assigns timestamps in 
monotonically increasing order, even across leaders. 
- A leader must only assign timestamps within its leader lease 
External consistency for T1 and T2: tabs(e1 
commit)  tabs(e2 
start) 
- The coordinator leader for a write Ti assigns a commit 
timestamp si no less than the value of TT.now().latest 
- Coordinator ensures clients see no data committed by Ti until 
TT.after(si) is true using a commit wait.
RW Transactions - Other Details 
- Writes are buffered until a commit; reads in a transaction do 
not see the effects of the transaction’s writes. 
- Uncommitted data is not assigned a timestamp 
- Reads in RWX use wound-wait to avoid deadlocks 
(preference given to older transaction) 
Client: 
1. Acquire locks, read most recent data 
2. Send keep alive messages to avoid timeouts 
3. Begin two phase commit 
4. Choose coordinator group, send commit to each leader 
Having client drive 2PC avoids sending data twice
Acquired 
Locks 
Compute s 
Read-Write Transactions 
Tc 
TP1 
Release 
Locks 
Committed 
Release 
Locks 
Compute s Prepared 
send s
Schema Change Transactions 
- The number of groups in the database could be in the 
millions, how do you make an atomic schema change? 
(BigTable only supports changes in one data center, but 
locks everything) 
- Spanner has a non-blocking variant of standard transactions 
for schema changes: 
1. Assign a timestamp in the future (registered during prepare) 
2. Reads and Writes synchronize with any registered schema at 
the future timestamp, but not before 
3. If they have reached the timestamp, they must block if their 
schemas are after timestamp t
Performance
F1 Case Study 
- Spanner started being experimentally evaluated under 
production workloads in early 2011, as part of a rewrite of 
Google’s advertising backend called F1. 
- Replaced a MySQL database that was manually sharded. 
- The uncompressed dataset is tens of terabytes. 
- Spanner removes need to manually reshard 
- Spanner provides synchronous replication and auto failover 
- F1 requires strong transactional semantics 
- Fairly invisible transfer from MySQL to Spanner; data center 
loss and swapping required only an update to schema to 
move leaders to closer front ends.
An Overview of Spanner: Google's Globally Distributed Database

Weitere ähnliche Inhalte

Was ist angesagt?

Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta LakeKnoldus Inc.
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...VMware Tanzu
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache CassandraDataStax
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Julien Le Dem
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignArinto Murdopo
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overviewJames Serra
 
CAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeCAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeDilum Bandara
 
Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteAdvanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteKenny Gryp
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 

Was ist angesagt? (20)

Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
CockroachDB
CockroachDBCockroachDB
CockroachDB
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Deep Dive on Amazon Aurora
Deep Dive on Amazon AuroraDeep Dive on Amazon Aurora
Deep Dive on Amazon Aurora
 
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
An Overview of Apache Cassandra
An Overview of Apache CassandraAn Overview of Apache Cassandra
An Overview of Apache Cassandra
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013Parquet Hadoop Summit 2013
Parquet Hadoop Summit 2013
 
Amazon Aurora: Under the Hood
Amazon Aurora: Under the HoodAmazon Aurora: Under the Hood
Amazon Aurora: Under the Hood
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28Aurora Deep Dive | AWS Floor28
Aurora Deep Dive | AWS Floor28
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
CAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain SyndromeCAP Theorem and Split Brain Syndrome
CAP Theorem and Split Brain Syndrome
 
Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteAdvanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suite
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 

Ähnlich wie An Overview of Spanner: Google's Globally Distributed Database

Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
Spanner Google’s Globally-Distributed DatabaseJames C. Corbett,.docx
Spanner Google’s Globally-Distributed DatabaseJames C. Corbett,.docxSpanner Google’s Globally-Distributed DatabaseJames C. Corbett,.docx
Spanner Google’s Globally-Distributed DatabaseJames C. Corbett,.docxrafbolet0
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05Rajesh Gupta
 
Real Time OS For Embedded Systems
Real Time OS For Embedded SystemsReal Time OS For Embedded Systems
Real Time OS For Embedded SystemsHimanshu Ghetia
 
Real Time Operating Systems
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating SystemsPawandeep Kaur
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Bab 4
Bab 4Bab 4
Bab 4n k
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
Salesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafkaSalesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafkaThomas Alex
 

Ähnlich wie An Overview of Spanner: Google's Globally Distributed Database (20)

Os Concepts
Os ConceptsOs Concepts
Os Concepts
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Cloud Spanner
Cloud SpannerCloud Spanner
Cloud Spanner
 
Spanner Google’s Globally-Distributed DatabaseJames C. Corbett,.docx
Spanner Google’s Globally-Distributed DatabaseJames C. Corbett,.docxSpanner Google’s Globally-Distributed DatabaseJames C. Corbett,.docx
Spanner Google’s Globally-Distributed DatabaseJames C. Corbett,.docx
 
Cassandra admin
Cassandra adminCassandra admin
Cassandra admin
 
Embedded os
Embedded osEmbedded os
Embedded os
 
Rtos Concepts
Rtos ConceptsRtos Concepts
Rtos Concepts
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
 
Real Time OS For Embedded Systems
Real Time OS For Embedded SystemsReal Time OS For Embedded Systems
Real Time OS For Embedded Systems
 
Real Time Operating Systems
Real Time Operating SystemsReal Time Operating Systems
Real Time Operating Systems
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Bab 4
Bab 4Bab 4
Bab 4
 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
Spanner (may 19)
Spanner (may 19)Spanner (may 19)
Spanner (may 19)
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
Salesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafkaSalesforce enabling real time scenarios at scale using kafka
Salesforce enabling real time scenarios at scale using kafka
 
Lecture1
Lecture1Lecture1
Lecture1
 

Mehr von Benjamin Bengfort

Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningBenjamin Bengfort
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Benjamin Bengfort
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Benjamin Bengfort
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection ProcessBenjamin Bengfort
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity ResolutionBenjamin Bengfort
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportBenjamin Bengfort
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataBenjamin Bengfort
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Benjamin Bengfort
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXBenjamin Bengfort
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBenjamin Bengfort
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Benjamin Bengfort
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with PythonBenjamin Bengfort
 

Mehr von Benjamin Bengfort (19)

Getting Started with TRISA
Getting Started with TRISAGetting Started with TRISA
Getting Started with TRISA
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)
 
Visualizing the Model Selection Process
Visualizing the Model Selection ProcessVisualizing the Model Selection Process
Visualizing the Model Selection Process
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation Report
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
 
Annotation with Redfox
Annotation with RedfoxAnnotation with Redfox
Annotation with Redfox
 
Rasta processing of speech
Rasta processing of speechRasta processing of speech
Rasta processing of speech
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
 

Kürzlich hochgeladen

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Kürzlich hochgeladen (20)

SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

An Overview of Spanner: Google's Globally Distributed Database

  • 1. Spanner: Google’s Globally- Distributed Database J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, and others, “Spanner: Google’s globally distributed database,” ACM Transactions on Computer Systems (TOCS), vol. 31, no. 3, p. 8, 2013.
  • 2. Helpful Definitions Sharding: is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole. Replication: Sharing of information to improve availability, durability 1. Transactional replication. This is the model for replicating transactional data, for example a database or some other form of transactional storage structure with one-copy serializability. 2. State machine replication. This model assumes that replicated process is a deterministic finite automaton and that atomic broadcast of every event is possible, usually implemented by a replicated log consisting of multiple subsequent rounds of the Paxos algorithm. 3. Virtual synchrony. This computational model is used when a group of processes cooperate to replicate in-memory data or to coordinate actions.
  • 3. Helpful Definitions Clock drift: several related phenomena where a clock does not run at the exact right speed compared to another clock. That is, after some time the clock "drifts apart" from the other clock. - Everyday clocks such as wristwatches have finite precision. Eventually they require correction to remain accurate. The rate of drift depends on the clock's quality, sometimes the stability of the power source, the ambient temperature, and other subtle environmental variables. Thus the same clock can have different drift rates at different occasions. - Atomic clocks are very precise and have nearly no clock drift. Even the Earth's rotation rate has more drift and variation in drift than an atomic clock. - As Einstein predicted, relativistic effects can also cause clock drift due to time dilation or gravitational distortions.
  • 4. Wilson Hsieh, Google Inc presents Spanner at OSDI 2012 https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett
  • 5. Spanner in a Nutshell 1. Lock-Free distributed read only transactions 2. External consistency of writes 3. Temporal multiple versioning 4. Schematized, semi-relational database with associated structured query language 5. Applications control replication and placement 6. All of this because of TrueTime
  • 7. Zone Server Organization - zonemaster (1 per zone) assigns data to the span servers (similar to NameNode in GFS) - spanserver (100 - 3000 per zone) serves data to the clients - Proxies are used by clients to locate span servers with the correct data (shard lookup) these are used to prevent bottleneck at zonemaster 1 per universe: - universemaster has a console with status information - placement driver handles data movement for replication or load balancing purposes
  • 8. Storage Data Model (key:string, timestamp:int64) → string - Tablet: a bag of timestamped key,val pairs as above - Tablet’s state is tracked and stored with B-tree-like files and a write-ahead log. Data storage is on a distributed filesystem (Colossus, the successor to GFS at Google). Tablets are optimized data structures first implemented by BigTable to track data by row/column position (the keys) and their locations on GFS. In BigTable (and presumably Spanner as well) Tablets are compressed using Snappy to keep them approximately 200 MB in size. Tablets (for now) can be seen as the basic unit of replication/sharding.
  • 9. Zone Software Stack - each tablet gets a Paxos state machine that maintains its log and state in the tablet (Paxos writes twice) - Paxos has long lived leaders (10 seconds) and is pipelined to improve for WAN latencies - Paxos is used to maintain consistency among replicas; every write must initiate at the Paxos leader. - The set of replicas is collectively a Paxos group.
  • 10. Zone Software Stack Continued - Each leader maintains a lock table with state for two phase locking, mapping range of keys to lock states. - Long lived leader and is critical, design for long-lived transactions (reports). - Transaction manager is used to implement distributed transactions and to select a participant leader, which coordinates Paxos between participant groups.
  • 11. Buckets* and Data Locality - Directories: collection of contiguous keys that share a prefix. - Prefixes allow applications to control the locality of their data. - All data in a directory has same replica configuration; movement of data between Paxos groups is via directory, in order to store data in a group closer to its accessors. - Spanner Tablet != BigTable Tablet, Spanner containers encapsulate multiple partitions of row space. - Smallest unit of geographic replication properties (placement)
  • 12. Application Data Model - schematized, semi-relational tables and a structured query language + general purpose transactions. - Is 2PC too expensive? Leave it to the application to deal with performance. - On top of Buckets an application creates a database in the universe, composed of tables. - But not actually relational; all rows must have primary keys.
  • 13. TrueTime - Time is an interval with timestamp endpoints - TT.now() → returns a TTinterval: [earliest, latest] - TT.after(t) → returns True iff t > latest - TT.before(t) → returns True iff t < earliest - is the instantaneous error bound (half the intervals width) - Clients poll TimeMasters (GPS and Atomic Clocks) and use Marzullo’s algorithm to detect and reject liars. Also detect bad local times or clocks. - Between polls, gets increasingly larger until the next poll and this increase is derived from worst-case local clock drift (significantly larger than expected).
  • 14. Marzullo’s Algorithm - An agreement algorithm used to select sources for estimating accurate time from a number of noisy time sources. - A refined version of it, renamed the intersection algorithm, forms part of the modern NTP. - Usually the best estimate is taken to be the smallest interval consistent with the largest number of sources.
  • 15. Concurrency Control - read-only transactions - Not simply read-write transactions w/o writes - execute at a system chosen timestamp - non-blocking (incoming writes aren’t locked) - proceeds on any replica sufficiently up to date - snapshot-reads - client specified timestamp - proceeds on any replica sufficiently up to date - commit is inevitable once timestamp is chosen so clients can avoid buffering.
  • 16. scope LastTS() sread slates Read Only Transactions t searlies t Paxos Group Leader Up to date replica
  • 17. 2PC and RW Transactions - timestamps assigned anytime when locks have been acquired but before lock is released. - within each Paxos group, Spanner assigns timestamps in monotonically increasing order, even across leaders. - A leader must only assign timestamps within its leader lease External consistency for T1 and T2: tabs(e1 commit) tabs(e2 start) - The coordinator leader for a write Ti assigns a commit timestamp si no less than the value of TT.now().latest - Coordinator ensures clients see no data committed by Ti until TT.after(si) is true using a commit wait.
  • 18. RW Transactions - Other Details - Writes are buffered until a commit; reads in a transaction do not see the effects of the transaction’s writes. - Uncommitted data is not assigned a timestamp - Reads in RWX use wound-wait to avoid deadlocks (preference given to older transaction) Client: 1. Acquire locks, read most recent data 2. Send keep alive messages to avoid timeouts 3. Begin two phase commit 4. Choose coordinator group, send commit to each leader Having client drive 2PC avoids sending data twice
  • 19. Acquired Locks Compute s Read-Write Transactions Tc TP1 Release Locks Committed Release Locks Compute s Prepared send s
  • 20. Schema Change Transactions - The number of groups in the database could be in the millions, how do you make an atomic schema change? (BigTable only supports changes in one data center, but locks everything) - Spanner has a non-blocking variant of standard transactions for schema changes: 1. Assign a timestamp in the future (registered during prepare) 2. Reads and Writes synchronize with any registered schema at the future timestamp, but not before 3. If they have reached the timestamp, they must block if their schemas are after timestamp t
  • 22. F1 Case Study - Spanner started being experimentally evaluated under production workloads in early 2011, as part of a rewrite of Google’s advertising backend called F1. - Replaced a MySQL database that was manually sharded. - The uncompressed dataset is tens of terabytes. - Spanner removes need to manually reshard - Spanner provides synchronous replication and auto failover - F1 requires strong transactional semantics - Fairly invisible transfer from MySQL to Spanner; data center loss and swapping required only an update to schema to move leaders to closer front ends.