This set of slides gives you an overview of Galera, configuration basics and deployment best practices.
The following topics are covered:
- Concepts
- Node provisioning
- Network partitioning
- Configuration example
- Benchmarks
- Deployment best practices
- Galera monitoring and management
2. Copyright Severalnines AB
Agenda
About Severalnines
What is Galera Replication?
Galera Concepts
Node Provisioning
Network partitioning/Split brain
Configuration Example
Benchmarks & Performance Metrics
Best Practices
Monitoring and Management
Confidential 2
3. Copyright Severalnines AB
About Us
Stockholm, Tokyo and Singapore
Database Automation and DBaaS software vendor
Over 7,000 deployments to date
Commercial product launched Q1 2011
Winner Best Startup EuroCloud Europe 2011
Launched Europe’s first Data Cloud in Nov 2011
Press coverage 2011: CIO Magazine, eWeek, PC-World, IDG
News, Le Figaro, LeMondeInformatique, heise.de, Computerwelt,
silicon.de, etc …
Confidential 3
4. Copyright Severalnines AB
What is Galera Replication?
Synchronous (Virtually) Multi-Master Replication
Read and Write on any Node
No Master Failover! No Slave Lag!
Application MySQL Server
Guaranteed write consistency
WSREP API WSREP API
Cluster wide conflicts resolution (certification)
WSREP Provider wsrep plugin
Highly Available and Scalable Replication Replication
No SPOF
Read and Write (Parallel Applier threads) scalability
Geographical Replication (Mix MySQL Async & Galera Sync)
Cluster (Group Communication Protocol)
Automatic Node Provisioning, QoS
Confidential 4
5. Copyright Severalnines AB
Galera Cluster for MySQL
Codership patches for MySQL
Binaries and source available at launchpad
InnoDB (& MyISAM experimental)
Client Client Client
No need to change DB schema/queries
Local queries
LB
Parallel Replication!
Multiple Applier Threads (1-512) R/W R/W R/W
MySQL MySQL MySQL
Row events, row level locks [WSREP] [WSREP] [WSREP]
Asynchronous Replication Galera Replication (Synchronous)
In/Out of the cluster
Confidential 5
6. Copyright Severalnines AB
Galera Cluster for MySQL cont.
Higher probability for “deadlocks”
Cluster wide optimistic locking
Locking conflicts detected at commit Client Client Client
First to commit succeeds
Minimum 3 nodes required
LB
“Donor” node blocks writes during full synch
of joining/recovering node
R/W R/W R/W
3rd node then is available for service
MySQL MySQL MySQL
[WSREP]
Gotchas: 2 recovering nodes will block the last node [WSREP] [WSREP]
Replication performance dependent on Galera Replication (Synchronous)
Network latency
Performance of the “slowest” or the farthest Node (RTT)
Number of deployed nodes
Confidential 6
7. Copyright Severalnines AB
Synchronous Replication
Transaction t1
Node 1
BEGIN COMMIT (REQ) COMMIT (ACK/returns)
Statements
Commit response time
time
COMMIT or
Rollback
WS Replication event
OK or Conflict
Node 2 Transaction applied
(virtually synchronous)
WS
time
Certification Apply event
Node 3 Transaction applied
(virtually synchronous)
WS
time
Certification Apply event All nodes 100% sync
Confidential 7
8. Copyright Severalnines AB
Galera Concepts
Application State
A set of data that application decides to replicate
Default is the whole MySQL databases. Every node is a complete replica
Application state is identified by a Global Transaction ID
Global Transaction ID (GTID)
f7720ae0-6f9b-11e1-0800-598d1b386dce:32520198989
CLUSTER/HISTORY/STATE UUID:TRX/STATE/SEQNO
All replicated transactions can be uniquely referenced in any node
Initial state: f7720ae0-6f9b-11e1-0800-598d1b386dce:0
Undefined state: 00000000-0000-0000-0000-000000000000:-1
Confidential 8
9. Copyright Severalnines AB
Galera Concepts cont.
MySQL
[WSREP]
Primary Component - PC
The whole cluster is a PC during normal operation
Node and network failures MySQL
[WSREP]
MySQL
[WSREP]
Splits clusters into several components
Primary Component
Only PC can continue to modify state
Quorum algorithm invoked to select a PC during cluster partitioning
Majority rules
Minority tries to reconnect with PC
Confidential 9
10. Copyright Severalnines AB
Galera Concepts cont.
State Snapshot Transfer - SST
A transfer of a consistent snapshot of a node state corresponding to a
certain GTID
Initialize the state of a newly joining cluster node from an already
initialized node (donor)
Incremental State Transfer - IST
Catch up with the cluster by replaying missing transactions
Known initial node state
Enough transactions cached at the donor
Confidential 10
11. Copyright Severalnines AB
Galera Concepts cont.
Node Failures
A peer crash is indistinguishable from network failure
A node is considered failed when it no longer can be communicated with
Node health verified by receiving messages or keepalives
evs.inactive_timeout
sets the timeout after which node is considered inactive (dead)
evs.suspect_timeout
sets the timeout after which the node can be pronounced dead if
everyone else agrees
Confidential 11
12. Copyright Severalnines AB
Galera Concepts cont.
LAN vs WAN replication
No notion of local or remote node
Works as long as TCP works
May need tuning to be more tolerant to network latency/issues
Network params sample
evs.keepalive_period = PT3S
evs.inactive_check_period = PT10S
evs.suspect_timeout = PT30S
evs.inactive_timeout = PT1M
evs.consensus_timeout = PT1M
Confidential 12
13. Copyright Severalnines AB
Node Provisioning
Automatic node (re)synchronization
A ‘donor’ is chosen to provision a ‘joiner’ node
‘Donor’ node is blocked (write operations) until SST completes
State Snapshot Transfer - SST
Scriptable interface
mysqldump (slow)
rsync (fast)
Percona Xtrabackup (faster and non-blocking)
Confidential 13
14. Copyright Severalnines AB
Node Provisioning cont.
Client Client Client
Load balancer
Node 1 MySQL
[WSREP]
Node 2 MySQL
[WSREP]
Confidential 14
15. Copyright Severalnines AB
Node Provisioning cont.
Client Client Client
Load balancer
Node 1 MySQL
[WSREP]
MySQL
Node 2 MySQL
[WSREP] [WSREP] ‘Joiner’ Node 3
Confidential 15
16. Copyright Severalnines AB
Node Provisioning cont.
Client Client Client
Load balancer
Node 1 MySQL
[WSREP]
‘Joiner’ Node 3
MySQL
Node 2 MySQL
[WSREP] [WSREP] rsync receive
wsrep_cluster_address=Node 2
SST Request
Confidential 16
17. Copyright Severalnines AB
Node Provisioning cont.
Client Client Client
Load balancer
Node 1 MySQL
[WSREP]
‘Joiner’ Node 3
MySQL
Node 2 MySQL
[WSREP] [WSREP] rsync receive
rsync send
Node 2 in ‘donor mode’.
Write operations blocked
Confidential 17
18. Copyright Severalnines AB
Node Provisioning cont.
Client Client Client
Load balancer
Node 1 MySQL
[WSREP]
Catch up
MySQL
Node 2 MySQL
[WSREP] [WSREP] Node 3
Confidential 18
19. Copyright Severalnines AB
Network Partitioning/Split Brain
Quorum based system
“Majority >50%” partition continues operation
“Minority” partition blocks operations
Until reconnected with Primary Component
Use odd number of nodes
Minimum 3 (5, 7, 9 etc)
Galera Arbitrator (garbd)
Useful if you have even number of nodes
Nodes across DCs
Replication relay
Confidential 19
20. Copyright Severalnines AB
Network Partitioning/Split Brain cont.
Client Client Client
Load balancer
MySQL
[WSREP]
MySQL
1 Primary Component [WSREP]
MySQL
[WSREP]
DC1 DC2
Confidential 20
21. Copyright Severalnines AB
Network Partitioning/Split Brain cont.
Client Client Client
Load balancer
MySQL
[WSREP]
MySQL
Block operations until
Primary Component ? [WSREP]
reconnected with PC
MySQL
[WSREP]
DC1 DC2
Confidential 21
22. Copyright Severalnines AB
Network Partitioning/Split Brain cont.
Client Client Client
Load balancer
MySQL
[WSREP]
MySQL
[WSREP]
MySQL
[WSREP]
DC1 DC2
Galera
Arbitrator
DC3
Confidential 22
23. Copyright Severalnines AB
Network Partitioning/Split Brain cont.
Client Client Client
Load balancer
MySQL
[WSREP]
MySQL
[WSREP]
MySQL
[WSREP]
Replication
Relay
DC1 DC2
Galera
Arbitrator
DC3
Confidential 23
24. Copyright Severalnines AB
Network Partitioning/Split Brain cont.
Client Client Client
Load balancer
MySQL
[WSREP]
MySQL
Primary Component ? [WSREP]
MySQL
[WSREP]
DC1 DC2
Galera
Arbitrator
DC3
Confidential 24
25. Copyright Severalnines AB
Galera Configuration Example
[mysqld]
wsrep_cluster_address=/usr/lib64/libgalera_smm.so
wsrep_node_address=gcomm:// # NOTE: This must be changed to peer address ASAP!
wsrep_node_name=node1
wsrep_provider='/usr/lib64/galera/libgalera_smm.so'
wsrep_provider_options='gcache.size=1G;socket.ssl_key=my_key;socket.ssl_cert=my_cert
' wsrep_slave_threads=16
wsrep_sst_method=xtrabackup
wsrep_sst_auth=root:
innodb_buffer_pool_size=1G
innodb_log_file_size=256M
innodb_autoinc_lock_mode=2
innodb_flush_log_at_trx_commit=0
innodb_doublewrite=0
innodb_file_per_table=1
binlog_format=ROW
datadir=/var/lib/mysql
log-bin = mysql-bin
server-id = 2
relay-log = mysql-relay-bin
#read-only = 1
log-slave-updates = 1
Confidential 25
26. Copyright Severalnines AB
wsrep variables
wsrep_provider
Path to wsrep provider library
wsrep_cluster_address
URI form:'gcomm://another_node_address?opt1=val1&opt2=val2
'gcomm://' special meaning. Initialize the cluster (never leave it in my.cnf)
wsrep_node_address
An optional address of the node. A short-cut way to configure listen
addresses for replication and state transfers
By default it will be initialized to the first network interface returned by
ifconfig. This could be unreliable.
For best results initialize it explicitly
Confidential 26
27. Copyright Severalnines AB
wsrep variables cont.
wsrep_node_name
An optional name for the node. It will be used in logging and to identify the
desired donor for state transfer
Default it will be initialized to hostname
wsrep_provider_options
Semicolon-separated list of options specific to provider
Ex:
gcache.size – a size of the permanent transaction on-disk cache
socket.ssl_key, socket.ssl_cert – SSL key and certificate files
Confidential 27
28. Copyright Severalnines AB
wsrep variables cont.
wsrep_slave_threads
Parallel applying threads (1-512)
>1 requires certain InnoDB settings. Applying of STATEMENT-based
events is always serialized
wsrep_sst_method
Base package contains scripts for mysqldump, rsync and xtrabackup
based state snapshot transfers. Own scripts can be used
Default is mysqldump
Confidential 28
29. Copyright Severalnines AB
Performance Metrics
wsrep_flow_control_paused
Fraction of the time replication was paused
wsrep_flow_control_sent
How many times this node paused replication
wsrep_local_recv_queue_avg
Average length of slave trx queue – a sign of slave side bottleneck
wsrep_cert_deps_distance
How many transactions can be applied in parallel
wsrep_local_send_queue_avg
A sign of network bottleneck
Confidential 29
30. Copyright Severalnines AB
Number of conflicts/”deadlocks”
wsrep_last_committed
Last committed transaction
wsrep_local_cert_failures, wsrep_local_bf_aborts
Rollbacks, conflicts detected
Confidential 30
33. Copyright Severalnines AB
Benchmarks: Comparing NDB vs Galera
Note: No optimizations done for the NDB storage engine (DB schema nor queries)
http://codership.com/content/whats-difference-kenneth
Confidential 33
34. Copyright Severalnines AB
Benchmarks: Comparing NDB vs Galera
Note: No optimizations done for the NDB storage engine (DB schema nor queries)
http://codership.com/content/whats-difference-kenneth
Confidential 34
35. Copyright Severalnines AB
Best Practices
Dedicated switch/network for Galera Nodes (1 GBit min)
Connection pools/Load balancing with applications
Gives best performance
Use static/elastic IPs for the Galera nodes
Con: Need to handle node membership changes
Con: JDBC/PHP etc are not aware of Galera specific Node states
Load Balancers
Hardware, e.g., IP5
SW load balancer
HAProxy with Galera specific health check scripts
IP dispatching in the kernal for example Linux LVS
GLB (Galera Load Balancer)
Con: Need to setup LB redundancy
Confidential 35
36. Copyright Severalnines AB
Best Practices cont.
Reference Node
Client Client Client
Act as a ‘donor’ node
Backup node
No client connections LB
R/W R/W R/W
MySQL
[WSREP] ... MySQL
[WSREP]
MySQL
[WSREP]
Donor & Backup
Node
Confidential 36
37. Copyright Severalnines AB
Best Practices cont.
Minimize probability of deadlocks
Writes go only to 1 Node
Applications use connection pool or Client Client Client
load balancer on read only nodes
Have 1 “reference” Node for write failover
LB
and donor
R R W
MySQL
[WSREP] ... MySQL
[WSREP]
MySQL
[WSREP]
“Master” Node
Confidential 37
38. Copyright Severalnines AB
Galera Limitations
MyISAM replication is experimental
DDL statements are replicated in statement level
Any writes to other table types, including system (mysql.*) tables are not replicated
CREATE USER..., but issuing: INSERT INTO mysql.user..., will not be replicated
Non-deterministic functions like NOW() are not supported
Query log cannot be directed to table
LOCK/UNLOCK TABLES cannot be supported in multi-master setups
lock functions (GET_LOCK(), RELEASE_LOCK()... )
Maximum allowed transaction size is defined by wsrep_max_ws_rows
and wsrep_max_ws_size
XA transactions can not be supported due to possible rollback on commit
Confidential 38