SlideShare ist ein Scribd-Unternehmen logo
1 von 61
Network Architecture Design for
Microservices on GCP
2
About me
@lainra (GitHub)
Twitter / @la1nra
SRE at Mercari microservices
platform team
3
Target network architecture design for microservices on GCP
Thank you for coming! See you next time!
Thank you for coming! See you next time!
Just kidding!
6
If you find that getting the solution from the start isn’t thrilling enough,
please keep with me to understand the journey that led to it!
More seriously
7
Table of contents
● Infrastructure introduction
● Issues leading to the architecture redesign
● Defining the new architecture goals
● Challenges and solutions
● Final design
● Wrap-up
Infrastructure introduction
9
● 100+ microservices
● 100+ VPCs (1 microservice = 1 VPC)
● 2 main Google Kubernetes Engine (GKE) clusters (1 Production
and 1 Development)
● 5+ secondary GKE clusters
● 2 countries (Japan and USA)
● 200+ developers
● 3k+ pods
Mercari infrastructure in a few numbers
10
Our microservices multi-tenancy model
Source: https://speakerdeck.com/mercari/mtc2018-microservices-platform-at-mercari
Issues leading to the architecture redesign
12
Cluster-internal cross-microservices communication worked fine, but we had issues
with outgoing traffic, especially the following:
Issues leading to the architecture redesign
● Traffic destined for internal services in other VPCs
● Traffic destined for GCP managed services
● Traffic destined for external tenants (third-party, Internet…)
● Traffic destined for our on-premises datacenter, AWS
13
Traffic destined for internal services in other VPCs
Our unmanaged network:
➔ All communications are public!
➔ Cost more than private traffic: 0.01$ per GB vs free (within same zone)
➔ Less secure than private traffic
➔ VPC default subnets IP ranges overlap -> Cannot use VPC Peering to privatise traffic
14
There are 2 kinds of GCP managed services:
Traffic destined for GCP managed services
- Services accessed through an API (i.e `*.googleapis.com`)
- Services provisioned in either another customer VPC or GCP-managed VPC (i.e Cloud
Memorystore, Private Cloud SQL)
While there is no issue with the first kind, the second require consumers to call the service
with an IP address from the VPC range of their instance.
GKE pods use a different CIDR than the VPC Subnet so they will get their traffic dropped
when leaving the VPC.
➔ Need to make GKE pods to use the same IP range as the VPC
15
We saw earlier that all GCE instances and GKE nodes had public IP addresses.
Traffic destined for external tenants (third-party, Internet…)
There are several problems with this:
1. Public IP leaking
- They are exposed to the Internet by the public IP
- When communicating with external tenants, the public IP is advertised
1. Lack of security mitigation options
- GKE uses many ports for type: `LoadBalancer` and `NodePorts` Services
- Makes it hard to mitigate the security risk with firewall rules
➔ Need to stop using Public IP addresses for GCE instances
16
Microservices are in GCP and the monolith is on-premises as we are still migrating:
Traffic destined for on-premises datacenter, AWS
Source: https://speakerdeck.com/mercari/mtc2018-microservices-platform-at-mercari
17
Microservices are in GCP and the monolith is on-premises as we are still migrating:
Traffic destined for on-premises datacenter, AWS
Now using Google Edge Peering to get a direct BGP route between GCP and our DC and L7
proxies to ensure security.
- Requires using our own Public IPv4 address block
- Cannot advertise private subnets from both locations
➔ Need a better way to provide private connectivity
Also, we wish to provide some AWS services for our developers.
➔ Need to build a reliable and high-performance link between GCP and AWS
18
Issues summary
● Cross-VPC security
● Cross-VPC traffic cost
● Cross-VPC traffic reliability
● GCE Instances security
● Inability for GKE pods to perform Cross-VPC connectivity
● On-premises and multi-cloud connectivity
● Lack of network resources management
19
Issues summary
● Cross-VPC security
● Cross-VPC traffic cost
● Cross-VPC traffic reliability
● GCE Instances security
● Inability for GKE pods to perform Cross-VPC connectivity
● On-premises and multi-cloud connectivity
● Lack of network resources management
How can we solve these issues?
New architecture goals definition
21
New architecture goals definition
● Harden East-West security between GCP projects
● Reduce East-West traffic cost
● Make East-West traffic more reliable
● Disable GCE instances public IPs and enforce internal traffic
● Enable Cross-VPC connectivity for GKE pods
● Enable production-grade on-premises and multi-cloud connectivity
● Define a multi-tenancy network management design
1:1 mapping between issues and goals to ensure we solve the right problem:
Challenges and solutions
23
Challenges and solutions
● Challenge 1: Multi-tenancy design
● Challenge 2: Which network ownership model to use to
enforce IP address management?
● Challenge 3: How big do we need to think?
● Challenge 4: Private IPv4 addresses exhaustion
● Challenge 5: Identifying edge cases
● Challenge 6: Managing multiple regions in a Shared VPC
● Challenge 7: Making GCE instances private only
During our research, we had many challenges to solve to get an architecture design
which fulfils our goals:
Challenge 1: Multi-tenancy design
25
Challenge 1: Multi-tenancy design
● We use Terraform and GitHub repositories to manage microservices tenants GCP
and GKE resources with full automation
● We manage these resources in a central project, our GKE cluster and limit required
operations by microservices developer teams to get bootstrapped
➔ With 100+ microservices GCP projects and VPCs, how to handle this the best way?
Giving flexibility to developers while providing adequate guardrails is a core concept
of our microservices platform.
26
Challenge 1: Multi-tenancy design
+ Cost nothing (using routes) and is secure
+ Cross-VPC traffic is considered the same as internal VPC traffic
- Is limited to 25 VPCs per peering group, but we have 100+ microservices…
- Need to manage the link inside each project on both sides, harder to automate
- Peered VPC network cannot overlap IP ranges -> default VPCs cannot be peered...
➔ This is not the solution we’re looking for.
Option 1: VPC Peering - Connect VPCs with a direct-internal link
27
Challenge 1: Multi-tenancy design
+ Can be used to connect VPCs reaching the VPC Peering limit (25)
+ VPCs IP ranges can overlap
- Need to manage the VPN tunnel for each project
- Impossible to automate and self-serve it
- Very costly, 0.05$ per tunnel/per hour, double to get HA mode
➔ This is not the solution we’re looking for.
Option 2: Cloud VPN - Create a VPN tunnel between VPCs
28
Challenge 1: Multi-tenancy design
+ Simplest management, all firewall rules and projects are centralized
+ Easy to attach GCP projects to the Shared VPC
+ Fine-grained permissions model with subnets per GCP project
+ Automatable
+ Cost nothing as all projects belong to the same VPC
+ Scale with multi-tenancy
- Easier to reach VPC limitations because of using one VPC network and one GCP project
- VPC subnets IP ranges cannot overlap -> require a good IP Address Management strategy
➔ This looks like the best solution for our use case!
Option 3: Shared VPC - Share a central VPC network across GCP projects
29
Challenge 1: Multi-tenancy design
However, Shared VPC enforces all participating GCP projects to not have
any IP overlap, leading us to the next challenge.
Solution: Use Shared VPC for our multi-tenancy model
Challenge 2: Which network ownership model to
enforce IP address management?
31
Challenge 2: Which network ownership model to use to enforce IP address management?
1 development team -> OK
10 development teams -> OK…
100 development teams -> In the red… -> BOTTLENECK
It might even happen before reaching 100 teams!
➔ How to prevent that???
Enterprises usually have a dedicated network team managing network resources
relying on standard procedures and manual operations.
32
Challenge 2: Which network ownership model to use to enforce IP address management?
● Make it so that microservices teams are self-sufficient in handling network for their scope.
● They don’t need full network control, only to easily fulfil their requirements.
● Network team manages IP address centrally and provides generic IP blocks to users on a
self-service basis
➔ In consequence, the network team need to provide automated configurations interfacing
with the other automation tools used to provision microservices infrastructure.
Solution: Automate network-related processes and operations.
33
Challenge 2: Which network ownership model to use to enforce IP address management?
Network is usually one common layer across entities. What happens when
multiple teams try to manage it separately?
What about multiple entities, companies in the same group?
34
Challenge 2: Which network ownership model to use to enforce IP address management?
‘organizations which design systems … are constrained
to produce designs which are copies of the
communication structures of these organizations’
Did you hear about Conway’s law?
35
Challenge 2: Which network ownership model to use to enforce IP address management?
Symptoms of Conway’s law in network management:
1. Naming convention drift:
○ i.e “vpc-network-tokyo-01” vs “vpc-tokyo-01” vs “vpc-asia-northeast-01”
2. Base architecture patterns definition drift, i.e having:
○ User-facing application in development network
○ Overly strict isolation between environments
○ Different IP address assignment strategies
3. Conflict of ownership, siloing, etc… between teams
36
Challenge 2: Which network ownership model to use to enforce IP address management?
● One way to handle and enforce IP address assignment -> ensures no IP overlap
● Still able to collect requirements and use cases from all entities
● Ensure the best architecture/solutions for most stakeholders
● Can automate almost all network components provisioning
With our automated Shared VPC provisioning, microservices teams get:
● Attachment between Shared VPC Host Project and their GCP project
● A subnet to use for GCE, Cloud SQL, Cloud Memorystore, Private Cloud Functions
● Secondary IP ranges when using GKE (but hard to call them microservices when doing so)
Solution: One central team manages all entities network
Challenge 3: How big do we need to think?
38
Challenge 3: How big do we need to think?
It brings several whereabouts, such as:
● How to process all this information?
● How to understand what we need?
● How to get things done?
An inconvenient to designing such an architecture as a single team is the scope’s size.
39
Challenge 3: How big do we need to think?
It should be one of the requirements for an architecture design as it should not prevent
scalability.
To create it, we need input from all infrastructure stakeholders.
Some guidelines to define a capacity planning:
● Understand how much infrastructure is used now, in 3 years, 5 years
● Extrapolate previsions with business expectations and roadmap
● Keep some extra margin for unexpected growth/events
● Get challenged by many stakeholders to improve the planning quality
Solution 1: Define a “rough” capacity planning
40
Challenge 3: How big do we need to think?
It is easy to be conservative when designing such an architecture and capacity planning, thus
never getting them done.
Some advices to keep in mind during the design process:
1. Identify as many two-way doors decisions as possible while keeping the base of your
architecture a high-quality decision.
2. Not every part of the design is ever set in stone, even less in our time.
3. Define what will be a one-way door decision, two-way door decisions first and tackle the
one-way first.
Solution 2: Keep flexibility in the process
41
Challenge 3: How big do we need to think?
● Is our design enabling future technologies such as serverless?
● How much capacity do we need for Disaster Recovery?
● Is our design future-proof? Is it evolutive?
● What would be the pain points in managing such a design?
Other good-to-ask questions:
Challenge 4: Private IPv4 addresses exhaustion
43
Challenge 4: Private IPv4 addresses exhaustion
Kubernetes loves IP addresses since it gives each pod a unique one.
● No issue when using overlay networks such as Flannel, Calico overlay…
● But in GKE, pods are now first-class citizen with Alias IP
Alias IP gives a VPC subnet IP address to each pod in a GKE cluster. Great for a lot of reasons!
However, it ends up bad, very bad...
IPv4 is a scarce resource: ~18M private IPv4 addresses available
44
Challenge 4: Private IPv4 addresses exhaustion
● GKE Nodes CIDR: /22 (1024 IPs)
● Pods CIDR: /14 (262144 IPs), each node has a /24 (256 IPs) portion allocated
● Services CIDR: /20 (4096 IPs)
Total: 267k IP addresses, which is ~1.5% of the total RFC 1918 IPv4 pool!
When scaling out to 8 clusters, with Disaster Recovery, you get almost 25% of it used!
➔ Kubernetes is extremely “IPvore” so we had to find solutions to make it use fewer IP
addresses.
Breakdown of Kubernetes IP addresses usage (for a 1000 nodes GKE cluster with
default settings):
45
Challenge 4: Private IPv4 addresses exhaustion
Flexible Pod CIDR sacrifices pod density for saving IP address.
- Limits the number of pods running per node. Default /24 for up to 110 pods per node
- /25 pods CIDR -> up to 64 pods per node, 128 IPs saved per node
- /26 pods CIDR -> up to 32 pods per node, 192 IPs saved per node
In earlier calculation using /26 pods CIDR it means:
- Max cluster IP usage: 267k -> 70k, a 74% decrease
- Max pods capacity: 110k -> 32k, a 71% decrease
➔ Depending on your use case, you may choose different values, we chose /26 since it fitted our
capacity planning.
Solution: Use Flexible Pod CIDR for GKE clusters
Challenge 5: Identifying edge-cases
47
Challenge 5: Identifying edge-cases
Edge-cases are cases that could invalidate your design, either expectedly or unexpectedly.
We researched thoroughly the GCP documentation to find possible limitations to the design.
Main limitations we identified with Shared VPC in GCP (as of August 2019):
● Max Shared VPC Service Project per Host Project: 100
● Max number of subnets per project: 275
● Max secondary IP ranges per subnet: 30
● Max number of VM instances per VPC network: 15000
● Max number of firewall rules per project: 500
● Max number of Internal Load Balancers per VPC network: 50
● Max nodes for GKE when using GCLB Ingress: 1000
Technical limitations are everywhere, even within cloud providers.
48
Challenge 5: Identifying edge-cases
Solution: Research limitations extensively but with moderation
1. Understand the limitations that would apply to the architecture design, i.e:
● 15k VM max = max 15 GKE clusters with 1000 nodes
● 275 subnets max = max 275 microservices (requiring GCE use)
1. Match these with the capacity planning to ensure their alignment.
We acknowledged these limitations thinking that:
● These would be lifted in the future, with as few redesigns as possible
● We might not achieve this scale (obviously we want to!)
● We made many two-way doors decisions so it is a calculated risk
➔ The important takeaway here is the ability to find the consensus between edge-cases, your
capacity planning and your risk assessment.
Challenge 6: Shared VPC multi-region design
50
Challenge 6: Shared VPC multi-region design
Taking our challenges into account, we defined 4 options for the Shared VPC multi-region design:
● Option 1: 1 Global Shared VPC Host Project, 1 Shared VPC network per region peered with
VPC peering
● Option 2: 1 Global Shared VPC Host Project, 1 Global Shared VPC network
● Option 3: 1 Shared VPC Host Project per region with VPC peering
● Option 4: 1 Shared VPC Host Project per region without VPC peering
Shared VPC is designed for multi-region but there are several ways to do so.
51
Challenge 6: Shared VPC multi-region design
Option 1: 1 Global Shared VPC Host Project, 1
Shared VPC network per region peered with
VPC peering
Option 2: 1 Global Shared VPC Host Project, 1
Global Shared VPC network
52
Challenge 6: Shared VPC multi-region design
Option 3: 1 Shared VPC Host Project per
region with VPC peering
Option 4: 1 Shared VPC Host Project per
region without VPC peering
53
Challenge 6: Shared VPC multi-region design
● It has the simplest management with a centralized Shared VPC Host Project for the entire
group
● It is the easiest way to implement the infrastructure logic in GitHub and Terraform
● Interconnection between regions is straightforward and leverages GCP Global VPC Network
● It fulfils the architecture goals and our guesses in Solution 5
After weighing in on each option’s pros and cons, we choose Option 2 for the following
reasons:
Challenge 7: Making GCE instances private only
55
Challenge 7: Making GCE instances private only
When GCE instances only have private IP addresses, they don’t have outbound Internet connectivity.
To enable it, we need to use NAT (Network Address Translation).
➔ In a Shared VPC architecture, how can we provide scalable NAT across multiple regions?
The only way to make instances private is to not give them a public IP address.
56
Challenge 7: Making GCE instances private only
Cloud NAT is a scalable regional NAT service for outbound traffic from GCE instances:
- It is embedded into the SDN (Software Defined Network) and decoupled from standard traffic
- Use one public IP to serve up to 64k TCP and 64k UDP ports
- Integrates within a VPC, so all shared VPC projects can use a central Cloud NAT
- Useful to get static IP addresses for third-parties using IP whitelists
- Default setting of 64 ports per GCE instance
In case of GKE nodes, 64 ports may be a bit low due to how many pods are hosted.
➔ Need to fine-tune the number of NAT IPs/number of ports allocated per VM to find a good
balance for GKE nodes.
Solution: Use Cloud NAT in each region
Final design
58
Final network architecture design
Wrap-up
60
Wrap-up
● Lots of issues with default network settings -> must redesign the network.
● Solving these issues should be the new architecture goals
● Design research implies many unexpected challenges
● Network ownership model must align with the network architecture
● Be strategic on designing IP assignment for Shared VPC and GKE
● Identifying edge-cases is crucial to evaluate the architecture design
● Multi-region + Shared VPC is not straightforward
● Ensure NAT capacity when making instances private
Thank you for coming!
(We’re hiring!!!)

Weitere ähnliche Inhalte

Was ist angesagt?

Google Cloud Platform (GCP)
Google Cloud Platform (GCP)Google Cloud Platform (GCP)
Google Cloud Platform (GCP)Chetan Sharma
 
Deploying your first application with Kubernetes
Deploying your first application with KubernetesDeploying your first application with Kubernetes
Deploying your first application with KubernetesOVHcloud
 
Introduction to GCP presentation
Introduction to GCP presentationIntroduction to GCP presentation
Introduction to GCP presentationMohit Kachhwani
 
Container orchestration overview
Container orchestration overviewContainer orchestration overview
Container orchestration overviewWyn B. Van Devanter
 
Amazon Virtual Private Cloud (VPC) - Networking Fundamentals and Connectivity...
Amazon Virtual Private Cloud (VPC) - Networking Fundamentals and Connectivity...Amazon Virtual Private Cloud (VPC) - Networking Fundamentals and Connectivity...
Amazon Virtual Private Cloud (VPC) - Networking Fundamentals and Connectivity...Amazon Web Services
 
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...Amazon Web Services
 
Cluster-as-code. The Many Ways towards Kubernetes
Cluster-as-code. The Many Ways towards KubernetesCluster-as-code. The Many Ways towards Kubernetes
Cluster-as-code. The Many Ways towards KubernetesQAware GmbH
 
MicroService Architecture
MicroService ArchitectureMicroService Architecture
MicroService ArchitectureFred George
 
Google cloud platform introduction
Google cloud platform introductionGoogle cloud platform introduction
Google cloud platform introductionSimon Su
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusGrafana Labs
 
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Vietnam Open Infrastructure User Group
 
네이버클라우드플랫폼이 제안하는 멀티클라우드(박기은 CTO) - IBM 스토리지 세미나
네이버클라우드플랫폼이 제안하는 멀티클라우드(박기은 CTO) - IBM 스토리지 세미나네이버클라우드플랫폼이 제안하는 멀티클라우드(박기은 CTO) - IBM 스토리지 세미나
네이버클라우드플랫폼이 제안하는 멀티클라우드(박기은 CTO) - IBM 스토리지 세미나NAVER CLOUD PLATFORMㅣ네이버 클라우드 플랫폼
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes NetworkingCJ Cullen
 
Introduction to GCP (Google Cloud Platform)
Introduction to GCP (Google Cloud Platform)Introduction to GCP (Google Cloud Platform)
Introduction to GCP (Google Cloud Platform)Pulkit Gupta
 
Monitoring kubernetes with prometheus
Monitoring kubernetes with prometheusMonitoring kubernetes with prometheus
Monitoring kubernetes with prometheusBrice Fernandes
 

Was ist angesagt? (20)

Google Cloud Platform (GCP)
Google Cloud Platform (GCP)Google Cloud Platform (GCP)
Google Cloud Platform (GCP)
 
Deploying your first application with Kubernetes
Deploying your first application with KubernetesDeploying your first application with Kubernetes
Deploying your first application with Kubernetes
 
Introduction to GCP presentation
Introduction to GCP presentationIntroduction to GCP presentation
Introduction to GCP presentation
 
Container orchestration overview
Container orchestration overviewContainer orchestration overview
Container orchestration overview
 
Introduction to helm
Introduction to helmIntroduction to helm
Introduction to helm
 
Amazon Virtual Private Cloud (VPC) - Networking Fundamentals and Connectivity...
Amazon Virtual Private Cloud (VPC) - Networking Fundamentals and Connectivity...Amazon Virtual Private Cloud (VPC) - Networking Fundamentals and Connectivity...
Amazon Virtual Private Cloud (VPC) - Networking Fundamentals and Connectivity...
 
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
 
Cluster-as-code. The Many Ways towards Kubernetes
Cluster-as-code. The Many Ways towards KubernetesCluster-as-code. The Many Ways towards Kubernetes
Cluster-as-code. The Many Ways towards Kubernetes
 
MicroService Architecture
MicroService ArchitectureMicroService Architecture
MicroService Architecture
 
Google cloud platform introduction
Google cloud platform introductionGoogle cloud platform introduction
Google cloud platform introduction
 
Monitoring Kubernetes with Prometheus
Monitoring Kubernetes with PrometheusMonitoring Kubernetes with Prometheus
Monitoring Kubernetes with Prometheus
 
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
Room 2 - 6 - Đinh Tuấn Phong - Migrate opensource database to Kubernetes easi...
 
Microservices, Containers and Docker
Microservices, Containers and DockerMicroservices, Containers and Docker
Microservices, Containers and Docker
 
네이버클라우드플랫폼이 제안하는 멀티클라우드(박기은 CTO) - IBM 스토리지 세미나
네이버클라우드플랫폼이 제안하는 멀티클라우드(박기은 CTO) - IBM 스토리지 세미나네이버클라우드플랫폼이 제안하는 멀티클라우드(박기은 CTO) - IBM 스토리지 세미나
네이버클라우드플랫폼이 제안하는 멀티클라우드(박기은 CTO) - IBM 스토리지 세미나
 
Google cloud platform
Google cloud platformGoogle cloud platform
Google cloud platform
 
Deep Dive Amazon EC2
Deep Dive Amazon EC2Deep Dive Amazon EC2
Deep Dive Amazon EC2
 
Introduction to Amazon EC2
Introduction to Amazon EC2Introduction to Amazon EC2
Introduction to Amazon EC2
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes Networking
 
Introduction to GCP (Google Cloud Platform)
Introduction to GCP (Google Cloud Platform)Introduction to GCP (Google Cloud Platform)
Introduction to GCP (Google Cloud Platform)
 
Monitoring kubernetes with prometheus
Monitoring kubernetes with prometheusMonitoring kubernetes with prometheus
Monitoring kubernetes with prometheus
 

Ähnlich wie Network architecture design for microservices on GCP

WSO2Con USA 2015: Planning Your Cloud Strategy
WSO2Con USA 2015: Planning Your Cloud StrategyWSO2Con USA 2015: Planning Your Cloud Strategy
WSO2Con USA 2015: Planning Your Cloud StrategyWSO2
 
Hybrid cloud openstack meetup
Hybrid cloud openstack meetupHybrid cloud openstack meetup
Hybrid cloud openstack meetupdfilppi
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)bigdata trunk
 
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...Oleg Shalygin
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsWeaveworks
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsSonja Schweigert
 
Webinar- Tea for the Tillerman
Webinar- Tea for the TillermanWebinar- Tea for the Tillerman
Webinar- Tea for the TillermanCumulus Networks
 
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...DevOps_Fest
 
Introduction to GCP
Introduction to GCPIntroduction to GCP
Introduction to GCPKnoldus Inc.
 
Successful K8S Platforms in Airgapped Environments
Successful K8S Platforms in Airgapped EnvironmentsSuccessful K8S Platforms in Airgapped Environments
Successful K8S Platforms in Airgapped EnvironmentsKubernetesCommunityD
 
Developing Microservices Directly in AKS/Kubernetes
Developing Microservices Directly in AKS/KubernetesDeveloping Microservices Directly in AKS/Kubernetes
Developing Microservices Directly in AKS/KubernetesChakradhar Rao Jonagam
 
Mastering Kubernetes - Basics and Advanced Concepts using Example Project
Mastering Kubernetes - Basics and Advanced Concepts using Example ProjectMastering Kubernetes - Basics and Advanced Concepts using Example Project
Mastering Kubernetes - Basics and Advanced Concepts using Example Projectwajrcs
 
Kubernetes best practices with GKE
Kubernetes best practices with GKEKubernetes best practices with GKE
Kubernetes best practices with GKEGDG Cloud Bengaluru
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices worldKarol Chrapek
 
Continuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteContinuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteWeaveworks
 
MuleSoft Surat Meetup#43 - Combine Service Mesh With Anypoint API Management ...
MuleSoft Surat Meetup#43 - Combine Service Mesh With Anypoint API Management ...MuleSoft Surat Meetup#43 - Combine Service Mesh With Anypoint API Management ...
MuleSoft Surat Meetup#43 - Combine Service Mesh With Anypoint API Management ...Jitendra Bafna
 
Deploying NGINX Plus & Kubernetes on Google Cloud Platform
Deploying NGINX Plus & Kubernetes on Google Cloud PlatformDeploying NGINX Plus & Kubernetes on Google Cloud Platform
Deploying NGINX Plus & Kubernetes on Google Cloud PlatformNGINX, Inc.
 

Ähnlich wie Network architecture design for microservices on GCP (20)

WSO2Con USA 2015: Planning Your Cloud Strategy
WSO2Con USA 2015: Planning Your Cloud StrategyWSO2Con USA 2015: Planning Your Cloud Strategy
WSO2Con USA 2015: Planning Your Cloud Strategy
 
Hybrid cloud openstack meetup
Hybrid cloud openstack meetupHybrid cloud openstack meetup
Hybrid cloud openstack meetup
 
Getting started with GCP ( Google Cloud Platform)
Getting started with GCP ( Google  Cloud Platform)Getting started with GCP ( Google  Cloud Platform)
Getting started with GCP ( Google Cloud Platform)
 
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
GCP - Continuous Integration and Delivery into Kubernetes with GitHub, Travis...
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
 
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOpsHybrid and Multi-Cloud Strategies for Kubernetes with GitOps
Hybrid and Multi-Cloud Strategies for Kubernetes with GitOps
 
Webinar- Tea for the Tillerman
Webinar- Tea for the TillermanWebinar- Tea for the Tillerman
Webinar- Tea for the Tillerman
 
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
 
E2E Services using Cloud Visitation Platforms
E2E Services using Cloud Visitation PlatformsE2E Services using Cloud Visitation Platforms
E2E Services using Cloud Visitation Platforms
 
Introduction to GCP
Introduction to GCPIntroduction to GCP
Introduction to GCP
 
Successful K8S Platforms in Airgapped Environments
Successful K8S Platforms in Airgapped EnvironmentsSuccessful K8S Platforms in Airgapped Environments
Successful K8S Platforms in Airgapped Environments
 
Developing Microservices Directly in AKS/Kubernetes
Developing Microservices Directly in AKS/KubernetesDeveloping Microservices Directly in AKS/Kubernetes
Developing Microservices Directly in AKS/Kubernetes
 
Mastering Kubernetes - Basics and Advanced Concepts using Example Project
Mastering Kubernetes - Basics and Advanced Concepts using Example ProjectMastering Kubernetes - Basics and Advanced Concepts using Example Project
Mastering Kubernetes - Basics and Advanced Concepts using Example Project
 
Kubernetes best practices with GKE
Kubernetes best practices with GKEKubernetes best practices with GKE
Kubernetes best practices with GKE
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 
Continuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event KeynoteContinuous Lifecycle London 2018 Event Keynote
Continuous Lifecycle London 2018 Event Keynote
 
MuleSoft Surat Meetup#43 - Combine Service Mesh With Anypoint API Management ...
MuleSoft Surat Meetup#43 - Combine Service Mesh With Anypoint API Management ...MuleSoft Surat Meetup#43 - Combine Service Mesh With Anypoint API Management ...
MuleSoft Surat Meetup#43 - Combine Service Mesh With Anypoint API Management ...
 
Deploying NGINX Plus & Kubernetes on Google Cloud Platform
Deploying NGINX Plus & Kubernetes on Google Cloud PlatformDeploying NGINX Plus & Kubernetes on Google Cloud Platform
Deploying NGINX Plus & Kubernetes on Google Cloud Platform
 
The rise of microservices
The rise of microservicesThe rise of microservices
The rise of microservices
 
A Seminar on Cloud Computing
A Seminar on Cloud ComputingA Seminar on Cloud Computing
A Seminar on Cloud Computing
 

Kürzlich hochgeladen

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Kürzlich hochgeladen (20)

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Network architecture design for microservices on GCP

  • 1. Network Architecture Design for Microservices on GCP
  • 2. 2 About me @lainra (GitHub) Twitter / @la1nra SRE at Mercari microservices platform team
  • 3. 3 Target network architecture design for microservices on GCP
  • 4. Thank you for coming! See you next time!
  • 5. Thank you for coming! See you next time! Just kidding!
  • 6. 6 If you find that getting the solution from the start isn’t thrilling enough, please keep with me to understand the journey that led to it! More seriously
  • 7. 7 Table of contents ● Infrastructure introduction ● Issues leading to the architecture redesign ● Defining the new architecture goals ● Challenges and solutions ● Final design ● Wrap-up
  • 9. 9 ● 100+ microservices ● 100+ VPCs (1 microservice = 1 VPC) ● 2 main Google Kubernetes Engine (GKE) clusters (1 Production and 1 Development) ● 5+ secondary GKE clusters ● 2 countries (Japan and USA) ● 200+ developers ● 3k+ pods Mercari infrastructure in a few numbers
  • 10. 10 Our microservices multi-tenancy model Source: https://speakerdeck.com/mercari/mtc2018-microservices-platform-at-mercari
  • 11. Issues leading to the architecture redesign
  • 12. 12 Cluster-internal cross-microservices communication worked fine, but we had issues with outgoing traffic, especially the following: Issues leading to the architecture redesign ● Traffic destined for internal services in other VPCs ● Traffic destined for GCP managed services ● Traffic destined for external tenants (third-party, Internet…) ● Traffic destined for our on-premises datacenter, AWS
  • 13. 13 Traffic destined for internal services in other VPCs Our unmanaged network: ➔ All communications are public! ➔ Cost more than private traffic: 0.01$ per GB vs free (within same zone) ➔ Less secure than private traffic ➔ VPC default subnets IP ranges overlap -> Cannot use VPC Peering to privatise traffic
  • 14. 14 There are 2 kinds of GCP managed services: Traffic destined for GCP managed services - Services accessed through an API (i.e `*.googleapis.com`) - Services provisioned in either another customer VPC or GCP-managed VPC (i.e Cloud Memorystore, Private Cloud SQL) While there is no issue with the first kind, the second require consumers to call the service with an IP address from the VPC range of their instance. GKE pods use a different CIDR than the VPC Subnet so they will get their traffic dropped when leaving the VPC. ➔ Need to make GKE pods to use the same IP range as the VPC
  • 15. 15 We saw earlier that all GCE instances and GKE nodes had public IP addresses. Traffic destined for external tenants (third-party, Internet…) There are several problems with this: 1. Public IP leaking - They are exposed to the Internet by the public IP - When communicating with external tenants, the public IP is advertised 1. Lack of security mitigation options - GKE uses many ports for type: `LoadBalancer` and `NodePorts` Services - Makes it hard to mitigate the security risk with firewall rules ➔ Need to stop using Public IP addresses for GCE instances
  • 16. 16 Microservices are in GCP and the monolith is on-premises as we are still migrating: Traffic destined for on-premises datacenter, AWS Source: https://speakerdeck.com/mercari/mtc2018-microservices-platform-at-mercari
  • 17. 17 Microservices are in GCP and the monolith is on-premises as we are still migrating: Traffic destined for on-premises datacenter, AWS Now using Google Edge Peering to get a direct BGP route between GCP and our DC and L7 proxies to ensure security. - Requires using our own Public IPv4 address block - Cannot advertise private subnets from both locations ➔ Need a better way to provide private connectivity Also, we wish to provide some AWS services for our developers. ➔ Need to build a reliable and high-performance link between GCP and AWS
  • 18. 18 Issues summary ● Cross-VPC security ● Cross-VPC traffic cost ● Cross-VPC traffic reliability ● GCE Instances security ● Inability for GKE pods to perform Cross-VPC connectivity ● On-premises and multi-cloud connectivity ● Lack of network resources management
  • 19. 19 Issues summary ● Cross-VPC security ● Cross-VPC traffic cost ● Cross-VPC traffic reliability ● GCE Instances security ● Inability for GKE pods to perform Cross-VPC connectivity ● On-premises and multi-cloud connectivity ● Lack of network resources management How can we solve these issues?
  • 21. 21 New architecture goals definition ● Harden East-West security between GCP projects ● Reduce East-West traffic cost ● Make East-West traffic more reliable ● Disable GCE instances public IPs and enforce internal traffic ● Enable Cross-VPC connectivity for GKE pods ● Enable production-grade on-premises and multi-cloud connectivity ● Define a multi-tenancy network management design 1:1 mapping between issues and goals to ensure we solve the right problem:
  • 23. 23 Challenges and solutions ● Challenge 1: Multi-tenancy design ● Challenge 2: Which network ownership model to use to enforce IP address management? ● Challenge 3: How big do we need to think? ● Challenge 4: Private IPv4 addresses exhaustion ● Challenge 5: Identifying edge cases ● Challenge 6: Managing multiple regions in a Shared VPC ● Challenge 7: Making GCE instances private only During our research, we had many challenges to solve to get an architecture design which fulfils our goals:
  • 25. 25 Challenge 1: Multi-tenancy design ● We use Terraform and GitHub repositories to manage microservices tenants GCP and GKE resources with full automation ● We manage these resources in a central project, our GKE cluster and limit required operations by microservices developer teams to get bootstrapped ➔ With 100+ microservices GCP projects and VPCs, how to handle this the best way? Giving flexibility to developers while providing adequate guardrails is a core concept of our microservices platform.
  • 26. 26 Challenge 1: Multi-tenancy design + Cost nothing (using routes) and is secure + Cross-VPC traffic is considered the same as internal VPC traffic - Is limited to 25 VPCs per peering group, but we have 100+ microservices… - Need to manage the link inside each project on both sides, harder to automate - Peered VPC network cannot overlap IP ranges -> default VPCs cannot be peered... ➔ This is not the solution we’re looking for. Option 1: VPC Peering - Connect VPCs with a direct-internal link
  • 27. 27 Challenge 1: Multi-tenancy design + Can be used to connect VPCs reaching the VPC Peering limit (25) + VPCs IP ranges can overlap - Need to manage the VPN tunnel for each project - Impossible to automate and self-serve it - Very costly, 0.05$ per tunnel/per hour, double to get HA mode ➔ This is not the solution we’re looking for. Option 2: Cloud VPN - Create a VPN tunnel between VPCs
  • 28. 28 Challenge 1: Multi-tenancy design + Simplest management, all firewall rules and projects are centralized + Easy to attach GCP projects to the Shared VPC + Fine-grained permissions model with subnets per GCP project + Automatable + Cost nothing as all projects belong to the same VPC + Scale with multi-tenancy - Easier to reach VPC limitations because of using one VPC network and one GCP project - VPC subnets IP ranges cannot overlap -> require a good IP Address Management strategy ➔ This looks like the best solution for our use case! Option 3: Shared VPC - Share a central VPC network across GCP projects
  • 29. 29 Challenge 1: Multi-tenancy design However, Shared VPC enforces all participating GCP projects to not have any IP overlap, leading us to the next challenge. Solution: Use Shared VPC for our multi-tenancy model
  • 30. Challenge 2: Which network ownership model to enforce IP address management?
  • 31. 31 Challenge 2: Which network ownership model to use to enforce IP address management? 1 development team -> OK 10 development teams -> OK… 100 development teams -> In the red… -> BOTTLENECK It might even happen before reaching 100 teams! ➔ How to prevent that??? Enterprises usually have a dedicated network team managing network resources relying on standard procedures and manual operations.
  • 32. 32 Challenge 2: Which network ownership model to use to enforce IP address management? ● Make it so that microservices teams are self-sufficient in handling network for their scope. ● They don’t need full network control, only to easily fulfil their requirements. ● Network team manages IP address centrally and provides generic IP blocks to users on a self-service basis ➔ In consequence, the network team need to provide automated configurations interfacing with the other automation tools used to provision microservices infrastructure. Solution: Automate network-related processes and operations.
  • 33. 33 Challenge 2: Which network ownership model to use to enforce IP address management? Network is usually one common layer across entities. What happens when multiple teams try to manage it separately? What about multiple entities, companies in the same group?
  • 34. 34 Challenge 2: Which network ownership model to use to enforce IP address management? ‘organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations’ Did you hear about Conway’s law?
  • 35. 35 Challenge 2: Which network ownership model to use to enforce IP address management? Symptoms of Conway’s law in network management: 1. Naming convention drift: ○ i.e “vpc-network-tokyo-01” vs “vpc-tokyo-01” vs “vpc-asia-northeast-01” 2. Base architecture patterns definition drift, i.e having: ○ User-facing application in development network ○ Overly strict isolation between environments ○ Different IP address assignment strategies 3. Conflict of ownership, siloing, etc… between teams
  • 36. 36 Challenge 2: Which network ownership model to use to enforce IP address management? ● One way to handle and enforce IP address assignment -> ensures no IP overlap ● Still able to collect requirements and use cases from all entities ● Ensure the best architecture/solutions for most stakeholders ● Can automate almost all network components provisioning With our automated Shared VPC provisioning, microservices teams get: ● Attachment between Shared VPC Host Project and their GCP project ● A subnet to use for GCE, Cloud SQL, Cloud Memorystore, Private Cloud Functions ● Secondary IP ranges when using GKE (but hard to call them microservices when doing so) Solution: One central team manages all entities network
  • 37. Challenge 3: How big do we need to think?
  • 38. 38 Challenge 3: How big do we need to think? It brings several whereabouts, such as: ● How to process all this information? ● How to understand what we need? ● How to get things done? An inconvenient to designing such an architecture as a single team is the scope’s size.
  • 39. 39 Challenge 3: How big do we need to think? It should be one of the requirements for an architecture design as it should not prevent scalability. To create it, we need input from all infrastructure stakeholders. Some guidelines to define a capacity planning: ● Understand how much infrastructure is used now, in 3 years, 5 years ● Extrapolate previsions with business expectations and roadmap ● Keep some extra margin for unexpected growth/events ● Get challenged by many stakeholders to improve the planning quality Solution 1: Define a “rough” capacity planning
  • 40. 40 Challenge 3: How big do we need to think? It is easy to be conservative when designing such an architecture and capacity planning, thus never getting them done. Some advices to keep in mind during the design process: 1. Identify as many two-way doors decisions as possible while keeping the base of your architecture a high-quality decision. 2. Not every part of the design is ever set in stone, even less in our time. 3. Define what will be a one-way door decision, two-way door decisions first and tackle the one-way first. Solution 2: Keep flexibility in the process
  • 41. 41 Challenge 3: How big do we need to think? ● Is our design enabling future technologies such as serverless? ● How much capacity do we need for Disaster Recovery? ● Is our design future-proof? Is it evolutive? ● What would be the pain points in managing such a design? Other good-to-ask questions:
  • 42. Challenge 4: Private IPv4 addresses exhaustion
  • 43. 43 Challenge 4: Private IPv4 addresses exhaustion Kubernetes loves IP addresses since it gives each pod a unique one. ● No issue when using overlay networks such as Flannel, Calico overlay… ● But in GKE, pods are now first-class citizen with Alias IP Alias IP gives a VPC subnet IP address to each pod in a GKE cluster. Great for a lot of reasons! However, it ends up bad, very bad... IPv4 is a scarce resource: ~18M private IPv4 addresses available
  • 44. 44 Challenge 4: Private IPv4 addresses exhaustion ● GKE Nodes CIDR: /22 (1024 IPs) ● Pods CIDR: /14 (262144 IPs), each node has a /24 (256 IPs) portion allocated ● Services CIDR: /20 (4096 IPs) Total: 267k IP addresses, which is ~1.5% of the total RFC 1918 IPv4 pool! When scaling out to 8 clusters, with Disaster Recovery, you get almost 25% of it used! ➔ Kubernetes is extremely “IPvore” so we had to find solutions to make it use fewer IP addresses. Breakdown of Kubernetes IP addresses usage (for a 1000 nodes GKE cluster with default settings):
  • 45. 45 Challenge 4: Private IPv4 addresses exhaustion Flexible Pod CIDR sacrifices pod density for saving IP address. - Limits the number of pods running per node. Default /24 for up to 110 pods per node - /25 pods CIDR -> up to 64 pods per node, 128 IPs saved per node - /26 pods CIDR -> up to 32 pods per node, 192 IPs saved per node In earlier calculation using /26 pods CIDR it means: - Max cluster IP usage: 267k -> 70k, a 74% decrease - Max pods capacity: 110k -> 32k, a 71% decrease ➔ Depending on your use case, you may choose different values, we chose /26 since it fitted our capacity planning. Solution: Use Flexible Pod CIDR for GKE clusters
  • 47. 47 Challenge 5: Identifying edge-cases Edge-cases are cases that could invalidate your design, either expectedly or unexpectedly. We researched thoroughly the GCP documentation to find possible limitations to the design. Main limitations we identified with Shared VPC in GCP (as of August 2019): ● Max Shared VPC Service Project per Host Project: 100 ● Max number of subnets per project: 275 ● Max secondary IP ranges per subnet: 30 ● Max number of VM instances per VPC network: 15000 ● Max number of firewall rules per project: 500 ● Max number of Internal Load Balancers per VPC network: 50 ● Max nodes for GKE when using GCLB Ingress: 1000 Technical limitations are everywhere, even within cloud providers.
  • 48. 48 Challenge 5: Identifying edge-cases Solution: Research limitations extensively but with moderation 1. Understand the limitations that would apply to the architecture design, i.e: ● 15k VM max = max 15 GKE clusters with 1000 nodes ● 275 subnets max = max 275 microservices (requiring GCE use) 1. Match these with the capacity planning to ensure their alignment. We acknowledged these limitations thinking that: ● These would be lifted in the future, with as few redesigns as possible ● We might not achieve this scale (obviously we want to!) ● We made many two-way doors decisions so it is a calculated risk ➔ The important takeaway here is the ability to find the consensus between edge-cases, your capacity planning and your risk assessment.
  • 49. Challenge 6: Shared VPC multi-region design
  • 50. 50 Challenge 6: Shared VPC multi-region design Taking our challenges into account, we defined 4 options for the Shared VPC multi-region design: ● Option 1: 1 Global Shared VPC Host Project, 1 Shared VPC network per region peered with VPC peering ● Option 2: 1 Global Shared VPC Host Project, 1 Global Shared VPC network ● Option 3: 1 Shared VPC Host Project per region with VPC peering ● Option 4: 1 Shared VPC Host Project per region without VPC peering Shared VPC is designed for multi-region but there are several ways to do so.
  • 51. 51 Challenge 6: Shared VPC multi-region design Option 1: 1 Global Shared VPC Host Project, 1 Shared VPC network per region peered with VPC peering Option 2: 1 Global Shared VPC Host Project, 1 Global Shared VPC network
  • 52. 52 Challenge 6: Shared VPC multi-region design Option 3: 1 Shared VPC Host Project per region with VPC peering Option 4: 1 Shared VPC Host Project per region without VPC peering
  • 53. 53 Challenge 6: Shared VPC multi-region design ● It has the simplest management with a centralized Shared VPC Host Project for the entire group ● It is the easiest way to implement the infrastructure logic in GitHub and Terraform ● Interconnection between regions is straightforward and leverages GCP Global VPC Network ● It fulfils the architecture goals and our guesses in Solution 5 After weighing in on each option’s pros and cons, we choose Option 2 for the following reasons:
  • 54. Challenge 7: Making GCE instances private only
  • 55. 55 Challenge 7: Making GCE instances private only When GCE instances only have private IP addresses, they don’t have outbound Internet connectivity. To enable it, we need to use NAT (Network Address Translation). ➔ In a Shared VPC architecture, how can we provide scalable NAT across multiple regions? The only way to make instances private is to not give them a public IP address.
  • 56. 56 Challenge 7: Making GCE instances private only Cloud NAT is a scalable regional NAT service for outbound traffic from GCE instances: - It is embedded into the SDN (Software Defined Network) and decoupled from standard traffic - Use one public IP to serve up to 64k TCP and 64k UDP ports - Integrates within a VPC, so all shared VPC projects can use a central Cloud NAT - Useful to get static IP addresses for third-parties using IP whitelists - Default setting of 64 ports per GCE instance In case of GKE nodes, 64 ports may be a bit low due to how many pods are hosted. ➔ Need to fine-tune the number of NAT IPs/number of ports allocated per VM to find a good balance for GKE nodes. Solution: Use Cloud NAT in each region
  • 60. 60 Wrap-up ● Lots of issues with default network settings -> must redesign the network. ● Solving these issues should be the new architecture goals ● Design research implies many unexpected challenges ● Network ownership model must align with the network architecture ● Be strategic on designing IP assignment for Shared VPC and GKE ● Identifying edge-cases is crucial to evaluate the architecture design ● Multi-region + Shared VPC is not straightforward ● Ensure NAT capacity when making instances private
  • 61. Thank you for coming! (We’re hiring!!!)

Hinweis der Redaktion

  1. コンウェイの法則(ほうそく):`システムを設計する組織(そしき)は、その構造(こうぞう)をそっくりまねた構造の設計を生み出してしまう`