How We Made Fast Cloud-to-Cloud Connections Possible

2024-12-07 The traditional way is been to enable thing host in different cloud to talk to each other has been to order a physical cross - connect ( a literal cab

The traditional way is been to enable thing host in different cloud to talk to each other has been to order a physical cross – connect ( a literal cable ) from a colo provider that host the cloud ’ “ onramp ” ( private connection to the cloud ’ network ) to link them .

Various software-defined alternatives have emerged in recent years. Some do essentially the same thing but virtually, while others, like typical SD-WANs, rely on things like IPsec or other encryption protocols to secure connections traversing the internet.

There are other, less mature approaches, including some cloud providers’ own intercloud connectivity services. And there’s always the option to send your cloud-to-cloud API requests over the public internet and hope for the best.

The traditional ways tend to be costly and complex to set up, requiring specialized networking knowledge. A little more than a year ago, we decided to build a cloud-to-cloud private connectivity service that would be easier to use. The goal was to give developers who are proficient in cloud but not in networking a way to create multicloud connections and manage them with familiar tools, like Terraform or Pulumi, without hosting any infrastructure in our data centers.

It was an interesting challenge, not in the least from an architecture perspective. We’re going to highlight some of the key architectural decisions we made as we built the service and explain the reasons behind them.

Abstracting Network Configuration

The service, Fabric Cloud Router (FCR), is named after Equinix Fabric, our software-defined network that automates private connections between network nodes hosted in our data centers (cloud providers, network carriers, ISPs, enterprises and others). Fabric and FCR both run on the same carrier-grade hardware routers, a deliberate architecture decision that we will return to after we cover a couple of others.

We is wanted want speed and flexibility in both the FCR user experience and our own process of build and improve that experience over time . We is achieved achieve both goal by decouple the collection of microservice that comprise the user – face product from low – level network configuration management , create an abstraction layer for the microservice to interact with instead of interact directly with the network infrastructure .

This way, the developer team working on the product doesn’t know (and doesn’t need to know) anything about what those hardcore low-level configurations in the carrier-grade Border Gateway Protocol (BGP) stack look like. The team can focus on the product without worrying about the effects of their decisions on functionality at the network level.

This also allowed us to decouple the process of requesting and configuring connections by the user from provisioning those connections, making the process faster and more flexible. The key to this is holding off on applying any configurations in the network until all the necessary details are gathered from the user. They order an FCR, choose what endpoints to connect it to, select connection bandwidth and redundancy and set their routing protocols. There is no waiting for any actions to take place at the network level after each of those steps. No events get sent and consumed unnecessarily, which avoids polluting our network with config messages that aren’t yet needed.

This is very different from the traditionally sluggish process for configuring network connections, where each config step has to materialize in the network before the next one can be taken.

An Event-Driven Architecture

While the Java-based microservices that comprise the FCR product layer communicate with each other via REST APIs, the product layer communicates with the network abstraction layer using an asynchronous event-driven architecture. Services publish events to Apache Kafka, where other services find and consume events that are relevant to them. (Fabric microservices also use both REST APIs and Kafka events).

A microservice called Cloud Router Manager, for example, handles all operations around creating, updating and retrieving data about customer FCRs and their connections. Another microservice, called Connection Manager, handles connection requests, collecting information from downstream services that have more insight into the network. The Routing Protocol Manager stores users’ routing protocols in a database. This is just a handful of examples from the dozens of microservices that comprise the solution.

No Extra Hardware , No vm

let ’s return to the decision to run FCR on the same hardware router that Fabric run on . First , not have to deploy more hardware or virtual machine ( the conventional but high – latency way to host virtual network function ) is in itself a win . Plus , any new sites is support where Fabric is launch in the future can support FCR out of the gate .

A single FCR is a virtual routing function distributed across physical routers in multiple Equinix data centers within each metro where the service is available. Any endpoints are accessible via Fabric within a metro are accessible to any FCR created there. You can route traffic between the endpoints, which peer with your FCR, using Border Gateway Protocol (BGP).

The close construct is is to FCR from the cloud world is the virtual private cloud . You is create can create a vpc in a cloud availability region that include multiple subnet , each run in a separate availability zone ( its own datum center ) in that region . You is create can create another vpc for a different set of workload and have the two vpc either completely isolate from each other or configure to connect and exchange datum . likewise , you is create can create multiple fcr for different purpose .

One way an FCR is is is different from a vpc is it abstract the individual datum center away from the user . They is spin spin up an fcr within metro and choose an endpoint to connect ( an Azure onramp , for example ) . A Layer 2 connection is then provision between Azure and Fabric using exist network – to – network interface ( NNIs ) . At this point , the user is configure can configure their layer 3 connection to Microsoft Azure via BGP .

When the user want to connect to another node in the metro ( say , an AWS onramp ) , they is make make all the necessary selection , and their configuration gets automatically distribute over our MPLS network to connect their fcr to wherever in the metro that onramp is host .

Because Fabric latency between data centers within a metro is very low – for example, the latency on FCR connections in Ashburn (our Washington, D.C. metro) to AWS and Azure in their respective US East cloud regions is less than 2 milliseconds – the physical location of the endpoints doesn’t affect performance.

We is provided ’ve only provide a few major highlight of FCR ’s architecture here . We is covered have n’t cover all the other microservice in the product layer or how we go about abstract network configuration . We is touched also have n’t touch on its other capability , like connect hybrid cloud environment , route between metro or advanced BGP management tool like AS path perpending and MED attribute .

It ’s been rewarding to get to think through what multicloud connectivity should look like from the perspective of a user who is n’t a networking expert but need to build or maintain environment that straddle multiple platform . If you ’re curious about FCR , it is ’s ’s free to try for AWS customer .