By Jacob Meyers and Rob Zienert
Temporal is a Sturdy Execution platform which lets you write code “as if failures don’t exist”. It’s develop into more and more vital to Netflix since its preliminary adoption in 2021, with customers starting from the operators of our Open Join international CDN to our Reside reliability groups now relying on Temporal to function their business-critical providers. On this put up, I’ll give a high-level overview of what Temporal provides customers, the issues we have been experiencing working Spinnaker that motivated its preliminary adoption at Netflix, and the way Temporal helped us cut back the variety of transient deployment failures at Netflix from 4% to 0.0001%.
A Crash Course on (a few of) Spinnaker
Spinnaker is a multi-cloud steady supply platform that powers the overwhelming majority of Netflix’s software program deployments. It’s composed of a number of (largely nautical themed) microservices. Let’s double-click on two specifically to know the issues we have been going through that led us to adopting Temporal.
In case you’re utterly new to Spinnaker, Spinnaker’s basic software for deployments is the Pipeline. A Pipeline consists of a sequence of steps known as Levels, which themselves will be decomposed into a number of Duties, or different Levels. An instance deployment pipeline for a manufacturing service could consist of those levels: Discover Picture -> Run Smoke Checks -> Run Canary -> Deploy to us-east-2 -> Wait -> Deploy to us-east-1.

Pipeline configuration is extraordinarily versatile. You’ll be able to have Levels run utterly serially, one after one other, or you possibly can have a mixture of concurrent and serial Levels. Levels will also be executed conditionally based mostly on the results of earlier levels. This brings us to our first Spinnaker service: Orca. Orca is the orca-stration engine of Spinnaker. It’s liable for managing the execution of the Levels and Duties {that a} Pipeline unrolls into and coordinating with different Spinnaker providers to really execute them.
A kind of collaborating providers is known as Clouddriver. Within the instance Pipeline above, a number of the Levels would require interfacing with cloud infrastructure. For instance, the canary deployment entails creating ephemeral hosts to run an experiment, and a full deployment of a brand new model of the service could contain spinning up new servers after which tearing down the previous ones. We name these kinds of operations that mutate cloud infrastructure Cloud Operations. Clouddriver’s job is to decompose and execute Cloud Operations despatched to it by Orca as a part of a deployment. Cloud Operations despatched from Orca to Clouddriver are comparatively excessive degree (for instance: createServerGroup), so Clouddriver understands learn how to translate these into lower-level cloud supplier API calls.
Ache factors within the interplay between Orca and Clouddriver and the implementation particulars of Cloud Operation execution in Clouddriver are what led us to search for new options and in the end migrate to Temporal, so we’ll subsequent have a look at the anatomy of a Cloud Operation. Cloud Operations within the OSS model of Spinnaker nonetheless work as described under, so motivated readers can comply with alongside in supply code, nonetheless our migration to Temporal is completely closed-source following a fork from OSS in 2020 to permit Netflix to make bigger pivots to the product reminiscent of this one.
The Authentic Cloud Operation Circulation
A Cloud Operation’s execution goes one thing like this:
- Orca, in orchestrating a Pipeline execution, decides a specific Cloud Operation must be carried out. It sends a POST request to Clouddriver’s /ops endpoint with an untyped bag-of-fields.
- Clouddriver makes an attempt to resolve the operation Orca despatched right into a set of AtomicOperation s— inside operations that solely Clouddriver understands.
- If the payload was legitimate and Clouddriver efficiently resolved the operation, it’ll instantly return a Job ID to Orca.
- Orca will instantly start polling Clouddriver’s GET /job/<id> endpoint to maintain monitor of the standing of the Cloud Operation.
- Asynchronously, Clouddriver begins executing AtomicOperations utilizing its personal inside orchestration engine. In the end, the AtomicOperations resolve into cloud supplier API calls. Because the Cloud Operation progresses, Clouddriver updates an inside state retailer to floor progress to Orca.
- Ultimately, if all went nicely, Clouddriver will mark the Cloud Operation full, which finally surfaces to Orca in its polling. Orca considers the Cloud Operation completed, and the deployment can progress.

This works nicely sufficient on the blissful path, however veer off the blissful path and dragons start to emerge:
- Clouddriver has its personal inside orchestration system unbiased of Orca to permit Orca to question the progress of Cloud Operation. That is largely undifferentiated lifting relative to Clouddriver’s purpose of actuating cloud infrastructure adjustments, and in the end provides complexity and floor space for bugs to the appliance. Moreover, Orca is tightly coupled to Clouddriver’s orchestration system — it should perceive learn how to ballot Clouddriver, interpret the standing, and deal with errors returned by Clouddriver.
- Distributed methods are messy — networks and exterior providers are unreliable. Whereas executing a Cloud Operation, Clouddriver might expertise transient community points, or the cloud supplier it’s trying to name into could also be having an outage, or any variety of points in between. Regardless of all of this, Clouddriver should be as dependable as moderately attainable as a core platform service. To take care of this form of challenge, Clouddriver internally advanced complicated retry logic, additional including cognitive complexity to the system.
- Bear in mind how a Cloud Operation will get decomposed by Clouddriver into AtomicOperations? Generally, if there’s a failure in the course of a Cloud Operation, we want to have the ability to roll again what was achieved in AtomicOperations previous to the failure. This led to a homegrown Saga framework being applied inside Clouddriver. Whereas this did end in an enormous step ahead in reliability of Cloud Operations going through transient failures as a result of the Saga framework additionally allowed replaying partially-failed Cloud Operations, it added but extra undifferentiated lifting contained in the service.
- The duty state stored by Clouddriver was instance-local. In different phrases, if the Clouddriver occasion finishing up a Cloud Operation crashed, that Cloud Operation state was misplaced, and Orca would finally outing polling for the duty standing. The Saga implementation talked about above mitigated this for sure operations, however was not extensively adopted throughout all cloud suppliers supported by Spinnaker.
We launched a lot of incidental complexity into Clouddriver in an effort to maintain Cloud Operation execution dependable, and regardless of all this deployments nonetheless failed round 4% of the time attributable to transient Cloud Operation failures.
Now, I can already hear you saying: “So what? Can’t individuals re-try their deployments in the event that they fail?” Whereas true, some pipelines take days to finish for complicated deployments, and a failed Cloud Operation mid-way by way of requires re-running the complete factor. This was detrimental to engineering productiveness at Netflix in a non-trivial manner. Fairly than proceed attempting to construct a quicker horse, we started to look elsewhere for our dependable orchestration necessities, which is the place Temporal comes in.
Temporal: Fundamental Ideas
Temporal is an open supply product that gives a sturdy execution platform on your functions. Sturdy execution implies that the platform will guarantee your packages run to completion regardless of adversarial situations. With Temporal, you set up your enterprise logic into Workflows, that are a deterministic collection of steps. The steps within Workflows are known as Actions, which is the place you encapsulate all of your non-deterministic logic that should occur in the middle of executing your Workflows. As your Workflows execute in processes known as Staff, the Temporal server durably shops their execution state in order that within the occasion of failures your Workflows will be retried and even migrated to a unique Employee. This makes Workflows extremely resilient to the kinds of transient failures Clouddriver was prone to. Right here’s a easy instance Workflow in Java that runs an Exercise to ship an e-mail as soon as each 30 days:
@WorkflowInterface
public interface SleepForDaysWorkflow {
@WorkflowMethod
void run();
}
public class SleepForDaysWorkflowImpl implements SleepForDaysWorkflow {
personal ultimate SendEmailActivities emailActivities =
Workflow.newActivityStub(
SendEmailActivities.class,
ActivityOptions.newBuilder()
.setStartToCloseTimeout(Length.ofSeconds(10))
.construct());
@Override
public void run() {
whereas (true) {
// Actions already carry retries/timeouts by way of choices.
emailActivities.sendEmail();
// Pause the workflow for 30 days earlier than sending the following e-mail.
Workflow.sleep(Length.ofDays(30));
}
}
}
@ActivityInterface
public interface SendEmailActivities {
void sendEmail();
}
There’s some fascinating issues to notice about this Workflow:
- Workflows and Actions are simply code, so you possibly can check them utilizing the identical strategies and processes as the remainder of your codebase.
- Actions are mechanically retried by Temporal with configurable exponential backoff.
- Temporal manages all of the execution state of the Workflow, together with timers (just like the one utilized by Workflow.sleep). If the Employee executing this workflow have been to have its energy cable unplugged, Temporal would guarantee one other Employee continues to execute it (even through the 30 day sleep).
- Workflow sleeps usually are not compute-intensive, they usually don’t tie up the course of.
You may already start to see how Temporal solves a number of the issues we had with Clouddriver. In the end, we determined to drag the set off on migrating Cloud Operation execution to Temporal.
Cloud Operations with Temporal
In the present day, we execute Cloud Operations as Temporal workflows. Right here’s what that appears like.
- Orca, utilizing a Temporal shopper, sends a request to Temporal to execute an UntypedCloudOperationRunner Workflow. The contract of the Workflow appears to be like one thing like this:
@WorkflowInterface
interface UntypedCloudOperationRunner {
/**
* Runs a cloud operation given an untyped payload.
*
* WorkflowResult is a skinny wrapper round OutputType offering an ordinary contract for
* purchasers to find out if the CloudOperation was profitable and fetching any errors.
*/
@WorkflowMethod
enjoyable <OutputType : CloudOperationOutput> run(stageContext: Map<String, Any?>, operationType: String): WorkflowResult<OutputType>
}
2. The Clouddriver Temporal employee is consistently polling Temporal for work. A employee will finally see a job for an UntypedCloudOperationRunner Workflow and begin executing it.
3. Just like earlier than with decision into AtomicOperations, Clouddriver does some pre-processing of the bag-of-fields in stageContext and resolves it to a strongly typed implementation of the CloudOperation Workflow interface based mostly on the operationType enter and the stageContext:
interface CloudOperation<I : CloudOperationInput, O : CloudOperationOutput> {
@WorkflowMethod
enjoyable function(enter: I, credentials: AccountCredentials<out Any>): O
}4. Clouddriver begins a Baby Workflow execution of the CloudOperation implementation it resolved. The kid workflow will execute Actions which deal with the precise cloud supplier API calls to mutate infrastructure.
5. Orca makes use of its Temporal Consumer to await completion of the UntypedCloudOperationRunner Workflow. As soon as it’s full, Temporal notifies the shopper and sends the outcome and Orca can proceed progressing the deployment.

Outcomes and Classes Discovered from the Migration
A shiny new structure is nice, however equally necessary is the non-glamorous work of refactoring legacy methods to suit the brand new structure. How did we combine Temporal into vital dependencies of all Netflix engineers transparently?
The reply, after all, is a mixture of abstraction and dynamic configuration. We constructed a CloudOperationRunner interface in Orca to encapsulate whether or not the Cloud Operation was being executed by way of the legacy path or Temporal. At runtime, Quick Properties (Netflix’s dynamic configuration system) decided which path a stage that wanted to execute a Cloud Operation would take. We might set these properties fairly granularly — by Stage kind, cloud supplier account, Spinnaker software, Cloud Operation kind (createServerGroup), and cloud supplier (both AWS or Titus in our case). The Spinnaker providers themselves have been the primary to be deployed utilizing Temporal, and inside two quarters, all functions at Netflix have been onboarded.
Impression
What did we now have to indicate for all of it? With Temporal because the orchestration engine for Cloud Operations, the proportion of deployments that failed attributable to transient Cloud Operation failures dropped from 4% to 0.0001%. For these holding monitor at residence, that’s a 4 and a half order of magnitude discount. Just about eliminating this failure mode for deployments was an enormous win for developer productiveness, particularly for groups with lengthy and complicated deployment pipelines.
Past the advance in deployment success metrics, we noticed quite a few different advantages:
- Orca now not must instantly talk with Clouddriver to start out Cloud Operations or ballot their standing with Temporal because the middleman. The providers are much less coupled, which is a win for maintainability.
- Talking of maintainability, with Temporal doing the heavy lifting of orchestration and retries within Clouddriver, we obtained to take away a number of the homegrown logic we’d constructed up over time for a similar goal.
- Since Temporal manages execution state, Clouddriver situations grew to become stateless and Cloud Operation execution can bounce between situations with impunity. We are able to deal with Clouddriver situations extra like cattle and allow issues like Chaos Monkey for the service which we have been beforehand prevented from doing.
- Migrating Cloud Operation steps into Actions was a forcing perform to re-write the logic to be idempotent. Since Temporal retries actions by default, it’s usually really useful they be idempotent. This alone fastened quite a few points that existed beforehand when operations have been retried in Clouddriver.
- We set the retry timeout for Actions in Clouddriver to be two hours by default. This provides us an extended leash to fix-forward or rollback Clouddriver if we introduce a regression earlier than buyer deployments fail — to them, it’d simply appear to be a deployment is taking longer than typical.
- Cloud Operations are a lot simpler to introspect than earlier than. Temporal ships with a terrific UI to assist visualize Workflow and Exercise executions, which is a big boon for debugging reside Workflows executing in manufacturing. The Temporal SDKs and server additionally emit a number of helpful metrics.

Classes Discovered
With the advantage of hindsight, there are additionally some classes we will share from this migration:
1. Keep away from pointless Baby Workflows: Structuring Cloud Operations as an UntypedCloudOperationRunner Workflow that begins Baby Workflows to really execute the Cloud Operation’s logic was pointless and the indirection made troubleshooting harder. There are conditions the place Baby Workflows are applicable, however on this case we have been utilizing them as a software for code group, which is usually pointless. We might’ve achieved the identical impact with class composition within the top-level mother or father Workflow.
2. Use single argument objects: At first, we structured Workflow and Exercise features with variable arguments, a lot as you’d write regular features. This may be problematic for Temporal due to Temporal’s determinism constraints. Including or eradicating an argument from a perform signature is not a backward-compatible change, and doing so can break long-running workflows — and it’s not instantly apparent in code assessment your change is problematic. The popular sample is to make use of a single serializable class to host all of your arguments for Workflows and Actions — these will be extra freely modified with out breaking determinism.
3. Separate enterprise failures from workflow failures: We just like the sample of the WorkflowResult kind that UntypedCloudOperationRunner returns within the interface above. It permits us to speak enterprise course of failures with out failing the Workflow itself and have extra total nuance in error dealing with. It is a sample we’ve carried over to different Workflows we’ve applied since.
Temporal at Netflix In the present day
Temporal adoption has skyrocketed at Netflix since its preliminary introduction for Spinnaker. In the present day, we now have tons of of use circumstances, and we’ve seen adoption double within the final yr with no indicators of slowing down.
One main distinction between preliminary adoption and immediately is that Netflix migrated from an on-prem Temporal deployment to utilizing Temporal Cloud, which is Temporal’s SaaS providing of the Temporal server. This has allow us to scale Temporal adoption whereas working a lean group. We’ve additionally constructed up a strong inside platform round Temporal Cloud to combine with Netflix’s inside ecosystem and make onboarding for our builders as simple as attainable. Keep tuned for a future put up digging into extra specifics of our Netflix Temporal platform.
Acknowledgement
All of us stand on the shoulders of giants in software program. I wish to name out that I’m retelling the work of my two gorgeous colleagues Chris Smalley and Rob Zienert on this put up, who have been the 2 aforementioned engineers who launched Temporal and carried out the migration.
How Temporal Powers Dependable Cloud Operations at Netflix was initially printed in Netflix TechBlog on Medium, the place persons are persevering with the dialog by highlighting and responding to this story.