100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine | by Netflix Technology Blog

By Jun He, Yingyi Zhang, Ely Spears

TL;DR

We lately upgraded the Maestro engine to transcend scalability and improved its efficiency by 100X! The general overhead is decreased from seconds to milliseconds. We’ve up to date the Maestro open supply mission with this enchancment! Please go to the Maestro GitHub repository to get began. When you discover it helpful, please give us a star.

Introduction

In our earlier weblog submit, we launched Maestro as a horizontally scalable workflow orchestrator designed to handle large-scale Information/ML workflows at Netflix. Over the previous two and a half years, Maestro has achieved its design purpose and efficiently supported large workflows with a whole lot of hundreds of jobs, managing tens of millions of executions every day. Because the adoption of Maestro will increase at Netflix, new use circumstances have emerged, pushed by Netflix’s evolving enterprise wants, reminiscent of Dwell, Advertisements, and Video games. To satisfy these wants, a few of the workflows are actually scheduled on a sub-hourly foundation. Moreover, Maestro is more and more getting used for low-latency use circumstances, reminiscent of advert hoc queries, past conventional every day or hourly scheduled ETL knowledge pipeline use circumstances.

Whereas Maestro excels in orchestrating numerous heterogeneous workflows and managing person end-to-end improvement experiences, customers have skilled noticeable speedbumps (i.e. ten seconds overhead) from the Maestro engine throughout workflow executions and improvement, affecting total effectivity and productiveness. Though being absolutely scalable to assist Netflix-scale use circumstances, the processing overhead from Maestro inside engine state transitions and lifecycle actions have grow to be a bottleneck, notably throughout improvement cycles. Customers have expressed the necessity for a excessive efficiency workflow engine to assist iterative improvement use circumstances.

To visualise our finish customers’ wants for the workflow orchestrator, we create a 5-layer construction graph proven under. Earlier than the change, Maestro reached stage 4 however confronted challenges to fulfill the person’s wants in stage 5. With the brand new engine design, Maestro is ready to energy the customers to work with their highest capability and spark pleasure for finish customers throughout their improvement over the Maestro.

Figure 1. A 5-layer structure showing needs for the workflow orchestrator — Determine 1. A 5-layer construction displaying wants for the workflow orchestrator.

On this weblog submit, we are going to share our new engine particulars, clarify our design trade-off choices, and share learnings from this redesign work.

Architectural Evolution of Maestro

Earlier than the change

To grasp the enhancements, we are going to first revisit the unique structure of Maestro to know why the overhead is excessive. The system was divided into three foremost layers, as illustrated within the diagram under. Within the sections that comply with we are going to clarify every layer and the function it performed in our efficiency optimization.

Figure 2. The architecture diagram before the evolution. — Determine 2. The structure diagram earlier than the evolution.

Maestro API and Step Runtime Layer

This layer gives seamless integrations with different Netflix providers (e.g., compute engines like Spark and Trino). Utilizing Maestro, hundreds of practitioners construct manufacturing workflows utilizing a paved path to entry platform providers . They will focus totally on their enterprise logic whereas counting on Maestro to handle the lifecycle of jobs and workflows plus the mixing with knowledge platform providers and required integrations reminiscent of for authentication, monitoring and alerting. This layer functioned effectively with out introducing important overhead.

Maestro Engine Layer

The Maestro engine serves a number of essential features:

Managing the lifecycle of workflows, their steps and sustaining their state machines
Supporting all person actions (e.g., begin, restart, cease, pause) on workflow and step entities
Translating advanced Maestro workflow graphs into parallel flows, the place every circulation is an array of sequentially chained circulation duties, translating each step right into a circulation job, after which executing reworked flows utilizing the inner circulation engine
Appearing as a center layer to keep up isolation between the Maestro step runtime layer and the underlying circulation engine layer
Implementing required knowledge entry patterns and writing Maestro knowledge into the database

When it comes to pace, this layer had acceptable overhead however confronted edge circumstances (e.g. a step is perhaps concurrently executed by two staff on the similar time, inflicting race situations) resulting from missing a robust assure from the inner circulation engine and the exterior distributed job queue.

Maestro Inside Stream Engine Layer

The Maestro inside circulation engine carried out 2 main features:

Calling job’s execution features at a given interval.
Beginning the following duties in an array of sequential job flows (not a graph), if relevant.

This foundational layer was based mostly on Netflix OSS Conductor 2.x (deprecated since Apr 2021), which requires a devoted set of separate database tables and distributed job queues.

The prevailing implementation of this layer introduces an impactful overhead (e.g. a number of seconds to tens of seconds total delays). The shortage of sturdy ensures (e.g. precisely as soon as publishing) from this layer results in race situations which trigger caught jobs or misplaced executions.

Choices to contemplate

We’ve evaluated three choices to handle these present points:

Choice 1: Implement an inside circulation engine optimized for Maestro particular use circumstances
Choice 2: Improve Conductor library to 4.0, which addresses the overheads and gives different enhancements and enhancements in contrast with Conductor 2.X.
Choice 3: Use Temporal as the inner circulation engine

One facet that influenced our evaluation of possibility two is that Conductor 2 offered a closing callback functionality within the state machine that was contributed particularly for Maestro’s use case to make sure database synchronization between the Conductor and Maestro engine states. It will require porting this performance to Conductor 4 although it had been dropped given no different Conductor use circumstances apart from Maestro relied on this. By rewriting the circulation engine it might permit elimination of a number of advanced inside databases and database synchronization necessities which was engaging for simplifying operational reliability. Given Maestro didn’t want the total set of state engine options provided by Conductor, this motivated us to contemplate a circulation engine rewrite as the next precedence.

The choice for Temporal was extra simple. Temporal is optimized in the direction of facilitating inter-process orchestration and would contain calling an exterior service to work together with the Temporal circulation engine. Given Maestro is working better than 1,000,000 duties per day, lots of that are lengthy working, we felt it was an pointless supply of threat to couple the DAG engine execution with an exterior service name. If our necessities went past light-weight state transition administration we’d rethink as a result of Temporal is a really sturdy management airplane orchestration system, however for our wants it launched complexity and potential reliability weak spots when there was no direct want for the superior characteristic set that it provided.

After contemplating Choice 2 and Choice 3, we developed extra conviction that Maestro’s structure may very well be vastly simplified by not utilizing a full DAG analysis engine and having to keep up the state machine for 2 techniques (Maestro and Conductor/Temporal). Subsequently, now we have determined to go together with Choice 1.

After the change

To handle these points, we fully rewrote the Maestro inside circulation engine layer to fulfill Maestro’s particular wants and optimize its efficiency. This new circulation engine is light-weight with minimal dependencies, specializing in excelling within the two main features talked about above. We additionally changed present distributed job queues with inside ones to offer a robust assure.

The brand new engine is extremely performant, environment friendly, scalable, and fault-tolerant. It’s the basis for all higher elements of Maestro and gives the next ensures to keep away from race situations:

A single step ought to solely be executed by a single employee at any given time
Step state ought to by no means be rolled again
Steps ought to all the time finally run to a terminal state
The interior circulation state ought to be finally in line with the Maestro workflow state
Exterior API and person actions mustn’t trigger race situations on the workflow execution

Right here is the brand new structure diagram after the change, which is far easier with much less dependencies:

Determine 3. The structure diagram after the evolution.

New Stream Engine Optimization

The brand new circulation engine considerably boosts pace by sustaining state in reminiscence. It ensures consistency by utilizing Maestro engine’s database because the supply of reality for workflow and step states. Throughout bootstrapping, the circulation engine rebuilds its in-memory state from the database, bettering efficiency and simplifying the general structure. That is in distinction to the earlier design wherein a number of databases needed to be reconciled towards each other (Conductor’s tables and Maestro’s tables) or else undergo race situations and uncommon orphaned job standing.

The circulation engine operates on in-memory circulation states, resembling a write by caching sample. Updates to workflow or step state within the database additionally replace the in-memory circulation state. If in-memory state is misplaced, the circulation engine rebuilds it from the database, making certain eventual consistency and resolving race situations.

This design delivers decrease latency and better throughput, avoids inconsistencies from twin persistence, simplifies the structure, and retains the in‑reminiscence view finally in line with the database.

Sustaining Scalability Whereas Gaining Pace

With the brand new engine, we considerably increase efficiency by collocating flows and their duties on the identical node all through their lifecycle. Subsequently, states of a circulation and its duties will keep in a single node’s reminiscence with out persisting to the database. This stickiness and locality carry nice efficiency advantages however inevitably influence scalability since duties are now not reassigned to a brand new employee of the entire cluster in every polling cycle.

To take care of horizontal scalability, we launched a circulation group idea to partition working flows into teams. On this manner, every Maestro circulation engine occasion solely wants to keep up possession of teams relatively than particular person flows, lowering upkeep prices (e.g., heartbeat) and simplifying reconciliation by permitting every Maestro node to load flows for a bunch in batches. Every Maestro node claims possession of a bunch of flows by a circulation group actor and manages their complete lifecycle through youngster circulation actors. If possession is misplaced resulting from node failure or lengthy JVM GC, one other node can declare the group to renew circulation executions by reconciling inside state from Maestro database. The next diagram illustrates the possession upkeep.

Determine 4. Possession upkeep sequence diagram.

Stream Partitioning

To effectively distribute site visitors, Maestro assigns a constant group ID to flows/workflows by a easy secure ID task technique, as proven within the diagram’s Partitioning Operate field. We selected this easier partitioning technique over superior ones, e.g. constant hashing, primarily resulting from execution and reconciliation prices and consistency challenges in a distributed system.

Since Maestro decomposes workflows into hierarchical inside flows (e.g., foreach), father or mother flows have to work together with youngster flows throughout completely different teams. To allow this, the maximal group quantity from the father or mother, denoted as N’ within the diagram, is handed all the way down to all youngster flows. This permits youngster flows, reminiscent of subworkflows or foreach iterations, to recompute their very own group IDs and likewise ensures {that a} father or mother circulation can all the time decide the group ID of its youngster flows utilizing solely their workflow identifiers.

Determine 5. Stream group partitioning mechanism diagram.

After a circulation’s group ID is set, the circulation operator routes the circulation request to the suitable node. Every node owns a particular vary of group IDs. For instance, within the diagram, Node 1 owns teams 0, 1, and a pair of, whereas Node 3 owns teams 6, 7, and eight. The teams then include the person flows (e.g., Stream A, Stream B).

On this design, the group measurement is configurable and nodes can even have completely different group measurement configurations. The next diagram exhibits a circulation group partitioning instance whereas the maximal group quantity is modified through the engine execution with out impacting any present workflows.

Determine 6. A circulation group partitioning instance.

Briefly, Maestro circulation engine shares the group data throughout the father or mother and youngster workflows to offer a versatile and secure partitioning mechanism to distribute work throughout the cluster.

Queue Optimization

We changed each exterior distributed job queues within the present system with inside ones, preserving the identical fault‑tolerance and restoration ensures whereas lowering latency and boosting throughput.

For the inner circulation engine, the queue is a straightforward in‑reminiscence Java blocking queue. It requires no persistence and will be rebuilt from Maestro state throughout reconciliation.

For the Maestro engine, we applied a database‑backed in‑reminiscence queue that gives precisely‑as soon as publishing and at‑least‑as soon as supply ensures, addressing a number of edge circumstances that beforehand required handbook state correction.

This design is much like the transactional outbox sample. In the identical transaction that updates Maestro tables, a row is inserted into the `maestro_queue` desk. Upon transaction commit, the job is straight away pushed to a queue employee on the identical node, eliminating polling latency. After profitable processing, the employee deletes the row from the database. A periodic sweeper re-enqueues any rows whose timeout has expired, making certain one other employee picks them up if a employee stalls or a node fails.

This design handles failures cleanly. If the transaction fails, each knowledge and message roll again atomically, no partial publishing. If a employee or node fails after commit, the timeout mechanism ensures the job is retried elsewhere. On restart, a node rebuilds its in‑reminiscence queue from the queue desk, offering at-least-once supply assure.

To reinforce scalability and keep away from competition throughout occasion varieties, every occasion sort is assigned a `queue_id`. Job messages are then partitioned by `queue_id`, optimizing efficiency and sustaining system effectivity below excessive load.

From Stateless Employee Mannequin to Stateful Actor Mannequin

Maestro beforehand used a shared-nothing stateless employee mannequin with a polling mechanism. When a job began, its identifier was enqueued to a distributed job queue. A employee from the circulation engine would decide the duty identifier from the queue, load the entire states of the entire workflow (together with the circulation itself and each job), execute the duty interface technique as soon as, write the up to date job knowledge again to the database, and put the duty again within the queue with a polling delay. The employee would then neglect this job and begin polling the following one.

That structure was easy and horizontally scalable (excluding database scalability issues), but it surely had drawbacks. The method launched appreciable overhead resulting from polling intervals and state loading. The time spent in a single polling cycle on distributed queues, loading full states, and different DB queries was important.

As Maestro engine decomposes advanced workflow graphs into a number of flows, actions may contain a number of flows spanning a number of polling cycles, including as much as important overhead (round ten seconds within the worst circumstances). Additionally, this design didn’t provide sturdy execution ensures primarily as a result of the distributed job queue might solely present at-least-once ensures. Duties is perhaps dequeued and dispatched to a number of staff, staff may reset states in sure race situations, or load stale states of different duties and make incorrect choices. For instance, after a protracted garbage-collection pause or community hiccup, two staff can decide up the identical job: one units the duty standing as accomplished after which unblocks the downstream steps to maneuver ahead. Nonetheless, the opposite employee, working off stale state, resets the duty standing again to working, leaving the entire workflow in a conflicting state.

Within the new design, we developed a stateful actor mannequin, holding inside states in reminiscence. All duties of a workflow are collocated in the identical Maestro node, offering one of the best efficiency as states are in the identical JVM.

Actor-Based mostly Mannequin

The brand new circulation engine matches nicely into an actor mannequin. We additionally intentionally designed it to permit sharing sure native states (read-only) between father or mother, youngster, and sibling actors. This optimization positive aspects efficiency advantages with out shedding thread security resulting from Maestro’s use circumstances. We used Java 21’s digital thread assist to implement it with minimal dependencies.

The brand new actor-based circulation engine is absolutely message/event-driven and might take actions instantly when occasions are obtained, eliminating polling interval delays. To take care of compatibility with the present polling-based logic, we developed a wakeup mechanism. This mannequin requires circulation actors and their youngster job actors to be collocated in the identical JVM for communication over the in-memory queue. Because the Maestro engine already decomposes large-scale workflow cases into many small flows, every circulation has a restricted variety of duties that match nicely into reminiscence.

Beneath is a high-level overview of the Maestro execution circulation based mostly on the actor mannequin.

Determine 7. The excessive stage overview of the Maestro execution.

When a workflow begins or throughout reconciliation, the circulation engine inserts (if not present) or masses the Maestro workflow and step occasion from the database, reworking it into the inner circulation and job state. This state stays in JVM reminiscence till evicted (e.g., when the workflow occasion reaches a terminal state).
A digital thread is created for every entity (workflow occasion or step try) as an actor to deal with all updates or actions for this entity, making certain thread security and eliminating distributed locks and potential race situations.
Every digital thread actor incorporates an in-memory state, a thread-safe blocking queue, and a state machine to replace states, making certain thread security and excessive effectivity.
Actors are organized hierarchically, with circulation actors managing all their job actors. Stream actors and their job actors are stored in the identical JVM for locality advantages, with the flexibility to relocate circulation cases to different nodes if wanted.
An occasion can get up a digital thread by pushing a message to the actor’s job queue, enabling Maestro to maneuver towards an event-driven strategy alongside the present polling-based strategy.
A reconciliation course of transforms the Maestro knowledge mannequin into the inner circulation knowledge.

Digital Thread Based mostly Implementation

We selected Java digital threads to implement numerous actors (e.g. group actors and circulation actors), which simplified the actor mannequin implementation. With a smaller quantity of code, we developed a totally useful and extremely performant event-driven distributed circulation engine. Digital threads match very nicely in use circumstances like state machine transitions inside actors. They’re light-weight sufficient to be created in a big quantity with out Out-Of-Reminiscence dangers.

Nonetheless, digital threads can probably impasse. They’re not appropriate for executing user-provided logic or advanced step runtime logic which may rely on exterior libraries or providers outdoors our management. To handle this, we separate circulation engine execution from job execution logic by including a separate employee thread pool (not digital threads) to run precise step runtime enterprise logic like launching containers or making exterior API calls. Stream/job actors can wait indefinitely for the way forward for the thread ballot executor to finish however don’t carry out precise execution, permitting us to profit from digital threads whereas avoiding impasse points.

Determine 8. Digital thread and employee thread separation.

Offering Sturdy Execution Ensures

To offer sturdy execution ensures, we applied a technology ID-based resolution to make sure that a single circulation or job is executed by just one actor at any time, with states that by no means roll again and finally attain a terminal state.

When a node claims a brand new group or a bunch with an expired heartbeat, it updates the database desk row and increments the group technology ID. Throughout node bootstrap, the group actor updates all its owned flows’ technology IDs whereas rebuilding inside circulation states. When creating a brand new circulation, the group actor verifies that the database technology ID matches its in-memory technology ID, in any other case rejecting the creation and reporting a retryable error to the caller. Please test the supply code for the implementation particulars.

Determine 9. An instance sequence diagram displaying how technology id gives a robust assure.

Moreover, the brand new circulation engine helps each event-driven execution and polling-based periodic reconciliation. Occasion-driven assist permits us to increase polling intervals for state reconciliation at a really low price, whereas polling-based reconciliation relaxes occasion supply necessities to at-most-once.

Testing, Validation and Rollout

Migrating a whole lot of hundreds of Netflix knowledge processing jobs to a brand new workflow engine required meticulous planning and execution to keep away from knowledge corruption, surprising site visitors patterns, and edge circumstances that would hinder efficiency positive aspects. We adopted a principled strategy to make sure a easy transition:

Reasonable Testing: Our testing mirrored real-world use circumstances as intently as potential.
Balanced Method: We balanced the necessity for fast supply with complete testing.
Minimal Consumer Disruption: The purpose was for customers to be unaware of the underlying adjustments.
Clear Communication: For circumstances requiring person involvement, clear communication was offered.

Maestro Check Framework

To realize our testing targets, we developed an adaptable testing framework for Maestro. This framework addresses the restrictions of static unit and integration assessments by offering a extra dynamic and complete strategy, mimicking natural manufacturing site visitors. It enhances present assessments to instill confidence when rolling out main adjustments, reminiscent of new DAG engines.

The framework is designed to pattern actual person workflows, disconnecting enterprise logic from exterior negative effects like knowledge reads or writes. This permits us to run workflow graphs of varied styles and sizes, reflecting the various use circumstances throughout Netflix. Whereas system integrations are dealt with by deployment pipeline integration assessments, the flexibility to train all kinds of workflow topologies (e.g., parallel executions, for-each jobs, conditional branching and parameter passing between jobs) was essential for making certain the brand new circulation engine’s correctness and efficiency.

The prototype workflow for the take a look at framework focuses on auto-testing parameters, involving two foremost steps:

1. Caching Manufacturing Workflows:

Profitable manufacturing cases are queried from a historic Maestro feed desk over a specified interval.
Run parameters, initiator, and occasion IDs are extracted and arranged into an occasion knowledge map.
YAML definitions and subworkflow IDs are pulled from S3 storage.
Each workflow definitions and occasion knowledge are cached on S3 for subsequent steps.

2. Pushing, Working, and Monitoring Workflows:

Cached workflow definitions and occasion knowledge are loaded.
Pocket book-based jobs are changed with customized notebooks, and sure job varieties (e.g., vanilla container runtime jobs, templated knowledge motion jobs) and sign triggers are transformed to a particular no-op job sort or skipped.
Summary job varieties like Write-Audit-Publish are expressed as a single step template however are translated to a number of reified nodes of the DAG when executed. These are auto-translated into a number of customized pocket book job varieties to interchange the generated nodes.
Workflows and subworkflows are pushed, with solely non-subworkflows being run utilizing unique manufacturing occasion data.
1. Within the father or mother workflow, every sub-workflow is changed with a particular no-op placeholder in order that the general topology is preserved however with out executing any side-effects of kid workflows and keep away from circumstances utilizing dynamic runtime parameter logic.
2. Every sub-workflow is then individually handled like a top-level father or mother workflow not initiated from its father or mother, to train the precise workflow steps of the sub-workflow.
The customized pocket book internally compares all handed parameters for every job.
Workflow cases are monitored till termination (success or failure).
An e-mail detailing failed workflow cases is generated.

Future phases of the take a look at framework purpose to increase assist for native steps, extra templates, Titus and Metaflow workflows, and embody extra sturdy sign testing. Additional integration with the ecosystem, together with devoted Genie clusters for no-op jobs and DGS for our inside workflow UI characteristic verification, can also be being explored.

Rollout Plan

Our rollout technique prioritized minimal person disruption. We decided that a whole workflow, from its root occasion, should reside in both the outdated or new circulation engine, stopping blended operations that would result in advanced failure modes and handbook knowledge reconciliation.

To facilitate this, we established a parallel infrastructure for the brand new workflow engine and leveraged our orchestrator gateway API to cover any routing or redirection logic from customers. This strategy offered wonderful isolation for managing the migration. Initially, particular workflows might explicitly choose in through a system flag, permitting us to watch their execution and acquire confidence. By scaling up site visitors to the parallel infrastructure in direct proportion to what was scaled down from the unique infrastructure, the twin infrastructure price enhance was negligible.

As soon as assured, we transitioned to a percentage-based cutover. Within the occasion of a sustained failure within the new engine, our crew might roll again a workflow by eradicating it from the brand new engine’s database and restarting it within the unique stack. Nonetheless, one consequence of rollback was that failed workflows needed to restart from the start, recomputing beforehand profitable steps, to make sure all artifacts have been generated from a constant circulation engine.

Leveraging Maestro’s 10-day workflow timeout, we migrated customers with out disruption. Present executions would both full or day out. Upon restarting (resulting from failure/timeout) or triggering a brand new occasion (resulting from success), the workflow could be picked up by the brand new engine. This successfully allowed us to step by step “drain” site visitors from the outdated engine to the brand new one with no person involvement.

Whereas the plan typically proceeded as anticipated with restricted edge circumstances, we did encounter a number of challenges:

Caught Workflows: Round 50 workflows with defunct or incorrect possession data entered a caught state. In some circumstances, a backlog of queued cases behind a caught occasion created a race situation wherein a brand new occasion could be began instantly when an outdated occasion was terminated, perpetually holding the workflow on the outdated engine. For these, we proactively contacted customers to barter handbook stop-and-restart occasions, forcing them onto the brand new engine.
Configuration Discrepancies: A major lesson discovered was the significance of meticulous record-keeping and administration of parallel infrastructure elements. We found alerts, system flags, and have flags configured for one stack however not the opposite. This led to a failure in a associate crew’s system that dynamically rolled out a Python migration by analyzing workflow configurations. The absence of a required characteristic flag within the new engine stack precipitated the method to be silently skipped, leading to incorrect Python model configurations for about 40 workflows. Though rapidly remediated, this precipitated person inconvenience as affected workflows wanted to be restarted and verified for no lingering knowledge corruption points. This challenge additionally highlighted limitations within the testing framework since runtime configuration based mostly on exterior API calls to the configuration service weren’t exercised in simulated workflow executions.

Regardless of these challenges, the migration was a hit. We migrated over 60,000 lively workflows producing over 1,000,000 knowledge processing duties every day with virtually no person involvement. By observing the circulation engine’s lifecycle administration latency, we validated a discount in step launch overhead from round 5 seconds to 50 milliseconds. Workflow begin overhead (incurred as soon as per every workflow execution) additionally improved from 200 milliseconds to 50 milliseconds. Aggregating this over 1,000,000 every day step executions interprets to saving roughly 57 days of circulation engine overhead per day, resulting in a snappier person expertise, extra well timed workflow standing for knowledge practitioners and better total job throughput for a similar infrastructure scale.

We moreover realized important advantages internally with decreased upkeep effort because of the new circulation engine’s simplified set of database elements. We have been capable of delete almost 40TB of out of date tables associated to the earlier stateless circulation engine and noticed a 90% discount in inside database question site visitors which had beforehand been a big supply of system alerts for the crew.

Conclusion

The architectural evolution of Maestro represents a big leap in efficiency, lowering overhead from seconds to milliseconds. This redesign with a stateful actor mannequin not solely enhances pace by 100X but additionally maintains scalability and reliability, making certain Maestro continues to satisfy the various wants of Netflix’s knowledge and ML workflows.

Key takeaways from this evolution embody:

Efficiency issues: Even in a system designed for scale, the pace of particular person operations considerably impacts person expertise and productiveness.
Simplicity wins: Decreasing dependencies and simplifying structure not solely improved efficiency but additionally enhanced reliability and maintainability.
Sturdy ensures are important: Offering sturdy execution ensures eliminates race situations and edge circumstances that beforehand required handbook intervention.
Locality optimizations repay: Collocating associated flows and duties in the identical JVM dramatically reduces overhead from the Maestro engine.
Trendy language options assist: Java 21’s digital threads enabled a chic actor-based implementation with minimal code complexity and dependencies.

We’re excited to share these enhancements with the open-source group and sit up for seeing how Maestro continues to evolve. The efficiency positive aspects we’ve achieved open new potentialities for low-latency workflow orchestration use circumstances whereas persevering with to assist the large scale that Netflix and different organizations require.

Go to the Maestro GitHub repository to discover these enhancements. In case you have any questions, ideas, or feedback about Maestro, please be happy to create a GitHub challenge within the Maestro repository. We’re keen to listen to from you. If you’re captivated with fixing giant scale orchestration issues, please be a part of us.

Acknowledgements

Particular due to Large Information Orchestration crew members for basic contributions to Maestro and diligent assessment, dialogue and incident response required to make this mission profitable: Davis Shepherd, Natallia Dzenisenka, Praneeth Yenugutala, Brittany Truong, Jonathan Indig, Deepak Ramalingam, Binbing Hou, Zhuoran Dong, Victor Dusa, and Gabriel Ikpaetuk — and and inside companions Yun Li and Romain Cledat.

Thanks to Anoop Panicker and Aravindan Ramkumar from our associate group that leads Conductor improvement in Netflix. They helped us perceive points in Conductor 2.X that originally motivated the rearchitecture and helped present context on later variations of Conductor that outlined a few of the core trade-offs for the choice to implement a customized DAG engine in Maestro.

We’d additionally wish to thank our companions on the Information Safety & Infrastructure and Engineering Help groups who helped establish and quickly repair the configuration discrepancy error encountered throughout manufacturing rollout: Amer Hesson, Ye Ji, Sungmin Lee, Brandon Quan, Anmol Khurana, and Manav Garekar.

A particular thanks additionally goes out to companions from the Information Expertise crew together with Jeff Bothe, Justin Wei, and Andrew Seier. The circulation engine pace enchancment was really so dramatic that it broke some integrations with our inside workflow UI that reported state transition durations. Our companions helped us catch and repair UI regressions earlier than they shipped to keep away from influence to customers.

We additionally thank Prashanth Ramdas, Anjali Norwood, Eva Tse, Charles Zhao, Sumukh Shivaprakash, Joey Lynch, Harikrishna Menon, Marcelo Mayworm, Charles Smith and different leaders for his or her constructive suggestions and steering on the Maestro mission.

Source link

What's Hot

Will ‘A Good Girl’s Guide to Murder’ Get a Season 3?

SATC Alum Jason Lewis Resurfaces After 3 Years Of Silence With A VERY Cryptic Message! WTF?!

Paul McCartney Closes Colbert’s ‘Late Show’ in Historic Fashion

100X Faster: How We Supercharged Netflix Maestro’s Workflow Engine | by Netflix Technology Blog | Sep, 2025

When Is Hugh Jackman’s Oscar-Nominated Movie Hitting Streaming

Jenna Ortega To Lead Leos Carax’s Mysterious Dystopian Road Movie ‘Lily May B’

Netflix Readying Michael Jackson Trial Docuseries

What to Watch on Netflix in June 2026

Subscribe to Updates