19 hours in the past
By Jun He, Natallia Dzenisenka, Praneeth Yenugutala, Yingyi Zhang, and Anjali Norwood
We’re thrilled to announce that the Maestro supply code is now open to the general public! Please go to the Maestro GitHub repository to get began. In case you discover it helpful, please give us a star.
What’s Maestro
Maestro is a general-purpose, horizontally scalable workflow orchestrator designed to handle large-scale workflows reminiscent of knowledge pipelines and machine studying mannequin coaching pipelines. It oversees the whole lifecycle of a workflow, from begin to end, together with retries, queuing, job distribution to compute engines, and so forth.. Customers can bundle their enterprise logic in numerous codecs reminiscent of Docker photographs, notebooks, bash script, SQL, Python, and extra. Not like conventional workflow orchestrators that solely help Directed Acyclic Graphs (DAGs), Maestro helps each acyclic and cyclic workflows and in addition consists of a number of reusable patterns, together with foreach loops, subworkflow, and conditional department, and so forth.
Our Journey with Maestro
Since we first launched Maestro on this weblog put up, we have now efficiently migrated a whole bunch of hundreds of workflows to it on behalf of customers with minimal interruption. The transition was seamless, and Maestro has met our design targets by dealing with our ever-growing workloads. Over the previous yr, we’ve seen a outstanding 87.5% enhance in executed jobs. Maestro now launches hundreds of workflow situations and runs half 1,000,000 jobs day by day on common, and has accomplished round 2 million jobs on significantly busy days.
Scalability and Versatility
Maestro is a completely managed workflow orchestrator that gives Workflow-as-a-Service to hundreds of finish customers, purposes, and providers at Netflix. It helps a variety of workflow use circumstances, together with ETL pipelines, ML workflows, AB take a look at pipelines, pipelines to maneuver knowledge between totally different storages, and so forth. Maestro’s horizontal scalability ensures it could handle each a lot of workflows and a lot of jobs inside a single workflow.
At Netflix, workflows are intricately linked. Splitting them into smaller teams and managing them throughout totally different clusters provides pointless complexity and degrades the person expertise. This method additionally requires extra mechanisms to coordinate these fragmented workflows. Since Netflix’s knowledge tables are housed in a single knowledge warehouse, we imagine a single orchestrator ought to deal with all workflows accessing it.
Be a part of us on this thrilling journey by exploring the Maestro GitHub repository and contributing to its ongoing improvement. Your help and suggestions are invaluable as we proceed to enhance the Maestro challenge.
Netflix Maestro presents a complete set of options designed to fulfill the various wants of each engineers and non-engineers. It consists of the widespread capabilities and reusable patterns relevant to varied use circumstances in a loosely coupled approach.
A workflow definition is outlined in a JSON format. Maestro combines user-supplied fields with these managed by Maestro to kind a versatile and highly effective orchestration definition. An instance might be discovered within the Maestro repository wiki.
A Maestro workflow definition contains two most important sections: properties and versioned workflow together with its metadata. Properties embody writer and proprietor info, and execution settings. Maestro preserves key properties throughout workflow variations, reminiscent of writer and proprietor info, run technique, and concurrency settings. This consistency simplifies administration and aids in trouble-shootings. If the possession of the present workflow adjustments, the brand new proprietor can declare the possession of the workflows with out creating a brand new workflow model. Customers also can allow the triggering or alerting options for a given workflow over the properties.
Versioned workflow consists of attributes like a novel identifier, title, description, tags, timeout settings, and criticality ranges (low, medium, excessive) for prioritization. Every workflow change creates a brand new model, enabling monitoring and straightforward reversion, with the energetic or the newest model utilized by default. A workflow consists of steps, that are the nodes within the workflow graph outlined by customers. Steps can characterize jobs, one other workflow utilizing subworkflow step, or a loop utilizing foreach step. Steps encompass distinctive identifiers, step sorts, tags, enter and output step parameters, step dependencies, retry insurance policies, and failure mode, step outputs, and so forth. Maestro helps configurable retry insurance policies primarily based on error sorts to boost step resilience.
This high-level overview of Netflix Maestro’s workflow definition and properties highlights its flexibility to outline complicated workflows. Subsequent, we dive into a few of the helpful options within the following sections.
Workflow Run Technique
Customers need to automate knowledge pipelines whereas retaining management over the execution order. That is essential when workflows can’t run in parallel or should halt present executions when new ones happen. Maestro makes use of predefined run methods to determine whether or not a workflow occasion ought to run or not. Right here is the checklist of predefined run methods Maestro presents.
Sequential Run Technique
That is the default technique utilized by maestro, which runs workflows separately primarily based on a First-In-First-Out (FIFO) order. With this run technique, Maestro runs workflows within the order they’re triggered. Notice that an execution doesn’t depend upon the earlier states. As soon as a workflow occasion reaches one of many terminal states, whether or not succeeded or not, Maestro will begin the subsequent one within the queue.
Strict Sequential Run Technique
With this run technique, Maestro will run workflows within the order they’re triggered however block execution if there’s a blocking error within the workflow occasion historical past. Newly triggered workflow situations are queued till the error is resolved by manually restarting the failed situations or marking the failed ones unblocked.
Within the above instance, run5 fails at 5AM, then later runs are queued however don’t run. When somebody manually marks run5 unblocked or restarts it, then the workflow execution will resume. This run technique is helpful for time insensitive however enterprise crucial workflows. This offers the workflow homeowners the choice to evaluate the failures at a later time and unblock the executions after verifying the correctness.
First-only Run Technique
With this run technique, Maestro ensures that the working workflow is full earlier than queueing a brand new workflow occasion. If a brand new workflow occasion is queued whereas the present one continues to be working, Maestro will take away the queued occasion. Maestro will execute a brand new workflow occasion provided that there isn’t any workflow occasion presently working, successfully turning off queuing with this run technique. This method helps to keep away from idempotency points by not queuing new workflow situations.
Final-only Run Technique
With this run technique, Maestro ensures the working workflow is the newest triggered one and retains solely the final occasion. If a brand new workflow occasion is queued whereas there may be an present workflow occasion already working, Maestro will cease the working occasion and execute the newly triggered one. That is helpful if a workflow is designed to at all times course of the newest knowledge, reminiscent of processing the newest snapshot of a whole desk every time.
Parallel with Concurrency Restrict Run Technique
With this run technique, Maestro will run a number of triggered workflow situations in parallel, constrained by a predefined concurrency restrict. This helps to fan out and distribute the execution, enabling the processing of huge quantities of information throughout the time restrict. A typical use case for this technique is for backfilling the outdated knowledge.
Parameters and Expression Language Help
In Maestro, parameters play an necessary position. Maestro helps dynamic parameters with code injection, which is tremendous helpful and highly effective. This characteristic considerably enhances the pliability and dynamism of workflows, permitting utilizing parameters to regulate execution logic and allow state sharing between workflows and their steps, in addition to between upstream and downstream steps. Along with different Maestro options, it makes the defining of workflows dynamic and permits customers to outline parameterized workflows for complicated use circumstances.
Nevertheless, code injection introduces important safety and security issues. For instance, customers may unintentionally write an infinite loop that creates an array and appends objects to it, finally crashing the server with out-of-memory (OOM) points. Whereas one method may very well be to ask customers to embed the injected code inside their enterprise logic as an alternative of the workflow definition, this could impose extra work on customers and tightly couple their enterprise logic with the workflow. In sure circumstances, this method blocks customers to design some complicated parameterized workflows.
To mitigate these dangers and help customers to construct parameterized workflows, we developed our personal custom-made expression language parser, a easy, safe, and protected expression language (SEL). SEL helps code injection whereas incorporating validations throughout syntax tree parsing to guard the system. It leverages the Java Safety Supervisor to limit entry, guaranteeing a safe and managed atmosphere for code execution.
Easy, Safe, and Protected Expression Language (SEL)
SEL is a do-it-yourself easy, safe, and protected expression language (SEL) to deal with the dangers related to code injection inside Maestro parameterized workflows. It’s a easy expression language and the grammar and syntax comply with JLS (Java Language Specs). SEL helps a subset of JLS, specializing in Maestro use circumstances. For instance, it helps knowledge sorts for all Maestro parameter sorts, elevating errors, datetime dealing with, and lots of predefined utility strategies. SEL additionally consists of extra runtime checks, reminiscent of loop iteration limits, array dimension checks, object reminiscence dimension limits and so forth, to boost safety and reliability. For extra particulars about SEL, please discuss with the Maestro GitHub documentation.
Output Parameters
To additional improve parameter help, Maestro permits for callable step execution, which returns output parameters from person execution again to the system. The output knowledge is transmitted to Maestro through its REST API, guaranteeing that the step runtime doesn’t have direct entry to the Maestro database. This method considerably reduces safety issues.
Parameterized Workflows
Due to the highly effective parameter help, customers can simply create parameterized workflows along with static ones. Customers take pleasure in defining parameterized workflows as a result of they’re simple to handle and troubleshoot whereas being highly effective sufficient to unravel complicated use circumstances.
- Static workflows are easy and straightforward to make use of however include limitations. Usually, customers must duplicate the identical workflow a number of occasions to accommodate minor adjustments. Moreover, workflow and jobs can’t share the states with out utilizing parameters.
- However, utterly dynamic workflows might be difficult to handle and help. They’re tough to debug or troubleshoot and arduous to be reused by others.
- Parameterized workflows strike a steadiness by being initialized step-by-step at runtime primarily based on person outlined parameters. This method offers nice flexibility for customers to regulate the execution at runtime whereas remaining simple to handle and perceive.
As we described within the earlier Maestro weblog put up, parameter help permits the creation of complicated parameterized workflows, reminiscent of backfill knowledge pipelines.
Workflow Execution Patterns
Maestro offers a number of helpful constructing blocks that enable customers to simply outline dataflow patterns or different workflow patterns. It offers help for widespread patterns instantly throughout the Maestro engine. Direct engine help not solely permits us to optimize these patterns but in addition ensures a constant method to implementing them. Subsequent, we are going to discuss in regards to the three main constructing blocks that Maestro offers.
Foreach Help
In Maestro, the foreach sample is modeled as a devoted step throughout the authentic workflow definition. Every iteration of the foreach loop is internally handled as a separate workflow occasion, which scales equally as every other Maestro workflow primarily based on the step executions (i.e. a sub-graph) outlined throughout the foreach definition block. The execution of sub-graph inside a foreach step is delegated to a separate workflow occasion. Foreach step then displays and collects the standing of those foreach workflow situations, every managing the execution of a single iteration. For extra particulars, please discuss with our earlier Maestro weblog put up.
The foreach sample is steadily used to repeatedly run the identical jobs with totally different parameters, reminiscent of knowledge backfilling or machine studying mannequin tuning. It will be tedious and time consuming to request customers to explicitly outline every iteration within the workflow definition (doubtlessly a whole bunch of hundreds of iterations). Moreover, customers would wish to create new workflows if the foreach vary adjustments, additional complicating the method.
Conditional Department Help
The conditional department characteristic permits subsequent steps to run provided that particular situations within the upstream step are met. These situations are outlined utilizing the SEL expression language, which is evaluated at runtime. Mixed with different constructing blocks, customers can construct highly effective workflows, e.g. performing some remediation if the audit examine step fails after which run the job once more.
Subworkflow Help
The subworkflow characteristic permits a workflow step to run one other workflow, enabling the sharing of widespread capabilities throughout a number of workflows. This successfully permits “workflow as a operate” and permits customers to construct a graph of workflows. For instance, we have now noticed complicated workflows consisting of a whole bunch of subworkflows to course of knowledge throughout a whole bunch tables, the place subworkflows are offered by a number of groups.
These patterns might be mixed collectively to construct composite patterns for complicated workflow use circumstances. For example, we will loop over a set of subworkflows or run nested foreach loops. One instance that Maestro customers developed is an auto-recovery workflow that makes use of each conditional department and subworkflow options to deal with errors and retry jobs routinely.
On this instance, subworkflow `job1` runs one other workflow consisting of extract-transform-load (ETL) and audit jobs. Subsequent, a standing examine job leverages the Maestro parameter and SEL help to retrieve the standing of the earlier job. Based mostly on this standing, it could determine whether or not to finish the workflow or to run a restoration job to deal with any knowledge points. After resolving the difficulty, it then executes subworkflow `job2`, which runs the identical workflow as subworkflow `job1`.
Step Runtime and Step Parameter
Step Runtime Interface
In Maestro, we use step runtime to explain a job at execution time. The step runtime interface defines two items of data:
- A set of primary APIs to regulate the habits of a step occasion at execution runtime.
- Some easy knowledge constructions to trace step runtime state and execution outcome.
Maestro presents just a few step runtime implementations reminiscent of foreach step runtime, subworkflow step runtime (talked about in earlier part). Every implementation defines its personal logic for begin, execute and terminate operations. At runtime, these operations management the way in which to initialize a step occasion, carry out the enterprise logic and terminate the execution underneath sure situations (i.e. guide intervention by customers).
Additionally, Maestro step runtime internally retains observe of runtime state in addition to the execution results of the step. The runtime state is used to find out the subsequent state transition of the step and inform if it has failed or terminated. The execution outcome hosts each step artifacts and the timeline of step execution historical past, that are accessible by subsequent steps.
Step Parameter Merging
To manage step habits in a dynamic approach, Maestro helps each runtime parameters and tags injection in step runtime. This makes a Maestro step extra versatile to soak up runtime adjustments (i.e. overridden parameters) earlier than truly being began. Maestro internally maintains a step parameter map that’s initially empty and is up to date by merging step parameters within the order under:
- Default Common Parameters: Parameters merging begins from default parameters that basically each step ought to have. For instance, workflow_instance_id, step_instance_uuid, step_attempt_id and step_id are required parameters for every maestro step. They’re internally reserved by maestro and can’t be handed by customers.
- Injected Parameters: Maestro then merges injected parameters (if current) into the parameter map. The injected parameters come from step runtime, that are dynamically generated primarily based on step schema. Every kind of step can have its personal schema with particular parameters related to this step. The step schema can evolve independently without having to replace Maestro code.
- Default Typed Parameters: After injecting runtime parameters, Maestro tries to merge default parameters which are associated to a selected kind of step. For instance, foreach step has loop_params and loop_index default parameters that are internally set by maestro and used for foreach step solely.
- Workflow and Step Data Parameters: These parameters comprise details about step and the workflow it belongs to. This may be identification info, i.e. workflow_id and might be merged to step parameter map if current.
- Undefined New Parameters: When beginning or restarting a maestro workflow occasion, customers can specify new step parameters that aren’t current in preliminary step definition. ParamsManager merges these parameters to make sure they’re accessible at execution time.
- Step Definition Parameters: These step parameters are outlined by customers at definition time and get merged if they don’t seem to be empty.
- Run and Restart Parameters: When beginning or restarting a maestro workflow occasion, customers can override outlined parameters by offering run or restart parameters. These two sorts of parameters are merged on the finish in order that step runtime can see the latest and correct parameter house.
The parameters merging logic might be visualized within the diagram under.
Step Dependencies and Alerts
Steps within the Maestro execution workflow graph can specific execution dependencies utilizing step dependencies. A step dependency specifies the data-related situations required by a step to begin execution. These situations are normally outlined primarily based on alerts, that are items of messages carrying info reminiscent of parameter values and might be revealed via step outputs or exterior techniques like SNS or Kafka messages.
Alerts in Maestro serve each sign set off sample and sign dependencies (a publisher-subscriber) sample. One step can publish an output sign (a pattern instance) that may unblock the execution of a number of different steps that depend upon it. A sign definition features a checklist of mapped parameters, permitting Maestro to carry out “sign matching” on a subset of fields. Moreover, Maestro helps sign operators like <, >, and so forth., on sign parameter values.
Netflix has constructed numerous abstractions on high of the idea of alerts. For example, a ETL workflow can replace a desk with knowledge and ship alerts that unblock steps in downstream workflows depending on that knowledge. Maestro helps “sign lineage,” which permits customers to navigate all historic situations of alerts and the workflow steps that match (i.e. publishing or consuming) these alerts. Sign triggering ensures exactly-once execution for the workflow subscribing a sign or a set of joined alerts. This method is environment friendly, because it conserves assets by solely executing the workflow or step when the desired situations within the alerts are met. A sign service is applied for these superior abstractions. Please discuss with the Maestro weblog for additional particulars on it.
Breakpoint
Maestro permits customers to set breakpoints on workflow steps, functioning equally to code-level breakpoints in an IDE. When a workflow occasion executes and reaches a step with a breakpoint, that step enters a “paused” state. This halts the workflow graph’s development till a person manually resumes from the breakpoint. If a number of situations of a workflow step are paused at a breakpoint, resuming one occasion will solely have an effect on that particular occasion, leaving the others in a paused state. Deleting the breakpoint will trigger all paused step situations to renew.
This characteristic is especially helpful through the preliminary improvement of a workflow, permitting customers to examine step executions and output knowledge. It is usually helpful when working a step a number of occasions in a “foreach” sample with numerous enter parameters. Setting a single breakpoint on a step will trigger all iterations of the foreach loop to pause at that step for debugging functions. Moreover, the breakpoint characteristic permits human intervention through the workflow execution and will also be used for different functions, e.g. supporting mutating step states whereas the workflow is working.
Timeline
Maestro features a step execution timeline, capturing all important occasions reminiscent of execution state machine adjustments and the reasoning behind them. This characteristic is helpful for debugging, offering insights into the standing of a step. For instance, it logs transitions reminiscent of “Created” and “Evaluating params”, and so forth. An instance of a timeline is included right here for reference. The applied step runtimes can add the timeline occasions into the timeline to floor the execution info to the tip customers.
Retry Insurance policies
Maestro helps retry insurance policies for steps that attain a terminal state because of failure. Customers can specify the variety of retries and configure retry insurance policies, together with delays between retries and exponential backoff methods, along with mounted interval retries. Maestro distinguishes between two sorts of retries: “platform” and “person.” Platform retries deal with platform-level errors unrelated to person logic, whereas person retries are for user-defined situations. Every kind can have its personal set of retry insurance policies.
Automated retries are helpful for dealing with transient errors that may be resolved with out person intervention. Maestro offers the pliability to set retries to zero for non-idempotent steps to keep away from retry. This characteristic ensures that customers have management over how retries are managed primarily based on their particular necessities.
Aggregated View
As a result of a workflow occasion can have a number of runs, it’s important for customers to see an aggregated state of all steps within the workflow occasion. Aggregated view is computed by merging base aggregated view with present runs occasion step statuses. For instance, as you may see on the determine under simulating a easy case, there’s a first run, the place step1 and step2 succeeded, step3 failed, and step4 and step5 haven’t began. When the person restarts the run, the run begins from step3 in run 2 with step1 and step2 skipped which succeeded within the earlier run. In spite of everything steps succeed, the aggregated view exhibits the run states for all steps.
Rollup
Rollup offers a high-level abstract of a workflow occasion, detailing the standing of every step and the rely of steps in every standing. It flattens steps throughout the present occasion and any nested non-inline workflows like subworkflows or foreach steps. For example, if a profitable workflow has three steps, one in every of which is a subworkflow akin to a five-step workflow, the rollup will point out that seven steps succeeded. Solely leaf steps are counted within the rollup, as different steps serve merely as tips to concrete workflows.
Rollup additionally retains references to any non-successful steps, providing a transparent overview of step statuses and facilitating simple navigation to problematic steps, even inside nested workflows. The aggregated rollup for a workflow occasion is calculated by combining the present run’s runtime knowledge with a base rollup. The present state is derived from the statuses of energetic steps, together with aggregated rollups for foreach and subworkflow steps. The bottom rollup is established when the workflow occasion begins and consists of statuses of inline steps (excluding foreach and subworkflows) from the earlier run that aren’t half of the present run.
For subworkflow steps, the rollup merely displays the rollup of the subworkflow occasion. For foreach steps, the rollup combines the bottom rollup of the foreach step with the present state rollup. The bottom is derived from the earlier run’s aggregated rollup, excluding the iterations to be restarted within the new run. The present state is periodically up to date by aggregating rollups of working iterations till all iterations attain a terminal state.
As a result of these processes, the rollup mannequin is finally constant. Whereas the determine under illustrates an easy instance of rollup, the calculations can grow to be complicated and recursive, particularly with a number of ranges of nested foreaches and subworkflows.
Maestro Occasion Publishing
When workflow definition, workflow occasion or step occasion is modified, Maestro generates an occasion, processes it internally and publishes the processed occasion to exterior system(s). Maestro has each inner and exterior occasions. The inner occasion tracks adjustments throughout the life cycle of workflow, workflow occasion or step occasion. It’s revealed to an inner queue and processed inside Maestro. After inner occasions are processed, a few of them might be remodeled into exterior occasion and despatched out to the exterior queue (i.e. SNS, Kafka). The exterior occasion carries maestro standing change info for downstream providers. The occasion publishing move is illustrated within the diagram under:
As proven within the diagram, the Maestro occasion processor bridges the 2 aforementioned Maestro occasions. It listens on the inner queue to get the revealed inner occasions. Throughout the processor, the inner job occasion is processed primarily based on its kind and will get transformed to an exterior occasion if wanted. The notification writer on the finish emits the exterior occasion in order that downstream providers can devour.
The downstream providers are principally event-driven. The Maestro occasion carries probably the most helpful message for downstream providers to seize totally different adjustments in Maestro. On the whole, these adjustments might be categorized into two classes: workflow change and occasion standing change. The workflow change occasion is related to actions at workflow stage, i.e definition or properties of a workflow has modified. In the meantime, occasion standing change tracks standing transition on workflow occasion or step occasion.
Maestro has been extensively used inside Netflix, and right this moment, we’re excited to make the Maestro supply code publicly accessible. We hope that the scalability and usefulness that Maestro presents can expedite workflow improvement exterior Netflix. We invite you to attempt Maestro, use it inside your group, and contribute to its improvement.
You will discover the Maestro code repository at github.com/Netflix/maestro. When you have any questions, ideas, or feedback about Maestro, please be happy to create a GitHub subject within the Maestro repository. We’re keen to listen to from you.
We’re taking workflow orchestration to the subsequent stage and always fixing new issues and challenges, please keep tuned for updates. In case you are enthusiastic about fixing massive scale orchestration issues, please be part of us.
Due to different Maestro crew members, Binbing Hou, Zhuoran Dong, Brittany Truong, Deepak Ramalingam, Moctar Ba, for his or her contributions to the Maestro challenge. Due to our Product Supervisor Ashim Pokharel for driving the technique and necessities. We’d additionally prefer to thank Andrew Seier, Romain Cledat, Olek Gorajek, and different beautiful colleagues at Netflix for his or her contributions to the Maestro challenge. We additionally thank Prashanth Ramdas, Eva Tse, David Noor, Charles Smith and different leaders of Netflix engineering organizations for his or her constructive suggestions and ideas on the Maestro challenge.