Dynamic Repartitioning for Time Series Workloads | by Netflix Technology Blog

11 min learn

18 hours in the past

By Rajiv Shringi, Kaidan Fullerton, Oleksii Tkachuk and Kartik Sathyanarayanan

Introduction

Netflix’s TimeSeries Abstraction is a scalable system for ingesting and querying petabytes of temporal occasion information with millisecond latency. We use Apache Cassandra 4.x because the underlying storage for these fundamental causes:

Throughput, latency, and value: Cassandra can deal with hundreds of thousands of low‑latency reads and writes in a cheap method.
Operational maturity: Our information platform crew has deep operational experience working massive Cassandra clusters in manufacturing.

Nonetheless, utilizing Cassandra at this scale introduces commerce‑offs for TimeSeries workloads. A key problem is broad partitions, as TimeSeries dataset partitions can develop fairly massive with occasions accumulating over time.

This drawback is additional compounded by the truth that TimeSeries servers routinely take care of a really excessive learn throughput:

This put up walks by way of our journey to cut back the impression of broad partitions in our TimeSeries datasets, the options we constructed, and the teachings we realized.

Word: Though this put up walks by way of re-partitioning in Cassandra, the identical strategies may be utilized extra broadly to different information shops.

Impression of Broad Partitions

For many of our datasets, we observe a mean learn latency within the order of single-digit milliseconds:

Nonetheless, in some datasets, as partitions develop too broad, we observe excessive learn latencies within the order of seconds, particularly in direction of the tail finish:

Excessive Tail Latency for Reads (seconds)

This may end up in timeouts:

In excessive instances, if many of the reads goal broad partitions, we are able to see Rubbish Assortment pauses, excessive CPU utilization and thread queueing.

Excessive CPU utilization and thread-queueing in Cassandra clusters

Scaling up the underlying Cassandra cluster is all the time an choice, however we want smarter alternate options than simply throwing more cash on the drawback.

TimeSeries Partitioning Technique

The TimeSeries Abstraction was designed to resolve the issue of broad partitions by dividing the info into discrete time chunks. For extra in-depth info, consult with our earlier weblog.

To summarize, right here is an illustration of how TimeSeries partitioning technique helps us break up broad partitions into manageable chunks.

Time Collection partitioning breaking apart a dataset into Time slices, time buckets and occasion buckets

This technique additional permits us to effectively question and drop information based mostly on time, with out having to take care of tombstones.

Choosing the Partitioning Technique

When a namespace (a.ok.a. dataset) is created, customers should specify their anticipated workload traits. This specification is then fed into our provisioning pipeline. The pipeline processes these inputs, runs Monte Carlo simulations, and produces an optimum infrastructure and partition configuration.

Provisioning picks optimum infra and configuration based mostly on person inputs

You may be taught extra about our methodology of capability planning on this insightful AWS re:Invent discuss given by certainly one of our beautiful colleagues.

The Downside with the Present Strategy

Though this methodology of provisioning is efficient in lots of conditions, it proves inadequate for TimeSeries workloads underneath these circumstances:

Workload is unknown or inaccurately estimated: Early on in a venture, customers can lack a dependable image of manufacturing visitors or just misestimate key parameters.
Workload evolves over time: Visitors patterns, consumer habits, and product necessities change. A “good” partitioning technique on day one can turn out to be inefficient months later.
Knowledge outliers exist: Not all TimeSeries IDs behave the identical. A small share of IDs can obtain a vastly larger quantity of occasions than the remaining.

Fortuitously, our design with discrete Time Slices provides us a pure escape hatch for the primary two situations; every new Time Slice can use a unique partitioning technique.

Every Time Slice can have a singular partition technique

Nonetheless, manually adjusting these configurations in a fleet that has 1000’s of TimeSeries datasets just isn’t sustainable. We want automation.

Resolution 1: Time Slice Re-Partitioning

Cassandra exposes helpful introspection APIs for understanding information utilization and entry patterns. For instance, nodetool tablehistograms present percentile distributions for partition sizes in a desk. Utilizing these instruments, we are able to detect instances of each over and underneath partitioning.

Under is an instance of over‑partitioning, the place the TimeSeries provisioning pipeline chosen very small time_bucket intervals based mostly on person offered inputs:

Provisioning chosen 60s time buckets based mostly on person inputs

inflicting partitions to have lower than 10 KB of information, resulting in excessive learn amplification and thread queueing:

Histogram of the given Cassandra desk displaying partition measurement percentiles

So as to tune partition methods effectively, we added a background employee, which displays partition histograms of Time Slices connected to a given software, and exposes it by way of a Cassandra digital desk:

Histograms uncovered by way of a Cassandra Digital desk

It then computes an adjustment issue when it detects partition sizes not assembly a configured density. This configured density is usually set between 2 MiB to 10 MiB relying on the workload.

DynamicTimeSliceConfigWorker: 
namespace: my_dataset_1
Noticed: TimeSlices have p99 partitions beneath configured goal of 10MB. 
Proposed: time_bucket interval: 60s -> 604800s

The employee can then replace future Time Slices with the brand new partition technique:

Partitioning adjusted for future Time Slice(s)

This technique has yielded actual leads to decreasing our learn latencies, in addition to decreasing the variety of timeouts attributable to thread queueing.

Discount in tail latency and thread queueing for

Nonetheless, this technique solely works if many of the information displays such habits that warrants re-partitioning of all the desk. It doesn’t work in instances the place solely a share of IDs inside the desk are broad.

We now have a few choices right here:

Do Nothing: That is generally the correct method if there is no such thing as a noticed impression to the appliance’s top-level metrics.
Partial Returns: We carried out a ‘Partial Return’ characteristic, which aborts an inflight request if it has breached a configured latency SLO, whereas returning no matter information it has collected up till that time. This can be a nice choice for shoppers who care extra about latency than fetching all the info.

Tail latency drops across the SLO cutoff as Partial Returns are enabled

Block IDs: That is an excessive step however value mentioning, as a result of we do take care of unhealthy information that often seeps into the system e.g. check or spam IDs that may make the system unstable.

dgwts.config.<dataset>.block.Ids: "<tsid-1>, <tsid-2>, <tsid-3>"

Finally, we encounter situations the place legitimate and vital TimeSeries IDs accumulate a excessive sufficient quantity of occasions, with callers needing to course of all of the associated information. Merely tolerating elevated latencies or timeouts when querying these IDs just isn’t a fascinating end result.

That is the place dynamic partitioning comes into play.

Resolution 2: Dynamic Partitioning per ID

Dynamic partitioning is an asynchronous pipeline that auto-detects and splits broad partitions on a TimeSeries ID stage somewhat than on the desk stage.

Get Netflix Know-how Weblog’s tales in your inbox

Be a part of Medium at no cost to get updates from this author.

Keep in mind me for quicker register

It has three fundamental phases:

Detection: Detects broad partitions for a given TimeSeries ID throughout the learn path.
Planning & Splitting: Plans and executes splits of these partitions into optimum sizes asynchronously.
Serving Reads: Re-routes the learn queries transparently to learn information from the cut up partitions when prepared.

That is the way it works at a excessive stage; we’ll dive into particulars after:

Dynamic Broad Partition Break up Async Pipeline

Listed below are the totally different phases of the pipeline:

Detection

Each TimeSeries learn operation tracks what number of bytes are learn for a given partition. If the bytes learn exceed a configured threshold, the server emits a detection occasion to Kafka:

{
"time_slice": "data_20260328", // the Cassandra desk this occasion was detected in
"time_series_id": "profileId:123", // the ID detected as broad
"time_bucket": 7, // the present time_bucket partition
"event_bucket": 2, // the present event_bucket partition
"immutable": true, // TimeSeries servers can compute if this partition is now not receiving writes
"model": "0" // reserved for future use e.g. invalidate if partition is now not immutable
}

Our determination to detect broad partitions on reads, versus writes, is predicated on our remark that almost all of the info within the wild doesn’t want this therapy. The slight draw back is that some reads on these massive partitions might endure sub-optimal efficiency for a really quick period (sometimes seconds) till this course of catches up.

Immutability

Though splitting mutable partitions is feasible, it’s inherently extra advanced. As a primary step in direction of fixing this drawback, we selected to cut back the floor space of this variation by specializing in immutable partitions, whereas nonetheless meaningfully decreasing caller timeouts.

Planning

Detection might happen based mostly on a partial learn, so the planner should nonetheless learn all the partition as soon as to compute an correct cut up plan. The checkpointing turns into essential right here. For planning reads that fail to course of all the partition, the method can all the time proceed from the final saved checkpoint.

Checkpointing

The wide_row metadata desk serves because the spine for state transitions and checkpointing of partition splits. It additionally shops info that’s used later by TimeSeries servers to correctly route Learn queries.

wide_row metadata for storing cut up states and checkpoints

Splitting

The Planner delegates the splitting of information to an applicable split-strategy. For instance, if EventBucketPartitionSplitStrategy is chosen, we cut up the partition by assigning extra occasion buckets to the identical time bucket. If the partition is ultra-wide, we cap the variety of occasion buckets we cut up into, so as to management the resultant learn amplification. Spreading into a number of partitions in such instances continues to be helpful so as to unfold the learn workload to a number of Cassandra replicas.

Break up by assigning extra occasion buckets for a given time bucket

Additional, because the Splitter has the total view of the partition, it may guarantee whole type order throughout all of the cut up buckets.

Validating Splits

The Planner shops a pre-split checksum of a given partition throughout the planning section, whereas the Splitter computes and shops the post-split checksum. The cut up standing is marked as accomplished provided that the 2 checksums match.

Guarantee checksums match pre- and post-split earlier than marking a cut up as COMPLETED

Monitoring Splits

The pre- and post-split partition sizes throughout totally different datasets are tracked to see how successfully the partition splits are being deliberate and executed:

Observe pre- and post-split partition sizes to make sure we’re splitting optimally

Serving Reads

The TimeSeries servers load the partition-keys of accomplished splits periodically into in-memory Bloom filters. Each learn operation checks the Bloom filter to see whether or not a question may be diverted to the cut up partitions.

Here’s what the Learn path appears to be like like:

Learn path for diverting reads to present or cut up partitions

The dimensions of the Bloom filters is monitored to make sure we have now sufficient reminiscence per server. Because of the compactness of partition keys, and ratio of broad partitions in a given dataset, the filters match comfortably in every server occasion.

Bloom filter approximate ingredient rely per namespace and time slice

The Bloom filter latency to test whether or not a given partition secret’s broad for each learn request is often in single-digit microseconds or higher, making this diversion virtually invisible to the callers.

Latency for checking Bloom filters is extraordinarily small for callers to note the diversion

For the instances that do find yourself with a Bloom filter hit, the TimeSeries servers lookup the wide_row metadata to see learn how to learn a selected broad partition:

{
"pre_split_data": {
"time_slice": "data_20260328",
"time_series_id": "6313825", → What to learn
"time_bucket": 0,
"event_bucket": 2
…
},
"post_split_data": {
"time_slice": "wide_data_20260328_0", → The place to learn it from
"event_bucket_partition_strategy": { → Technique to delegate to for studying
"target_event_buckets": 2,
"start_event_bucket": 32 → How ought to the technique learn it
}
…
}

This metadata learn is backed by a read-through cache, making it fairly performant:

Metadata fetch latency is sort of low to have an effect on learn operations

Lastly, the reads for the cut up partitions are delegated to our present PartitionReader, which reads N smaller partitions in parallel, somewhat than 1 massive partition, enhancing general efficiency and stability!

Learn a lot smaller partitions in parallel and merge outcomes

Fallbacks

The prevailing broad partition from the unique time slice isn’t deleted. This helps us in creating protected fallbacks in many various situations of partial failures and eventual consistency. The marginally bigger cupboard space we use consequently is well worth the operational security we achieve.

Constructing Extra Confidence

Serving incorrect reads can be disastrous. To ascertain belief past checksums, we leveraged extra mechanisms reminiscent of:

Utilizing our present Knowledge Bridge pipelines to confirm splits offline:

Spark job to make sure that the cut up information is a precise match to the unique information

Implementing a phased rollout technique to soundly advance by way of phases as our confidence within the system grew:

Advance by way of Learn modes as soon as earlier mode passes checks

A crucial a part of this phased rollout was the Comparability section, which in contrast bytes served by outdated learn path and the brand new learn path whereas in shadow mode:

A chart of bytes match vs bytes differ in a given shadow interval

Outcomes

On account of these dynamic splits, we see an enormous enchancment within the common learn latency of most broad partitions, bringing it down from seconds:

Current common latency for studying broad partitions

to low double-digit milliseconds!

Common latency for studying dynamically cut up partitions

Tail latencies of studying broad partitions dropped from a number of seconds:

Current tail latency for studying broad partitions

to round 200 ms or higher:

Tail latency for studying dynamically cut up partitions

leading to a drop in learn timeouts:

Total, this has resulted in a extra steady Cassandra cluster with decrease CPU utilization and little to no thread queuing:

Low CPU utilization and no thread-queueing

Additional, for excessive broad rows, the place a dataset would face fixed timeouts and unavailability blips, the service was in a position to paginate and question 500MB+ partitions whereas remaining accessible:

grpc … com.netflix.dgw.ts.TimeSeriesService/SearchEventRecords -d
'{"namespace": "...",
"search_query": {...},
"time_interval": {
"begin": "2026–05–11T23:42:51.484398Z",
"finish": "2026–05–12T00:13:50.694205Z"
},
"pageSize" : 1000,
}'
# Response:
{
"next_page_token" : ….,
"data": [
{
…
}
],
"response_context": [{
"namespace": "...",
…
# Trades elevated latency for being available
"time_taken": "41.072410142s"
}
]
}

Conclusion

There may be extra work deliberate round this characteristic, like splitting mutable broad partitions, or re-processing beforehand failed splits, however this has been a profitable begin in enhancing service efficiency and decreasing our assist burden.

Additional, we want to spotlight some key classes that we realized at totally different factors on this journey.

Decreasing Floor Space: As a primary step, discover less complicated options that may nonetheless ship significant impression. Additionally, decreasing the floor space of a fancy change and deploying incrementally pays off operationally.
Constructing Confidence: Make investments time and sources to construct confidence in new options, particularly when justified by the characteristic complexity, deployment blast radius, and/or potential impression.

Acknowledgements: Particular due to our beautiful colleagues who additional contributed to this characteristic’s success: Tom DeVoe, Chris Lohfink, Sumanth Pasupuleti and Joey Lynch.

Source link

What's Hot

The Best Science Fiction Film From Each Year

16 He-Man Action Figures of the ’80s and What Made Them Genius

17 ‘Obsession’ Secrets That Change Everything

Dynamic Repartitioning for Time Series Workloads | by Netflix Technology Blog | Jun, 2026

Al Pacino’s 1979 ‘…And Justice for All’ Headed to Netflix As TV Series

Dan Levy Refuses to Watch the Modern Wave of “Island” Dating Shows

Watch The Full Trailer For Netflix’s Harlan Coben Series

Is the HBO Series Coming Back After Season 3? – Hollywood Life

Subscribe to Updates