18 hours in the past
By Rajiv Shringi, Kaidan Fullerton, Oleksii Tkachuk and Kartik Sathyanarayanan
Introduction
Netflix’s TimeSeries Abstraction is a scalable system for ingesting and querying petabytes of temporal occasion information with millisecond latency. We use Apache Cassandra 4.x because the underlying storage for these fundamental causes:
- Throughput, latency, and value: Cassandra can deal with hundreds of thousands of low‑latency reads and writes in a cheap method.
- Operational maturity: Our information platform crew has deep operational experience working massive Cassandra clusters in manufacturing.
Nonetheless, utilizing Cassandra at this scale introduces commerce‑offs for TimeSeries workloads. A key problem is broad partitions, as TimeSeries dataset partitions can develop fairly massive with occasions accumulating over time.
This drawback is additional compounded by the truth that TimeSeries servers routinely take care of a really excessive learn throughput:
This put up walks by way of our journey to cut back the impression of broad partitions in our TimeSeries datasets, the options we constructed, and the teachings we realized.
Word: Though this put up walks by way of re-partitioning in Cassandra, the identical strategies may be utilized extra broadly to different information shops.
Impression of Broad Partitions
For many of our datasets, we observe a mean learn latency within the order of single-digit milliseconds:
Nonetheless, in some datasets, as partitions develop too broad, we observe excessive learn latencies within the order of seconds, particularly in direction of the tail finish:
This may end up in timeouts:
In excessive instances, if many of the reads goal broad partitions, we are able to see Rubbish Assortment pauses, excessive CPU utilization and thread queueing.
Scaling up the underlying Cassandra cluster is all the time an choice, however we want smarter alternate options than simply throwing more cash on the drawback.
TimeSeries Partitioning Technique
The TimeSeries Abstraction was designed to resolve the issue of broad partitions by dividing the info into discrete time chunks. For extra in-depth info, consult with our earlier weblog.
To summarize, right here is an illustration of how TimeSeries partitioning technique helps us break up broad partitions into manageable chunks.
This technique additional permits us to effectively question and drop information based mostly on time, with out having to take care of tombstones.
Choosing the Partitioning Technique
When a namespace (a.ok.a. dataset) is created, customers should specify their anticipated workload traits. This specification is then fed into our provisioning pipeline. The pipeline processes these inputs, runs Monte Carlo simulations, and produces an optimum infrastructure and partition configuration.
You may be taught extra about our methodology of capability planning on this insightful AWS re:Invent discuss given by certainly one of our beautiful colleagues.
The Downside with the Present Strategy
Though this methodology of provisioning is efficient in lots of conditions, it proves inadequate for TimeSeries workloads underneath these circumstances:
- Workload is unknown or inaccurately estimated: Early on in a venture, customers can lack a dependable image of manufacturing visitors or just misestimate key parameters.
- Workload evolves over time: Visitors patterns, consumer habits, and product necessities change. A “good” partitioning technique on day one can turn out to be inefficient months later.
- Knowledge outliers exist: Not all TimeSeries IDs behave the identical. A small share of IDs can obtain a vastly larger quantity of occasions than the remaining.
Fortuitously, our design with discrete Time Slices provides us a pure escape hatch for the primary two situations; every new Time Slice can use a unique partitioning technique.
Nonetheless, manually adjusting these configurations in a fleet that has 1000’s of TimeSeries datasets just isn’t sustainable. We want automation.
Resolution 1: Time Slice Re-Partitioning
Cassandra exposes helpful introspection APIs for understanding information utilization and entry patterns. For instance, nodetool tablehistograms present percentile distributions for partition sizes in a desk. Utilizing these instruments, we are able to detect instances of each over and underneath partitioning.
Under is an instance of over‑partitioning, the place the TimeSeries provisioning pipeline chosen very small time_bucket intervals based mostly on person offered inputs:
inflicting partitions to have lower than 10 KB of information, resulting in excessive learn amplification and thread queueing:
So as to tune partition methods effectively, we added a background employee, which displays partition histograms of Time Slices connected to a given software, and exposes it by way of a Cassandra digital desk:
It then computes an adjustment issue when it detects partition sizes not assembly a configured density. This configured density is usually set between 2 MiB to 10 MiB relying on the workload.
DynamicTimeSliceConfigWorker:
namespace: my_dataset_1
Noticed: TimeSlices have p99 partitions beneath configured goal of 10MB.
Proposed: time_bucket interval: 60s -> 604800sThe employee can then replace future Time Slices with the brand new partition technique:
This technique has yielded actual leads to decreasing our learn latencies, in addition to decreasing the variety of timeouts attributable to thread queueing.
Nonetheless, this technique solely works if many of the information displays such habits that warrants re-partitioning of all the desk. It doesn’t work in instances the place solely a share of IDs inside the desk are broad.
We now have a few choices right here:
- Do Nothing: That is generally the correct method if there is no such thing as a noticed impression to the appliance’s top-level metrics.
- Partial Returns: We carried out a ‘Partial Return’ characteristic, which aborts an inflight request if it has breached a configured latency SLO, whereas returning no matter information it has collected up till that time. This can be a nice choice for shoppers who care extra about latency than fetching all the info.
- Block IDs: That is an excessive step however value mentioning, as a result of we do take care of unhealthy information that often seeps into the system e.g. check or spam IDs that may make the system unstable.
dgwts.config.<dataset>.block.Ids: "<tsid-1>, <tsid-2>, <tsid-3>"Finally, we encounter situations the place legitimate and vital TimeSeries IDs accumulate a excessive sufficient quantity of occasions, with callers needing to course of all of the associated information. Merely tolerating elevated latencies or timeouts when querying these IDs just isn’t a fascinating end result.
That is the place dynamic partitioning comes into play.
Resolution 2: Dynamic Partitioning per ID
Dynamic partitioning is an asynchronous pipeline that auto-detects and splits broad partitions on a TimeSeries ID stage somewhat than on the desk stage.
Get Netflix Know-how Weblog’s tales in your inbox
Be a part of Medium at no cost to get updates from this author.
It has three fundamental phases:
- Detection: Detects broad partitions for a given TimeSeries ID throughout the learn path.
- Planning & Splitting: Plans and executes splits of these partitions into optimum sizes asynchronously.
- Serving Reads: Re-routes the learn queries transparently to learn information from the cut up partitions when prepared.
That is the way it works at a excessive stage; we’ll dive into particulars after:
Listed below are the totally different phases of the pipeline:
Detection
Each TimeSeries learn operation tracks what number of bytes are learn for a given partition. If the bytes learn exceed a configured threshold, the server emits a detection occasion to Kafka:
{
"time_slice": "data_20260328", // the Cassandra desk this occasion was detected in
"time_series_id": "profileId:123", // the ID detected as broad
"time_bucket": 7, // the present time_bucket partition
"event_bucket": 2, // the present event_bucket partition
"immutable": true, // TimeSeries servers can compute if this partition is now not receiving writes
"model": "0" // reserved for future use e.g. invalidate if partition is now not immutable
}Our determination to detect broad partitions on reads, versus writes, is predicated on our remark that almost all of the info within the wild doesn’t want this therapy. The slight draw back is that some reads on these massive partitions might endure sub-optimal efficiency for a really quick period (sometimes seconds) till this course of catches up.
Immutability
Though splitting mutable partitions is feasible, it’s inherently extra advanced. As a primary step in direction of fixing this drawback, we selected to cut back the floor space of this variation by specializing in immutable partitions, whereas nonetheless meaningfully decreasing caller timeouts.
Planning
Detection might happen based mostly on a partial learn, so the planner should nonetheless learn all the partition as soon as to compute an correct cut up plan. The checkpointing turns into essential right here. For planning reads that fail to course of all the partition, the method can all the time proceed from the final saved checkpoint.
Checkpointing
The wide_row metadata desk serves because the spine for state transitions and checkpointing of partition splits. It additionally shops info that’s used later by TimeSeries servers to correctly route Learn queries.
Splitting
The Planner delegates the splitting of information to an applicable split-strategy. For instance, if EventBucketPartitionSplitStrategy is chosen, we cut up the partition by assigning extra occasion buckets to the identical time bucket. If the partition is ultra-wide, we cap the variety of occasion buckets we cut up into, so as to management the resultant learn amplification. Spreading into a number of partitions in such instances continues to be helpful so as to unfold the learn workload to a number of Cassandra replicas.
Additional, because the Splitter has the total view of the partition, it may guarantee whole type order throughout all of the cut up buckets.
Validating Splits
The Planner shops a pre-split checksum of a given partition throughout the planning section, whereas the Splitter computes and shops the post-split checksum. The cut up standing is marked as accomplished provided that the 2 checksums match.
Monitoring Splits
The pre- and post-split partition sizes throughout totally different datasets are tracked to see how successfully the partition splits are being deliberate and executed:
Serving Reads
The TimeSeries servers load the partition-keys of accomplished splits periodically into in-memory Bloom filters. Each learn operation checks the Bloom filter to see whether or not a question may be diverted to the cut up partitions.
Here’s what the Learn path appears to be like like:
The dimensions of the Bloom filters is monitored to make sure we have now sufficient reminiscence per server. Because of the compactness of partition keys, and ratio of broad partitions in a given dataset, the filters match comfortably in every server occasion.
The Bloom filter latency to test whether or not a given partition secret’s broad for each learn request is often in single-digit microseconds or higher, making this diversion virtually invisible to the callers.
For the instances that do find yourself with a Bloom filter hit, the TimeSeries servers lookup the wide_row metadata to see learn how to learn a selected broad partition:
{
"pre_split_data": {
"time_slice": "data_20260328",
"time_series_id": "6313825", → What to learn
"time_bucket": 0,
"event_bucket": 2
…
},
"post_split_data": {
"time_slice": "wide_data_20260328_0", → The place to learn it from
"event_bucket_partition_strategy": { → Technique to delegate to for studying
"target_event_buckets": 2,
"start_event_bucket": 32 → How ought to the technique learn it
}
…
}This metadata learn is backed by a read-through cache, making it fairly performant:
Lastly, the reads for the cut up partitions are delegated to our present PartitionReader, which reads N smaller partitions in parallel, somewhat than 1 massive partition, enhancing general efficiency and stability!
Fallbacks
The prevailing broad partition from the unique time slice isn’t deleted. This helps us in creating protected fallbacks in many various situations of partial failures and eventual consistency. The marginally bigger cupboard space we use consequently is well worth the operational security we achieve.
Constructing Extra Confidence
Serving incorrect reads can be disastrous. To ascertain belief past checksums, we leveraged extra mechanisms reminiscent of:
- Utilizing our present Knowledge Bridge pipelines to confirm splits offline:
- Implementing a phased rollout technique to soundly advance by way of phases as our confidence within the system grew:
A crucial a part of this phased rollout was the Comparability section, which in contrast bytes served by outdated learn path and the brand new learn path whereas in shadow mode:
Outcomes
On account of these dynamic splits, we see an enormous enchancment within the common learn latency of most broad partitions, bringing it down from seconds:
to low double-digit milliseconds!
Tail latencies of studying broad partitions dropped from a number of seconds:
to round 200 ms or higher:
leading to a drop in learn timeouts:
Total, this has resulted in a extra steady Cassandra cluster with decrease CPU utilization and little to no thread queuing:
Additional, for excessive broad rows, the place a dataset would face fixed timeouts and unavailability blips, the service was in a position to paginate and question 500MB+ partitions whereas remaining accessible:
grpc … com.netflix.dgw.ts.TimeSeriesService/SearchEventRecords -d
'{"namespace": "...",
"search_query": {...},
"time_interval": {
"begin": "2026–05–11T23:42:51.484398Z",
"finish": "2026–05–12T00:13:50.694205Z"
},
"pageSize" : 1000,
}'
# Response:
{
"next_page_token" : ….,
"data": [
{
…
}
],
"response_context": [{
"namespace": "...",
…
# Trades elevated latency for being available
"time_taken": "41.072410142s"
}
]
}Conclusion
There may be extra work deliberate round this characteristic, like splitting mutable broad partitions, or re-processing beforehand failed splits, however this has been a profitable begin in enhancing service efficiency and decreasing our assist burden.
Additional, we want to spotlight some key classes that we realized at totally different factors on this journey.
- Decreasing Floor Space: As a primary step, discover less complicated options that may nonetheless ship significant impression. Additionally, decreasing the floor space of a fancy change and deploying incrementally pays off operationally.
- Constructing Confidence: Make investments time and sources to construct confidence in new options, particularly when justified by the characteristic complexity, deployment blast radius, and/or potential impression.
Acknowledgements: Particular due to our beautiful colleagues who additional contributed to this characteristic’s success: Tom DeVoe, Chris Lohfink, Sumanth Pasupuleti and Joey Lynch.
