Entertainer.newsEntertainer.news
  • Home
  • Celebrity
  • Movies
  • Music
  • Web Series
  • Podcast
  • OTT
  • Television
  • Interviews
  • Awards

Subscribe to Updates

Get the latest Entertainment News and Updates from Entertainer News

What's Hot

The Best Science Fiction Film From Each Year

June 4, 2026

16 He-Man Action Figures of the ’80s and What Made Them Genius

June 4, 2026

17 ‘Obsession’ Secrets That Change Everything

June 4, 2026
Facebook Twitter Instagram
Thursday, June 4
  • About us
  • Advertise with us
  • Submit Articles
  • Privacy Policy
  • Contact us
Facebook Twitter Tumblr LinkedIn
Entertainer.newsEntertainer.news
Subscribe Login
  • Home
  • Celebrity
  • Movies
  • Music
  • Web Series
  • Podcast
  • OTT
  • Television
  • Interviews
  • Awards
Entertainer.newsEntertainer.news
Home Dynamic Repartitioning for Time Series Workloads | by Netflix Technology Blog | Jun, 2026
Web Series

Dynamic Repartitioning for Time Series Workloads | by Netflix Technology Blog | Jun, 2026

Team EntertainerBy Team EntertainerJune 3, 2026Updated:June 4, 2026No Comments14 Mins Read
Facebook Twitter Pinterest LinkedIn Tumblr WhatsApp VKontakte Email
Dynamic Repartitioning for Time Series Workloads | by Netflix Technology Blog | Jun, 2026
Share
Facebook Twitter LinkedIn Pinterest Email


Netflix Technology Blog

11 min learn

·

18 hours in the past

By Rajiv Shringi, Kaidan Fullerton, Oleksii Tkachuk and Kartik Sathyanarayanan

Introduction

Netflix’s TimeSeries Abstraction is a scalable system for ingesting and querying petabytes of temporal occasion information with millisecond latency. We use Apache Cassandra 4.x because the underlying storage for these fundamental causes:

  • Throughput, latency, and value: Cassandra can deal with hundreds of thousands of low‑latency reads and writes in a cheap method.
  • Operational maturity: Our information platform crew has deep operational experience working massive Cassandra clusters in manufacturing.

Nonetheless, utilizing Cassandra at this scale introduces commerce‑offs for TimeSeries workloads. A key problem is broad partitions, as TimeSeries dataset partitions can develop fairly massive with occasions accumulating over time.

This drawback is additional compounded by the truth that TimeSeries servers routinely take care of a really excessive learn throughput:

Press enter or click on to view picture in full measurement
Reads/second for various datasets

This put up walks by way of our journey to cut back the impression of broad partitions in our TimeSeries datasets, the options we constructed, and the teachings we realized.

Word: Though this put up walks by way of re-partitioning in Cassandra, the identical strategies may be utilized extra broadly to different information shops.

Impression of Broad Partitions

For many of our datasets, we observe a mean learn latency within the order of single-digit milliseconds:

Press enter or click on to view picture in full measurement
Best Latency for Reads (ms)

Nonetheless, in some datasets, as partitions develop too broad, we observe excessive learn latencies within the order of seconds, particularly in direction of the tail finish:

Press enter or click on to view picture in full measurement
Excessive Tail Latency for Reads (seconds)

This may end up in timeouts:

Press enter or click on to view picture in full measurement
Learn timeouts / second

In excessive instances, if many of the reads goal broad partitions, we are able to see Rubbish Assortment pauses, excessive CPU utilization and thread queueing.

Press enter or click on to view picture in full measurement
Excessive CPU utilization and thread-queueing in Cassandra clusters

Scaling up the underlying Cassandra cluster is all the time an choice, however we want smarter alternate options than simply throwing more cash on the drawback.

TimeSeries Partitioning Technique

The TimeSeries Abstraction was designed to resolve the issue of broad partitions by dividing the info into discrete time chunks. For extra in-depth info, consult with our earlier weblog.

To summarize, right here is an illustration of how TimeSeries partitioning technique helps us break up broad partitions into manageable chunks.

Press enter or click on to view picture in full measurement
Time Collection partitioning breaking apart a dataset into Time slices, time buckets and occasion buckets

This technique additional permits us to effectively question and drop information based mostly on time, with out having to take care of tombstones.

Choosing the Partitioning Technique

When a namespace (a.ok.a. dataset) is created, customers should specify their anticipated workload traits. This specification is then fed into our provisioning pipeline. The pipeline processes these inputs, runs Monte Carlo simulations, and produces an optimum infrastructure and partition configuration.

Press enter or click on to view picture in full measurement
Provisioning picks optimum infra and configuration based mostly on person inputs

You may be taught extra about our methodology of capability planning on this insightful AWS re:Invent discuss given by certainly one of our beautiful colleagues.

The Downside with the Present Strategy

Though this methodology of provisioning is efficient in lots of conditions, it proves inadequate for TimeSeries workloads underneath these circumstances:

  • Workload is unknown or inaccurately estimated: Early on in a venture, customers can lack a dependable image of manufacturing visitors or just misestimate key parameters.
  • Workload evolves over time: Visitors patterns, consumer habits, and product necessities change. A “good” partitioning technique on day one can turn out to be inefficient months later.
  • Knowledge outliers exist: Not all TimeSeries IDs behave the identical. A small share of IDs can obtain a vastly larger quantity of occasions than the remaining.

Fortuitously, our design with discrete Time Slices provides us a pure escape hatch for the primary two situations; every new Time Slice can use a unique partitioning technique.

Press enter or click on to view picture in full measurement
Every Time Slice can have a singular partition technique

Nonetheless, manually adjusting these configurations in a fleet that has 1000’s of TimeSeries datasets just isn’t sustainable. We want automation.

Resolution 1: Time Slice Re-Partitioning

Cassandra exposes helpful introspection APIs for understanding information utilization and entry patterns. For instance, nodetool tablehistograms present percentile distributions for partition sizes in a desk. Utilizing these instruments, we are able to detect instances of each over and underneath partitioning.

Under is an instance of over‑partitioning, the place the TimeSeries provisioning pipeline chosen very small time_bucket intervals based mostly on person offered inputs:

Press enter or click on to view picture in full measurement
Provisioning chosen 60s time buckets based mostly on person inputs

inflicting partitions to have lower than 10 KB of information, resulting in excessive learn amplification and thread queueing:

Press enter or click on to view picture in full measurement
Histogram of the given Cassandra desk displaying partition measurement percentiles

So as to tune partition methods effectively, we added a background employee, which displays partition histograms of Time Slices connected to a given software, and exposes it by way of a Cassandra digital desk:

Press enter or click on to view picture in full measurement
Histograms uncovered by way of a Cassandra Digital desk

It then computes an adjustment issue when it detects partition sizes not assembly a configured density. This configured density is usually set between 2 MiB to 10 MiB relying on the workload.

DynamicTimeSliceConfigWorker: 
namespace: my_dataset_1
Noticed: TimeSlices have p99 partitions beneath configured goal of 10MB.
Proposed: time_bucket interval: 60s -> 604800s

The employee can then replace future Time Slices with the brand new partition technique:

Press enter or click on to view picture in full measurement
Partitioning adjusted for future Time Slice(s)

This technique has yielded actual leads to decreasing our learn latencies, in addition to decreasing the variety of timeouts attributable to thread queueing.

Press enter or click on to view picture in full measurement
Discount in tail latency and thread queueing for

Nonetheless, this technique solely works if many of the information displays such habits that warrants re-partitioning of all the desk. It doesn’t work in instances the place solely a share of IDs inside the desk are broad.

We now have a few choices right here:

  • Do Nothing: That is generally the correct method if there is no such thing as a noticed impression to the appliance’s top-level metrics.
  • Partial Returns: We carried out a ‘Partial Return’ characteristic, which aborts an inflight request if it has breached a configured latency SLO, whereas returning no matter information it has collected up till that time. This can be a nice choice for shoppers who care extra about latency than fetching all the info.
Press enter or click on to view picture in full measurement
Tail latency drops across the SLO cutoff as Partial Returns are enabled
  • Block IDs: That is an excessive step however value mentioning, as a result of we do take care of unhealthy information that often seeps into the system e.g. check or spam IDs that may make the system unstable.
dgwts.config.<dataset>.block.Ids: "<tsid-1>, <tsid-2>, <tsid-3>"

Finally, we encounter situations the place legitimate and vital TimeSeries IDs accumulate a excessive sufficient quantity of occasions, with callers needing to course of all of the associated information. Merely tolerating elevated latencies or timeouts when querying these IDs just isn’t a fascinating end result.

That is the place dynamic partitioning comes into play.

Resolution 2: Dynamic Partitioning per ID

Dynamic partitioning is an asynchronous pipeline that auto-detects and splits broad partitions on a TimeSeries ID stage somewhat than on the desk stage.

Get Netflix Know-how Weblog’s tales in your inbox

Be a part of Medium at no cost to get updates from this author.

It has three fundamental phases:

  • Detection: Detects broad partitions for a given TimeSeries ID throughout the learn path.
  • Planning & Splitting: Plans and executes splits of these partitions into optimum sizes asynchronously.
  • Serving Reads: Re-routes the learn queries transparently to learn information from the cut up partitions when prepared.

That is the way it works at a excessive stage; we’ll dive into particulars after:

Press enter or click on to view picture in full measurement
Dynamic Broad Partition Break up Async Pipeline

Listed below are the totally different phases of the pipeline:

Detection

Each TimeSeries learn operation tracks what number of bytes are learn for a given partition. If the bytes learn exceed a configured threshold, the server emits a detection occasion to Kafka:

{
"time_slice": "data_20260328", // the Cassandra desk this occasion was detected in
"time_series_id": "profileId:123", // the ID detected as broad
"time_bucket": 7, // the present time_bucket partition
"event_bucket": 2, // the present event_bucket partition
"immutable": true, // TimeSeries servers can compute if this partition is now not receiving writes
"model": "0" // reserved for future use e.g. invalidate if partition is now not immutable
}

Our determination to detect broad partitions on reads, versus writes, is predicated on our remark that almost all of the info within the wild doesn’t want this therapy. The slight draw back is that some reads on these massive partitions might endure sub-optimal efficiency for a really quick period (sometimes seconds) till this course of catches up.

Immutability

Though splitting mutable partitions is feasible, it’s inherently extra advanced. As a primary step in direction of fixing this drawback, we selected to cut back the floor space of this variation by specializing in immutable partitions, whereas nonetheless meaningfully decreasing caller timeouts.

Planning

Detection might happen based mostly on a partial learn, so the planner should nonetheless learn all the partition as soon as to compute an correct cut up plan. The checkpointing turns into essential right here. For planning reads that fail to course of all the partition, the method can all the time proceed from the final saved checkpoint.

Checkpointing

The wide_row metadata desk serves because the spine for state transitions and checkpointing of partition splits. It additionally shops info that’s used later by TimeSeries servers to correctly route Learn queries.

Press enter or click on to view picture in full measurement
wide_row metadata for storing cut up states and checkpoints

Splitting

The Planner delegates the splitting of information to an applicable split-strategy. For instance, if EventBucketPartitionSplitStrategy is chosen, we cut up the partition by assigning extra occasion buckets to the identical time bucket. If the partition is ultra-wide, we cap the variety of occasion buckets we cut up into, so as to management the resultant learn amplification. Spreading into a number of partitions in such instances continues to be helpful so as to unfold the learn workload to a number of Cassandra replicas.

Press enter or click on to view picture in full measurement
Break up by assigning extra occasion buckets for a given time bucket

Additional, because the Splitter has the total view of the partition, it may guarantee whole type order throughout all of the cut up buckets.

Validating Splits

The Planner shops a pre-split checksum of a given partition throughout the planning section, whereas the Splitter computes and shops the post-split checksum. The cut up standing is marked as accomplished provided that the 2 checksums match.

Press enter or click on to view picture in full measurement
Guarantee checksums match pre- and post-split earlier than marking a cut up as COMPLETED

Monitoring Splits

The pre- and post-split partition sizes throughout totally different datasets are tracked to see how successfully the partition splits are being deliberate and executed:

Press enter or click on to view picture in full measurement
Observe pre- and post-split partition sizes to make sure we’re splitting optimally

Serving Reads

The TimeSeries servers load the partition-keys of accomplished splits periodically into in-memory Bloom filters. Each learn operation checks the Bloom filter to see whether or not a question may be diverted to the cut up partitions.

Here’s what the Learn path appears to be like like:

Press enter or click on to view picture in full measurement
Learn path for diverting reads to present or cut up partitions

The dimensions of the Bloom filters is monitored to make sure we have now sufficient reminiscence per server. Because of the compactness of partition keys, and ratio of broad partitions in a given dataset, the filters match comfortably in every server occasion.

Press enter or click on to view picture in full measurement
Bloom filter approximate ingredient rely per namespace and time slice

The Bloom filter latency to test whether or not a given partition secret’s broad for each learn request is often in single-digit microseconds or higher, making this diversion virtually invisible to the callers.

Press enter or click on to view picture in full measurement
Latency for checking Bloom filters is extraordinarily small for callers to note the diversion

For the instances that do find yourself with a Bloom filter hit, the TimeSeries servers lookup the wide_row metadata to see learn how to learn a selected broad partition:

{
"pre_split_data": {
"time_slice": "data_20260328",
"time_series_id": "6313825", → What to learn
"time_bucket": 0,
"event_bucket": 2
…
},
"post_split_data": {
"time_slice": "wide_data_20260328_0", → The place to learn it from
"event_bucket_partition_strategy": { → Technique to delegate to for studying
"target_event_buckets": 2,
"start_event_bucket": 32 → How ought to the technique learn it
}
…
}

This metadata learn is backed by a read-through cache, making it fairly performant:

Press enter or click on to view picture in full measurement
Metadata fetch latency is sort of low to have an effect on learn operations

Lastly, the reads for the cut up partitions are delegated to our present PartitionReader, which reads N smaller partitions in parallel, somewhat than 1 massive partition, enhancing general efficiency and stability!

Press enter or click on to view picture in full measurement
Learn a lot smaller partitions in parallel and merge outcomes

Fallbacks

The prevailing broad partition from the unique time slice isn’t deleted. This helps us in creating protected fallbacks in many various situations of partial failures and eventual consistency. The marginally bigger cupboard space we use consequently is well worth the operational security we achieve.

Constructing Extra Confidence

Serving incorrect reads can be disastrous. To ascertain belief past checksums, we leveraged extra mechanisms reminiscent of:

  • Utilizing our present Knowledge Bridge pipelines to confirm splits offline:
Press enter or click on to view picture in full measurement
Spark job to make sure that the cut up information is a precise match to the unique information
  • Implementing a phased rollout technique to soundly advance by way of phases as our confidence within the system grew:
Press enter or click on to view picture in full measurement
Advance by way of Learn modes as soon as earlier mode passes checks

A crucial a part of this phased rollout was the Comparability section, which in contrast bytes served by outdated learn path and the brand new learn path whereas in shadow mode:

Press enter or click on to view picture in full measurement
A chart of bytes match vs bytes differ in a given shadow interval

Outcomes

On account of these dynamic splits, we see an enormous enchancment within the common learn latency of most broad partitions, bringing it down from seconds:

Press enter or click on to view picture in full measurement
Current common latency for studying broad partitions

to low double-digit milliseconds!

Press enter or click on to view picture in full measurement
Common latency for studying dynamically cut up partitions

Tail latencies of studying broad partitions dropped from a number of seconds:

Press enter or click on to view picture in full measurement
Current tail latency for studying broad partitions

to round 200 ms or higher:

Press enter or click on to view picture in full measurement
Tail latency for studying dynamically cut up partitions

leading to a drop in learn timeouts:

Press enter or click on to view picture in full measurement

Total, this has resulted in a extra steady Cassandra cluster with decrease CPU utilization and little to no thread queuing:

Press enter or click on to view picture in full measurement
Low CPU utilization and no thread-queueing

Additional, for excessive broad rows, the place a dataset would face fixed timeouts and unavailability blips, the service was in a position to paginate and question 500MB+ partitions whereas remaining accessible:

grpc … com.netflix.dgw.ts.TimeSeriesService/SearchEventRecords -d
'{"namespace": "...",
"search_query": {...},
"time_interval": {
"begin": "2026–05–11T23:42:51.484398Z",
"finish": "2026–05–12T00:13:50.694205Z"
},
"pageSize" : 1000,
}'
# Response:
{
"next_page_token" : ….,
"data": [
{
…
}
],
"response_context": [{
"namespace": "...",
…
# Trades elevated latency for being available
"time_taken": "41.072410142s"
}
]
}

Conclusion

There may be extra work deliberate round this characteristic, like splitting mutable broad partitions, or re-processing beforehand failed splits, however this has been a profitable begin in enhancing service efficiency and decreasing our assist burden.

Additional, we want to spotlight some key classes that we realized at totally different factors on this journey.

  • Decreasing Floor Space: As a primary step, discover less complicated options that may nonetheless ship significant impression. Additionally, decreasing the floor space of a fancy change and deploying incrementally pays off operationally.
  • Constructing Confidence: Make investments time and sources to construct confidence in new options, particularly when justified by the characteristic complexity, deployment blast radius, and/or potential impression.

Acknowledgements: Particular due to our beautiful colleagues who additional contributed to this characteristic’s success: Tom DeVoe, Chris Lohfink, Sumanth Pasupuleti and Joey Lynch.



Source link

Blog Dynamic Jun Netflix Repartitioning Series Technology time Workloads
Share. Facebook Twitter Pinterest LinkedIn Tumblr WhatsApp Email
Previous ArticleIs THIS Why Prince Harry Was Left Off Guest List For Cousin’s Wedding Despite Prince William Landing An Invite?!
Next Article Before Off Campus, Belmont Cameli Starred In A Failed Reboot Of A Classic ’90s Sitcom
Team Entertainer
  • Website

Related Posts

Al Pacino’s 1979 ‘…And Justice for All’ Headed to Netflix As TV Series

June 3, 2026

Dan Levy Refuses to Watch the Modern Wave of “Island” Dating Shows

June 2, 2026

Watch The Full Trailer For Netflix’s Harlan Coben Series

June 1, 2026

Is the HBO Series Coming Back After Season 3? – Hollywood Life

June 1, 2026
Recent Posts
  • The Best Science Fiction Film From Each Year
  • 16 He-Man Action Figures of the ’80s and What Made Them Genius
  • 17 ‘Obsession’ Secrets That Change Everything
  • Former Disney Channel megastars who completely turned their backs on fame

Archives

  • June 2026
  • May 2026
  • April 2026
  • March 2026
  • February 2026
  • January 2026
  • December 2025
  • November 2025
  • October 2025
  • September 2025
  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • September 2024
  • August 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • February 2024
  • January 2024
  • December 2023
  • November 2023
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021

Categories

  • Actress
  • Awards
  • Behind the Camera
  • BollyBuzz
  • Celebrity
  • Edit Picks
  • Glam & Style
  • Global Bollywood
  • In the Frame
  • Insta Inspector
  • Interviews
  • Movies
  • Music
  • News
  • News & Gossip
  • News & Gossips
  • OTT
  • Podcast
  • Power & Purpose
  • Press Release
  • Spotlight Stories
  • Spotted!
  • Star Luxe
  • Television
  • Trending
  • Uncategorized
  • Web Series
NAVIGATION
  • About us
  • Advertise with us
  • Submit Articles
  • Privacy Policy
  • Contact us
  • About us
  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us
Copyright © 2026 Entertainer.

Type above and press Enter to search. Press Esc to cancel.

Sign In or Register

Welcome Back!

Login to your account below.

Lost password?