Ram Srivasta Kannan, Wale Akintayo, Jay Bharadwaj, John Crimmins, Shengwei Wang, Zhitao Zhu
Introduction
In 2024, the On-line Information Shops group at Netflix carried out a complete evaluation of the relational database applied sciences used throughout the corporate. This analysis examined performance, efficiency, and whole price of possession throughout our database ecosystem. Based mostly on this evaluation, we determined to standardize on Amazon Aurora PostgreSQL as the first relational database providing for Netflix groups.
A number of key elements influenced this resolution:
- PostgreSQL already underpinned nearly all of our relational workloads, which made it a pure basis for standardization. Inside evaluations revealed that Aurora PostgreSQL had supported over 95% of the purposes and workloads operating on different relational databases throughout our inner companies.
- Trade momentum had continued to shift towards PostgreSQL, pushed by its open ecosystem, robust group help, and broad adoption throughout trendy knowledge platforms.
- Aurora’s cloud-native, distributed structure offered clear benefits in scalability, excessive availability, and elasticity in comparison with conventional single-node PostgreSQL deployments.
- Aurora PostgreSQL provided a wealthy function set, together with a robust, forward-looking roadmap aligned with the wants of large-scale, globally distributed purposes.
A Clear Migration Path Ahead
As a part of this strategic shift, certainly one of our key initiatives for 2024/2025 was migrating current customers to Aurora PostgreSQL. This effort started with RDS PostgreSQL migrations and can increase to incorporate migrations from different relational methods in subsequent phases.
As an information platform group, our purpose is to make this evolution predictable, well-supported, and minimally disruptive. This enables groups to undertake Aurora PostgreSQL at a tempo that aligns with their product and operational roadmaps, whereas we transfer towards a unified and scalable relational knowledge platform throughout the group.
Database Migration: Extra Than a Easy Switch
Migrating a database includes excess of copying rows from one system to a different. It’s a coordinated strategy of transitioning each knowledge and database performance whereas preserving correctness, availability, and efficiency. At scale, a well-designed migration should decrease disruption to purposes and guarantee a clear, deterministic handoff from the outdated system to the brand new one.
Most database migrations comply with a standard set of high-level steps:
- Information Replication: Information is first copied from the supply database to the vacation spot, usually utilizing replication, in order that ongoing modifications are constantly captured and utilized.
- Quiescence: Write site visitors to the supply database is halted, permitting the vacation spot to totally catch up and eradicate any remaining divergence.
- Validation: The system verifies that the supply and vacation spot databases are absolutely synchronized and comprise equivalent knowledge.
- Cutover: Consumer purposes are reconfigured to level to the vacation spot database, which turns into the brand new major supply of reality.
Challenges
Operational Challenges
Migrating to a brand new relational database at Netflix scale presents substantial operational challenges. With a fleet approaching 400 PostgreSQL clusters, manually migrating every one is just not scalable for the info platform group. Such an strategy would require a big period of time, introduce the danger of human error, and necessitate appreciable hands-on engineering effort. Compounding the issue, coordinating downtime throughout the numerous interconnected companies that rely on every database is extraordinarily cumbersome at this scale.
To handle these challenges, we designed a self-service migration workflow that permits service homeowners to run their very own RDS PostgreSQL to Aurora PostgreSQL migrations. The workflow routinely handles orchestration, security checks, and correctness ensures end-to-end, leading to decrease operational overhead and a predictable, dependable migration expertise.
Technical challenges
- Zero knowledge loss — We should assure that each one knowledge from the supply cluster is absolutely and safely migrated to the vacation spot inside a really tight window, with no chance of information loss.
- Minimal downtime — Some downtime is unavoidable throughout migration, as purposes should briefly pause write site visitors whereas reducing over to Aurora PostgreSQL. For higher-tier companies that energy important components of the Netflix ecosystem, this window have to be stored extraordinarily brief to forestall user-facing affect and preserve service reliability.
- No management over shopper purposes — Because the platform group, we handle the databases, however software groups deal with the learn and write operations. We can not assume that they’ve the flexibility to pause writes on demand, nor can we need to expose such controls to them, as errors might result in knowledge inconsistencies put up migration. Subsequently, constructing a self-service migration pipeline requires inventive control-plane options to halt site visitors, making certain that no writes happen in the course of the validation and cutover phases.
- No direct entry to RDS credentials — The migration automation should carry out replication, quiescence, and validation with out requesting database credentials from customers or counting on guide authentication. Supply databases are sometimes tightly secured, permitting entry solely from shopper purposes, however extra importantly, requiring credential entry — even when it have been attainable — would considerably enhance operational overhead and threat. On the similar time, the migration platform might function in environments with out direct entry to the supply database, making conventional verification or parity checks inconceivable.
- No Degradation in Efficiency — The migration course of should not affect the efficiency or stability of manufacturing databases as soon as they’re operating within the Aurora PostgreSQL ecosystem.
- Full Ecosystem Parity — Past migrating the core database, related parts reminiscent of parameter teams, learn replicas, and replication slots should even be migrated to make sure useful equivalence.
Minimal Person Effort — Since we depend on groups who should not database consultants to carry out migrations, the method have to be easy, intuitive, and absolutely self-guided.
AWS beneficial migration strategies
Utilizing a snapshot
One of many easiest AWS-recommended approaches for migrating from RDS PostgreSQL to Aurora PostgreSQL is predicated on snapshots. On this mannequin, write site visitors to the supply PostgreSQL database is first stopped. A guide snapshot of the RDS PostgreSQL occasion is then taken and migrated to Aurora, the place AWS converts it into an Aurora-compatible format.
As soon as the conversion completes, a brand new Aurora PostgreSQL cluster is created from the snapshot. After the cluster is introduced on-line and validated, software site visitors is redirected to the Aurora endpoint, finishing the migration.
Reference
Utilizing an Aurora learn reproduction
Within the read-replica–primarily based strategy, an Aurora PostgreSQL learn reproduction is created from an current RDS PostgreSQL occasion. AWS establishes steady, asynchronous replication from the RDS supply to the Aurora reproduction, permitting ongoing modifications to be streamed in close to actual time.
As a result of replication runs constantly, the Aurora reproduction stays intently synchronized with the supply database. This allows groups to provision and validate the Aurora surroundings — together with configuration, connectivity, and efficiency traits — whereas manufacturing site visitors continues to circulation to the supply.
When the replication lag is sufficiently low, write site visitors is briefly paused to permit the reproduction to totally catch up. The Aurora learn reproduction is then promoted to a standalone Aurora PostgreSQL cluster, and software site visitors is redirected to the brand new Aurora endpoint. This strategy considerably reduces downtime in comparison with snapshot-based migrations and is well-suited for manufacturing methods that require minimal disruption.
These variations characterize the important thing concerns when selecting a migration technique from RDS PostgreSQL to Aurora PostgreSQL. For our automation, we opted for the Aurora Learn Duplicate strategy, buying and selling elevated implementation complexity for a considerably shorter downtime window for shopper purposes.
In Netflix’s RDS setup, a Information Entry Layer (DAL) sits between purposes and backend databases, appearing as middleware that centralizes database connectivity, safety, and site visitors routing on behalf of shopper purposes.
On the shopper aspect, purposes join by way of a ahead proxy that manages mutual TLS (mTLS) authentication and establishes a safe tunnel to the Information Gateway service. The Information Gateway, appearing as a reverse proxy for database servers, terminates shopper connections, enforces centralized authentication and authorization, and forwards site visitors to the suitable RDS PostgreSQL occasion.
This layered design ensures that purposes by no means deal with uncooked database credentials, gives a constant and safe entry sample throughout all datastore sorts, and delivers remoted, clear connectivity to managed PostgreSQL clusters. Whereas the first purpose of this structure is to implement robust safety controls and standardize how purposes entry exterior AWS knowledge shops, it additionally permits backend databases to be switched transparently through configuration, enabling managed, low-downtime migrations.
Migration Course of
The Platform group’s purpose is to ship a totally automated, self-service workflow that helps with the migration of buyer RDS PostgreSQL cases to Aurora PostgreSQL clusters. This migration software orchestrates all the course of — from getting ready the supply surroundings, initializing the Aurora learn reproduction, and sustaining steady synchronization, throughout to cutover — with out requiring any database credentials or guide intervention from the shopper.
Designed for minimal downtime and seamless consumer expertise, the workflow ensures full ecosystem parity between RDS and Aurora, preserving efficiency traits and operational conduct whereas enabling prospects to profit from Aurora’s improved scalability, resilience, and value effectivity.
Information Replication Section
Allow Automated Backups
Automated backups have to be enabled on the supply database as a result of the Aurora learn reproduction is initialized from a constant snapshot of the supply after which stored in sync by way of steady replication. Automated backups present the steady snapshot required to bootstrap the reproduction, together with the continual streaming of write-ahead log (WAL) data wanted to maintain the learn reproduction intently synchronized with the supply.
Port RDS parameters to an Aurora parameter group
We create a devoted Aurora parameter group for every cluster and migrate all RDS-compatible parameters from the supply RDS occasion. This ensures that the Aurora cluster inherits the identical configuration settings — reminiscent of reminiscence configuration, connection limits, question planner conduct, and different PostgreSQL engine parameters which have equivalents in Aurora. Parameters which might be unsupported or behave otherwise in Aurora are both omitted or adjusted in accordance with Aurora greatest practices.
Create an Aurora learn reproduction cluster and occasion
Creating an Aurora learn reproduction cluster is a important step in migrating from RDS PostgreSQL to Aurora PostgreSQL. At this stage, the Aurora cluster is created and hooked up to the RDS PostgreSQL major as a duplicate, establishing steady replication from the supply RDS PostgreSQL occasion. These Aurora learn replicas keep almost in sync with ongoing modifications by streaming write-ahead logs (WAL) from the supply, enabling minimal downtime throughout cutover. The cluster is absolutely operational for validation and efficiency testing, however it isn’t but writable — RDS stays the authoritative major.
Quiescence Section
The purpose of the quiescence part is to transition shopper purposes from the supply RDS PostgreSQL occasion to the Aurora PostgreSQL cluster as the brand new major database, whereas preserving knowledge consistency throughout cutover.
Step one on this course of is to cease all write site visitors to the supply RDS PostgreSQL occasion to ensure consistency. To realize this, we instruct customers to halt application-level site visitors, which helps forestall points reminiscent of retry storms, queue backlogs, or pointless useful resource consumption when connectivity modifications throughout cutover. This coordination additionally provides groups time to organize operationally, for instance, by suppressing alerts, notifying downstream shoppers, or speaking deliberate upkeep to their prospects.
Nevertheless, relying solely on application-side controls is unreliable. Operational gaps, misconfigurations, or lingering connections can nonetheless modify the supply database state, doubtlessly leading to modifications that aren’t replicated to the vacation spot and resulting in knowledge inconsistency or loss. To implement a clear and deterministic cutover, we additionally block site visitors on the infrastructure layer. That is accomplished by detaching the RDS occasion’s safety teams to forestall new inbound connections, adopted by a reboot of the occasion. With safety teams eliminated, no new SQL classes might be established, and the reboot forcibly terminates any current connections.
This strategy deliberately avoids requiring database credentials or logging into the PostgreSQL server to manually terminate connections. Whereas it could be slower than application- or database-level intervention, it gives a reliably automated and repeatable mechanism to totally quiesce the supply RDS PostgreSQL occasion earlier than Aurora promotion, eliminating the danger of divergent writes or an inconsistent WAL state.
Validation Section
To find out whether or not the Aurora learn reproduction has absolutely caught up with the supply RDS PostgreSQL occasion, we monitor replication progress utilizing Aurora’s OldestReplicationSlotLag metric. This metric represents how far the Aurora reproduction is behind the supply in making use of write-ahead log (WAL) data.
As soon as shopper site visitors is halted throughout quiescence, the supply RDS PostgreSQL occasion stops producing significant WAL entries. At that time, the replication lag ought to converge to zero, indicating that each one WAL data comparable to actual writes have been absolutely replayed on Aurora.
Nevertheless, in observe, our experiments present that the metric by no means settles at a gradual zero. As a substitute, it briefly drops to 0, then rapidly returns to 64 MB, repeating this sample each couple of minutes as proven within the determine beneath.
This conduct stems from how OldestReplicationSlotLag is calculated. Internally, the lag is derived utilizing the next question:
SELECT
slot_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS slot_lag_bytes
FROM pg_replication_slots;Conceptually, this interprets to:
OldestReplicationSlotLag = current_WAL_position_on_RDS
– restart_lsn See AWS references right here and right here.
Get Netflix Know-how Weblog’s tales in your inbox
Be part of Medium without cost to get updates from this author.
The restart_lsn represents the oldest write-ahead log (WAL) report that PostgreSQL should retain to make sure a replication shopper can safely resume replication.
When PostgreSQL performs a WAL phase swap, Aurora usually catches up virtually instantly. At that second, the restart_lsn briefly matches the supply’s present WAL place, inflicting the reported lag to drop to 0. Throughout idle intervals, PostgreSQL performs an empty WAL phase rotation roughly each 5 minutes, pushed by the archive_timeout = 300s setting within the database parameter group.
Instantly afterward, PostgreSQL begins writing to the brand new WAL phase. Since this new phase has not but been absolutely flushed or consumed by Aurora, the WAL place in supply RDS PostgreSQL advances forward of the restart_lsn of Aurora PostgreSQL by precisely one phase. Consequently, OldestReplicationSlotLag jumps to 64 MB, which corresponds to the configured WAL phase dimension at database initialization, and stays there till the following phase swap happens.
As a result of idle PostgreSQL performs an empty WAL rotation roughly each 5 minutes, this zero-then-64 MB oscillation is anticipated. Importantly, the second when the lag drops to 0 signifies that each one significant WAL data have been absolutely replicated, and the Aurora learn reproduction is absolutely caught up with the supply.
Cutover Section
As soon as the Aurora learn reproduction has absolutely caught up with the supply RDS PostgreSQL occasion — as confirmed by way of replication lag evaluation — the ultimate step is to advertise the reproduction and redirect software site visitors. Selling the Aurora learn reproduction converts it into an impartial, writable Aurora PostgreSQL cluster with its personal author and reader endpoints. At this level, the supply RDS PostgreSQL occasion is not the authoritative major and is made inaccessible.
As a result of Netflix’s RDS ecosystem is fronted by a Information Entry Layer (DAL), consisting of client-side ahead proxies and a centralized Information Gateway, switching databases doesn’t require software code modifications or database credential entry. As a substitute, site visitors redirection is dealt with completely by way of configuration updates within the reverse-proxy layer. Particularly, we replace the runtime configuration of the Envoy-based Information Gateway to route site visitors to the newly promoted Aurora cluster. As soon as this configuration change propagates, all client-initiated database connections are transparently routed by way of the DAL to the Aurora author endpoint, finishing the migration with out requiring any software modifications.
This proxy-level cutover, mixed with Aurora promotion, permits a seamless transition for service homeowners, minimizes downtime, and preserves knowledge consistency all through the migration course of.
Buyer Expertise: Migrating a Enterprise-Essential Companion Platform
One of many important groups to undertake the RDS PostgreSQL to Aurora PostgreSQL migration workflow was the Enablement Functions group. This group owns a set of databases that mannequin Netflix’s total ecosystem of accomplice integrations, together with system producers, discovery platforms, and distribution companions. These databases energy a set of enterprise purposes that companions worldwide depend on to construct, check, certify, and launch Netflix experiences on their units and companies.
As a result of these databases sit on the middle of Netflix’s accomplice enablement and certification workflows, they’re consumed by a various set of shopper purposes throughout each inner and exterior organizations. Internally, reliability groups use this knowledge to determine streaming failures for particular units and configurations, supporting high quality enhancements throughout the system ecosystem. On the similar time, these databases immediately serve exterior companions working throughout many areas. System producers depend on them to configure, check, and certify new {hardware}, whereas fee companions use them to arrange and launch bundled choices with Netflix.
System Lifecycle Administration
Netflix works with a variety of system companions to make sure Netflix streams seamlessly throughout a various ecosystem of shopper units. A core accountability of System Lifecycle Administration is to offer instruments and workflows that enable companions to develop, check, and certify Netflix integrations on their units.
As a part of the system lifecycle, companions run Netflix-provided check suites in opposition to their NRDP implementation. We retailer alerts that characterize the present stage for every system within the certification course of. This certification knowledge kinds the spine of Netflix’s system enablement program, making certain that solely validated units can launch Netflix experiences.
Companion Billed Integrations
Along with system enablement, the identical accomplice metadata can be consumed by Netflix’s Companion Billed Integrations group. This group permits exterior companions to supply Netflix as a part of bundled subscription and billing experiences.
Any disruption in these databases impacts accomplice integration workflows. If the database is unavailable, companions could also be unable to configure or launch service bundles with Netflix. Sustaining excessive availability and knowledge correctness is crucial to preserving clean integration operations.
The worldwide nature of those workflows makes it tough to schedule downtime home windows. Any disruption would affect accomplice productiveness and threat eroding belief in Netflix’s integration and certification processes.
Preparation
Given the criticality of the Enablement Functions databases, thorough preparation was important earlier than initiating the migration. The group invested vital effort upfront to know site visitors patterns, determine all shoppers, and set up clear communication channels.
Perceive Consumer Fan-Out and Site visitors Patterns
Step one was to realize an entire view of how the databases have been being utilized in manufacturing. Utilizing observability instruments like CloudWatch metrics, the group analyzed PostgreSQL connection counts, learn and write patterns, and total load traits. This helped set up a baseline for regular conduct and ensured there have been no surprising site visitors spikes or hidden dependencies that would complicate the migration.
Simply as importantly, this baseline gave the Enablement Functions group a tough concept of the post-migration conduct on Aurora. For instance, they anticipated to see an analogous variety of energetic database connections and comparable site visitors patterns after cutover, making it simpler to validate that the migration had preserved operational traits.
Determine and Enumerate All Database Customers
In contrast to most databases, the place the set of shoppers is well-known to the proudly owning group, these databases have been accessed by a variety of inner companies and external-facing methods that weren’t absolutely enumerated upfront. To handle this, we leveraged a software known as flowlogs, an eBPF-based community attribution tooling was used to seize TCP circulation knowledge to determine the companies and purposes establishing connections to the database(hyperlink).
This strategy allowed the group to enumerate energetic shoppers, together with those who weren’t beforehand documented, making certain no shoppers have been missed throughout migration planning.
Set up Devoted Communication Channels
As soon as all shoppers have been recognized, a devoted communication channel was created to offer steady updates all through the migration course of. This channel was used to share timelines, readiness checks, standing updates, and cutover notifications, making certain that each one stakeholders remained aligned and will reply rapidly if points arose.
Migration Course of
After finishing application-side preparation, the Enablement Functions group initiated the info replication part of the migration workflow. The automation efficiently provisioned the Aurora learn reproduction cluster and ported the RDS PostgreSQL parameter group to a corresponding Aurora parameter group, bringing the vacation spot surroundings up with equal configuration.
Sudden Replication Slot Habits
Nevertheless, shortly after replication started, we noticed that the OldestReplicationSlotLag metric was unexpectedly excessive. This was counterintuitive, as Aurora learn replicas are designed to stay intently synchronized with the supply database by constantly streaming write-ahead logs (WAL).
Additional investigation revealed the presence of an inactive logical replication slot on the supply RDS PostgreSQL occasion. An inactive replication slot could cause elevated OldestReplicationSlotLag as a result of PostgreSQL should retain all WAL data required by the slot’s final recognized place (restart_lsn), even when no shopper is actively consuming knowledge from it. Replication slots are deliberately designed to forestall knowledge loss by making certain {that a} shopper can resume replication from the place it left off. Consequently, PostgreSQL is not going to recycle or delete WAL segments wanted by a replication slot till the slot advances. When a slot turns into inactive — reminiscent of when a shopper migration job is stopped or deserted — the slot’s place not strikes ahead. In the meantime, the database continues to generate WAL, forcing PostgreSQL to retain more and more older WAL information. This rising hole between the present WAL place and the slot’s restart_lsn manifests as a excessive OldestReplicationSlotLag.
Figuring out and addressing these inactive replication slots was a important prerequisite to continuing safely with the migration and making certain correct replication state throughout cutover.
Profitable Migration After Remediation
After figuring out the inactive logical replication slot, the group safely cleaned it up on the supply RDS PostgreSQL occasion and resumed the migration workflow. With the stale slot eliminated, replication progressed as anticipated, and the Aurora learn reproduction rapidly converged with the supply. The migration then proceeded easily by way of the quiescence part, with no surprising conduct or replication anomalies noticed.
Following promotion, software site visitors transitioned seamlessly to the newly writable Aurora PostgreSQL cluster. By the Information Entry Layer, new shopper connections have been routinely routed to Aurora, and observability metrics confirmed wholesome conduct — connection counts, learn/write patterns, and total load intently matched pre-migration baselines. From the applying and accomplice perspective, the cutover was clear, validating each the correctness of the migration workflow and the effectiveness of the preparation steps.
Open questions
How can we choose goal Aurora PostgreSQL occasion sorts primarily based on the prevailing manufacturing RDS PostgreSQL occasion?
When deciding on the goal Aurora PostgreSQL occasion kind for a manufacturing migration, our steering is deliberately conservative. We prioritize stability and efficiency first, and optimize for price solely after observing actual workload conduct on Aurora.
In observe, the beneficial strategy is to undertake Graviton2-based cases (notably the r6g household) every time attainable, preserve the identical occasion household and dimension the place possible, and — at minimal — protect the reminiscence footprint of the prevailing RDS occasion.
In contrast to RDS PostgreSQL, Aurora doesn’t help the m-series, making a direct household match inconceivable for these cases. In such instances, merely retaining the identical “dimension” (e.g., 2xlarge → 2xlarge) will not be significant as a result of the reminiscence profiles differ throughout households. As a substitute, we map cases by reminiscence equivalence. For instance, an Aurora r6g.xlarge gives a reminiscence footprint corresponding to an RDS m5.2xlarge, making it a sensible alternative. This memory-aligned technique presents a safer and extra predictable baseline for manufacturing migrations.
Downtime Throughout RDS → Aurora Cutover?
To realize minimal downtime throughout an RDS PostgreSQL → Aurora PostgreSQL migration, we front-load as a lot work as attainable into the preparation part. By the point we attain cutover, the Aurora learn reproduction is already provisioned and constantly replicating WAL from the supply RDS occasion. Earlier than initiating downtime, we be certain that the replication lag between Aurora and RDS has stabilized inside an appropriate threshold. If the lag is giant or fluctuating considerably, forcing a cutover will solely inflate downtime.
Downtime begins the second we take away the safety teams from the supply RDS occasion, blocking all inbound site visitors. We then reboot the occasion to forcibly terminate current connections, which generally takes as much as a minute. From this level ahead, no writes might be carried out.
After site visitors is halted, the following goal is to confirm that Aurora has absolutely replayed all significant WAL data from RDS. We monitor this utilizing OldestReplicationSlotLag. We first look forward to the metric to drop to 0, indicating that Aurora has consumed all WAL with actual writes. Beneath regular idle conduct, PostgreSQL triggers an empty WAL swap each 5 minutes. After observing one knowledge level at 0, we look forward to a further idle WAL rotation and make sure that the lag oscillates throughout the anticipated 0 → 64 MB sample — signifying that the one remaining WAL segments are empty ones produced throughout idle time. At this level, we all know the Aurora reproduction is absolutely caught up and might be safely promoted.
Whereas these validation steps run, we carry out the configuration updates on the Envoy reverse proxy in parallel. As soon as promotion completes and Envoy is restarted with the brand new runtime configuration, all client-initiated connections start routing to the Aurora cluster. In observe, the full write-downtime noticed throughout companies averages round 10 minutes, dominated largely by the RDS reboot and the idle WAL swap interval.
Optimization: Lowering Idle-Time Wait
For companies requiring stricter downtime budgets, ready the complete 5 minutes for an idle WAL swap might be prohibitively costly. In such instances, we are able to drive a WAL rotation instantly after site visitors is lower off by issuing:
SELECT pg_switch_wal();
As soon as the swap happens, OldestReplicationSlotLag will drop to 0 once more as Aurora consumes the brand new (empty) WAL phase. This strategy eliminates the necessity to look forward to the default archive_timeout interval, which may considerably cut back total downtime.
How can we migrate CDC shoppers?
As a part of the info platform group in Netflix, we offer a managed Change Information Seize (CDC) service throughout a wide range of datastores. For PostgreSQL, logical replication slots is the best way of implementing change knowledge seize. At Netflix, we construct a managed abstraction on high of those replication slots known as datamesh to handle prospects who’re leveraging them (hyperlink).
Every logical replication slot tracks a shopper’s place within the write-ahead log (WAL), making certain that WAL data are retained till the buyer has efficiently processed them. This ensures ordered and dependable supply of row-level modifications to downstream methods. On the similar time, it tightly {couples} the lifecycle of replication slots to database operations, making their administration a important consideration throughout database migrations.
A key problem in migrating from RDS PostgreSQL to Aurora PostgreSQL is transitioning these CDC shoppers safely — with out knowledge loss, stalled replication, or prolonged downtime — whereas making certain that replication slots are accurately managed all through the cutover course of.
Every row-level change in PostgreSQL is emitted as a CDC occasion with an operation kind of INSERT, UPDATE, DELETE, or REFRESH. REFRESH occasions are generated throughout backfills by querying the database immediately and emitting the present state of rows in chunks. Downstream shoppers are designed to be idempotent and finally constant, permitting them to securely course of retries, replays, and backfills.
Dealing with Replication Slots Throughout Migration
Earlier than initiating database cutover, we briefly pause CDC consumption by stopping the infrastructure liable for consuming from PostgreSQL replication slots and writing into datamesh supply. This additionally drops the replication slot from the database and cleans up our inner state round replication slot offsets. This primarily resets the state of the connector to certainly one of a model new one.
This step is important for 2 causes. First, it prevents replication slots from blocking WAL recycling throughout migration. Second, it ensures that no CDC shoppers are left pointing on the supply database as soon as site visitors is quiesced and cutover begins. Whereas CDC shoppers are paused, downstream methods briefly cease receiving new change occasions, however stay steady. As soon as CDC shoppers are paused, we proceed with stopping different shopper site visitors and executing the RDS-to-Aurora cutover.
Reinitializing CDC After Cutover
After the Aurora PostgreSQL cluster has been promoted and site visitors has been redirected, CDC shoppers are reconfigured to level to the Aurora endpoint and restarted. As a result of their earlier state was deliberately cleared, shoppers initialize as if they’re beginning recent.
On startup, new logical replication slots are created on Aurora, and a full backfill is carried out by querying the database and emitting REFRESH occasions for all current rows. These occasions let the buyer know {that a} guide refresh was accomplished from Aurora and to deal with this as an upsert operation. This establishes a clear and constant baseline from which ongoing CDC can resume. Customers are anticipated to deal with these refresh occasions accurately as a part of regular operation.
By explicitly managing PostgreSQL replication slots as a part of the migration workflow, we’re in a position to migrate CDC shoppers safely and predictably, with out forsaking stalled slots, retained WAL, or shoppers pointing to the incorrect database. This strategy permits CDC pipelines to be cleanly re-established on Aurora whereas preserving correctness and operational simplicity.
How can we roll again in the midst of the method?
Pre-quiescence
Rolling again earlier than the pre-quienscence part is kind of straightforward. Your major RDS database continues to be the supply. Rolling again earlier than the quiescence part is simple. At this stage, the first RDS PostgreSQL occasion continues to function the only real supply of reality, and no shopper site visitors has been redirected.
If a rollback is required, the migration might be safely aborted by deleting the newly created Aurora PostgreSQL cluster together with its related parameter teams. No modifications are wanted on the applying aspect, and regular operations on RDS PostgreSQL can proceed with out affect.
Throughout-quiescence
Rolling again in the course of the quiescence part is extra concerned. At this level, shopper site visitors to the supply RDS PostgreSQL occasion has already been stopped by detaching its safety teams. To roll again safely, entry should first be restored by reattaching the unique safety teams to the RDS occasion, permitting shopper connections to renew. As well as, any logical replication slots eliminated in the course of the migration have to be recreated in order that CDC shoppers can proceed processing modifications from the supply database.
As soon as connectivity and replication slots are restored, the RDS PostgreSQL occasion can safely resume its position as the first supply of reality.
Publish-quiescence
Rolling again after cutover, as soon as the Aurora PostgreSQL cluster is serving manufacturing site visitors, is considerably extra advanced. At this stage, Aurora has turn out to be the first supply of reality, and shopper purposes might have already got written new knowledge to it.
On this situation, rollback requires organising replication in the other way, with Aurora because the supply and RDS PostgreSQL because the vacation spot. This may be achieved utilizing a service reminiscent of AWS Database Migration Service (DMS). AWS gives detailed steering for organising this reverse replication circulation, which might be adopted emigrate knowledge again to RDS if mandatory.
Conclusion
Standardizing and lowering the floor space of information applied sciences is essential for any large-scale platform. For the Netflix platform group, this technique permits us to pay attention engineering effort, ship deeper worth on a smaller set of well-understood methods, and considerably lower the operational overhead of operating a number of database applied sciences that serve comparable functions. Inside the relational database ecosystem, Aurora PostgreSQL has turn out to be the paved-path datastore — providing robust scalability, resilience, and constant operational patterns throughout the fleet.
Migrations of this scale demand options which might be dependable, low-touch, and minimally disruptive for service homeowners. Our automated RDS PostgreSQL → Aurora PostgreSQL workflow represents a significant step ahead, offering predictable cutovers, robust correctness ensures, and a migration expertise that works uniformly throughout various workloads.
As we proceed this journey, the Relational Information Platform group is constructing higher-level abstractions and capabilities on high of Aurora, enabling service homeowners to focus much less on the complexities of database internals and extra on delivering product worth. Extra to come back — keep tuned.
Acknowledgements
Particular because of our different beautiful colleagues/prospects who contributed to the success of the RDS PostgreSQL to Aurora PostgreSQL migration. Sumanth Pasupuleti, Cole Perez, Ammar Khaku
