Off to the Races: Race Conditions to Avoid in a Two-Way Live Sync

by Paul Bemis on March 06, 2020

You sit down at your desk to see a bug report: the "live" integration your team built hasn't synced any new data for the last 5 days. One of the developers on your team reports that there are 3 million items on the queue, and that number has been growing over the last 8 hours. You eventually accept that everything that happened in that time period is a lost cause and that you'll have to communicate that to customers. The developers start digging into what went wrong.

If you want to avoid this scenario, read on for a look at some types of race conditions and ways to avoid them. But first, why did you build this integration?

Syncing data between multiple systems is an important part of software development. Entering the same data into multiple applications is time-consuming and error-prone for users, so companies often seek to automate this. Major players like Dell, Oracle, IBM, and others have created Integration Platform as a Service (iPaaS) offerings to provide off-the-shelf solutions to this problem. Salesforce’s acquisition of MuleSoft in 2018 and Hubspot’s acquisition of PieSync in 2019 are examples of investment in this space.

But data integrations are not always facilitated by iPaaS vendors. They are often built in-house by the software companies who built the application on one end of the integration. Whether you’re working in healthcare, real estate, e-commerce, insurance, or something else, you may run into some of the pitfalls discussed below.

Data synchronization looks different depending on the system’s needs. The sync can be one-way or two-way. It can be scheduled, initiated by a user, or live (triggered by system events). When a sync is two-way and live, there are a few types of race conditions that can cause problems.

By “live” I specifically mean asynchronous and loosely-coupled, where the sync of data is triggered by events/webhooks, and the system initiating those events does not know about the data sync. I am also referring to a system where the sync is responsible for keeping track of the connection between the ID of the entity in System A and the ID in System B. Furthermore, this system does not trust the content of events as the source of truth for data, but first performs a Get on the origin system when processing events. If event data is trusted as the source of truth, there is another class of race conditions: same-direction updates.

When considering these race conditions, it may help to picture an example. One such example would be automating data flow between any CRM and any other software that acts on those accounts, like an ERP or a marketing platform. Users want data entered in the CRM to show up in the other software, but also want the flexibility to add new accounts in the other software and have it sync to the CRM. In the examples below, I will reference creation of or updates to “Customers” to leverage the CRM example.

Here is an example of how this sync might be built:

Example sync architecture

Race condition 1: same-direction creates

Syncing data is simplest if Create actions in the target system only happen as a result of Create events in the origin system. In practice it is not likely to be that simple. There may be business rules that dictate that an entity cannot sync until certain conditions are met.

For example, if syncing customers from a CRM into an online marketing solution like Constant Contact or MailChimp, an email address may be required. A customer that originally had no email address would be created in the target system on an Update if an email address is added.

Additionally, it is also more defensive to check whether the entity exists in the target system regardless of whether the event was a create or an update. Even the most reliable systems will have some misses, and a 99.9% success rate on millions of entities would result in thousands of missing entities. Resolving that missing data is much easier on end users (and your support team) if they can simply re-save.

So let’s assume that for either or both of those reasons, an entity will be created in the target system on both create and update events. In this scenario, instead of relying on the action in the origin system (Create or Update) to determine which action to take, the existence or lack thereof of an ID mapping will be what dictates the action. If the entity ID from the origin system is already mapped to a target system ID, then the target system should be updated. If the origin system’s ID is not tied to one in the target system, then the entity needs to be created.

Problem: Sometimes two events can come very close together, as when there is a typo when an entity is created which is quickly fixed. In this case it is possible that the second event will be processed before the first job is completed. For example:

  1. Customer is created with typo in origin system. Create event is raised.
  2. Processing of Create event begins. Seeing no matching customer, a new one will be created in the target system. Race Conditions Ex1, Part 1+2
  3. Typo in customer name is corrected. Update event is raised.
  4. Processing of Update event begins. Seeing no matching customer, a new one will be created in the target system. Race Conditions Ex1, Part 3+4
  5. First job saves the customer.
  6. First job saves ID mapping. Race Conditions Ex1, Part 5+6
  7. Second job persists the (duplicate) customer. Race Conditions Ex1, Part 7

Solution: Restricting processing per entity to one at a time, or using context of in-flight processing per entity.

If possible, it would be ideal to limit your asynchronous event processing to only process each entity once at any given time. This could be achieved by dynamically creating queues based on the origin system, tenant, entity type, and entity ID. That could turn out to be prohibitive, so another option would be distributed context: leveraging a distributed cache or similar option to allow the multiple in-flight processes to know about the others, and make decisions based on that fact. In this case, if a Create is in progress, it would be redundant for another one to start.

Race condition 2: round-trip creates

The bigger problem with two-way live, asynchronous, loosely-coupled syncs is the round trip. First, let’s look at round-trip creates. When an entity is persisted in the target system, the sync then has to save the ID mapping between the origin and target systems. At the same time, the target system will raise a Create event, which will also be processed.

Problem: Sometimes, the round-trip Create event may get processed before the ID mapping has been saved. Since the existence of the ID mapping will be what tells the system whether to perform the Create (see above), now the round-trip Create will happen, causing a duplicate.

  1. Customer is created in System A. Create event is raised.
  2. A-to-B processing of Create event begins. Seeing no ID mapping, the customer will be created in System B. Race Conditions Ex2, Part 1+2
  3. Customer is persisted in System B. System B returns the ID and raises a Create event. Race Conditions Ex2, Part 3
  4. B-to-A processing of Create event begins. Seeing no ID mapping, this is treated as a Create, not Update. Race Conditions Ex2, Part 4
  5. First job saves ID mapping. Race Conditions Ex2, Part 5
  6. Duplicate customer is created in system A as a result of the round trip. Race Conditions Ex2, Part 6

Solution: Distributed context of recent jobs. Unfortunately, the reason this problem exists is because the ID mapping has not yet been saved, so IDs can’t be used to identify this scenario. The way to solve it would be to save the combination of the action (Create) and the entity, or at least key properties of that entity. This action/entity combination can be saved in a distributed cache for a short amount of time for future processes to check against.

Race condition 3: round-trip updates

Similar to round-trip creates are round-trip updates. Let’s assume that the system is already smart enough to not persist updates in an infinite loop- that it checks for any differences before saving an update. Picture a phone number being changed in System A. When that event is processed, the update is made in System B, and an update event is raised. The resulting B-to-A job will see that everything about the entity is the same in System A and System B, so won’t persist an unnecessary update to System A. However, there are still potential issues.

Problem: If multiple updates are made rapidly, the resulting round-trip update may see a difference.

Assumption: Customer exists in both systems, and the phone number needs to be changed from 111-111-1111 to 123-456-7890.

  1. User updates Customer in System A. Phone number is changed from 111-111-1111 to 123-456-7899 (typo). Update event is raised.
  2. Processing of A-to-B Update event begins. Seeing an existing ID mapping, it is treated as an Update.
  3. Seeing a needed change in phone number, the update is persisted to System B. System B raises an Update event. Race Conditions Ex3, Part 1-3
  4. Processing of B-to-A Update event begins. Seeing an existing ID mapping, it is treated as an update. Race Conditions Ex3, Part 4
  5. User updates Customer in System A. Phone number is changed to 123-456-7890 (typo fixed). Update event is raised. Race Conditions Ex3, Part 5
  6. The B-to-A job sees a needed change in phone number. 123-456-7899 is different from 123-456-7890, so the update is persisted to System A (overwriting the correct value with the wrong one). Update event is raised, and we go around again. Race Conditions Ex3, Part 6

Solution: It depends on whether business rules let you ignore round trip updates. If you never care about round-trip updates, then this is the same solution as the round-trip Create scenario. It would be possible to solve it even more simply, because now the IDs from both systems are available. The system could simply store that an Update happened to a given entity in a given direction, which would let the round trip be ignored.

However, it is possible you care about round-trip updates. The two systems you are connecting have different rules around field lengths, allowable values, number of values, etc. The system may sometimes change the data as it is syncing- in these cases you might want that change to be synced back. For example, if System A holds a single phone number and System B holds multiple, you would likely sync the first or primary number from System B.

Race Conditions Ex3 Solution Part 1

But if the phone number is deleted from System A, and System B has another that now becomes the primary, it may be required to sync that back to System A. Otherwise, a later update in System A may cause an empty phone number to be seen as a difference, removing the now-primary number from System B.

The phone number is deleted from System A, and that change is synced to remove it from System B:

Race Conditions Ex3 Solution Part 2

Then, the remaining phone number syncs back from System B. Otherwise a later change to System A could cause the remaining phone number to be deleted from System B.

Race Conditions Ex3 Solution Part 3

If you do care about handling the round-trip update when needed, the solution is distributed context of recent jobs, to be factored into decisions of looking for differences. As in the create scenario, entities must be mapped to a common format and saved to a distributed cache or similar. Then, when a round-trip update happens, things that are the same in the A-to-B update and the B-to-A update should be removed before looking for differences.

If a phone number was modified in System A, here is how it would work:

Race Conditions Ex3 Solution Part 4

This information is compared before comparing what is flowing from System B to what exists in System A, and any properties that match are removed. Since there are no remaining differences, it doesn’t matter that the phone number in System A is now 123-456-7890: that won’t be checked.

If a phone number was deleted in System A but System B had a second number, here is how it would work:

Race Conditions Ex3 Solution Part 5

Since there is an expected change to the data in System B that needed to be synced back, that difference will still be there after removing the matches from the cache. Then the remaining difference will be compared to what is System A to determine whether the update should be made.

Conclusion

A two-way live sync is considerably more complex than one-way scheduled or on-demand syncs that can be guaranteed to run one at a time. As with any project, I would recommend proving the value of a sync by starting with the simpler case and adding complexity once justified. Even in a two-way live sync, incremental progress could be made by simplifying some of these scenarios: only doing Creates on Create events, or ignoring round-trip updates. Both of these have user impact but may be acceptable steps toward the ultimate vision. If you need to support a two-way live sync with all the bells and whistles, hopefully knowing about these race conditions helps you identify risks ahead of time.


Other Interesting Articles

Cleaning Local Git Branches
Dynamically Right-Sizing Your Cloud Infrastructure

Share this article

 

We would love to hear what you think. Reach out to us.