TransWikia.com

Why is SystemModStamp missing updates?

Salesforce Asked by SebTHU on October 4, 2021

I’m a bit of a beginner with Salesforce (speciality is SQL and datawarehousing). For the last few months I’ve been looking after an ETL process which pulls data out of SagePeople Cloud (which, confusingly, seems to be Salesforce inside the box).

Method is SSIS (SQL Integration Services) packages, using the Cozyroc Salesforce Source SSIS component. The loads are deltas, so I want to get "all rows updated since [a date time]". The component has an ObjectFilter property, which I set at runtime to

SystemModStamp>=[a date, e.g. 2020-06-30T23:00:00Z]

I’ve already worked out that:

  1. SystemModStamp is better than LastModifiedDate. The former can get updated by some updates, which leave LastModifiedDate untouched.
  2. Internally, both these are what I (in SQL world) would call a datetimeoffset: an unambiguous date/time with a specified timezone, which can be expressed in different ways but always means the same time. For example, the date/time above is actually midnight at the start of 1st July UK (summer) time (or 2020-07-01T00:00:00+01:00). But when the data comes through, both are translated to UK time without any timezone information – presumably because my user is set up as located in the UK.

The date parameter passed for each delta load execution is the date/time at which the last (successful) execution of this same load sent its SOQL to Salesforce to get the data (I save this in a variable, from SQL GETUTCDATE()) – less a safety margin, which I initially set to 5 minutes. To save confusion all these date/times are saved as datetimeoffsets at timezone Z. Logs show that this is working correctly:

Execution 1 started talking to Salesforce at 2020-07-07T06:00:00Z

Execution 2 used date parameter 2020-07-07T05:55:00Z, and started its execution at 2020-07-07T09:00:00Z

Execution 3 used date parameter 2020-07-07T08:55:00Z…. and so on.

The problem is: the delta loads are missing updates. New delta load runs 3-hourly. I’m comparing to the old delta load system, which runs daily, and takes the last 2 days of updates (actually last midnight before now, rolled back 2 days: lastmodifieddate = LAST_N_DAYS:2). That system is picking up updates – different column values – which the new system is missing: in every case, the LastModifiedDate and SystemModStamp in the pulled ETL data (old vs new) are identical, but the old system is correct (matches the current state of Salesforce).

There seem to be two types of problem cases:

  1. Rows created or updated almost exactly when a delta load starts. e.g. at 09:00:15. I can see how a load starting at 09:00:00 might miss these. But why are they not picked up by the next load, which winds back to 08:55?
  2. Rows created/updated at other times (according to SystemModStamp/LastModifiedDate).

The solution to (1) might be to wind back 10 or 15 minutes rather than 5. Something to do with long-running transactions inside Salesforce? How long should be allowed for these?

(2) is more mysterious. What I suspect is this:

  • A row gets updated at 08:00. SystemModStamp set, all good
  • New delta load picks it up at 09:00. All good
  • Old delta load, much later (at 02:00 the following morning) also picks it up. But it picks up other updates to the row that have happened since 08:00, but haven’t updated SystemModStamp. Some process inside Salesforce is "cheating", and updating things without setting SystemModStamp.

The only clue I have is that the columns with differences are mostly what I’d call "computed" columns: things like Months_Since_[employee]Start, which might be automatically updated in bulk periodically, rather than being instantly calculated like true computed columns. But some other columns with fewer differences aren’t of this kind: e.g. Email, Gender, Postcode.

There might be something wrong with my date logic (though from the logs, I doubt it). The worrying possibility is that actually SystemModStamp is not reliable for getting deltas out of Salesforce, because some processes don’t update it. Can anyone comment on this? Thanks!

One Answer

If you want true replication, use the getUpdated and getDeleted API calls. They come in both REST and SOAP flavors. They provide a latestDateCovered attribute that you can use to start the next delta check. Unlike using SOQL, you can guarantee that you'll get the correct delta every time, as long as you use the one attribute provided.

Depending on how you wrote your query, your SystemModStamp deal could miss records, including situations where you set a cap on the upper end as well as the lower end. For example, the SystemModStamp can be off by a few seconds, or may not be updated until after you do your query, etc.

In short, Replication Is Hard. The Replication APIs are specifically designed to make a best effort to ensure that no records are ever missed, no matter what crazy things might be going on in the background. If you have the ability to use them, you should do so. Details are in the relevant documentation (SOAP and REST API docs). Note that this API also gives you deletions, which a SOQL can't do reliably.

Correct answer by sfdcfox on October 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP