Upcoming product maintenance and downtime

We are performing an upgrade of a key backend messaging system in all regions starting February 12th, 2024. Each hosting environment will experience 2-4 hours of downtime during its scheduled maintenance window, see below.

This upgrade will enable SailPoint’s Identity Security Cloud to keep scaling to support our ever-growing set of customers and features. To prevent data or message loss, we will shut down all products in the environment being upgraded for the duration of the migration (2-4 hours). This is to make sure that there are no messages in flight that may get lost or misrouted as we switch our components to start using the upgraded cluster for messaging.

During the maintenance window, we will first shutdown all frontend and backend services in the environment as well as all AI pipelines. We will then ensure that all data is replicated to the new cluster and that no new messages are flowing. Once there, we will reconfigure all components to use the new cluster and start up the environment.

We have tested this migration in our lower environments and have developed and tested rollback processes as well in case of unexpected setbacks.

The maintenance schedule is as follows. We will keep this table up to date if the schedules changes. Please reach out to your CSM or our Support team if you have any questions.

Region Environment Maintenance Start Time (CST) Maintenance Start Time (UTC) Maintenance Start Time (Local Time)
AP-Northeast-1 (Tokyo) Sandbox 2024-02-12 09:00 AM CST 2024-02-12 15:00 UTC 2024-02-12 00:00 JST
AP-Southeast-1 (Singapore) Sandbox 2024-02-12 09:00 AM CST 2024-02-12 15:00 UTC 2024-02-12 23:00 SGT
AP-Northeast-1 (Tokyo) Production 2024-02-14 09:00 AM CST 2024-02-14 15:00 UTC 2024-02-14 00:00 JST
AP-Southeast-1 (Singapore) Production 2024-02-14 09:00 AM CST 2024-02-14 15:00 UTC 2024-02-14 23:00 SGT
CA-Central-1 (Montreal, Canada) Sandbox 2024-02-14 01:00 AM CST 2024-02-14 07:00 UTC 2024-02-14 02:00 EST
AP-Southeast-2 (Sydney) Sandbox 2024-02-16 09:00 AM CST 2024-02-16 15:00 UTC 2024-02-17 02:00 AEDT
EU-West-2 (London) Sandbox & Production 2024-02-16 04:00 PM CST 2024-02-16 23:00 UTC 2024-02-16 22:00 GMT
US-East-1 (US) Sandbox 2024-02-18 (Sunday) 09:00 AM CST 2024-02-19 15:00 UTC 2024-02-18 09:00 CST
CA-Central-1 (Montreal, Canada) Production 2024-02-18 (Sunday) 09:00 CST 2024-02-19 03:00 UTC 2024-02-14 10:00 EST
US-West-2 (US, legacy) Sandbox & Production 2024-02-20 09:00AM CST 2024-02-20 15:00 UTC 2024-02-20 07:00 PST
AP-Southeast-2 (Sydney) Production 2024-02-20 09:00 AM CST 2024-02-20 15:00 UTC 2024-02-21 02:00 AEDT
EU-Central-1 (Frankfurt) Sandbox & Production 2024-02-20 12:00 PM CST 2024-02-20 18:00 UTC 2024-02-20 19:00 CET
US-East-1 (US) Production DELAYED New Date TBD DELAYED New Date TBD DELAYED New Date TBD

Hi @alex_derzhi,

That is quite a long downtime. I do have some specific questions to get some clarification.

Will identities still be able to login and perform read operations? Like check their accounts? Can we still call the APIs, for example to collect audit data? Or is it the case that anything related to IdentityNow will be down from SailPoint’s side?

To prevent data or message loss, we will shut down all products in the environment being upgraded for the duration of the migration (2-4 hours)

Will the above mean that products that do not get upgraded will stay on? But what if there are dependencies on products that are down because they do get updated? I am not sure which products will get updated, but as example. If identity refresh will not get upgraded and will stay on, and it will change an identity attribute, could it mean that it will not trigger a workflow execution, because workflows might be down? Or will identity refresh also get turned off, to ensure identity refresh will never trigger a workflow which will then fail by default without it getting run again?

Should we prepare some queries to determine after the downtime if we need to manually run tasks which would otherwise be executed directly?

Also if you create a role, it exists in v3/roles, but it will take over a day before it shows up in search. Does the migration take this into account, or might it create errors for these half existing objects? Are we being advised to not make certain changes before the downtime as a consequence?

Kind regards,
Angelo Mekenkamp

Hi @alex_derzhi,

Could you please take a look at the questions above?

We have asked these questions to our CSM like you mentioned, who asked us to forward the questions to support. While we are waiting until support is done reviewing our questions, researching for the answers from different SailPoint teams and getting back to us, perhaps you can already answer these questions on this forum post directly? This will ensure that more people interested in the answers find the answers here and prevent that many CSMs or Support people are now trying to answer the same questions. Additionally, this will remove SailPoint Support as middle man from the communication chain between you and us, which speeds up the information flow and decreases the chances of miscommunication.

Kind regards,
Angelo

1 Like

Hi Angelo,

All SaaS components and products EXCEPT ARM and NERM will be shutdown fully. No UI or API access will be available, and not backend processing will occur. Since even read action product audit events, we can not actually have a true “read only” mode to avoid a shutdown, for example.

Processes like Aggregations, Refresh, Workflows, Provisioned, etc will not trigger during the downtime. Any in-flight jobs will naturally pause processing as their processing services shutdown OR, and this is unlikely, will be retried once services boot back up. A Refresh will clear any identities in an error state, but we do not expect this to happened, and we did not see it in our testing.

Scheduled tasks will fire on their next scheduled run and will not run multiple times if multiple scheduled starts are missed (such as hourly jobs). I can’t speak to the role and search issue directly, but I do know that this uses the same message bus functionality as most of our backend processing, so it will also pause processing and then resume ingestion of data once things boot back up. Since it’s a fully shutdown, we do not expect issues with “half created objects”, and did not find any during Dev testing.

Thanks,
Alex Derzhi

2 Likes

Hi @alex_derzhi,

Perfect, thank you for this clear clarification! :smiley:

2 Likes

Hi @alex_derzhi, I agree with Angelo that this outage is massive and as you describe the operation, it looks like we’ll have to stop some automatic jobs before the maintenance window starts and manually resume their execution once the system is available again.

However, “2-4h window” is a very vague statement and I don’t want to stay on front of my computer all Sunday long just waiting for the system to be restored, so I emailed our CSM to ask the following questions and he suggested me to post them here:

Comms post-outage:

  • Is it expected that the outage will automatically end by the end of the window?
    (I know the notice says “for up to four (4) hours”, but I’m checking in case you’re using CI/CD pipelines that allows a fair estimate of the duration of the outage).

  • If the window length is just a rough estimate and it might be shorter or longer than 4h, will we have any email notice when the system is back to operational?

  • If the window might be actually variable, and it goes well beyond the estimated end-time, will we receive ETA and progress updates to prepare when we’ll have to get back to restore the jobs we stopped?

  • Will be any “cooldown” period that we should keep hold on any critical syncs until the system is stable again?

Hypercare support:

  • Will be any direct channels to request assistance post-outage?

  • What would be the SLA for the resolution of those incidents?

Thank you!

2 Likes

hello, @alex_derzhi.

I need clarification on the maintenance times for the us-east-1 region.

The chart above lists a start time of 2/25/2024 7:00 PM CST as well as a local time of 2/25/2024 8:00 PM CST. We received an in-application notification which states the maintenance window will start at 2/25/2024 at 9:00 AM. Which of these times is the correct time?

The chart appears to be correct for the us-east-1 sandbox environment listing a start time of 2/18/2024 9:00 PM CST and local start time of 2/18/2024 2100 CST. However, the in-application notification states the maintenance starts on 2/18/2024 at 9:00 AM CST. Which is correct?

Please confirm maintenance start times for sandbox and production us-east-1 regions ASAP.

Thanks,
Sabrina Cannon

1 Like

Hi Elisa,

With our testing and first two Sandbox migrations completed, we know that we need at least 2 hours to complete all changes and ensure data is synced. The average is just above 3 hours. We only went beyond the 4 hour window once in our early testing, but it is possible if we encounter unexpected issues. If we go past the 4 hour window, we will update our Status Page with the info and expected ETA.

We do not expect any actions to be necessary after the migration is complete, in our testing the chances of actually missing a message and having an Identity go into an Error state is almost zero. (Such things are never truly zero, but we’ve done a lot of work to ensure this). There is no cooldown period, you are welcome to kick off any process you need once the UI is up. In the very unlikely chance that identities go into an Error state, a Refresh will clear that up.

Support is provided by our Support team as usual. Since this is an announced and scheduled downtime, this will not affect SLAs. However, if there is a subsequent outage due to this migration, it will treated as a P0, top urgency, incident. This is why we are striking a balance between migrating during off-times and still having our key engineers awake and ready to help in case of unexpected issues.

Thanks,
Alex

Hi Sabrina, you’re totally right, we have a mismatch in our docs. We have been adjusting the migration times a bit here and there to reduce impact and increase engineer availability. Here are the updated times for US-East-1 and EU-West-2. I have updated the main post as well.

Region Environment Maintenance Start Time (CST) Maintenance Start Time (UTC) Maintenance Start Time (Local Time)
EU-West-2 (London) Sandbox & Production 2024-02-16 02:00 PM CST 2024-02-16 21:00 UTC 2024-02-16 20:00 GMT
US-East-1 (US) Sandbox 2024-02-18 (Sunday) 09:00 AM CST 2024-02-19 15:00 UTC 2024-02-18 09:00 CST
US-East-1 (US) Production 2024-02-25 (Sunday) 01:00 AM CST 2024-02-25 07:00 UTC 2024-02-25 01:00 CST

EDIT: Updated US-East-1 Sandbox to 9AM. 9PM was not correct, our apologies for the confusion.
Edit 2: Updated US-East-1 Production to 1AM

1 Like

Thanks @alex_derzhi, so if the outage in our production environment starts on Feb 25th at 9AM CST, it’s fair to say that I would be able to restore our scheduled jobs by 1PM?

When you say “we will update our Status Page” are you referring to the SailPoint Status page or another site especially built to communicate updates related to this outage?

Besides publishing updates to a page we need to monitor, are you going to email us such updates?

Apologies if this sounds little pushy, but on Sundays I rather spend my mornings having brunch with my family than refreshing a status page to see if there are any updates to this. I usually work long hours during the week and I would really appreciate if we get email alerts that I can watch on my mobile to minimize the time hooked to the computer doing nothing.

Thank you very much!

UPDATE:

We have delayed the London (EU-West-2) migration by two hours to reduce impact based on historic load levels. Here are the updated times:

Region Environment Maintenance Start Time (CST) Maintenance Start Time (UTC) Maintenance Start Time (Local Time)
EU-West-2 (London) Sandbox & Production 2024-02-16 04:00 PM CST 2024-02-16 23:00 UTC 2024-02-16 22:00 GMT

We apologize if this causes inconvenience. This delay will reduce the risk of the migration going over the 4 hour limit.

Thanks,
Alex

Hi Elisa,

Correct, that’s where we post updates for production environments. I will ask the team to also post Sandbox updates via that page as well. You can subscribe to updates there via email. We will not be sending out any other kind of notifications beyond what was already announced and posted.

Thanks,
Alex

Thanks Alex, I just subscribed to the alerts from the support page. Good luck with the migration and thanks for all your answers here!

Hi Alex,
Could you confirm the maintenance window for US-East-1 SANDBOX? I’ve received an email from Sailpoint saying it’s 9:00 AM CST and not PM.

Thank you,

John

John,

9AM is correct for US-East-1 Sandbox. I update the main table as well as my comment above. We had some crossed wires between all of the sessions. We want to do in the same time frame as production to ensure that scheduled jobs behave the same. Also, helps with engineer availability in case of issues. Sorry for our confusion here.

Alex

Hi @alex_derzhi, I thought you may want to know that after the Sandbox maintenance window our tenant got a source with an endless account aggregation.

I didn’t want to stop the aggregations in Sandbox to test this scenario which in your answer to Angelo you stated:

Scheduled tasks will fire on their next scheduled run and will not run multiple times if multiple scheduled starts are missed (such as hourly jobs).

I ran a manual aggregation for the source and while it finished successfully I still see the other aggregation running, so I submitted a case with support to report this issue for your team to look at what might be the cause of this, so it doesn’t happen in Prod next week.

Feel free to DM me if you need any further details.

We have one final schedule updated. Based on feedback from our customers, we have changed the US-East-1 Production migration to reduce impact to weekend operations across the globe. Here are the new times for the US-East-1 Production downtime:

Region Environment Maintenance Start Time (CST) Maintenance Start Time (UTC) Maintenance Start Time (Local Time)
US-East-1 (US) Production 2024-02-25 (Sunday) 01:00 AM CST 2024-02-25 07:00 UTC 2024-02-25 01:00 CST

Elisa,

Would you please open a Support request with all the info that you have so we can investigate.

Thanks,
Alex

1 Like

Hi @alex_derzhi, I already did: CS0271806

Hi @alex_derzhi, we just captured the below notice in our tenant. Is this yet another window or is it an update to the already scheduled one?
Can you confirm, so I plan accordingly with my colleagues in India, as the scheduled time is now harder to support from the US? Thanks!

3 Likes