Rate Limits for Workflows

colin_mckibben · January 16, 2025, 2:00pm

Announcing Workflows Rate Limits

To enhance your experience with SailPoint’s Identity Security Cloud, we are introducing updates to SaaS workflows starting January 16, 2025. Some workflows generate an exceptionally high number of executions daily, causing system-wide lag and impacting service quality for all tenants. To maintain system stability and equitable performance, we’re implementing standardized workflow execution rate limits that are fair and sufficient for meeting customer business needs.

New Workflow Execution Terms:

Initial Allowance: Each tenant can execute up to 400,000 workflows per day at the current rate.
Rate Limiting Beyond the Limit: After reaching the 400,000-execution threshold, workflow executions will be limited to 5 invocations per second.
Quota Reset: The execution quota resets every 24 hours.

Optimizing Your Workflows:

High execution rates can slow down services for tenants in the same region, affecting processing times across the entire customer community. Optimizing your workflows can significantly reduce unnecessary executions and can be done in a few easy steps. To optimize your workflows:

Use Filters: Apply filters to trigger (events that start workflows) a workflow only when specific conditions are met.
Review Best Practices: Visit this link for tips on streamlining workflows to align with your business needs.

Why This Matters:

By implementing workflow rate limiting, we aim to have fair, consistent, and reliable service for all customers. These limits provide an opportunity to evaluate and refine your workflows, ensuring they operate efficiently with the right balance of steps and executions aligned to your business goals.

If you have any questions, please reach out to your Customer Success Manager or Support Expert for assistance. You may also leverage the “Provide Feedback” link in Identity Security Cloud or SailPoint’s Ideas Portal.

sup3rmark · January 16, 2025, 4:35pm

Hi Colin!

Is there an alert, dashboard, or API endpoint that will let us know the number of executions used in the current period, or tell us when we’re approaching/have hit the limit?

Crimson3708 · January 16, 2025, 4:37pm

I was also going to ask the same as Mark. Hoping that there’s some way to see total number of executions in a day etc

angelo_mekenkamp · January 16, 2025, 4:39pm

Although it sounds like a reasonable thing to try and limit the amount of unnecessary computations, which could benefit us all, I also have my doubts:

Announcement too late
You announced this on the 16th of January 2025 and say this functionality goes into production on the 16th of January 2025. If organizations are already surpassing this limit, they are in for a unpleasant surprise aren’t they?

Fairness
I doubt that the current strategy is considered fair.
If organization A has 100 times as many identities and 10 times as many applications compared to organization B. Wouldn’t it be considered fair that organization A gets a higher limit? After all I can imagine more workflows need to be executed there, and I am not involved with license costs, but I doubt these will be equal as well.

workflow execution metrics
In addition, some of our workflows we can’t see the number of executions as the POST /v2024/workflow-metrics API is returning a 502 bad gateway error for some of the workflows (and not necessarily the ones we expect high numbers for).
Also this API is undocumented (@christina_gagnon).

Besides that, I don’t believe there is an API to see the amount of daily executions, not per workflow, but also not in total. Making it more difficult to see if we pass that number, to ensure we stay under it.

Spikes
And finally, I can imagine that occasional activities like a migration, onboarding of a new source with 400k+ accounts can cause one-time spikes of executions. If we usually have 10k executions per day and require 1000k at the start of each month, we get a problem, even though this will be less executions than if one has 100k executions each day. It will require workarounds and can cause delays to spread these type of executions over multiple days.

Alternative improvements
Since workflows are mainly there to compensate for functionality that SailPoint does not offer out of the box or in a desired manner, putting a limit like this could limit what we can do.
In addition some workflows take many executions just because of limited SailPoint functionality. For example since you can not filter over the pending approvals with
a filter like created gt .... AND created lt .... As a consequence our workflow needs to iterate over them all. And since SailPoint APIs use pagination and workflows can not apply recursion (unless executing another workflow), this increases the number of executions as well.

Definition of amount of executions
Many of our workflows require looping over arrays. I know that in the backend, these are considered child workflows of dynamically created workflows. If we have a workflow where we loop over the owned roles of a single identity, and the identity has 40 owned roles, will this count as 41 executions (one for the parent workflow, 40 for the child workflow executions), or just as 1?

Passing the limit
What happens if we reach the limit? Will an error appear in the event logs for each failed workflow execution, perhaps such that we can retry them during a less busy day? How can we see which ones we missed and which workflows actually occurred?

Notification
We can get (email) notifications if sources get unhealthy.
Before this limitation you just announced gets added into production (although I think this is already to late due to the late announcement). Can we also get email notifications if we trigger this limit and email notifications in increasing urgency if we are approaching this limit? (Let’s say 50%, 75% and 90%?). If SailPoint is just suddenly cutting off workflows at a certain limit this could have critical effects, especially relating to production environments.

sup3rmark · January 17, 2025, 3:47am

One way this is handled by some other platforms is through API credits. For periods where you’re well below your limit, you earn a certain number of credits which you can bank for a specified duration. If you then get really bursty all of the sudden, and blow past your limit, you can automatically spend some of your credits to avoid being throttled. Once all credits are expended, though, throttle away. This would be a good way to handle @angelo_mekenkamp’s scenario.

It would also be good if there was a way to request burst capacity via a support ticket in situations where, like in his other scenario, we’re onboarding a new source or maybe processing a merger or acquisition and have a ton of new identities being onboarded as a one-time load. Put in a support ticket and ask for extra workflow capacity a few days in advance, maybe work with support to schedule the load for a certain time window, etc.

We as customers don’t have the data to back up our concerns here, obviously, and I’m sure that SailPoint has identified the biggest offenders here and ensured that this isn’t going to cause major disruption for the majority of folks. That said, it could be helpful for us if SailPoint could provide some statistics on how many executions the different-sized environments with 5, 25, 50, 100 workflows average on a typical day, and how many customers in those cohorts are exceeding these new limits and could be negatively impacted by this change.

colin_mckibben · January 17, 2025, 2:00pm

I have asked the workflows PM to weigh in on this feedback, but I would like to reiterate that Workflow executions will not be dropped. Executions over the daily limit will be put in a queue for slower processing.

As for the other points of concern, our workflows PM will provide feedback.

tburt · January 21, 2025, 8:33pm

Hello Mark

We have an API but it is not publicly available. Having said that, there are a number of catches put in place to ensure you are notified.

This limit was set higher than any burst loads over the past 12 months of monitoring
When a customer hits 200k, 1/2 of the existing limit, we are alerted internally and will initiate a support call reverse escalation.
If you do hit the 400k, you will still execute workflows. Those workflows post 400k will be at a slower/throttled rate until a)The rate burns down or b)the 24 hour reset applies.
When a customer is in Rate Limit mode, there is a banner displayed on the Workflow Admin page
In addition to the banner, an audit message is generated which can be viewed in Search.
No events will ever be dropped. Even if a customer is consistently above the limit, events are stored and processed at the throttled rate.

tburt · January 21, 2025, 8:51pm

Hello Angelo, Please see my comments for each topic you raised:

Announcement too late
If organizations are already surpassing this limit, they are in for a unpleasant surprise aren’t they?
TBReply:
No customers are hitting this limit. We have been tracking these metrics for the past 12 months to make a data-driven decision on where to set the limit. The 400k over 24 hours is significantly higher than what we have tracked so no customers will be in rate limiting.

Fairness
TBReply:
“Fairness” in this case is defined as not slowing other customers in the same region down due to one rogue set of workflows. This allows us to slow down ONLY the offending tenant and thus not affect other customers.

workflow execution metrics
TBReply:
We will be introducing Execution Logging improvements in the coming months.

We do have an internal API however we have not published this due to the fact no customers should hit this mark. If, however, a customer has a large amount we have internal monitoring/alerting and we will be reverse escalating once a customer hits 200k.

Spikes
TBReply:
Again, we chose this number after 12 full months of monitoring executions including burst/spike activities. These number are set higher than what we have observed for that reason.

Remember that no workflow execution event will ever be dropped. We are simply throttling the events to ensure fairness across other orgs in the region. If you have a spike of over 400k in a day, those events will still be processed just at a slower rate until a)The events clear, or b)The 400k limit is reset at the 24 hour mark.

Alternative improvements
TBReply:
Just as a reminder, this limit was chosen based on 12 full months of monitoring. The only times we saw even rates even close to this were when orgs had poorly designed workflows that were triggering unnecessarily without trigger filtering in place.

Definition of amount of executions
TBReply:
This limit is strictly focused on the originating execution events. Downstream (child workflows) are not currently limited however we are gathering metrics/data on those and will apply limits in the future. This will mainly be to protect downstream services from being overloaded.

Passing the limit
TBReply:
You will be notified through our support organization using a process called “Reverse Escalation” when you hit the 200k mark, 1/2 of the limit. We have alerting on our side so we can track any occurrence to be proactive. All Events are stored, zero dropped events, so that even if a 400k limit is hit, the events still process at a throttled rate of 5 per second until they are burned down or the 24 hour period resets to 400k.

Notification
As mentioned above, we are alerting internally and will create a reverse escalation process through our support organization when you are at 50% of the 400k limit in a 24 hour period (200k). We are NOT cutting off workflows at all. Workflows will still execute and no events will be dropped.

tburt · January 21, 2025, 8:54pm

Hello Mark,

Yes we have been carefully monitoring these metrics/data over the past 12 months. The numbers chosen for the rate limiting are based on that 12 months of data and are significantly higher than what we have monitored over that period. It is also an important reminder that events are not stopped at the 400k limit, they are merely slowed/throttled to 5 events per second until a)The events are all processed or b)The 24 hour period resets to 400k.

angelo_mekenkamp · January 22, 2025, 7:14am

Thank you for the clarifying answers @tburt!

It is definitely good to know that once over the limit, executions will not get dropped except for 5 invocations per second (similar to the API rate limit), but will actually be queued with slower execution time. Is there also a hardcoded invocation per second limit when still under the limit? 100 per second, something else?

And please make such announcements earlier such that we can clarify the impact and address concerns like this before it is already live and if then applicable, make the required preparations. Even if the conclusion is that it will have no immediate impact, we will know that in advance and we can then inform SailPoint if we think otherwise.

tburt · January 22, 2025, 2:07pm

@angelo_mekenkamp
" Is there also a hardcoded invocation per second limit when still under the limit? 100 per second, something else?"

No, they should execute instantaneously up to the 400k limit.

sup3rmark · January 22, 2025, 5:42pm

Thank you for the additional detail! I appreciate your responsiveness to the expressed concerns. I personally was not worried, as I know none of the environments I’ve ever touched would come anywhere near those numbers, but glad to see that you have used data to drive the thresholds for this as I assumed.

I think the biggest concern here was just the timing of the announcement. Since it came the same day as the limit, it seemed like this was an attempt to address an active issue and would be an impactful change to at least some customers.

Thanks for the clarifications!
-Mark

David_Norris · March 8, 2025, 11:09pm

Is there a limit to the queue?

Per 24 hour, each tenant can have a max of 832,000 workflow executions (400 + (5 x 60 * 60 * 24)). Given this, “if” there’s a tenant that goes beyond this daily, for whatever reason, that queue size will grow perpetually. Guess we’re not there yet to truly have to think about this yet? I’m guessing there’s going to be a pricing model to address this in the product roadmap, just like number of workflows, number of steps in a workflow…etc…and everything else in the SaaS space.

Given the number of 400k, can I also loosely deduce that ISC currently doesn’t have a client / tenant with multi-million users CIAM use cases? (and I don’t think ISC has CIAM in mind just yet, right? )

Lastly, at what time does this limit get reset? I’m guessing it’s relative to the tenant’s hosting region instead of every tenant globally all at once at GMT 00:00.

EDIT: Come to think of it…even within the same region, maybe each tenant has its own reset time (?)…kind of a load leveling in the multi-tenant environment.