The recording for this livestream can be found here:
Workflows may fail for a number of reasons. Your workflow logic may have been flawed and resulted in an unforeseen error during an execution, a downstream API that you are invoking may have been unavailable, or the workflow service itself may have had an outage. Actions that would have been automated by the failed workflow execution now require you to find and remediate them manually… or do they? Join us as SailPoint Developer Advocate, Colin McKibben, demonstrates how to retry failed workflows.
There are no plans for a single retry endpoint, but I am releasing the blog and Colab item that will accompany this livestream soon. The blog will demonstrate how you can use two endpoints to achieve the same functionality.
Sources can perform API calls. We can configure which error types we would like to be retried, after which the source retries it to the given amount of retried we configured.
I would prefer if workflows also has build in retry mechanisms on the APIs. If endpoints are temporarily unavailable, it would be nice if the workflow can retry again for a configurable amount of times with the given waiting intervals.
To me, SailPoint providing build in retry functionality is preferred above SailPoint providing a workaround to circumvent this missing functionality.
Of course the live-stream is not available yet, but I would like to offer my thoughts in advance as perhaps parts of it can be covered in the presentation in case it is applicable:
If the workaround needs to execute two endpoints to offer automatic retry mechanism for a different potential failing action within our workflow, the risk of the workflow failing now depends on these two endpoints not failing, which would translate the failure risk from one endpoint to another.
If the workaround you provided is not to retry a single action, but to retry the whole workflow, there is also the issue of the workflow not being atomic, which is usually the case. For example if the workflow creates an object (a certification campaign, an access request, a role, etc.) and then fails to activate/modify/use that object, a full retry would try to create the same object again, which could then fail because of the object already existing, or the workflow could go into a different logic-branch than originally intended as a consequence.
A possible outage of the workflow service itself is being mentioned. If this occurs, I can imagine two things happening. Currently running workflows failing, and new workflows not being started when the trigger occurs. I wonder how can you retry the later case if workflows did not even get started when the trigger occurred. How could we then fetch the input to retry the workflow without manually finding and remediating it? Can we call those two endpoints to retry all failed and non-starting workflow executions for all workflows failed within the last hour?
This is excellent feedback, and I will incorporate these talking points into my blog post.
A dedicated retry feature is definitely preferable over a workaround solution. I know the workflow team is designing better error handling features at both the workflow and individual action layer to help with this, but it won’t cover the case where you want to retry an entire workflow execution.
If the APIs needed to retry the workflow are temporarily unavailable, you could just wait until they are available again, or create a custom solution that keeps retrying until they are available.
The solution I provide is to retry the entire workflow. This is a point that I’m going to make clear in my blog. If you have a workflow execution that partially succeeded and performed some sort of action, like a create, update, or delete, then you probably don’t want to retry the entire workflow again. However, if the workflow failed early without performing any actions, then retrying is acceptable. Partially retrying a workflow just doesn’t seem very feasible. I think the error handling features that the workflow team is working on will provide better facilities for handling errors in actions to reduce the need for retrying the entire workflow.
If the execution starts, then it most likely can be retried. If the service is down and the execution for an event never even started, then you can’t retry it. The only recourse is to somehow trigger the event yourself, or create a script that queries the appropriate date from ISC and then crafts the appropriate trigger input that can be used to retry the workflow. For example, if the event is “Identity created”, you wouldn’t want to delete those identities and then run the aggregation again. Instead, you could run a search query for identities created within the outage window, craft an object that matches the trigger input for “Identity Created” trigger, and then invoke the test workflow API with those inputs.
Thank you @colin_mckibben for the crystal clear answers on what your suggested approach can and can not do. That is very helpful
I think I know which endpoints you are going to utilize. Looking forward to the livestream!
To ensure that retrying workflows is also possible for less tech-savvy people that prefer not to work with APIs, it might be nice if SailPoint would add a button in the executions history in the workflows UI marked as ‘retry’ that will then retry the whole workflow with the same input. At a later stage, we might be even able to retry the workflow from the specific node in the workflow that previously failed.
I love your suggested UI placement. Hoping to see something like that eventually. I am comfortable with the APIs but I need to take time off and can’t expect anyone else on my team to follow a more complex workaround like that should an outage or many failures occur.
Thanks for putting this together! Do you happen to have an estimated date for the Colab post? This is a pressing item on one of my implementations and we are determining if we should begin creating a solution of our own or wait to see what you have in store. (I am guessing this would be related to your previous forum posts on the matter)
I have posted the blog and corresponding Colab workflow. A big thank you to @angelo_mekenkamp for forcing me to think through the side effects of retrying executions. I added a section about “Safely retrying workflows” to share my thoughts on how it can be done.