How to Retry Workflows in Identity Security Cloud

Introduction

Identity Security Cloud provides a service called Workflows, which allows you to execute custom functionality in response to events that occur in your system. As with any automation, you may encounter issues with your workflows that cause them to fail, such as a bug in in the logic, a downstream API being unavailable, or the Workflows service itself being unavailable. It is important to be able to recover from these failures and retry the workflow so that you do not have to manually perform the steps that would have been automated.

In this blog, I will demonstrate how to leverage the Workflow APIs to detect failed workflows and retry them with the correct inputs so that you don’t have to manually resolve failed executions. If you would like to understand the underlying concepts so that you can implement your own retry logic, please continue reading this blog. However, if you just want a working solution that uses forms and workflows to retry failed executions, then please see this Colab post for more details.

Looking for a video format?

This exercise was presented here:

Detecting failed executions

Filtering by startTime

Before a failed workflow execution can be retried, you first need to identify the executions that failed. ISC provides two APIs that assist in this process. First, you will want to use the list workflows endpoint to get a list of all workflows in your tenant. You will need the id of each workflow that you want to inspect for failed executions. Once you have the id of a workflow that you want to inspect for failed executions, you will use the list workflow executions endpoint to a get a list of executions that have failed.

https://sailpoint.api.identitynow.com/v2024/workflows/:id/executions?filters=status eq "Failed"

This query will return all executions of the workflow, for the past 90 days, that have failed. It will sort the results by startTime in descending order, so the most recent failure will be first. Depending on the workflow, there could be a lot of executions (ex. thousands). Since this endpoint will only return 250 executions per API call, you will need to use pagination in order to ensure you identify every execution that failed.

This query is likely not what you want, though, since it will return all failed executions. In a real world scenario, you will likely only want to retry failed executions between a certain date range to avoid retrying executions that you already retried. For this use case, you can add additional query parameters.

https://sailpoint.api.identitynow.com/v2024/workflows/:id/executions?filters=status eq "Failed" and start_time ge "2024-02-07T08:00:00Z" and start_time le "2024-02-07T09:00:00Z"

This query will only filter the executions to your specified date range. You will still need to paginate through the API if there are more than 250 failed executions within the provided date range. To paginate simply increment the offset query parameter by 250 each time you call the API until you receive a response with less than 250 executions.

API call 1
https://sailpoint.api.identitynow.com/v2024/workflows/:id/executions?filters=status eq "Failed" and start_time ge "2024-02-07T08:00:00Z" and start_time le "2024-02-07T09:00:00Z"&offset=0

API call 2
https://sailpoint.api.identitynow.com/v2024/workflows/:id/executions?filters=status eq "Failed" and start_time ge "2024-02-07T08:00:00Z" and start_time le "2024-02-07T09:00:00Z"&offset=250

Get the original input of an execution

Before you can retry a workflow execution, you need to know the original input that was used to invoke the execution. The input can vary depending on the trigger used for the workflow and details of the specific event that triggered the workflow. The get workflow execution history endpoint gets the details of an individual execution, including the trigger input. To invoke this endpoint, you need to use the execution ID that you obtained from the list executions endpoint.

https://sailpoint.api.identitynow.com/v2024/workflow-executions/:id/history

The response will contain an array of objects that informs you of the details of every step in the workflow. You can ignore most of this information. The particular object that you will need has a type of WorkflowExecutionStarted. This object will contain an attributes object which contains the trigger input that was used to invoke the failed execution.

[
  {
        "type": "WorkflowExecutionStarted",
        "timestamp": "2024-08-13T13:14:42.914349209Z",
        "attributes": {
          "input": {
              "actor": {
                  "id": "ee769173319b41d19ccec6cea52f237b",
                  "name": "john.doe",
                  "type": "IDENTITY"
              },
              "connector": "active-directory",
              "created": "2021-03-29T22:01:50.474Z",
              "id": "2c9180866166b5b0016167c32ef31a66",
              "name": "Test source",
              "type": "DIRECT_CONNECT"
          }
      }
    },
  ...
  ...
]

Save the value of attributes for use in the next step. In the example above, you would save the following value:

{
    "input": {
        "actor": {
            "id": "ee769173319b41d19ccec6cea52f237b",
            "name": "john.doe",
            "type": "IDENTITY"
        },
        "connector": "active-directory",
        "created": "2021-03-29T22:01:50.474Z",
        "id": "2c9180866166b5b0016167c32ef31a66",
        "name": "Test source",
        "type": "DIRECT_CONNECT"
    }
}

Do this for every execution that you want to retry.

Retry an execution

Once you have the original input of the workflow execution you want to retry, you can invoke the workflow with the desired input using the test workflow endpoint. You will need to supply the ID of the workflow that you want to trigger along with the input payload. An example invocation of this endpoint would look like this:

POST https://sailpoint.api.identitynow.com/v2024/workflows/:id/test

Request body

{
    "input": {
        "actor": {
            "id": "ee769173319b41d19ccec6cea52f237b",
            "name": "john.doe",
            "type": "IDENTITY"
        },
        "connector": "active-directory",
        "created": "2021-03-29T22:01:50.474Z",
        "id": "2c9180866166b5b0016167c32ef31a66",
        "name": "Test source",
        "type": "DIRECT_CONNECT"
    }
}

This will invoke the workflow with the provided input. As long as you have resolved whatever issue caused the workflow to fail, be it a logic error or a downstream service being unavailable, the workflow should succeed this time.

Accounting for delayed executions

You may have workflows that use a wait step to pause execution for hours or days, or execution may be delayed for other reasons. If one or more of these delayed executions fails, then you may miss it when retrying failed executions. For example, let’s say you want to retry executions between 8 AM to 12 PM on a certain day. You run your API queries and they result in 50 failed executions. You retry the failed executions and they work. However, if the workflow in question uses a dynamic wait action that caused some of the executions to delay until the next day, and a few of them failed, then you will have missed them. Since the above approach uses the startTime to filter failed executions, you can’t run the same query again without also retrying the 50 workflows that you already retried. Since the workflow executions endpoint doesn’t support closeTime as a filterable property at this time, we have to take a different approach.

You can modify your program to keep track of the executions it retried in a local file or database. Each time an execution is retried, the program should save the execution ID in a local file. When running the program with the same start and end date, the program could compare the executions it retrieved from the API with the ones it saved in the local file. If it can’t find the execution in the local file, then it’s because the execution was delayed and only now finished with a failure status. The program can then retry these delayed executions and add them to the local file. This approach would allow you to keep retrying the same date range without the risk of retrying an execution that you already retried.

Safely retrying workflows

If the workflow you are retrying performs any modifications to data (create, update, or delete), either within SailPoint or in an external system, additional precautions should be taken to avoid additional failures or unintended side-effects. For example, if the workflow creates a certification campaign, and the workflow execution failed on a step that is after the creation of the cert campaign, then retrying the workflow may result in two certification campaigns being created. However, not all modification actions may have negative impacts. For example, sending an email is considered a modification action, but retrying a workflow execution that ends up sending the same email to an admin may be ok. It is up to you to decide which modification actions should not be executed twice and to take the necessary precautions when retrying executions. This section describes how to categorize safe vs unsafe failed executions as well a method for identifying which executions are unsafe.

Categorizing safe and unsafe executions

Safe executions may have one or more of the following characteristics:

  • The execution did not perform any modification actions before it failed (i.e. it only queried data).
  • The execution did not perform any modification actions that could result in negative consequences if performed again. This includes sending an email to a person that is aware they may receive another email, or updating a role where running the update again won’t change the final state of the role.
  • The workflow has built-in error handling to make sure that modification actions can’t be retried. For example, the workflow could get the current state of a role and check if it already contains the necessary changes. If it doesn’t contain the necessary changes, then it will perform the update. This type of error handling will ensure that a retry of the workflow won’t result in a duplicate update action.

Unsafe executions may have one or more of the following characteristics:

  • The execution performed one or more modifications that will cause issues in the target systems if retried. This includes submitting access requests, creating cert campaigns, sending emails to people who aren’t expecting another email.
  • The execution performed one or more safe modifications that, if retried, would result in an error that fails the execution again. For example, if the execution deleted a role, then trying to delete the role again would not cause any issues in the target system, but the workflow would fail again because the role no longer exists. This could be mitigated by performing a comparison to check if the role exists or not.

Handling unsafe executions

If you are creating a program to retry failed executions, you can make it safer by identifying executions that are potentially unsafe and not automatically retrying them. You can then provide this list of unsafe executions to the user for further review before attempting to retry them. The following method can be used to identify unsafe executions.

Get the execution history

Given a workflow execution ID, use the ID to query the execution history. This history includes a complete listing of all the actions that were completed along with the action that failed.

Identify any unsafe actions that completed

The execution history contains a lot of data, but each action will have three events.

  • ActivityTaskScheduled means the action was queued for execution. It can contain useful information, like the method type of an HTTP Request.
  • ActivityTaskStarted means the action has started executing. You can ignore this.
  • ActivityTaskCompleted means the action completed without error.

Loop through all of the completed actions and programmatically identify any actions that could be unsafe. The ActivityTaskCompleted object will contain attributes such as the action name, and input/output attributes. For the out of the box workflow actions, you will need to create a list of unsafe actions, such as Create Certification Campaign, Manage Accounts, Manage Access, to name a few. For the HTTP Request action, consider it unsafe if it uses a method that is post, patch, put, or delete.

If one or more unsafe actions completed in a failed execution, then your program should not attempt to retry the execution. Instead, it could log the unsafe executions to a file for further review by an admin. If the admin determines they are safe to retry, then your program could retry the safe executions.

Simulating events

In some cases, it may be desirable to simulate event triggers in order to trigger a workflow. For example, if you build a brand new workflow that uses an Identity Created trigger and want to run that workflow on previous identities, then simulating the events will help you achieve this. Alternatively, if the Workflow service is temporarily unavailable and some events were never processed, you may wish to simulate those events once the Workflow service is available again so that the workflow can process them.

Query the data

To simulate an event trigger, you will need to leverage the SailPoint APIs to extract the necessary data as well as write a program that can transform the extracted data to the exact input produced by the event trigger in question. To demonstrate this technique, let’s assume you have just built a workflow that uses the Identity Created event trigger, and you want to simulate events for identities that were created in the last month. You can start by crafting a search query that will fetch the identities created in the last month along with their relevant attributes.

POST /v2024/search?limit=10000

Request body

{
    "indices": [
        "identities"
    ],
    "query": {
        "query": "created:[now-1M TO now]"
    },
    "queryResultFilter": {
        "includes": [
            "id",
            "name",
            "attributes"
        ]
    }
}

Transform the data

Next, you will need to write a program in your favorite language to transform the data into the same input structure that the event trigger will provide. You can click here to find a list of event triggers and their respective inputs. The Identity Created trigger will produce this input:

{
  "identity": {
    "type": "IDENTITY",
    "id": "2c91808568c529c60168cca6f90c1313",
    "name": "William Wilson"
  },
  "attributes": {
    "firstname": "John"
  }
}

The program needs to loop over every identity returned by the search query and transform the data into the structure above. Fortunately, the search query gives us all the data we need, and transforming it is relatively simple. An example python script might look like this:

identities = getIdentities() # You will need to build this function to execute the search query

transformed_identities = []
for identity in identities:
  transformed_identity = {
    "identity": {
      "id": identity["id"],
      "type": "IDENTITY",
      "name": identity["name"]
    },
    "attributes": identity["attributes"]
  }

  transformed_identities.append(transformed_identity)

return transformed_identities

Invoke the workflow

The final step is to invoke the test workflow endpoint for the newly created workflow for each identity returned by the search query. You can invoke this endpoint in the same script you used to fetch and transform the identity data. The transformed identity data will look like it came from the event trigger service itself, but they are actually simulated events.

Handling different scenarios

Now that you know the process for retrying a failed workflow, let’s examine the different scenarios in which we would want to retry a workflow.

Logic error

Let’s say you build a workflow that passed your initial testing and you enable it in production, only to find that there are edge cases in production that cause the workflow to fail. Once you identify the issue and update the workflow logic to handle the edge cases, you may wish to retry those failed workflows so that you don’t have to manually trigger them again or manually fulfill the tasks. In this scenario, you would identify the date range in which the failures occurred, which may be a few minutes to a few hours or days. Set your start date and your end date and run you retry logic. The retry logic will only retry the workflows that failed in that time period, providing the same input to your newly updated workflow, which should now succeed.

If the retries fail again and you have to further update your workflow, simply set the start date to when you ran the retries and the end date to be when the last workflow execution started. Since these were run in a batch, it might only be a few seconds between the start and end date.

Downstream API error

If you are using the HTTP Request action to invoke API services, those API services may experience downtime that causes your workflow to fail. You can wait for the downstream API to become available again before retrying the executions. Just make note of when the executions started failing and when they began working again, and that will be the date range you use to retry the workflow executions.

Workflow service is unavailable

If the Workflow service itself experiences issues that prevent workflows from running successfully, you can follow the same procedure as Downstream API error. Wait for the Workflow service to become available again before retrying the executions. Just make note of when the executions started failing and when they began working again, and that will be the date range you use to retry the workflow executions.

It is important to note that if the execution never started because the Workflow service was unavailable, then your only recourse is to simulate the event as described in the previous section.

When to NOT retry an execution

While there are many reasons you may want to retry executions of a workflow, that are also a few reasons why you would not want to retry executions. The primary reasons are as follows:

  • Do not retry the same date range multiple times. This can lead to executions succeeding more than one time, resulting in potential issues such as sending out duplicate emails, creating duplicate certification campaigns, etc. If you are retrying a single execution, then you can retry it until it succeeds, but if you are retrying a date range, then always increment the start/end date to avoid retrying successful attempts.
  • Do not automatically retry executions that were partially completed with actions that are unsafe. For example, if your workflow creates cert campaigns, submits access requests, or performs any other modification to ISC or external systems, and the executions are known to have completed any of those actions before failing, you may not want to retry the workflow. Retrying might cause those actions to be performed again, creating duplicate objects or failing the workflow again. You can read more about unsafe executions and how to handle them in the section entitled “Safely retrying workflows”.
  • Do not retry executions that were intentionally stopped by you or SailPoint. This may occur if, for example, an event triggered the unwanted deletion of a batch of identities, or if a massive amount of executions were cancelled due to performance issues. These events are probably rare, but it is worth considering when retrying executions so as not to cause the same issue again

Example application

To help demonstrate the process of retrying workflow executions, I have created the following proof-of-concept workflow and form. It is important to note that this workflow does not safely retry executions. Given a start and end date, it will retry all failed executions within the date range, which may cause issues if a particular execution successfully completed one or more modification actions. Please exercise caution when using this in your ISC tenant.

3 Likes

Great stuff @colin_mckibben :slight_smile:

Some remarks from my side:

If you actually want to include the exact time of the end date, I would recommend using le (less than or equal), instead of adding a second and using lt (less than, but not equal)

If you call the endpoint like this, you do get a lot of unneeded data which did not need to be transferred to the client. Why don’t you write the filter such that it takes both end points into account?
Assuming every day you want to include the exact timestamp of begin_date itself and exclude the exact timestamp of the end_date itself (as that one will be included when calling the script the day after), you can do:

https://sailpoint.api.identitynow.com/v2024/workflows/:id/executions?filters=status eq "Failed" and start_time ge "2024-09-03T0:00:00Z and start_time lt "2024-09-04T00:00:00Z".

In this way, you don’t have to iterate through the results to compare them to the begin_date, which, in your pseudo code, means that you can get rid of a complete for loop.

Also note that you are comparing with the start_time. However, if each day we retry the previous days failed executions, and we compare using the start_time, chances are that you will miss some failed executions. Since workflow executions could take a few days to execute (months even), it could be that you are retrying the executions even when one of the executions is still pending and has not failed yet. It can become difficult to spot those later without also accidentally retrying the previously tried workflows again.

Instead I recommend comparing to the end_time instead. If every day you retry the executions that failed (and ended) the previous day, you guarantee that you are not missing any executions.

To me the name test for the endpoint /workflows/:id/test is the wrong name. You don’t use this API necessarily for testing. You use it to manually trigger the workflow. Calling it test-workflow gives the suggestion that it does not actually perform the actions, that the actions performed are not going to be stored for audit purposes, and that it should not be used for actual triggering of the workflow. I would suggest renaming it to /workflows/:id/execute or something similar instead.

And last but not least. I have mentioned it in a different post as well, but I think it is worth writing it here as well. I think that SailPoint should offer build-in retry mechanisms for workflows. If an HTTP action fails with a timeout error for example, we should be able to tell ISC to wait for a bit and retry the action itself and then continue the workflow from that point rather than retry the whole workflow, which can become impractical if we have a complex workflow. And it would be a big improvement if we could point an action to a previous action in workflows, without having to call another workflow, such that we get some kind of while loop, so we could call an API to fetch status, and after this action check if the status is pending, and if so, call the same action again, until it returns to success or failure. If functionality like that would be available, we wouldn’t need to depend on complex workarounds and we could build even greater things with it. :slight_smile:

Kind regards,
Angelo

6 Likes

This is great feedback. I totally missed the le and ge operators, and have reworked the blog accordingly. Same with pagination. I added a section on paginating.

For the delayed workflow execution issue, I also added a section with one possible approach. There are probably other ways to handle it, but I just addressed on potential solution.