Later Friday after noon my ServiceNow team was working with the ServiceNow Engineering team to trouble shoot an issue that took down the ServiceNow API host. After some investigation the ServiceNow Engineer determined that the cause for this issue was the SailPoint Identity Governance Connector. Has anyone run into this issue before or also had this issue on 03/08/2024?
This resulted in users not being able to submit tickets with in ServiceNow. I am starting to dig into the logs from the VA cluster to better understand what happened.
When I reviewed the ccg logs and compared multiple days of data to this date nothing seemed out of the norm. I did not see a large number of exceptions as I was originally expecting. With this being our Production ServiceNow Instance I will not be able to share the logs due to sensitive data around my user base. Once the issue seem to resolved it self late on 03/08/2024 we have not seen it again.
The only exceptions I am getting are user not found and insignificant permissions. Which are in my case normal exceptions based on how we have the connection set up on the ServiceNow side of the house. I did not see any errors around conductivity at all when I review the 1.4 GB of data from our VA cluster.
Here are the results of our investigation into what caused this issue:
Start Time: 2020-03-18 10:46 am US/Central
End Time: 2020-03-18 03:00 pm US/Central
Affected Systems: ssmhcprod
Symptoms: Users not able to submit items via portal, integration experienced 429 rejections.
Business Impact: Critical
Root Cause of the issue: It was identified Sailpoint-initiated several inbound transactions ~10:46 and onward several took more than minutes to execute, while all the later transaction starts backing up. Compared to each day this week and last week. This was an unusual event. Monroe explained that they found that Sailpoint aggregation was started around 10:26 this morning after it has been stopped for couple of days which increased the load on the ServicePoint API and pretty much flooded the API thread available causing the regular transaction to backlog (the queue depth was filled up to 50), all the other incoming transaction received 429
We reviewed non sailpoint API call and found the other large once were regular and not out of usual and several were running together.
Relief Measure:
The system recovered on its own the 429 are designed to avoid overloading and other issue and worked as expected. The queue continue processing and clearing up.
Solution Provided:
Sailpoint needs to throttle when such large incoming load is introduced, maybe such process should run during offline hours.
Preventative Measure(s):
Team to plan ahead when such actions are performed by Sailpoint, they are one off but can have an impact on limited or Sailpoint need to throttle API_INT (Integration Sempahores) ( 4 per node)