Stuck aggregation tasks - Monitoring

Which IIQ version are you inquiring about?

8.4 P1

Share all details about your problem, including any error messages you may have received.

we have observed that aggregation and identity refresh tasks—particularly those configured with partitioning—occasionally become stuck. These tasks appear to be running but do not progress: the number of entries processed remains static, and no errors are logged. This behaviour disrupts downstream processes and requires manual intervention.

The issue is intermittent and typically affects partitioned tasks.
Task status remains in “Running” state indefinitely.

Do we have any built-in mechanism in IdentityIQ to detect and recover from stalled tasks automatically.

Some ways we have explored is
Monitor the long running tasks by creating a scheduled run rule task monitoring and see if the average time crosses alert the team by sending an email.

HI @vinnysail

You’re correct that IdentityIQ doesn’t have a fully “built-in” automated detection and recovery mechanism for this specific scenario. The TaskResult object’s status changing to Failed or Terminated is the primary built-in indicator, but in your case, it remains Running

Your approach of monitoring long-running tasks via a scheduled rule is a good starting point and a common practice. Let’s delve deeper into potential causes and more robust detection/recovery strategies.

Understanding Why Partitioned Tasks Get “Stuck” (Common Causes)

When partitioned tasks hang without errors, it’s often due to:

Database Issues:

  • Deadlocks/Lock Contention: One or more partitions might be waiting for a database lock held by another process (IIQ or external), leading to a deadlock.
  • Connection Exhaustion/Stalling: The database connection pool might be exhausted, or a database connection might have silently dropped/stalled for a specific partition thread.
  • Slow Queries: A particular data set within a partition might hit a very inefficient query, causing it to run excessively long without erroring out.

Application Server (JVM) Issues:

  • Thread Starvation/Deadlock: A thread executing a partition might enter a deadlock with another thread within the JVM, or simply get stuck in a busy-wait state.
  • Memory Exhaustion: While less common for “no error” scenarios, a subtle memory leak within a partition’s processing could lead to a slow crawl before a full crash.
  • JVM Pauses: Long garbage collection (GC) pauses can make tasks appear stuck.
  • Network Issues: Transient network problems between IIQ server(s) and the database, or between IIQ and the target application (for aggregation), can cause indefinite waits.

Network Issues:

  • Transient network problems between IIQ server(s) and the database, or between IIQ and the target application (for aggregation), can cause indefinite waits.

Partitioning Configuration:

  • Uneven Distribution: If partitions are heavily uneven in data size, one very large partition might naturally take a long time, making it seem stuck.
  • Too Many Partitions/Threads: Over-partitioning or setting too many threads can sometimes exacerbate contention issues.
1 Like

Advanced Detection and Recovery Strategies

Your current approach of a “long-running task monitoring” rule is excellent. Let’s enhance it.

Enhanced Monitoring Task (Scheduled Rule)

Instead of just checking average time, build a more sophisticated monitoring rule:

What to Monitor:

  • TaskResult State: Check taskResult.getCompletionStatus() == TaskResult.CompletionStatus.Running.
  • TaskResult.getProcessed(): This is the number of entries processed. If it hasn’t changed for a defined period (e.g., last 15-30 minutes), it’s a strong indicator of a stall.
  • TaskResult.getCreated() / getLaunched(): Compare new Date().getTime() with taskResult.getLaunched().getTime() to determine total elapsed time.
  • taskResult.getAttributes(): For partitioned tasks, examine the attributes map. You might find partitionsProcessed, currentPartition, or similar metrics, which can provide more granular insights into which partition is stuck.
  • TaskResult.getMessages(): Look for recent messages that might indicate issues, even if not formal errors.
  • Actionable Alerts:
    • Email: Your current plan to send an email is good. Include task name, ID, current status, processed count, and elapsed time.
    • Pager/SMS: For critical tasks, escalate to an on-call system.
import sailpoint.object.TaskResult;
import sailpoint.api.TaskManager;
import java.util.Date;
import java.util.List;
import java.util.Map;
import java.text.SimpleDateFormat;

// Configuration for the monitoring task
long maxStaticTimeMillis = 30 * 60 * 1000L; // 30 minutes of no progress
long maxTotalRunTimeMillis = 4 * 60 * 60 * 1000L; // 4 hours total run time (adjust as needed)
String[] tasksToMonitor = {"My Aggregation Task", "My Identity Refresh Task"}; // Specific task names

// Email alert parameters
String emailTo = "support@yourcompany.com";
String emailSubjectPrefix = "[IIQ Alert] Stalled Task: ";

try {
    TaskManager tm = new TaskManager(context);
    List<TaskResult> activeTasks = tm.getLatestTaskResults(TaskResult.CompletionStatus.Running);

    for (TaskResult taskResult : activeTasks) {
        String taskName = taskResult.getTaskDefinition().getName();

        // Only monitor specific tasks
        boolean monitorThisTask = false;
        for (String monitorName : tasksToMonitor) {
            if (taskName.equals(monitorName)) {
                monitorThisTask = true;
                break;
            }
        }
        if (!monitorThisTask) {
            continue;
        }

        long currentTime = new Date().getTime();
        long launchedTime = taskResult.getLaunched() != null ? taskResult.getLaunched().getTime() : currentTime;
        long lastModifiedTime = taskResult.getModified() != null ? taskResult.getModified().getTime() : launchedTime;
        long lastProcessedCount = taskResult.getProcessed();

        // Store previous state to detect stagnation (requires Custom Object or Global Variable)
        // For simplicity, let's assume we retrieve a stored map of task statuses
        // In a real scenario, you'd save this to a Custom Object to persist state across runs
        Map<String, Object> storedStatus = context.getGlobal("taskMonitorStatus");
        if (storedStatus == null) {
            storedStatus = new HashMap<>();
            context.setGlobal("taskMonitorStatus", storedStatus);
        }

        Map<String, Object> taskState = (Map<String, Object>) storedStatus.get(taskResult.getId());
        if (taskState == null) {
            taskState = new HashMap<>();
            taskState.put("lastProcessedCount", lastProcessedCount);
            taskState.put("lastUpdateTime", currentTime);
            storedStatus.put(taskResult.getId(), taskState);
            continue; // Skip check on first run for this task
        }

        long storedProcessedCount = (Long) taskState.get("lastProcessedCount");
        long storedUpdateTime = (Long) taskState.get("lastUpdateTime");

        boolean stalled = false;
        String reason = "";

        if (lastProcessedCount == storedProcessedCount && (currentTime - storedUpdateTime > maxStaticTimeMillis)) {
            stalled = true;
            reason = "No progress for " + (currentTime - storedUpdateTime) / 1000 / 60 + " minutes.";
        }

        if ((currentTime - launchedTime) > maxTotalRunTimeMillis) {
            stalled = true;
            reason = "Exceeded max total runtime of " + maxTotalRunTimeMillis / 1000 / 60 / 60 + " hours.";
        }

        if (stalled) {
            String subject = emailSubjectPrefix + taskName + " (ID: " + taskResult.getId() + ")";
            StringBuilder body = new StringBuilder();
            body.append("Task '").append(taskName).append("' (ID: ").append(taskResult.getId()).append(") appears to be stalled.\n");
            body.append("Reason: ").append(reason).append("\n");
            body.append("Status: Running\n");
            body.append("Processed Entries: ").append(lastProcessedCount).append("\n");
            body.append("Launched At: ").append(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(taskResult.getLaunched())).append("\n");
            body.append("Last Modified/Updated At: ").append(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").format(taskResult.getModified())).append("\n");
            body.append("Elapsed Time: ").append((currentTime - launchedTime) / 1000 / 60).append(" minutes\n");

            // Send email
            // Replace with your actual email sending logic
            // context.sendEmail(emailTo, subject, body.toString());
            log.warn("ALERT: " + subject + " - " + body.toString()); // Log for now

            // Optional: Attempt to terminate the task
            // CAUTION: Only uncomment this if you are confident in automatic termination.
            // It might terminate a legitimate long-running task.
            // tm.terminateTask(taskResult);
            // log.warn("Attempted to terminate stalled task: " + taskName + " (ID: " + taskResult.getId() + ")");

            // Reset state after alert/action to avoid spamming
            taskState.put("lastProcessedCount", -1L); // Mark as processed for next check
            taskState.put("lastUpdateTime", currentTime);
        } else {
            // Update stored state for next run
            taskState.put("lastProcessedCount", lastProcessedCount);
            taskState.put("lastUpdateTime", currentTime);
        }
    }
} catch (Exception e) {
    log.error("Error in task monitoring rule: " + e.getMessage(), e);
}
return null; // For a rule

Root Cause Analysis (Crucial for Partitioned Tasks)

Automating restarts is a band-aid. Understanding the root cause is key.

  • Database Health: Work with your DBA. Monitor database sessions during affected tasks. Look for:
    • Long-running queries.
    • Blocked sessions/deadlocks.
    • Connection pool usage.
  • Thread Dumps: When a task is stuck, immediately take a series of thread dumps from your application server’s JVM (e.g., using jstack or kill -3 <pid>). Analyze these dumps. If multiple threads are in the same waiting state (e.g., waiting for a database connection, network IO), it points to a bottleneck. Look for:
    • BLOCKED or WAITING states.
    • Threads named like QuartzScheduler_Worker-X or specific task threads.
    • What libraries/classes they are executing (e.g., java.sql, sailpoint.connector).
  • Heap Dumps: If memory exhaustion is suspected, take a heap dump (jmap).
  • IIQ Logs (DEBUG/TRACE): Temporarily set logging for sailpoint.task, sailpoint.server.scheduler, sailpoint.connector, and sailpoint.object to DEBUG/TRACE before running the problematic task. Review logs carefully after a hang.
  • Application-Specific Logging: If the task connects to an application, check the logs on that application’s side as well.
  • Partition Strategy Review:
    • Are your partitioning rules/queries (RequestDefinition objects) efficient?
    • Are partitions balanced in size? Highly skewed partitions can cause long waits.
    • Are there enough maxThreads configured for RequestDefinition objects like Aggregate Partition and Identity Refresh Partition? (Typically 1-2x CPU cores).

Proactive Measures

  • Database Maintenance: Regular database indexing, statistics updates, and cleanup are vital.
  • Resource Tuning: Optimize iiq.properties (dataSourceMaxActive, dataSourceMaxWait), application server heap size, and garbage collection settings.
  • Network Stability: Ensure stable, high-bandwidth, low-latency network connectivity between IIQ nodes and the database.
  • Connector/Rule Optimization: Profile custom rules and connectors. Ensure they handle timeouts gracefully and don’t make blocking calls without limits.
  • Scheduler Host Affinity: If you have multiple IIQ nodes, consider dedicating specific nodes as “Task Servers” (ServiceDefinition objects) to prevent UI performance issues and better manage task load.
  • Task Definition Optimization:
    • Review all options in your aggregation/refresh tasks. Some options can significantly impact performance or cause contention (e.g., “Refresh all application account attributes”).
    • Consider breaking very large tasks into smaller, more frequent ones.

While automated recovery is a valuable goal, the intermittent and unlogged nature of the stall suggests an underlying environmental or configuration issue that needs dedicated root cause analysis (starting with thread dumps and database monitoring). The enhanced monitoring rule will at least provide timely alerts and potentially automate manual termination.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.