VA Critical Error – vector container failing with "vector-env failed" on every restart

Hi everyone,

I’ve been dealing with a persistent Critical Error on one of my Virtual Appliances and would appreciate any insights.


SETUP

  • 1 VA cluster with 2 VAs
  • VA 1: Connected (healthy, handling all workloads normally)
  • VA 2: Connected + Critical Error
  • VA 2 vector version: 0.53.0
  • OS: Flatcar Linux

SYMPTOM

VA 2 triggered the following Critical Error:

“vector service has restarted 53 times in the last 30 minutes”category: container / severity: errors

After investigating on the VA directly, I found:

  • vector container status: Exited (ExitCode 1) immediately on every start attempt
  • vector.log: 0 bytes (container never actually starts)
  • vector-start.log: filled entirely with “ERROR: vector-env failed”, nothing else
  • All other containers (ccg, va_agent, charon, fluent, otel_agent) are healthy
  • Memory: 12 Gi free, Disk: 104 G free — not a resource issue
  • OOMKilled: false

WHAT I FOUND

The vector container has these environment variables set:

VA_CERTIFICATE_PATH=/opt/sailpoint/share/secure/va-gateway.crt
VA_PRIVATE_KEY_PATH=/opt/sailpoint/share/secure/va-gateway.key

But when I checked the directory:

drwx------. 2 root root 4096 Jun 24 2025 /opt/sailpoint/share/secure

The secure/ directory is root:root 700 — completely inaccessible to the sailpoint user or the container’s entrypoint script. My theory is that vector-env is trying to read the certificate/key files at startup and failing immediately because of this permission.

I also noticed that otel_agent has been logging 503 errors when trying to push metrics to:

https://edge-prometheus-us-east-1.identitynow-demo.com/api/v1/write

Last error was around 2026-05-21. Not sure if this is related.


WHAT I TRIED

  • Cluster restart — issue reproduced after about 30 minutes
  • Checked vector-start.log, vector.log, otel_agent.log
  • Confirmed no other errors in vector-start.log besides “vector-env failed”

QUESTIONS

  1. Has anyone seen “ERROR: vector-env failed” before? What was the root cause in your case?
  2. Is the secure/ directory being root:root 700 expected, or does it suggest something went wrong during a VA update?
  3. Did a VA re-deployment resolve this for anyone, or is this something that requires SailPoint Support intervention?
  4. Any idea whether the otel_agent 503 errors are related to the vector issue?

Thanks in advance — any experience or pointers would be really helpful!

Hi,

I would request you to raise a Sailpoint Support ticket for this and they should be able to guide you through next steps.

Looks like permission errors causing container startup failures, chown/chmod as a fix, and the relationship between vector-env failing and file access issues.
But for quick resolution, kindly raise a ticket with SailPoint Support team so that they can guide you in a better way because mostly VA related things are their hands.

From what you’ve described, this looks more like a broken Vector startup/configuration issue than a resource problem.

A few observations:

  • secure/ being root:root 700 is typically expected on the VA. Containers that need those certs are usually granted access through mounts/permissions, so I wouldn’t assume that’s the root cause by itself.

  • vector.log remaining 0 bytes and vector-start.log only showing vector-env failed suggests Vector is failing before the service even initializes.

  • The fact that only one VA in the cluster is affected points more toward local corruption, a failed update, or a bad container/image state on that specific VA.

  • The OTEL 503s are likely a symptom rather than the cause. If telemetry components can’t start correctly, you’ll often see downstream metric export failures.

Given that:

  1. Compare the Vector container image/version and environment variables between VA1 and VA2.

  2. Check whether the certificate/key files referenced by VA_CERTIFICATE_PATH and VA_PRIVATE_KEY_PATH actually exist on VA2.

  3. If everything matches VA1, I would open a support case with SailPoint and attach the VA support bundle.

  4. In practice, I’ve seen redeploying the affected VA resolve similar single-node container corruption issues faster than prolonged troubleshooting.

Since the issue survives a restart and is isolated to one VA, my next step would be support bundle + VA redeployment.