MCMMCMBy Revdau
v1.1 is unreleased — see v1.0 for the current stable release.

Troubleshooting Guide

Diagnostics, command references, and troubleshooting scenarios for resolving deployment issues with the MCM Platform.

MCM Platform — Troubleshooting Guide

This guide provides diagnostics, resolution steps, and common troubleshooting scenarios for the Multi-Cloud Management (MCM) platform. It covers Common scenarios (applicable to all deployments), Single VM Specific issues, and 3-Tier Specific cluster issues.

Running Container Commands (3-Tier Only)

To run commands or inspect logs for services, use cluster-wide commands (e.g., docker service logs, docker service ls) directly on the Access Node (VM1). To execute commands inside a specific container (e.g., docker exec), first identify which node hosts that container replica via docker stack ps mcm, SSH into that host node, and run the command locally.


1. Common Troubleshooting Scenarios (All Deployments)

These scenarios apply to both Single VM and 3-Tier orchestrated deployments.

Container Fails to Start or Boot-Loops

Symptom: One or more containers show a status of Restarting or Exited in the docker ps list.

Diagnosis Steps

  • Inspect application logs for the failing service (run on Access Node for 3-tier):
    docker service logs mcm_<service-name> --tail 50 --follow
  • List only the services that are currently failing to run / replicas are 0 (run on Access Node for 3-tier):
    docker service ls --filter "replicas=0/1"
  • Verify if host ports 80 or 443 are already being occupied by another web server (e.g., Apache, Nginx) (run on Access Node for 3-tier):
    ss -tlnp | grep -E ":(80|443)"

Common Causes & Resolutions

CauseResolution
Dependent service not healthyWait for MongoDB, Keycloak, or Elasticsearch to become fully healthy before starting application microservices.
Host port conflictStop the conflicting application service, or change the port mappings in /etc/mcm/user_config.env.
Out of memoryIncrease host VM memory, or disable unused compose profiles to reduce the RAM footprint.
Missing certificatesRegenerate missing certificates using /opt/mcm/scripts/generate-certs.sh.
Missing secretsVerify /etc/mcm/secrets.env exists and contains generated passwords.

Certificate and Truststore Errors

Symptom: Service logs show PKIX path building failed, SSL handshake failed, or certificate verify failed when communicating internally or externally.

Diagnosis Steps

  • Verify if certificates exist in the configuration directory for a specific service:
    ls -la /var/lib/mcm/mcm-api/certs/
  • Verify the validity and expiration dates of a specific service certificate file:
    openssl x509 -in /var/lib/mcm/mcm-api/certs/mcm-api.crt -noout -dates
  • List certificates and details contained inside the Java keystore:
    keytool -list -v -keystore /var/lib/mcm/mcm-api/certs/keystore.p12 -storepass <KEYSTORE_PASSWORD>

Resolution

To force-regenerate all certificates and reload the truststores, perform these steps on the node hosting the configuration (Access Node):

  1. Remove the current configuration fingerprint:
    sudo rm /var/lib/mcm/.config_fingerprint
  2. Restart the platform to generate new certificates and recreate keystores/truststores:
    sudo bash /opt/mcm/scripts/restart.sh

Keycloak Authentication Issues

Symptom: Users cannot log in, the UI returns 401 Unauthorized, or logs report Invalid client secret.

Diagnosis Steps

  • View Keycloak logs to check for client authorization errors (run on Access Node for 3-tier):
    docker service logs mcm_keycloak --tail 50
  • Inspect the PostgreSQL database logs associated with Keycloak storage (run on Access Node for 3-tier):
    docker service logs mcm_keycloak-postgres --tail 20

Resolution

  1. Verify that the KC_MCM_CLIENT_SECRET value in /etc/mcm/secrets.env matches the configuration inside Keycloak.
  2. Verify that the Keycloak realm import file at /opt/mcm/keycloak/import/mcm.json was loaded successfully on initial startup.
  3. Restart Keycloak individually to reload configurations (run on Access Node for 3-tier):
    docker service update --force mcm_keycloak

MongoDB Connection Failures

Symptom: Backend services throw a MongoSocketException or report authentication failed.

Diagnosis Steps

  • Check if the MongoDB database service is up and responding to pings (run on DB Node for 3-tier):
    docker exec $(docker ps -q -f name=mcm_mongodb) mongosh --eval "db.adminCommand('ping')"
  • Test application user authentication directly inside the database (run on DB Node for 3-tier):
    docker exec $(docker ps -q -f name=mcm_mongodb) mongosh -u mcm -p <MONGO_INITDB_MCM_PASSWORD> --authenticationDatabase mcm --eval "db.getName()"
  • Check MongoDB logs for connection limits or auth failures (run on Access Node for 3-tier):
    docker service logs mcm_mongodb --tail 50

Resolution

  1. Confirm that the database password MONGO_INITDB_MCM_PASSWORD matches across /etc/mcm/secrets.env and the internal database setup.
  2. If a password was changed after the database initialized, you will need to re-align the secrets.
  3. Restart the MongoDB container (run on Access Node for 3-tier):
    docker service update --force mcm_mongodb

Elasticsearch Cluster Health Issues

Symptom: Search operations fail, or cluster health endpoints return errors.

Diagnosis Steps

  • Check the overall cluster health status (run on DB Node for 3-tier):
    curl -sk -u elastic:<ELASTIC_PASSWORD> https://localhost:9200/_cluster/health?pretty
  • Check shard allocation and see if there are unassigned shards (run on DB Node for 3-tier):
    curl -sk -u elastic:<ELASTIC_PASSWORD> https://localhost:9200/_cat/allocation?v
  • View the status, size, and document count of all Elasticsearch indices (run on DB Node for 3-tier):
    curl -sk -u elastic:<ELASTIC_PASSWORD> https://localhost:9200/_cat/indices?v

Resolution

  • Status Yellow: Normal for a Single VM deployment (since replica shards cannot be allocated to other nodes). No action is required.
  • Status Red: Shards are corrupt or unassigned. Check if the VM has run out of disk space, as Elasticsearch goes into read-only mode when disk usage crosses 90% (flood stage watermark).

APISIX Gateway Routing Errors

Symptom: Accessing the UI or APIs returns 502 Bad Gateway, 404 Not Found, or routing fails.

Diagnosis Steps

  • Inspect the APISIX proxy error and access logs (run on Access Node for 3-tier):
    docker service logs mcm_apisix --tail 50
  • View the initialization logs of APISIX to verify routes were imported (run on Access Node for 3-tier):
    docker service logs mcm_init-apisix
  • Check if the backend etcd configuration store is running and healthy (run on Access Node for 3-tier):
    docker exec $(docker ps -q -f name=mcm_etcd) etcdctl endpoint health

Resolution

  1. Check that the init-apisix initialization container exited successfully with code 0.
  2. Verify that upstream backend microservices are healthy and reachable.
  3. Restart the APISIX service (run on Access Node for 3-tier):
    docker service update --force mcm_apisix

Service Health Check Failures

Symptom: One or more containers remain in an (unhealthy) or bootstrapping state, or the platform does not fully initialize.

Diagnosis Steps

  1. Verify overall service status (Run on VM1 - Access Node): Execute the health check script to get a summary table of service states:

    sudo bash /opt/mcm/scripts/healthcheck.sh

    If any service shows ❌ UNHEALTHY / STARTING, check its cluster task list to find which VM is hosting the container and its current error message:

    docker stack ps mcm --no-trunc
  2. Verify expected Docker images are loaded on the hosting nodes: Run docker images on the respective VM node and cross-reference with the expected list below:

    Hosting NodeExpected Images (Verify via docker images)
    VM1 (Access Node)mcm-ui, apache/apisix, bitnamilegacy/etcd, elastic/kibana, wazuh/wazuh-dashboard
    VM2 (App Node)mcm-api, mcm-governance-api, mcm-finops-api, mcm-secops-api, mcm-orchestration-api, mcm-discovery-api, mcm-observability-api, mcm-ai-api
    VM3 (DB & Security)postgres, quay.io/keycloak/keycloak, elastic/elasticsearch, mongo, elastic/filebeat, wazuh/wazuh-manager, wazuh/wazuh-indexer
  3. Reload missing images and restart the platform: If any image is missing, push the correct node-specific tarball from VM1 (Access Node) to the target VM using the local SSH key, then load it locally:

    • On VM1 (Access Node):

      cd /home/[VM_SSH_USER]/mcm_artifacts/images
      docker load -i vm1_images.tar
    • On VM2 (App Node):

      # 1. Run on VM1 to push the tarball to VM2 using the cluster SSH key:
      scp -i /home/[VM_SSH_USER]/3-tier.key /home/[VM_SSH_USER]/mcm_artifacts/images/vm2_images.tar [VM_SSH_USER]@[VM2_PRIVATE_IP]:/tmp/
      
      # 2. Run on VM2 to load the image and clean up:
      docker load -i /tmp/vm2_images.tar && rm /tmp/vm2_images.tar
    • On VM3 (DB Node):

      # 1. Run on VM1 to push the tarball to VM3 using the cluster SSH key:
      scp -i /home/[VM_SSH_USER]/3-tier.key /home/[VM_SSH_USER]/mcm_artifacts/images/vm3_images.tar [VM_SSH_USER]@[VM3_PRIVATE_IP]:/tmp/
      
      # 2. Run on VM3 to load the image and clean up:
      docker load -i /tmp/vm3_images.tar && rm /tmp/vm3_images.tar
    • Restart the platform (Run on VM1 - Access Node):

      sudo bash /opt/mcm/scripts/restart.sh

Systemd Service Failure

Symptom: The background service managing the platform fails to launch.

  • Cause: Failed Systemd Unit Startup
    • Explanation: The host operating system service unit (mcm.service) encountered an error during startup, blocking initialization of compose or tasks.
    • Diagnosis & Resolution:
      1. Check the service status:
        sudo systemctl status mcm.service
      2. View the system journal logs to isolate the exact startup error:
        sudo journalctl -u mcm.service --since "1 hour ago" --no-pager

Network Port Connectivity Verification (nc / Netcat)

When troubleshooting cluster communication, database connections, or gateway access, use the nc (netcat) utility to verify if ports are open and accessible between nodes.

Commands

  • Test TCP Port connectivity:
    nc -zv <TARGET_IP> <PORT>
  • Test UDP Port connectivity:
    nc -zuv <TARGET_IP> <PORT>
    (Parameters: -z instructs netcat to only scan for listening ports/daemons without sending data, -v enables verbose output, and -u switches from TCP to UDP mode).

Interpreting Netcat Output

  • Success Output:
    Connection to 10.0.1.20 2377 port [tcp/*] succeeded!
    What it means: The target port is open, a service is active and listening, and there are no firewalls blocking the traffic.
  • Failure Output (Connection Refused):
    nc: connect to 10.0.1.20 port 2377 (tcp) failed: Connection refused
    What it means: The host VM is reachable, but no service is listening on that port. Ensure the service (e.g., Docker Swarm, APISIX, MongoDB) is running on the target node.
  • Failure Output (Connection Timed Out):
    nc: connect to 10.0.1.20 port 2377 (tcp) failed: Connection timed out
    What it means: The network packets are being silently dropped. This indicates that a firewall (either cloud VPC Security Groups/NSGs or host-level UFW/iptables) is blocking traffic on this port. Double-check your firewall ingress and egress configuration rules.

2. 3-Tier Specific Troubleshooting

These scenarios apply to deployments distributed across the 3-tier orchestrated cluster.

VXLAN Overlay Routing Failures (504 Gateway Timeout)

Symptom: Accessing the gateway results in a 504 Gateway Time-out. Core APIs on VM2 cannot communicate with databases on VM3.

  • Cause: Blocked UDP Port 4789 (VXLAN)
    • Explanation: The cluster virtual network requires UDP Port 4789 to be open inbound/outbound on all cluster nodes. If it is blocked or configured as TCP, inter-node container routing fails.
    • Diagnosis & Resolution:
      1. Test UDP port connectivity manually between nodes using netcat:
        # From VM1 to VM2 App Node
        nc -zuv <VM2_PRIVATE_IP> 4789
        
        # From VM2 to VM3 DB Node
        nc -zuv <VM3_PRIVATE_IP> 4789
        (Refer to the Network Port Connectivity Verification section above to analyze and resolve connection timeouts or refusal messages).
      2. Re-apply configurations and redeploy the stack once the port is open:
        docker stack rm mcm
        docker stack deploy -c /opt/mcm/docker-compose-stack.yml mcm --resolve-image never

SSH and Key Authorization Failures

Symptom: The installer or upgrade scripts fail with SSH connection errors to VM2 or VM3.

  • Cause: Incorrect Key Permissions or Missing Sudoers Rules
    • Explanation: The VM1 Access Node connects to workers over SSH. This fails if key permissions are insecure or the SSH user lacks passwordless sudo privileges.
    • Resolution:
      1. Restrict private key permissions on VM1:
        chmod 400 /home/ubuntu/3-tier.pem
      2. Ensure the SSH user (e.g., ubuntu) has passwordless sudo configured on VM2 and VM3:
        echo "ubuntu ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/90-init-users

Stuck or Pending Container Tasks

Symptom: Services are deployed, but some replicas show 0/1 and remain in Pending state indefinitely.

  • Cause: Missing Node Labels (Placement Constraint Mismatches)
    • Explanation: The orchestrator relies on node labels (tier=access, tier=app, tier=db) to schedule containers. If a node is missing its label, its tasks cannot schedule.
    • Diagnosis & Resolution:
      1. View the scheduling error reason:
        docker stack ps mcm --no-trunc
      2. Inspect the node labels on VM1:
        docker node inspect <NODE_HOSTNAME> --format '{{json .Spec.Labels}}'
      3. Re-apply the placement labels if missing:
        docker node update --label-add tier=access <VM1_HOSTNAME>
        docker node update --label-add tier=app <VM2_HOSTNAME>
        docker node update --label-add tier=db <VM3_HOSTNAME>

VM2 Memory Exhaustion & JVM Crash Loops (Exit Code 137)

Symptom: Microservices on VM2 (App Node) crash-loop with exit code 137.

  • Cause: Out-Of-Memory (OOM) Termination
    • Explanation: Exit code 137 indicates the VM OS terminated the container due to RAM exhaustion.
    • Resolution:
      1. Prune unused images and containers on VM2:

        sudo docker system prune -af

        CAUTION: Avoid the --volumes flag during active deployments

        Running docker system prune with the --volumes flag will delete all unused local Docker volumes. If the Swarm stack is stopped or a database replica is temporarily offline, this will permanently delete your database storage (PostgreSQL, MongoDB, etc.) resulting in irreversible data loss. For routine space cleanup, run the prune command without the --volumes flag.


Offline Container Image Loading Failures

Symptom: Tasks fail to deploy on VM2 or VM3 with "No such image" errors.

  • Cause: Interrupted scp Transfer or Out of Disk Space
    • Explanation: The offline installation copies image archives over network connections. Disk exhaustion on target nodes blocks complete extraction.
    • Resolution:
      1. Manually transfer the image from VM1 to the target worker:
        scp -i <KEY> /home/ubuntu/mcm-artifacts/images/mcm-ui.tar ubuntu@<VM2_IP>:/tmp/
      2. Load the archive into the local container platform registry:
        ssh -i <KEY> ubuntu@<VM2_IP> "sudo docker load -i /tmp/mcm-ui.tar && rm /tmp/mcm-ui.tar"
      3. Rerun the upgrade or restart command on the VM1 Access Node to complete image deployment and schedule the newly loaded service container task:
        • To complete an upgrade:
          sudo ./upgrade.sh
        • To redeploy the stack:
          sudo /opt/mcm/scripts/restart.sh

On this page