This guide provides diagnostics, resolution steps, and common troubleshooting scenarios for the Multi-Cloud Management (MCM) platform. It covers Common scenarios (applicable to all deployments), Single VM Specific issues, and 3-Tier Specific cluster issues.
Running Container Commands (3-Tier Only)
To run commands or inspect logs for services, use cluster-wide commands (e.g., docker service logs, docker service ls) directly on the Access Node (VM1). To execute commands inside a specific container (e.g., docker exec), first identify which node hosts that container replica via docker stack ps mcm, SSH into that host node, and run the command locally.
Status Yellow: Normal for a Single VM deployment (since replica shards cannot be allocated to other nodes). No action is required.
Status Red: Shards are corrupt or unassigned. Check if the VM has run out of disk space, as Elasticsearch goes into read-only mode when disk usage crosses 90% (flood stage watermark).
Verify overall service status (Run on VM1 - Access Node):
Execute the health check script to get a summary table of service states:
sudo bash /opt/mcm/scripts/healthcheck.sh
If any service shows ❌ UNHEALTHY / STARTING, check its cluster task list to find which VM is hosting the container and its current error message:
docker stack ps mcm --no-trunc
Verify expected Docker images are loaded on the hosting nodes:
Run docker images on the respective VM node and cross-reference with the expected list below:
Reload missing images and restart the platform:
If any image is missing, push the correct node-specific tarball from VM1 (Access Node) to the target VM using the local SSH key, then load it locally:
On VM1 (Access Node):
cd /home/[VM_SSH_USER]/mcm_artifacts/imagesdocker load -i vm1_images.tar
On VM2 (App Node):
# 1. Run on VM1 to push the tarball to VM2 using the cluster SSH key:scp -i /home/[VM_SSH_USER]/3-tier.key /home/[VM_SSH_USER]/mcm_artifacts/images/vm2_images.tar [VM_SSH_USER]@[VM2_PRIVATE_IP]:/tmp/# 2. Run on VM2 to load the image and clean up:docker load -i /tmp/vm2_images.tar && rm /tmp/vm2_images.tar
On VM3 (DB Node):
# 1. Run on VM1 to push the tarball to VM3 using the cluster SSH key:scp -i /home/[VM_SSH_USER]/3-tier.key /home/[VM_SSH_USER]/mcm_artifacts/images/vm3_images.tar [VM_SSH_USER]@[VM3_PRIVATE_IP]:/tmp/# 2. Run on VM3 to load the image and clean up:docker load -i /tmp/vm3_images.tar && rm /tmp/vm3_images.tar
When troubleshooting cluster communication, database connections, or gateway access, use the nc (netcat) utility to verify if ports are open and accessible between nodes.
(Parameters: -z instructs netcat to only scan for listening ports/daemons without sending data, -v enables verbose output, and -u switches from TCP to UDP mode).
Connection to 10.0.1.20 2377 port [tcp/*] succeeded!
What it means: The target port is open, a service is active and listening, and there are no firewalls blocking the traffic.
Failure Output (Connection Refused):
nc: connect to 10.0.1.20 port 2377 (tcp) failed: Connection refused
What it means: The host VM is reachable, but no service is listening on that port. Ensure the service (e.g., Docker Swarm, APISIX, MongoDB) is running on the target node.
Failure Output (Connection Timed Out):
nc: connect to 10.0.1.20 port 2377 (tcp) failed: Connection timed out
What it means: The network packets are being silently dropped. This indicates that a firewall (either cloud VPC Security Groups/NSGs or host-level UFW/iptables) is blocking traffic on this port. Double-check your firewall ingress and egress configuration rules.
Symptom: Accessing the gateway results in a 504 Gateway Time-out. Core APIs on VM2 cannot communicate with databases on VM3.
Cause: Blocked UDP Port 4789 (VXLAN)
Explanation: The cluster virtual network requires UDP Port 4789 to be open inbound/outbound on all cluster nodes. If it is blocked or configured as TCP, inter-node container routing fails.
Diagnosis & Resolution:
Test UDP port connectivity manually between nodes using netcat:
# From VM1 to VM2 App Nodenc -zuv <VM2_PRIVATE_IP> 4789# From VM2 to VM3 DB Nodenc -zuv <VM3_PRIVATE_IP> 4789
Symptom: The installer or upgrade scripts fail with SSH connection errors to VM2 or VM3.
Cause: Incorrect Key Permissions or Missing Sudoers Rules
Explanation: The VM1 Access Node connects to workers over SSH. This fails if key permissions are insecure or the SSH user lacks passwordless sudo privileges.
Resolution:
Restrict private key permissions on VM1:
chmod 400 /home/ubuntu/3-tier.pem
Ensure the SSH user (e.g., ubuntu) has passwordless sudo configured on VM2 and VM3:
echo "ubuntu ALL=(ALL) NOPASSWD:ALL" | sudo tee /etc/sudoers.d/90-init-users
Explanation: The orchestrator relies on node labels (tier=access, tier=app, tier=db) to schedule containers. If a node is missing its label, its tasks cannot schedule.
Symptom: Microservices on VM2 (App Node) crash-loop with exit code 137.
Cause: Out-Of-Memory (OOM) Termination
Explanation: Exit code 137 indicates the VM OS terminated the container due to RAM exhaustion.
Resolution:
Prune unused images and containers on VM2:
sudo docker system prune -af
CAUTION: Avoid the --volumes flag during active deployments
Running docker system prune with the --volumes flag will delete all unused local Docker volumes. If the Swarm stack is stopped or a database replica is temporarily offline, this will permanently delete your database storage (PostgreSQL, MongoDB, etc.) resulting in irreversible data loss. For routine space cleanup, run the prune command without the --volumes flag.