I hope this message finds you well. I'm currently facing a significant issue with my Windows High-Performance Computing (HPC) setup, severely impacting my work. I am hoping someone here can help me resolve it.
Here's the situation: I have a small HPC computing cluster running on Windows Server 2019, which I've been using for intensive computational tasks. The cluster consists of 1 head node and 4 compute nodes. Recently, I've encountered a problem where the compute nodes are not properly communicating with the head node. This issue started after applying the latest Windows updates, causing job failures and considerable downtime.
The main symptoms are:
Compute nodes intermittently drop out of the cluster: The compute nodes occasionally disappear from the cluster manager, which disrupts ongoing computations. This happens randomly but more frequently under heavy load.
Job scheduling failures: Jobs are getting stuck in the queue and not being dispatched to the compute nodes. The job scheduler logs show repeated "unable to connect" errors.
Increased network latency: Data transfer between nodes, which used to be seamless, now experiences significant delays. Ping tests show variable and high latency values, with occasional timeouts.
Details of my setup:
Head node specifications: Intel Xeon E5-2640, 64GB RAM, Windows Server 2019, Microsoft HPC Pack 2016 Update 3
Compute node specifications: Intel Xeon E5-2620, 32GB RAM, Windows Server 2019
It's clear that you're experiencing significant issues with your HPC setup after the recent Windows updates. Here are some steps and best practices to troubleshoot and resolve the communication and performance issues on your cluster:
Step-by-Step Troubleshooting Guide
Step 1: Verify Network Configuration and Performance
Check Network Switch and Cabling:
Ensure all network cables are properly connected and not damaged.
Verify that the switch configurations haven’t changed. Check for any firmware updates for the Cisco SG350-28 switch.
It's clear that you're experiencing significant issues with your HPC setup after the recent Windows updates. Here are some steps and best practices to troubleshoot and resolve the communication and performance issues on your cluster:
Step-by-Step Troubleshooting Guide
Step 1: Verify Network Configuration and Performance
Check Network Switch and Cabling:
Ensure all network cables are properly connected and not damaged.
Verify that the switch configurations haven’t changed. Check for any firmware updates for the Cisco SG350-28 switch.
Network Performance Tests:
Use ping to test basic connectivity:
Code:
cmd ping
Use iperf to measure network throughput:
Code:
cmd iperf3 -s (On one node, start the server) iperf3 -c (On another node, start the client to measure bandwidth)
Check Network Settings:
Ensure there are no IP address conflicts.
Verify the network interface cards (NICs) are configured correctly.
Ensure that the Ethernet adapter settings are optimal for HPC workloads (check for offload settings, jumbo frames, etc.).
Step 2: Review System and Application Logs
Event Viewer Logs:
Check the Windows Event Viewer on the head node and compute nodes for any errors or warnings.
Look for patterns or specific errors related to network, cluster service, or job scheduling.
HPC Pack Logs:
Review the HPC Pack logs for any errors or warnings that could provide more details on communication issues.
Step 3: Check for and Roll Back Problematic Updates
Identify Recent Updates:
Go to Settings > Update & Security > Windows Update > View update history to identify updates installed recently.
Note any updates installed around the time the issues began.
Roll Back Updates:
If a specific update is suspected, you can uninstall it:
Code:
cmd wusa /uninstall /KB:
Pause Further Updates:
Pause updates temporarily to prevent reinstallation until the issue is resolved:
Ensure the correct network interface is being used for HPC communication.
Go to HPC Cluster Manager > Configuration > Network Configuration and verify the settings.
Cluster Nodes Configuration:
Check each compute node’s status in HPC Cluster Manager.
Ensure that the node status is healthy and no alerts are raised.
Step 5: Reconfigure and Test Job Scheduler
Restart HPC Services:
Restart the HPC services on the head node and compute nodes:
Code:
cmd net stop hpcscheduler net start hpcscheduler net stop hpcmanagement net start hpcmanagement
Submit Test Jobs:
Submit small test jobs to verify if they are scheduled and executed correctly.
Monitor the logs and job status for any errors or delays.
Step 6: Advanced Configuration Checks
Firewall and Security Software:
Ensure firewall settings allow necessary traffic between the head node and compute nodes.
Temporarily disable security software to rule out interference.
DNS and Name Resolution:
Verify DNS configuration and ensure that all nodes can resolve each other’s names.
Use nslookup to test name resolution.
Shared Storage Performance:
Ensure that the SMB shared storage is functioning correctly and not causing delays. Test access times and throughput.
Final Thoughts
By following these steps, you should be able to identify and resolve the issues causing communication breakdowns and job scheduling failures in your HPC cluster. Here's a summarized action plan: [table="width: 100%"] | Step | Action Items |
Verify Network Configuration
Check cables and switch settings, perform ping and iperf tests, ensure optimal NIC settings.
Review System and Application Logs
Check Event Viewer and HPC Pack logs for errors or warnings.
Roll Back Problematic Updates
Identify and uninstall recent updates, pause further updates.
Verify HPC Pack Configuration
Ensure correct network interface and healthy node status in HPC Cluster Manager.
Reconfigure and Test Job Scheduler
Restart HPC services, submit test jobs, and monitor logs.
Advanced Configuration Checks
Ensure firewall and DNS settings are correct, verify shared storage performance.
Example Commands and Tools
ping:
Code:
cmd ping
iperf:
Code:
cmd iperf3 -s (On the server node) iperf3 -c (On the client node)
wusa (to uninstall updates):
Code:
cmd wusa /uninstall /KB:
Restart HPC Services:
Code:
cmd net stop hpcscheduler net start hpcscheduler net stop hpcmanagement net start hpcmanagement
If you need further assistance with any specific step or encounter any particular issues, feel free to ask for more detailed guidance![/table]