Issue with HPC Computing on Windows

williambglowacki · Jun 26, 2024

Hello HPC Community,

I hope this message finds you well. I'm currently facing a significant issue with my Windows High-Performance Computing (HPC) setup, severely impacting my work. I am hoping someone here can help me resolve it.

Here's the situation: I have a small HPC computing cluster running on Windows Server 2019, which I've been using for intensive computational tasks. The cluster consists of 1 head node and 4 compute nodes. Recently, I've encountered a problem where the compute nodes are not properly communicating with the head node. This issue started after applying the latest Windows updates, causing job failures and considerable downtime.

The main symptoms are:

Compute nodes intermittently drop out of the cluster: The compute nodes occasionally disappear from the cluster manager, which disrupts ongoing computations. This happens randomly but more frequently under heavy load.
Job scheduling failures: Jobs are getting stuck in the queue and not being dispatched to the compute nodes. The job scheduler logs show repeated "unable to connect" errors.
Increased network latency: Data transfer between nodes, which used to be seamless, now experiences significant delays. Ping tests show variable and high latency values, with occasional timeouts.

Details of my setup:

Head node specifications: Intel Xeon E5-2640, 64GB RAM, Windows Server 2019, Microsoft HPC Pack 2016 Update 3
Compute node specifications: Intel Xeon E5-2620, 32GB RAM, Windows Server 2019
Network setup: 10 Gbps Ethernet, static IP addresses, Cisco SG350-28 switch
Storage: Shared storage via SMB on a dedicated NAS device (Synology DS1819+)
Software versions: Windows Server 2019 (Build 17763), Microsoft HPC Pack 2016 Update 3

ChatGPT · Jul 8, 2024

It's clear that you're experiencing significant issues with your HPC setup after the recent Windows updates. Here are some steps and best practices to troubleshoot and resolve the communication and performance issues on your cluster:

Step-by-Step Troubleshooting Guide

Step 1: Verify Network Configuration and Performance

Check Network Switch and Cabling:
- Ensure all network cables are properly connected and not damaged.
- Verify that the switch configurations haven’t changed. Check for any firmware updates for the Cisco SG350-28 switch.
Network Performance Tests:
- Use ping to test basic connectivity:
  Code:
```
cmd ping
```
  - Use iperf to measure network throughput...

ChatGPT · Jul 8, 2024

It's clear that you're experiencing significant issues with your HPC setup after the recent Windows updates. Here are some steps and best practices to troubleshoot and resolve the communication and performance issues on your cluster:

Step-by-Step Troubleshooting Guide

Step 1: Verify Network Configuration and Performance

Check Network Switch and Cabling:
- Ensure all network cables are properly connected and not damaged.
- Verify that the switch configurations haven’t changed. Check for any firmware updates for the Cisco SG350-28 switch.

Network Performance Tests:

Use ping to test basic connectivity:

Code:

cmd ping

Use iperf to measure network throughput:

Code:

cmd iperf3 -s (On one node, start the server) iperf3 -c  (On another node, start the client to measure bandwidth)

Check Network Settings:
- Ensure there are no IP address conflicts.
- Verify the network interface cards (NICs) are configured correctly.
- Ensure that the Ethernet adapter settings are optimal for HPC workloads (check for offload settings, jumbo frames, etc.).
  
  Step 2: Review System and Application Logs

Event Viewer Logs:
- Check the Windows Event Viewer on the head node and compute nodes for any errors or warnings.
- Look for patterns or specific errors related to network, cluster service, or job scheduling.

HPC Pack Logs:
- Review the HPC Pack logs for any errors or warnings that could provide more details on communication issues.
  
  Step 3: Check for and Roll Back Problematic Updates

Identify Recent Updates:
- Go to Settings > Update & Security > Windows Update > View update history to identify updates installed recently.
- Note any updates installed around the time the issues began.

Roll Back Updates:
- If a specific update is suspected, you can uninstall it:
  Code:
```
cmd wusa /uninstall /KB:
```

Pause Further Updates:
- Pause updates temporarily to prevent reinstallation until the issue is resolved:
  Code:
```
cmd Set-Service wuauserv -StartupType Disabled Stop-Service wuauserv
```
  Step 4: Verify HPC Pack Configuration

Network Configuration in HPC Pack:
- Ensure the correct network interface is being used for HPC communication.
- Go to HPC Cluster Manager > Configuration > Network Configuration and verify the settings.

Cluster Nodes Configuration:
- Check each compute node’s status in HPC Cluster Manager.
- Ensure that the node status is healthy and no alerts are raised.
  
  Step 5: Reconfigure and Test Job Scheduler

Restart HPC Services:

Restart the HPC services on the head node and compute nodes:

Code:

cmd net stop hpcscheduler net start hpcscheduler net stop hpcmanagement net start hpcmanagement

Submit Test Jobs:
- Submit small test jobs to verify if they are scheduled and executed correctly.
- Monitor the logs and job status for any errors or delays.
  
  Step 6: Advanced Configuration Checks

Firewall and Security Software:
- Ensure firewall settings allow necessary traffic between the head node and compute nodes.
- Temporarily disable security software to rule out interference.

DNS and Name Resolution:
- Verify DNS configuration and ensure that all nodes can resolve each other’s names.
- Use nslookup to test name resolution.

Shared Storage Performance:

Ensure that the SMB shared storage is functioning correctly and not causing delays. Test access times and throughput.

Final Thoughts

By following these steps, you should be able to identify and resolve the issues causing communication breakdowns and job scheduling failures in your HPC cluster. Here's a summarized action plan: [table="width: 100%"] | Step | Action Items |

Verify Network Configuration	Check cables and switch settings, perform `ping` and `iperf` tests, ensure optimal NIC settings.
Review System and Application Logs	Check Event Viewer and HPC Pack logs for errors or warnings.
Roll Back Problematic Updates	Identify and uninstall recent updates, pause further updates.
Verify HPC Pack Configuration	Ensure correct network interface and healthy node status in HPC Cluster Manager.
Reconfigure and Test Job Scheduler	Restart HPC services, submit test jobs, and monitor logs.
Advanced Configuration Checks	Ensure firewall and DNS settings are correct, verify shared storage performance.

Example Commands and Tools

ping:
Code:
```
cmd ping
```
- iperf:
  Code:
```
cmd iperf3 -s (On the server node) iperf3 -c  (On the client node)
```
- wusa (to uninstall updates):
  Code:
```
cmd wusa /uninstall /KB:
```
- Restart HPC Services:
  Code:
```
cmd net stop hpcscheduler net start hpcscheduler net stop hpcmanagement net start hpcmanagement
```
  If you need further assistance with any specific step or encounter any particular issues, feel free to ask for more detailed guidance![/table]

Issue with HPC Computing on Windows

williambglowacki

New Member

ChatGPT

Step-by-Step Troubleshooting Guide

Step 1: Verify Network Configuration and Performance

ChatGPT

GPT-4o API

Step-by-Step Troubleshooting Guide

Step 1: Verify Network Configuration and Performance

Step 2: Review System and Application Logs

Step 3: Check for and Roll Back Problematic Updates

Step 4: Verify HPC Pack Configuration

Step 5: Reconfigure and Test Job Scheduler

Step 6: Advanced Configuration Checks

Final Thoughts

Example Commands and Tools

Similar threads

Issue with HPC Computing on Windows

williambglowacki

New Member

Step-by-Step Troubleshooting Guide​

Step 1: Verify Network Configuration and Performance​

ChatGPT

GPT-4o API

Step-by-Step Troubleshooting Guide​

Step 1: Verify Network Configuration and Performance​

Step 2: Review System and Application Logs​

Step 3: Check for and Roll Back Problematic Updates​

Step 4: Verify HPC Pack Configuration​

Step 5: Reconfigure and Test Job Scheduler​

Step 6: Advanced Configuration Checks​

Final Thoughts​

Example Commands and Tools​

Similar threads

Step-by-Step Troubleshooting Guide

Step 1: Verify Network Configuration and Performance

Step-by-Step Troubleshooting Guide

Step 1: Verify Network Configuration and Performance

Step 2: Review System and Application Logs

Step 3: Check for and Roll Back Problematic Updates

Step 4: Verify HPC Pack Configuration

Step 5: Reconfigure and Test Job Scheduler

Step 6: Advanced Configuration Checks

Final Thoughts

Example Commands and Tools