Issue with HPC Computing on Windows

williambglowacki

New Member
Joined
Jun 2, 2024
Messages
3
Hello HPC Community,

I hope this message finds you well. I'm currently facing a significant issue with my Windows High-Performance Computing (HPC) setup, severely impacting my work. I am hoping someone here can help me resolve it.

Here's the situation: I have a small HPC computing cluster running on Windows Server 2019, which I've been using for intensive computational tasks. The cluster consists of 1 head node and 4 compute nodes. Recently, I've encountered a problem where the compute nodes are not properly communicating with the head node. This issue started after applying the latest Windows updates, causing job failures and considerable downtime.

The main symptoms are:

  1. Compute nodes intermittently drop out of the cluster: The compute nodes occasionally disappear from the cluster manager, which disrupts ongoing computations. This happens randomly but more frequently under heavy load.
  2. Job scheduling failures: Jobs are getting stuck in the queue and not being dispatched to the compute nodes. The job scheduler logs show repeated "unable to connect" errors.
  3. Increased network latency: Data transfer between nodes, which used to be seamless, now experiences significant delays. Ping tests show variable and high latency values, with occasional timeouts.
Details of my setup:

  • Head node specifications: Intel Xeon E5-2640, 64GB RAM, Windows Server 2019, Microsoft HPC Pack 2016 Update 3
  • Compute node specifications: Intel Xeon E5-2620, 32GB RAM, Windows Server 2019
  • Network setup: 10 Gbps Ethernet, static IP addresses, Cisco SG350-28 switch
  • Storage: Shared storage via SMB on a dedicated NAS device (Synology DS1819+)
  • Software versions: Windows Server 2019 (Build 17763), Microsoft HPC Pack 2016 Update 3
 
Solution
It's clear that you're experiencing significant issues with your HPC setup after the recent Windows updates. Here are some steps and best practices to troubleshoot and resolve the communication and performance issues on your cluster:

Step-by-Step Troubleshooting Guide​

Step 1: Verify Network Configuration and Performance​

  1. Check Network Switch and Cabling:
    • Ensure all network cables are properly connected and not damaged.
    • Verify that the switch configurations haven’t changed. Check for any firmware updates for the Cisco SG350-28 switch.
  2. Network Performance Tests:
    • Use ping to test basic connectivity:
      Code:
      cmd ping
      • Use iperf to measure network throughput...
It's clear that you're experiencing significant issues with your HPC setup after the recent Windows updates. Here are some steps and best practices to troubleshoot and resolve the communication and performance issues on your cluster:

Step-by-Step Troubleshooting Guide​

Step 1: Verify Network Configuration and Performance​

  1. Check Network Switch and Cabling:
    • Ensure all network cables are properly connected and not damaged.
    • Verify that the switch configurations haven’t changed. Check for any firmware updates for the Cisco SG350-28 switch.
  2. Network Performance Tests:
    • Use ping to test basic connectivity:
      Code:
      cmd ping
      • Use iperf to measure network throughput:
        Code:
        cmd iperf3 -s (On one node, start the server) iperf3 -c  (On another node, start the client to measure bandwidth)
    • Check Network Settings:
      • Ensure there are no IP address conflicts.
      • Verify the network interface cards (NICs) are configured correctly.
      • Ensure that the Ethernet adapter settings are optimal for HPC workloads (check for offload settings, jumbo frames, etc.).

        Step 2: Review System and Application Logs​

    • Event Viewer Logs:
      • Check the Windows Event Viewer on the head node and compute nodes for any errors or warnings.
      • Look for patterns or specific errors related to network, cluster service, or job scheduling.
    • HPC Pack Logs:
      • Review the HPC Pack logs for any errors or warnings that could provide more details on communication issues.

        Step 3: Check for and Roll Back Problematic Updates​

    • Identify Recent Updates:
      • Go to Settings > Update & Security > Windows Update > View update history to identify updates installed recently.
      • Note any updates installed around the time the issues began.
    • Roll Back Updates:
      • If a specific update is suspected, you can uninstall it:
        Code:
        cmd wusa /uninstall /KB:
    • Pause Further Updates:
      • Pause updates temporarily to prevent reinstallation until the issue is resolved:
        Code:
        cmd Set-Service wuauserv -StartupType Disabled Stop-Service wuauserv

        Step 4: Verify HPC Pack Configuration​

    • Network Configuration in HPC Pack:
      • Ensure the correct network interface is being used for HPC communication.
      • Go to HPC Cluster Manager > Configuration > Network Configuration and verify the settings.
    • Cluster Nodes Configuration:
      • Check each compute node’s status in HPC Cluster Manager.
      • Ensure that the node status is healthy and no alerts are raised.

        Step 5: Reconfigure and Test Job Scheduler​

    • Restart HPC Services:
      • Restart the HPC services on the head node and compute nodes:
        Code:
        cmd net stop hpcscheduler net start hpcscheduler net stop hpcmanagement net start hpcmanagement
    • Submit Test Jobs:
      • Submit small test jobs to verify if they are scheduled and executed correctly.
      • Monitor the logs and job status for any errors or delays.

        Step 6: Advanced Configuration Checks​

    • Firewall and Security Software:
      • Ensure firewall settings allow necessary traffic between the head node and compute nodes.
      • Temporarily disable security software to rule out interference.
    • DNS and Name Resolution:
      • Verify DNS configuration and ensure that all nodes can resolve each other’s names.
      • Use nslookup to test name resolution.
    • Shared Storage Performance:
      • Ensure that the SMB shared storage is functioning correctly and not causing delays. Test access times and throughput.

        Final Thoughts​

        By following these steps, you should be able to identify and resolve the issues causing communication breakdowns and job scheduling failures in your HPC cluster. Here's a summarized action plan: [table="width: 100%"] | Step | Action Items |
        Verify Network ConfigurationCheck cables and switch settings, perform ping and iperf tests, ensure optimal NIC settings.
        Review System and Application LogsCheck Event Viewer and HPC Pack logs for errors or warnings.
        Roll Back Problematic UpdatesIdentify and uninstall recent updates, pause further updates.
        Verify HPC Pack ConfigurationEnsure correct network interface and healthy node status in HPC Cluster Manager.
        Reconfigure and Test Job SchedulerRestart HPC services, submit test jobs, and monitor logs.
        Advanced Configuration ChecksEnsure firewall and DNS settings are correct, verify shared storage performance.

        Example Commands and Tools​

      • ping:
        Code:
        cmd ping
        • iperf:
          Code:
          cmd iperf3 -s (On the server node) iperf3 -c  (On the client node)
        • wusa (to uninstall updates):
          Code:
          cmd wusa /uninstall /KB:
        • Restart HPC Services:
          Code:
          cmd net stop hpcscheduler net start hpcscheduler net stop hpcmanagement net start hpcmanagement
          If you need further assistance with any specific step or encounter any particular issues, feel free to ask for more detailed guidance![/table]
 
Solution