williambglowacki
New Member
- Joined
- Jun 2, 2024
- Messages
- 3
- Thread Author
- #1
Hello HPC Community,
I hope this message finds you well. I'm currently facing a significant issue with my Windows High-Performance Computing (HPC) setup, severely impacting my work. I am hoping someone here can help me resolve it.
Here's the situation: I have a small HPC computing cluster running on Windows Server 2019, which I've been using for intensive computational tasks. The cluster consists of 1 head node and 4 compute nodes. Recently, I've encountered a problem where the compute nodes are not properly communicating with the head node. This issue started after applying the latest Windows updates, causing job failures and considerable downtime.
The main symptoms are:
I hope this message finds you well. I'm currently facing a significant issue with my Windows High-Performance Computing (HPC) setup, severely impacting my work. I am hoping someone here can help me resolve it.
Here's the situation: I have a small HPC computing cluster running on Windows Server 2019, which I've been using for intensive computational tasks. The cluster consists of 1 head node and 4 compute nodes. Recently, I've encountered a problem where the compute nodes are not properly communicating with the head node. This issue started after applying the latest Windows updates, causing job failures and considerable downtime.
The main symptoms are:
- Compute nodes intermittently drop out of the cluster: The compute nodes occasionally disappear from the cluster manager, which disrupts ongoing computations. This happens randomly but more frequently under heavy load.
- Job scheduling failures: Jobs are getting stuck in the queue and not being dispatched to the compute nodes. The job scheduler logs show repeated "unable to connect" errors.
- Increased network latency: Data transfer between nodes, which used to be seamless, now experiences significant delays. Ping tests show variable and high latency values, with occasional timeouts.
- Head node specifications: Intel Xeon E5-2640, 64GB RAM, Windows Server 2019, Microsoft HPC Pack 2016 Update 3
- Compute node specifications: Intel Xeon E5-2620, 32GB RAM, Windows Server 2019
- Network setup: 10 Gbps Ethernet, static IP addresses, Cisco SG350-28 switch
- Storage: Shared storage via SMB on a dedicated NAS device (Synology DS1819+)
- Software versions: Windows Server 2019 (Build 17763), Microsoft HPC Pack 2016 Update 3