Cluster

Manage and monitor your compute cluster nodes for ML workloads.

Overview

The Cluster section provides comprehensive management of compute nodes that power your ML experiments, training jobs, and deployments. Monitor resource utilization, manage node configurations, and ensure optimal cluster health.

Cluster Dashboard

The dashboard displays key cluster metrics at a glance:

Summary Cards:

Total Nodes: Total number of nodes in the cluster
Online: Number of nodes currently online
Busy: Number of nodes actively processing workloads
CPU Usage (%): Average CPU usage across cluster
Memory Usage (%): Average memory usage across cluster

Node List View

The cluster table shows all nodes with the following information:

Columns:

Node: Node name and details (hostname, IP address)
Type: Node type (GPU, CPU)
Status: Current status (Online, Offline, Maintenance)
Resources: Available resources (CPU cores, RAM, GPU)
Usage: Real-time CPU and Memory usage with progress bars
Jobs: Running jobs count
Uptime: Node uptime duration
Health: Health status (Healthy, Warning, Critical)
Actions: Quick actions menu

Filtering and Search:

Search by node name or IP
Filter by Type (GPU, CPU, All)
Filter by Status (Online, Offline, Busy, Maintenance)

Creating a Cluster Node

Navigate to Deep Learning Platform → Cluster → Click Create

Basic Information

Node Name* (Required)

Unique identifier for the cluster node
Example: gpu-node-01, cpu-node-high-mem

Hostname* (Required)

Network hostname
Example: gpu01.cluster.local

IP Address* (Required)

IPv4 address of the node
Example: 192.168.1.101

Node Type* (Required)

Select from dropdown: GPU, CPU
Default: GPU

Status* (Required)

Select from dropdown: Online, Offline
Default: Online

CPU Resources

CPU Cores* (Required)

Total number of CPU cores
Example: 16

Total number of CPU Cores (Helper text)

Available CPU Cores* (Required)

Number of available CPU cores
Example: 8

Number of available CPU cores (Helper text)

Memory Resources

Total Memory (GB)* (Required)

Total RAM in GB
Example: 64

Total RAM in GB (Helper text)

Available Memory (GB)* (Required)

Available RAM in GB
Example: 32

Available memory (GB) (Helper text)

GPU Resources (Optional)

GPU Count

Number of GPUs (0 for CPU-only nodes)
Example: 0, 4, 8

Number of GPUs (0 for CPU-only nodes) (Helper text)

GPU Type

Select GPU model from dropdown
Options: NVIDIA A100, NVIDIA V100, NVIDIA T4, etc.

GPU Memory per GPU (GB)

Memory per GPU in GB
Example: 80 (for A100)

VRAM per GPU in GB (Helper text)

Storage & Network

Total Storage (GB)* (Required)

Total disk storage in GB
Example: 1000

Total disk space in GB (Helper text)

Network Bandwidth (Mbps)* (Required)

Network bandwidth in Mbps
Example: 10000 (10 Gbps)

Network speed in Mbps (Helper text)

Network Latency (ms)* (Required)

Average network latency in milliseconds
Example: 1

Average network latency (Helper text)

Configuration

Max Concurrent Jobs* (Required)

Maximum number of jobs that can run simultaneously
Example: 4

Maximum number of jobs that can run simultaneously (Helper text)

Priority

Node priority (1-10, higher is better)
Example: 5

Node priority (1-10, higher is better) (Helper text)

Tags

Comma-separated tags for categorization
Example: production,high-memory,gpu

Location (Optional)

Datacenter

Datacenter location
Example: Datacenter

Rack

Rack identifier
Example: R-10

Zone

Availability zone
Example: Zone-A

Actions

Cancel: Discard and close
Create Cluster Node: Submit and create the node

Viewing Node Details

To view detailed information about a cluster node:

Navigate to Deep Learning Platform → Cluster
Click on a node from the list
View comprehensive details in the modal dialog

Details Panel Sections:

Basic Information:

Node Name: e.g., "gpu-node-01"
Hostname: e.g., "gpu01.cluster.local"
IP Address: e.g., "192.168.1.101"
Node Type: GPU or CPU
Status: Online, Offline, Busy, Maintenance

CPU Resources:

CPU Cores: Total cores (e.g., 32)
Total number of CPU cores
Available CPU Cores: Available cores (e.g., 11)
Number of available CPU cores

Memory Resources:

Total Memory (GB): Total RAM (e.g., 128)
Total RAM in GB
Available Memory (GB): Available RAM (e.g., 32)
Available memory (GB)

GPU Resources (Optional):

GPU Count: Number of GPUs (e.g., 4)
Number of GPUs (0 for CPU-only nodes)
GPU Type: GPU model (e.g., A100)
GPU Memory per GPU (GB): VRAM per GPU (e.g., 80)
VRAM per GPU in GB

Storage & Network:

Total Storage (GB): Total disk space
Total disk space in GB
Network Bandwidth (Mbps): Network speed
Network speed in Mbps
Network Latency (ms): Average latency
Network speed in Mbps

Configuration:

Max Concurrent Jobs: Maximum simultaneous jobs
Maximum number of jobs that can run simultaneously
Priority: Node priority
Node priority (1-10, higher is better)
Tags: Comma-separated tags for categorization

Location (Optional):

Datacenter: Datacenter location
Rack: Rack identifier
Zone: Availability zone

Editing a Node

To update node configuration:

Open node details page
Click Edit button (or three-dot menu → Edit)
Modify editable fields in the Edit Cluster Node modal

Click Update Cluster Node to save changes

[!NOTE] The Edit form is identical to the View form, but with editable fields and an "Update Cluster Node" button.

Editable Fields:

✅ Hostname
✅ Status (Online, Offline, Maintenance)
✅ Available CPU Cores
✅ Available Memory (GB)
✅ GPU configuration
✅ Storage & Network settings
✅ Max Concurrent Jobs
✅ Priority
✅ Tags
✅ Location information
❌ Node Name (cannot edit)
❌ IP Address (cannot edit)
❌ Node Type (cannot edit)
❌ Total CPU Cores (cannot edit)
❌ Total Memory (cannot edit)

Node Management

Changing Node Status

Setting to Maintenance:

Open node details
Click Edit
Change Status to "Maintenance"
Save changes
Node will stop accepting new jobs

Bringing Node Online:

Open node details
Click Edit
Change Status to "Online"
Save changes
Node will start accepting jobs

Monitoring Node Health

Health Indicators:

Healthy (Green): All systems normal
Warning (Orange): High resource usage or minor issues
Critical (Red): Node failure or severe issues

When to Check:

High CPU/Memory usage (>90%)
Jobs failing frequently
Network connectivity issues
Hardware errors

Deleting a Node

To remove a node from the cluster:

Navigate to node details
Click Delete button
Confirm deletion

[!WARNING] You cannot delete a node with running jobs. Stop or migrate jobs first.

Before Deleting:

Ensure no jobs are running
Migrate important jobs to other nodes
Backup any local data
Update cluster capacity planning

Best Practices

Resource Allocation:

Reserve some resources for system overhead
Don't allocate 100% of available resources
Monitor usage patterns and adjust

Node Naming:

Use descriptive names: gpu-node-01, cpu-highmem-02
Include node type in name
Use consistent naming convention

Maintenance:

Schedule regular maintenance windows
Update node status before maintenance
Monitor health indicators
Keep firmware and drivers updated

Tagging Strategy:

Use tags for organization: production, development
Tag by capability: high-memory, gpu, fast-storage
Tag by location: datacenter-a, rack-10

Next Steps

Submit Jobs to cluster nodes
Run Experiments on GPU nodes
Monitor resource usage in Analytics
Deploy models via Deployments

PreviousDatasets NextJobs

Last updated 4 months ago

hashtagOverview

hashtagCluster Dashboard

hashtagNode List View

hashtagCreating a Cluster Node

hashtagBasic Information

hashtagCPU Resources

hashtagMemory Resources

hashtagGPU Resources (Optional)

hashtagStorage & Network

hashtagConfiguration

hashtagLocation (Optional)

hashtagActions

hashtagViewing Node Details

hashtagEditing a Node

hashtagNode Management

hashtagChanging Node Status

hashtagMonitoring Node Health

hashtagDeleting a Node

hashtagBest Practices

hashtagNext Steps