# Cluster

Manage and monitor your compute cluster nodes for ML workloads.

![Cluster Overview](/files/fSUcedV7YSaF7Hl1cHwZ)

## Overview

The Cluster section provides comprehensive management of compute nodes that power your ML experiments, training jobs, and deployments. Monitor resource utilization, manage node configurations, and ensure optimal cluster health.

## Cluster Dashboard

The dashboard displays key cluster metrics at a glance:

**Summary Cards**:

* **Total Nodes**: Total number of nodes in the cluster
* **Online**: Number of nodes currently online
* **Busy**: Number of nodes actively processing workloads
* **CPU Usage (%)**: Average CPU usage across cluster
* **Memory Usage (%)**: Average memory usage across cluster

## Node List View

The cluster table shows all nodes with the following information:

**Columns**:

* **Node**: Node name and details (hostname, IP address)
* **Type**: Node type (GPU, CPU)
* **Status**: Current status (Online, Offline, Maintenance)
* **Resources**: Available resources (CPU cores, RAM, GPU)
* **Usage**: Real-time CPU and Memory usage with progress bars
* **Jobs**: Running jobs count
* **Uptime**: Node uptime duration
* **Health**: Health status (Healthy, Warning, Critical)
* **Actions**: Quick actions menu

**Filtering and Search**:

* Search by node name or IP
* Filter by Type (GPU, CPU, All)
* Filter by Status (Online, Offline, Busy, Maintenance)

## Creating a Cluster Node

Navigate to **Deep Learning Platform** → **Cluster** → Click **Create**

![Create Cluster Node](/files/kCQTaK2iRc7Cdn8dKb4O)

### Basic Information

**Node Name**\* (Required)

* Unique identifier for the cluster node
* Example: `gpu-node-01`, `cpu-node-high-mem`

**Hostname**\* (Required)

* Network hostname
* Example: `gpu01.cluster.local`

**IP Address**\* (Required)

* IPv4 address of the node
* Example: `192.168.1.101`

**Node Type**\* (Required)

* Select from dropdown: GPU, CPU
* Default: `GPU`

**Status**\* (Required)

* Select from dropdown: Online, Offline
* Default: `Online`

### CPU Resources

**CPU Cores**\* (Required)

* Total number of CPU cores
* Example: `16`

**Total number of CPU Cores** (Helper text)

**Available CPU Cores**\* (Required)

* Number of available CPU cores
* Example: `8`

**Number of available CPU cores** (Helper text)

### Memory Resources

**Total Memory (GB)**\* (Required)

* Total RAM in GB
* Example: `64`

**Total RAM in GB** (Helper text)

**Available Memory (GB)**\* (Required)

* Available RAM in GB
* Example: `32`

**Available memory (GB)** (Helper text)

### GPU Resources (Optional)

**GPU Count**

* Number of GPUs (0 for CPU-only nodes)
* Example: `0`, `4`, `8`

**Number of GPUs (0 for CPU-only nodes)** (Helper text)

**GPU Type**

* Select GPU model from dropdown
* Options: NVIDIA A100, NVIDIA V100, NVIDIA T4, etc.

**GPU Memory per GPU (GB)**

* Memory per GPU in GB
* Example: `80` (for A100)

**VRAM per GPU in GB** (Helper text)

### Storage & Network

**Total Storage (GB)**\* (Required)

* Total disk storage in GB
* Example: `1000`

**Total disk space in GB** (Helper text)

**Network Bandwidth (Mbps)**\* (Required)

* Network bandwidth in Mbps
* Example: `10000` (10 Gbps)

**Network speed in Mbps** (Helper text)

**Network Latency (ms)**\* (Required)

* Average network latency in milliseconds
* Example: `1`

**Average network latency** (Helper text)

### Configuration

**Max Concurrent Jobs**\* (Required)

* Maximum number of jobs that can run simultaneously
* Example: `4`

**Maximum number of jobs that can run simultaneously** (Helper text)

**Priority**

* Node priority (1-10, higher is better)
* Example: `5`

**Node priority (1-10, higher is better)** (Helper text)

**Tags**

* Comma-separated tags for categorization
* Example: `production,high-memory,gpu`

### Location (Optional)

**Datacenter**

* Datacenter location
* Example: `Datacenter`

**Rack**

* Rack identifier
* Example: `R-10`

**Zone**

* Availability zone
* Example: `Zone-A`

### Actions

* **Cancel**: Discard and close
* **Create Cluster Node**: Submit and create the node

## Viewing Node Details

To view detailed information about a cluster node:

1. Navigate to **Deep Learning Platform** → **Cluster**
2. Click on a node from the list
3. View comprehensive details in the modal dialog

![View Cluster Node](/files/crG6gWpOJHY0gRjm2MTk)

**Details Panel Sections**:

**Basic Information**:

* Node Name: e.g., "gpu-node-01"
* Hostname: e.g., "gpu01.cluster.local"
* IP Address: e.g., "192.168.1.101"
* Node Type: GPU or CPU
* Status: Online, Offline, Busy, Maintenance

**CPU Resources**:

* CPU Cores: Total cores (e.g., 32)
* Total number of CPU cores
* Available CPU Cores: Available cores (e.g., 11)
* Number of available CPU cores

**Memory Resources**:

* Total Memory (GB): Total RAM (e.g., 128)
* Total RAM in GB
* Available Memory (GB): Available RAM (e.g., 32)
* Available memory (GB)

**GPU Resources (Optional)**:

* GPU Count: Number of GPUs (e.g., 4)
* Number of GPUs (0 for CPU-only nodes)
* GPU Type: GPU model (e.g., A100)
* GPU Memory per GPU (GB): VRAM per GPU (e.g., 80)
* VRAM per GPU in GB

**Storage & Network**:

* Total Storage (GB): Total disk space
* Total disk space in GB
* Network Bandwidth (Mbps): Network speed
* Network speed in Mbps
* Network Latency (ms): Average latency
* Network speed in Mbps

**Configuration**:

* Max Concurrent Jobs: Maximum simultaneous jobs
* Maximum number of jobs that can run simultaneously
* Priority: Node priority
* Node priority (1-10, higher is better)
* Tags: Comma-separated tags for categorization

**Location (Optional)**:

* Datacenter: Datacenter location
* Rack: Rack identifier
* Zone: Availability zone

## Editing a Node

To update node configuration:

1. Open node details page
2. Click **Edit** button (or three-dot menu → Edit)
3. Modify editable fields in the Edit Cluster Node modal

![Edit Cluster Node](/files/SV5WQSb3LfqziHJaHF7V)

4. Click **Update Cluster Node** to save changes

> \[!NOTE] The Edit form is identical to the View form, but with editable fields and an "Update Cluster Node" button.

**Editable Fields**:

* ✅ Hostname
* ✅ Status (Online, Offline, Maintenance)
* ✅ Available CPU Cores
* ✅ Available Memory (GB)
* ✅ GPU configuration
* ✅ Storage & Network settings
* ✅ Max Concurrent Jobs
* ✅ Priority
* ✅ Tags
* ✅ Location information
* ❌ Node Name (cannot edit)
* ❌ IP Address (cannot edit)
* ❌ Node Type (cannot edit)
* ❌ Total CPU Cores (cannot edit)
* ❌ Total Memory (cannot edit)

## Node Management

### Changing Node Status

**Setting to Maintenance**:

1. Open node details
2. Click **Edit**
3. Change Status to "Maintenance"
4. Save changes
5. Node will stop accepting new jobs

**Bringing Node Online**:

1. Open node details
2. Click **Edit**
3. Change Status to "Online"
4. Save changes
5. Node will start accepting jobs

### Monitoring Node Health

**Health Indicators**:

* **Healthy** (Green): All systems normal
* **Warning** (Orange): High resource usage or minor issues
* **Critical** (Red): Node failure or severe issues

**When to Check**:

* High CPU/Memory usage (>90%)
* Jobs failing frequently
* Network connectivity issues
* Hardware errors

### Deleting a Node

To remove a node from the cluster:

1. Navigate to node details
2. Click **Delete** button
3. Confirm deletion

> \[!WARNING] You cannot delete a node with running jobs. Stop or migrate jobs first.

**Before Deleting**:

* Ensure no jobs are running
* Migrate important jobs to other nodes
* Backup any local data
* Update cluster capacity planning

## Best Practices

**Resource Allocation**:

* Reserve some resources for system overhead
* Don't allocate 100% of available resources
* Monitor usage patterns and adjust

**Node Naming**:

* Use descriptive names: `gpu-node-01`, `cpu-highmem-02`
* Include node type in name
* Use consistent naming convention

**Maintenance**:

* Schedule regular maintenance windows
* Update node status before maintenance
* Monitor health indicators
* Keep firmware and drivers updated

**Tagging Strategy**:

* Use tags for organization: `production`, `development`
* Tag by capability: `high-memory`, `gpu`, `fast-storage`
* Tag by location: `datacenter-a`, `rack-10`

## Next Steps

* Submit [Jobs](/kaisar-network/kaisar-ai-ops/deep-learning-platform/jobs.md) to cluster nodes
* Run [Experiments](/kaisar-network/kaisar-ai-ops/deep-learning-platform/experiments.md) on GPU nodes
* Monitor resource usage in [Analytics](/kaisar-network/kaisar-ai-ops/deep-learning-platform/analytics.md)
* Deploy models via [Deployments](/kaisar-network/kaisar-ai-ops/deep-learning-platform/deployments.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.kaisar.io/kaisar-network/kaisar-ai-ops/deep-learning-platform/cluster.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
