# Jobs

Submit and manage training jobs on cluster resources.

![Jobs Overview](/files/hDnBvlQ76FEF0S8Ycpsx)

## Overview

The Jobs section allows you to submit, monitor, and manage ML training jobs that run on cluster nodes. Track job status, progress, and resource utilization in real-time.

## Jobs Dashboard

The dashboard displays key job metrics at a glance:

**Summary Cards**:

* **Total Jobs**: Total number of jobs in the system
* **Running**: Number of jobs currently running
* **Completed**: Number of jobs completed successfully
* **Failed**: Number of jobs that failed

## Job List View

The jobs table shows all submitted jobs with the following information:

**Columns**:

* **Job Name**: Job name and type (Training, Inference, etc.)
* **Status**: Current status with color-coded badges
* **Priority**: Job priority (High, Medium, Low)
* **Experiment**: Associated experiment name
* **Progress**: Progress bar with percentage
* **Created**: Job creation date
* **Actions**: Quick actions menu

**Filtering and Search**:

* Search by job name
* Filter by status, priority, or experiment

## Creating a Job

Navigate to **Deep Learning Platform** → **Jobs** → Click **Create**

![Create Job](/files/HQXeH8NXqlcg4Ww2afDy)

### Basic Information

**Job Name**\* (Required)

* Enter a descriptive name for the job
* Example: `BERT Fine-tuning - Sentiment Analysis`
* Helper text: "Enter a descriptive name for the job"

**Description**

* Detailed description of the job
* Example: "Fine-tuning BERT-base model for sentiment analysis on IMDB movie reviews"

**Job Type**\* (Required)

* Select from dropdown: Training, Inference, Hyperparameter Tuning, etc.
* Default: `Training`
* Helper text: "training"

**Priority**\* (Required)

* Select priority level: High, Medium, Low
* Default: `High`
* Helper text: "high"

### Configuration

**Epochs**

* Number of training epochs
* Example: `3`
* Helper text: "Number of training epochs"

**Batch Size**

* Training batch size
* Example: `16`
* Helper text: "Training batch size"

**Learning Rate**

* Initial learning rate
* Example: `0.00002`
* Helper text: "Initial learning rate"

**Optimizer**

* Select optimizer from dropdown
* Options: AdamW, Adam, SGD, etc.
* Default: `AdamW`
* Helper text: "adamw"

### Resources

**CPU Cores**

* Number of CPU cores required
* Example: `4`

**Memory (GB)**

* Memory allocation in GB
* Example: `16`

**GPU Count**

* Number of GPUs required
* Example: `0` (for CPU-only jobs)

### Actions

* **Cancel**: Discard and close
* **Create Job**: Submit the job to the queue

## Viewing Job Details

To view detailed information about a job:

1. Navigate to **Deep Learning Platform** → **Jobs**
2. Click on a job from the list
3. View comprehensive details in the modal dialog

![View Job Details](/files/LiIKhbu66TgHTRyE6N2l)

**Details Panel Sections**:

**Basic Information**:

* **Job Name**: e.g., "BERT Fine-tuning - Sentiment Analysis"
* **Description**: Full description of the job
* **Job Type**: Training, Inference, etc.
* **Priority**: High, Medium, Low

**Configuration**:

* **Epochs**: Number of training epochs (e.g., 3)
* **Batch Size**: Training batch size (e.g., 16)
* **Learning Rate**: Initial learning rate (e.g., 0.00002)
* **Optimizer**: Optimizer used (e.g., AdamW)

**Resources**:

* **CPU Cores**: Allocated CPU cores (e.g., 4)
* **Memory (GB)**: Allocated memory (e.g., 16)
* **GPU Count**: Number of GPUs (e.g., 0)

## Editing a Job

To update job configuration:

1. Open job details page
2. Click **Edit** button (or three-dot menu → Edit)
3. Modify editable fields in the Edit Job modal

![Edit Job Form](/files/q36QKAvrmpBsfESQCff1)

4. Click **Update Job** to save changes

> \[!NOTE] You can only edit jobs that are in Pending or Failed status. Running or Completed jobs cannot be edited.

**Editable Fields**:

* ✅ Job Name
* ✅ Description
* ✅ Priority (can change to expedite or delay)
* ✅ Configuration (epochs, batch size, learning rate, optimizer)
* ✅ Resources (CPU, memory, GPU)
* ❌ Job Type (cannot edit)
* ❌ Status (managed by system)

## Job Status

**Status Types**:

**Pending** (Orange):

* Job is queued, waiting for resources
* Will start when cluster resources become available

**Running** (Blue):

* Job is actively executing
* Resources are allocated
* Progress is being tracked

**Completed** (Green):

* Job finished successfully
* Results are available
* Resources have been released

**Failed** (Red):

* Job encountered an error
* Check logs for error details
* Can be restarted or debugged

## Managing Jobs

### Stopping a Running Job

To stop a job that's currently running:

1. Open job details or click actions menu
2. Click **Stop** button
3. Confirm action
4. Job will be terminated and resources released

> \[!WARNING] Stopping a job will lose all progress. Consider checkpointing your training jobs.

### Restarting a Failed Job

To restart a job that failed:

1. Open failed job details
2. Click **Restart** button
3. Job will be resubmitted to the queue
4. Monitor for success

### Deleting a Job

To remove a job:

1. Navigate to job details or list
2. Click **Delete** button
3. Confirm deletion

> \[!WARNING] Deleting a job will permanently remove:
>
> * Job configuration
> * Training logs
> * Checkpoints and outputs
> * This action cannot be undone!

**Before Deleting**:

* Download important logs
* Save model checkpoints
* Export results if needed

## Job Monitoring

### Real-time Progress

Monitor job progress in real-time:

* Progress bar shows completion percentage
* View live logs in job details
* Track resource utilization
* Monitor metrics and loss curves

### Job Logs

Access job logs:

1. Open job details
2. Navigate to **Logs** tab
3. View stdout/stderr output
4. Filter by log level
5. Download logs for offline analysis

### Resource Usage

Monitor resource consumption:

* CPU utilization
* Memory usage
* GPU utilization (if applicable)
* Network I/O
* Disk I/O

## Job Scheduling

**Priority-based Scheduling**:

* **High** priority jobs run first
* **Medium** priority jobs run when high-priority queue is empty
* **Low** priority jobs run when resources are available

**Resource Allocation**:

* Jobs are matched to suitable cluster nodes
* GPU jobs require GPU-enabled nodes
* CPU/Memory requirements must be met

**Queue Management**:

* View pending jobs in queue
* Estimate wait time based on current load
* Adjust priority if needed

## Best Practices

**Job Naming**:

* Use descriptive names: `bert-sentiment-imdb-v1`
* Include model, task, and version
* Keep names concise but informative

**Resource Requests**:

* Request only what you need
* Don't over-allocate resources
* Monitor actual usage and adjust

**Checkpointing**:

* Save checkpoints regularly
* Enable auto-save in training code
* Store checkpoints in persistent storage

**Error Handling**:

* Implement retry logic
* Log errors comprehensively
* Set up failure notifications

**Priority Usage**:

* Use High priority sparingly
* Reserve for urgent production jobs
* Most jobs should be Medium priority

## Next Steps

* Run jobs on [Cluster](/kaisar-network/kaisar-ai-ops/deep-learning-platform/cluster.md) nodes
* Link jobs to [Experiments](/kaisar-network/kaisar-ai-ops/deep-learning-platform/experiments.md)
* Monitor performance in [Analytics](/kaisar-network/kaisar-ai-ops/deep-learning-platform/analytics.md)
* Deploy successful models via [Deployments](/kaisar-network/kaisar-ai-ops/deep-learning-platform/deployments.md)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.kaisar.io/kaisar-network/kaisar-ai-ops/deep-learning-platform/jobs.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
