^{Last Update: April 23, 2025}

Monitoring a Fine-Tuning Job

Monitoring the fine-tuning process is essential to ensure that the model is learning effectively to make adjustments as needed. With Hyperstack Gen AI Platform, users can easily monitor the progress of their fine-tuning job and gain insights into the training process.

Monitor Through API

You can monitor a fine-tuning job by making a GET request to the training info endpoint:

MODEL_NAME="finetuned-mistral-7b"
curl -X GET " https://api.genai.hyperstack.cloud/tailor/v1/named-training-info-log/$MODEL_NAME" \
  -H "X-API-Key: $API_KEY" \
  -H "Content-Type: application/json"

If the fine-tuning job succeeded, the response will look like this:

{
  "metrics": {
    "end_train_message": ["Training job ended"],
    "end_train_status": ["dormant"],
    "eval_loss": [2.644, 0.474],
    "loss": [1.354, 1.985, 1.938, 0.114, 0.669, 0.434],
    "perplexity": []
  },
  "status": "success"
}

If the fine-tuning job failed, the response will look like this:

{
  "metrics": {
    "end_train_message": ["message"],
    "end_train_status": ["failed_training"],
    "eval_loss": [],
    "eval_perplexity": [],
    "loss": [],
    "perplexity": []
  },
  "status": "success"
}

Monitor Through The UI

1. Access the Training Metrics Page

Navigate to this page to view the status of training jobs.

2. Check Job Status

Training Jobs: View the list of jobs that are currently training.
Failed Jobs: Identify and review jobs where the training has failed.

3. Review Metrics

The primary metric collected during training is the loss data, which is crucial for evaluating the performance of a model. The following details are provided:

Training Loss: Indicates how well the model is learning from the training data.
Validation Loss: Measures the model's performance on a separate validation dataset.

4. Analyse Visualizations

The Training Metrics page offers two main types of visualizations to help users understand their model's performance:

Performance Comparison Chart: This bar chart compares the training and validation loss before and after fine-tuning, showing the reduction in loss as a result of the training.
Model Performance Over Steps Chart: This line chart displays the loss over the steps of the training process, allowing users to see how the loss decreases as the training progresses.

Example Metrics Display

Training Details:
- Model Name: Legal-1.0
- Tags: legal, tax
- Base Model: Mistral-7B
Training Status: Running
Training Duration: 1 hour 30 minutes
Current Metrics:
- Training Loss: 5.6282
- Validation Loss: 10.8603
Hyperparameters Used:
- Learning Rate: auto
- Batch Size: auto
- Epochs: auto
- Percentage of Dataset for Eval: 5% (auto)
Performance Comparison:
- Training Loss Reduction: 5.2321
- Validation Loss Reduction: 0.0000

Example Charts

Performance Comparison (Start/End of Fine-Tuning)

Metric	Before Fine-Tuning	After Fine-Tuning
Training Loss	10.860	5.628
Validation Loss	10.861	5.860

Monitoring the training and validation loss is essential for understanding how well a user's model is learning and generalizing to new data. Consistently decreasing loss values indicate effective training progress.

Key Benefits

Monitor regularly: Regularly check the fine-tuning job metrics to ensure the model is learning effectively.
Compare metrics: Compare the metrics across different fine-tuning jobs to identify the most effective hyperparameters.
Adjust as needed: Adjust the hyperparameters, model, or training data as needed based on the metrics.