Monitoring a Fine-Tuning Job
Monitoring the fine-tuning process is essential to ensure that the model is learning effectively to make adjustments as needed. With Hyperstack Gen AI Platform, users can easily monitor the progress of their fine-tuning job and gain insights into the training process.
Monitor Through API
You can monitor a fine-tuning job by making a GET request to the training info endpoint:
MODEL_NAME="finetuned-mistral-7b"
curl -X GET " https://api.genai.hyperstack.cloud/tailor/v1/named-training-info-log/$MODEL_NAME" \
-H "X-API-Key: $API_KEY" \
-H "Content-Type: application/json"
If the fine-tuning job succeeded, the response will look like this:
{
"metrics": {
"end_train_message": ["Training job ended"],
"end_train_status": ["dormant"],
"eval_loss": [2.644, 0.474],
"loss": [1.354, 1.985, 1.938, 0.114, 0.669, 0.434],
"perplexity": []
},
"status": "success"
}
If the fine-tuning job failed, the response will look like this:
{
"metrics": {
"end_train_message": ["message"],
"end_train_status": ["failed_training"],
"eval_loss": [],
"eval_perplexity": [],
"loss": [],
"perplexity": []
},
"status": "success"
}
Monitor Through The UI
1. Access the Training Metrics Page
Navigate to this page to view the status of training jobs.
2. Check Job Status
- Training Jobs: View the list of jobs that are currently training.
- Failed Jobs: Identify and review jobs where the training has failed.
3. Review Metrics
The primary metric collected during training is the loss data, which is crucial for evaluating the performance of a model. The following details are provided:
- Training Loss: Indicates how well the model is learning from the training data.
- Validation Loss: Measures the model's performance on a separate validation dataset.
4. Analyse Visualizations
The Training Metrics page offers two main types of visualizations to help users understand their model's performance:
- Performance Comparison Chart: This bar chart compares the training and validation loss before and after fine-tuning, showing the reduction in loss as a result of the training.
- Model Performance Over Steps Chart: This line chart displays the loss over the steps of the training process, allowing users to see how the loss decreases as the training progresses.
Example Metrics Display
-
Training Details:
- Model Name:
Legal-1.0
- Tags:
legal
,tax
- Base Model:
Mistral-7B
- Model Name:
-
Training Status:
Running
-
Training Duration:
1 hour 30 minutes
-
Current Metrics:
- Training Loss:
5.6282
- Validation Loss:
10.8603
- Training Loss:
-
Hyperparameters Used:
- Learning Rate:
auto
- Batch Size:
auto
- Epochs:
auto
- Percentage of Dataset for Eval:
5%
(auto)
- Learning Rate:
-
Performance Comparison:
- Training Loss Reduction:
5.2321
- Validation Loss Reduction:
0.0000
- Training Loss Reduction:
Example Charts
Performance Comparison (Start/End of Fine-Tuning)
Metric | Before Fine-Tuning | After Fine-Tuning |
---|---|---|
Training Loss | 10.860 | 5.628 |
Validation Loss | 10.861 | 5.860 |
Monitoring the training and validation loss is essential for understanding how well a user's model is learning and generalizing to new data. Consistently decreasing loss values indicate effective training progress.
Key Benefits
- Monitor regularly: Regularly check the fine-tuning job metrics to ensure the model is learning effectively.
- Compare metrics: Compare the metrics across different fine-tuning jobs to identify the most effective hyperparameters.
- Adjust as needed: Adjust the hyperparameters, model, or training data as needed based on the metrics.