5.8 KiB
Error Handling & Recovery
Overview
AtlasOS implements comprehensive error handling with structured error responses, graceful degradation, and automatic recovery mechanisms to ensure system reliability and good user experience.
Error Types
Structured API Errors
All API errors follow a consistent structure:
{
"code": "NOT_FOUND",
"message": "dataset not found",
"details": "tank/missing"
}
Error Codes
INTERNAL_ERROR- Unexpected server errors (500)NOT_FOUND- Resource not found (404)BAD_REQUEST- Invalid request parameters (400)CONFLICT- Resource conflict (409)UNAUTHORIZED- Authentication required (401)FORBIDDEN- Insufficient permissions (403)SERVICE_UNAVAILABLE- Service temporarily unavailable (503)VALIDATION_ERROR- Input validation failed (400)
Error Handling Patterns
1. Structured Error Responses
All errors use the errors.APIError type for consistent formatting:
if resource == nil {
writeError(w, errors.ErrNotFound("dataset").WithDetails(datasetName))
return
}
2. Graceful Degradation
Service operations (SMB/NFS/iSCSI) use graceful degradation:
- Desired State Stored: Configuration is always stored in the store
- Service Application: Service configuration is applied asynchronously
- Non-Blocking: Service failures don't fail API requests
- Retry Ready: Failed operations can be retried later
Example:
// Store the configuration (always succeeds)
share, err := a.smbStore.Create(...)
// Apply to service (may fail, but doesn't block)
if err := a.smbService.ApplyConfiguration(shares); err != nil {
// Log but don't fail - desired state is stored
log.Printf("SMB service configuration failed (non-fatal): %v", err)
}
3. Panic Recovery
All HTTP handlers are wrapped with panic recovery middleware:
func (a *App) errorMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
defer recoverPanic(w, r)
next.ServeHTTP(w, r)
})
}
Panics are caught and converted to proper error responses instead of crashing the server.
4. Atomic Operations with Rollback
Service configuration operations are atomic with automatic rollback:
- Write to temporary file (
*.atlas.tmp) - Backup existing config (
.backup) - Atomically replace config file
- Reload service
- On failure: Automatically restore backup
Example (SMB):
// Write to temp file
os.WriteFile(tmpPath, config, 0644)
// Backup existing
cp config.conf config.conf.backup
// Atomic replace
os.Rename(tmpPath, configPath)
// Reload service
if err := reloadService(); err != nil {
// Restore backup automatically
os.Rename(backupPath, configPath)
return err
}
Retry Mechanisms
Retry Configuration
The errors.Retry function provides configurable retry logic:
config := errors.DefaultRetryConfig() // 3 attempts with exponential backoff
err := errors.Retry(func() error {
return serviceOperation()
}, config)
Default Retry Behavior
- Max Attempts: 3
- Backoff: Exponential (100ms, 200ms, 400ms)
- Use Case: Transient failures (network, temporary service unavailability)
Error Recovery
Service Configuration Recovery
When service configuration fails:
- Configuration is stored (desired state preserved)
- Error is logged (for debugging)
- Operation continues (API request succeeds)
- Manual retry available (via API or automatic retry later)
Database Recovery
- Connection failures: Logged and retried
- Transaction failures: Rolled back automatically
- Schema errors: Detected during migration
ZFS Operation Recovery
- Command failures: Returned as errors to caller
- Partial failures: State is preserved, operation can be retried
- Validation: Performed before destructive operations
Error Logging
All errors are logged with context:
log.Printf("create SMB share error: %v", err)
log.Printf("%s service error: %v", serviceName, err)
Error logs include:
- Error message
- Operation context
- Resource identifiers
- Timestamp (via standard log)
Best Practices
1. Always Use Structured Errors
// Good
writeError(w, errors.ErrNotFound("pool").WithDetails(poolName))
// Avoid
writeJSON(w, http.StatusNotFound, map[string]string{"error": "not found"})
2. Handle Service Errors Gracefully
// Good - graceful degradation
if err := service.Apply(); err != nil {
log.Printf("service error (non-fatal): %v", err)
// Continue - desired state is stored
}
// Avoid - failing the request
if err := service.Apply(); err != nil {
return err // Don't fail the whole request
}
3. Validate Before Operations
// Good - validate first
if !datasetExists {
writeError(w, errors.ErrNotFound("dataset"))
return
}
// Then perform operation
4. Use Context for Error Details
// Good - include context
writeError(w, errors.ErrInternal("failed to create pool").WithDetails(err.Error()))
// Avoid - generic errors
writeError(w, errors.ErrInternal("error"))
Error Response Format
All error responses follow this structure:
{
"code": "ERROR_CODE",
"message": "Human-readable error message",
"details": "Additional context (optional)"
}
HTTP status codes match error types:
400- Bad Request / Validation Error401- Unauthorized403- Forbidden404- Not Found409- Conflict500- Internal Error503- Service Unavailable
Future Enhancements
- Error Tracking: Centralized error tracking and alerting
- Automatic Retry Queue: Background retry for failed operations
- Error Metrics: Track error rates by type and endpoint
- User-Friendly Messages: More descriptive error messages
- Error Correlation: Link related errors for debugging