243 lines
5.8 KiB
Markdown
243 lines
5.8 KiB
Markdown
# Error Handling & Recovery
|
|
|
|
## Overview
|
|
|
|
AtlasOS implements comprehensive error handling with structured error responses, graceful degradation, and automatic recovery mechanisms to ensure system reliability and good user experience.
|
|
|
|
## Error Types
|
|
|
|
### Structured API Errors
|
|
|
|
All API errors follow a consistent structure:
|
|
|
|
```json
|
|
{
|
|
"code": "NOT_FOUND",
|
|
"message": "dataset not found",
|
|
"details": "tank/missing"
|
|
}
|
|
```
|
|
|
|
### Error Codes
|
|
|
|
- `INTERNAL_ERROR` - Unexpected server errors (500)
|
|
- `NOT_FOUND` - Resource not found (404)
|
|
- `BAD_REQUEST` - Invalid request parameters (400)
|
|
- `CONFLICT` - Resource conflict (409)
|
|
- `UNAUTHORIZED` - Authentication required (401)
|
|
- `FORBIDDEN` - Insufficient permissions (403)
|
|
- `SERVICE_UNAVAILABLE` - Service temporarily unavailable (503)
|
|
- `VALIDATION_ERROR` - Input validation failed (400)
|
|
|
|
## Error Handling Patterns
|
|
|
|
### 1. Structured Error Responses
|
|
|
|
All errors use the `errors.APIError` type for consistent formatting:
|
|
|
|
```go
|
|
if resource == nil {
|
|
writeError(w, errors.ErrNotFound("dataset").WithDetails(datasetName))
|
|
return
|
|
}
|
|
```
|
|
|
|
### 2. Graceful Degradation
|
|
|
|
Service operations (SMB/NFS/iSCSI) use graceful degradation:
|
|
|
|
- **Desired State Stored**: Configuration is always stored in the store
|
|
- **Service Application**: Service configuration is applied asynchronously
|
|
- **Non-Blocking**: Service failures don't fail API requests
|
|
- **Retry Ready**: Failed operations can be retried later
|
|
|
|
Example:
|
|
```go
|
|
// Store the configuration (always succeeds)
|
|
share, err := a.smbStore.Create(...)
|
|
|
|
// Apply to service (may fail, but doesn't block)
|
|
if err := a.smbService.ApplyConfiguration(shares); err != nil {
|
|
// Log but don't fail - desired state is stored
|
|
log.Printf("SMB service configuration failed (non-fatal): %v", err)
|
|
}
|
|
```
|
|
|
|
### 3. Panic Recovery
|
|
|
|
All HTTP handlers are wrapped with panic recovery middleware:
|
|
|
|
```go
|
|
func (a *App) errorMiddleware(next http.Handler) http.Handler {
|
|
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
|
defer recoverPanic(w, r)
|
|
next.ServeHTTP(w, r)
|
|
})
|
|
}
|
|
```
|
|
|
|
Panics are caught and converted to proper error responses instead of crashing the server.
|
|
|
|
### 4. Atomic Operations with Rollback
|
|
|
|
Service configuration operations are atomic with automatic rollback:
|
|
|
|
1. **Write to temporary file** (`*.atlas.tmp`)
|
|
2. **Backup existing config** (`.backup`)
|
|
3. **Atomically replace** config file
|
|
4. **Reload service**
|
|
5. **On failure**: Automatically restore backup
|
|
|
|
Example (SMB):
|
|
```go
|
|
// Write to temp file
|
|
os.WriteFile(tmpPath, config, 0644)
|
|
|
|
// Backup existing
|
|
cp config.conf config.conf.backup
|
|
|
|
// Atomic replace
|
|
os.Rename(tmpPath, configPath)
|
|
|
|
// Reload service
|
|
if err := reloadService(); err != nil {
|
|
// Restore backup automatically
|
|
os.Rename(backupPath, configPath)
|
|
return err
|
|
}
|
|
```
|
|
|
|
## Retry Mechanisms
|
|
|
|
### Retry Configuration
|
|
|
|
The `errors.Retry` function provides configurable retry logic:
|
|
|
|
```go
|
|
config := errors.DefaultRetryConfig() // 3 attempts with exponential backoff
|
|
err := errors.Retry(func() error {
|
|
return serviceOperation()
|
|
}, config)
|
|
```
|
|
|
|
### Default Retry Behavior
|
|
|
|
- **Max Attempts**: 3
|
|
- **Backoff**: Exponential (100ms, 200ms, 400ms)
|
|
- **Use Case**: Transient failures (network, temporary service unavailability)
|
|
|
|
## Error Recovery
|
|
|
|
### Service Configuration Recovery
|
|
|
|
When service configuration fails:
|
|
|
|
1. **Configuration is stored** (desired state preserved)
|
|
2. **Error is logged** (for debugging)
|
|
3. **Operation continues** (API request succeeds)
|
|
4. **Manual retry available** (via API or automatic retry later)
|
|
|
|
### Database Recovery
|
|
|
|
- **Connection failures**: Logged and retried
|
|
- **Transaction failures**: Rolled back automatically
|
|
- **Schema errors**: Detected during migration
|
|
|
|
### ZFS Operation Recovery
|
|
|
|
- **Command failures**: Returned as errors to caller
|
|
- **Partial failures**: State is preserved, operation can be retried
|
|
- **Validation**: Performed before destructive operations
|
|
|
|
## Error Logging
|
|
|
|
All errors are logged with context:
|
|
|
|
```go
|
|
log.Printf("create SMB share error: %v", err)
|
|
log.Printf("%s service error: %v", serviceName, err)
|
|
```
|
|
|
|
Error logs include:
|
|
- Error message
|
|
- Operation context
|
|
- Resource identifiers
|
|
- Timestamp (via standard log)
|
|
|
|
## Best Practices
|
|
|
|
### 1. Always Use Structured Errors
|
|
|
|
```go
|
|
// Good
|
|
writeError(w, errors.ErrNotFound("pool").WithDetails(poolName))
|
|
|
|
// Avoid
|
|
writeJSON(w, http.StatusNotFound, map[string]string{"error": "not found"})
|
|
```
|
|
|
|
### 2. Handle Service Errors Gracefully
|
|
|
|
```go
|
|
// Good - graceful degradation
|
|
if err := service.Apply(); err != nil {
|
|
log.Printf("service error (non-fatal): %v", err)
|
|
// Continue - desired state is stored
|
|
}
|
|
|
|
// Avoid - failing the request
|
|
if err := service.Apply(); err != nil {
|
|
return err // Don't fail the whole request
|
|
}
|
|
```
|
|
|
|
### 3. Validate Before Operations
|
|
|
|
```go
|
|
// Good - validate first
|
|
if !datasetExists {
|
|
writeError(w, errors.ErrNotFound("dataset"))
|
|
return
|
|
}
|
|
// Then perform operation
|
|
```
|
|
|
|
### 4. Use Context for Error Details
|
|
|
|
```go
|
|
// Good - include context
|
|
writeError(w, errors.ErrInternal("failed to create pool").WithDetails(err.Error()))
|
|
|
|
// Avoid - generic errors
|
|
writeError(w, errors.ErrInternal("error"))
|
|
```
|
|
|
|
## Error Response Format
|
|
|
|
All error responses follow this structure:
|
|
|
|
```json
|
|
{
|
|
"code": "ERROR_CODE",
|
|
"message": "Human-readable error message",
|
|
"details": "Additional context (optional)"
|
|
}
|
|
```
|
|
|
|
HTTP status codes match error types:
|
|
- `400` - Bad Request / Validation Error
|
|
- `401` - Unauthorized
|
|
- `403` - Forbidden
|
|
- `404` - Not Found
|
|
- `409` - Conflict
|
|
- `500` - Internal Error
|
|
- `503` - Service Unavailable
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Error Tracking**: Centralized error tracking and alerting
|
|
2. **Automatic Retry Queue**: Background retry for failed operations
|
|
3. **Error Metrics**: Track error rates by type and endpoint
|
|
4. **User-Friendly Messages**: More descriptive error messages
|
|
5. **Error Correlation**: Link related errors for debugging
|