This commit is contained in:
242
docs/ERROR_HANDLING.md
Normal file
242
docs/ERROR_HANDLING.md
Normal file
@@ -0,0 +1,242 @@
|
||||
# Error Handling & Recovery
|
||||
|
||||
## Overview
|
||||
|
||||
AtlasOS implements comprehensive error handling with structured error responses, graceful degradation, and automatic recovery mechanisms to ensure system reliability and good user experience.
|
||||
|
||||
## Error Types
|
||||
|
||||
### Structured API Errors
|
||||
|
||||
All API errors follow a consistent structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"code": "NOT_FOUND",
|
||||
"message": "dataset not found",
|
||||
"details": "tank/missing"
|
||||
}
|
||||
```
|
||||
|
||||
### Error Codes
|
||||
|
||||
- `INTERNAL_ERROR` - Unexpected server errors (500)
|
||||
- `NOT_FOUND` - Resource not found (404)
|
||||
- `BAD_REQUEST` - Invalid request parameters (400)
|
||||
- `CONFLICT` - Resource conflict (409)
|
||||
- `UNAUTHORIZED` - Authentication required (401)
|
||||
- `FORBIDDEN` - Insufficient permissions (403)
|
||||
- `SERVICE_UNAVAILABLE` - Service temporarily unavailable (503)
|
||||
- `VALIDATION_ERROR` - Input validation failed (400)
|
||||
|
||||
## Error Handling Patterns
|
||||
|
||||
### 1. Structured Error Responses
|
||||
|
||||
All errors use the `errors.APIError` type for consistent formatting:
|
||||
|
||||
```go
|
||||
if resource == nil {
|
||||
writeError(w, errors.ErrNotFound("dataset").WithDetails(datasetName))
|
||||
return
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Graceful Degradation
|
||||
|
||||
Service operations (SMB/NFS/iSCSI) use graceful degradation:
|
||||
|
||||
- **Desired State Stored**: Configuration is always stored in the store
|
||||
- **Service Application**: Service configuration is applied asynchronously
|
||||
- **Non-Blocking**: Service failures don't fail API requests
|
||||
- **Retry Ready**: Failed operations can be retried later
|
||||
|
||||
Example:
|
||||
```go
|
||||
// Store the configuration (always succeeds)
|
||||
share, err := a.smbStore.Create(...)
|
||||
|
||||
// Apply to service (may fail, but doesn't block)
|
||||
if err := a.smbService.ApplyConfiguration(shares); err != nil {
|
||||
// Log but don't fail - desired state is stored
|
||||
log.Printf("SMB service configuration failed (non-fatal): %v", err)
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Panic Recovery
|
||||
|
||||
All HTTP handlers are wrapped with panic recovery middleware:
|
||||
|
||||
```go
|
||||
func (a *App) errorMiddleware(next http.Handler) http.Handler {
|
||||
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
||||
defer recoverPanic(w, r)
|
||||
next.ServeHTTP(w, r)
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
Panics are caught and converted to proper error responses instead of crashing the server.
|
||||
|
||||
### 4. Atomic Operations with Rollback
|
||||
|
||||
Service configuration operations are atomic with automatic rollback:
|
||||
|
||||
1. **Write to temporary file** (`*.atlas.tmp`)
|
||||
2. **Backup existing config** (`.backup`)
|
||||
3. **Atomically replace** config file
|
||||
4. **Reload service**
|
||||
5. **On failure**: Automatically restore backup
|
||||
|
||||
Example (SMB):
|
||||
```go
|
||||
// Write to temp file
|
||||
os.WriteFile(tmpPath, config, 0644)
|
||||
|
||||
// Backup existing
|
||||
cp config.conf config.conf.backup
|
||||
|
||||
// Atomic replace
|
||||
os.Rename(tmpPath, configPath)
|
||||
|
||||
// Reload service
|
||||
if err := reloadService(); err != nil {
|
||||
// Restore backup automatically
|
||||
os.Rename(backupPath, configPath)
|
||||
return err
|
||||
}
|
||||
```
|
||||
|
||||
## Retry Mechanisms
|
||||
|
||||
### Retry Configuration
|
||||
|
||||
The `errors.Retry` function provides configurable retry logic:
|
||||
|
||||
```go
|
||||
config := errors.DefaultRetryConfig() // 3 attempts with exponential backoff
|
||||
err := errors.Retry(func() error {
|
||||
return serviceOperation()
|
||||
}, config)
|
||||
```
|
||||
|
||||
### Default Retry Behavior
|
||||
|
||||
- **Max Attempts**: 3
|
||||
- **Backoff**: Exponential (100ms, 200ms, 400ms)
|
||||
- **Use Case**: Transient failures (network, temporary service unavailability)
|
||||
|
||||
## Error Recovery
|
||||
|
||||
### Service Configuration Recovery
|
||||
|
||||
When service configuration fails:
|
||||
|
||||
1. **Configuration is stored** (desired state preserved)
|
||||
2. **Error is logged** (for debugging)
|
||||
3. **Operation continues** (API request succeeds)
|
||||
4. **Manual retry available** (via API or automatic retry later)
|
||||
|
||||
### Database Recovery
|
||||
|
||||
- **Connection failures**: Logged and retried
|
||||
- **Transaction failures**: Rolled back automatically
|
||||
- **Schema errors**: Detected during migration
|
||||
|
||||
### ZFS Operation Recovery
|
||||
|
||||
- **Command failures**: Returned as errors to caller
|
||||
- **Partial failures**: State is preserved, operation can be retried
|
||||
- **Validation**: Performed before destructive operations
|
||||
|
||||
## Error Logging
|
||||
|
||||
All errors are logged with context:
|
||||
|
||||
```go
|
||||
log.Printf("create SMB share error: %v", err)
|
||||
log.Printf("%s service error: %v", serviceName, err)
|
||||
```
|
||||
|
||||
Error logs include:
|
||||
- Error message
|
||||
- Operation context
|
||||
- Resource identifiers
|
||||
- Timestamp (via standard log)
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Always Use Structured Errors
|
||||
|
||||
```go
|
||||
// Good
|
||||
writeError(w, errors.ErrNotFound("pool").WithDetails(poolName))
|
||||
|
||||
// Avoid
|
||||
writeJSON(w, http.StatusNotFound, map[string]string{"error": "not found"})
|
||||
```
|
||||
|
||||
### 2. Handle Service Errors Gracefully
|
||||
|
||||
```go
|
||||
// Good - graceful degradation
|
||||
if err := service.Apply(); err != nil {
|
||||
log.Printf("service error (non-fatal): %v", err)
|
||||
// Continue - desired state is stored
|
||||
}
|
||||
|
||||
// Avoid - failing the request
|
||||
if err := service.Apply(); err != nil {
|
||||
return err // Don't fail the whole request
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Validate Before Operations
|
||||
|
||||
```go
|
||||
// Good - validate first
|
||||
if !datasetExists {
|
||||
writeError(w, errors.ErrNotFound("dataset"))
|
||||
return
|
||||
}
|
||||
// Then perform operation
|
||||
```
|
||||
|
||||
### 4. Use Context for Error Details
|
||||
|
||||
```go
|
||||
// Good - include context
|
||||
writeError(w, errors.ErrInternal("failed to create pool").WithDetails(err.Error()))
|
||||
|
||||
// Avoid - generic errors
|
||||
writeError(w, errors.ErrInternal("error"))
|
||||
```
|
||||
|
||||
## Error Response Format
|
||||
|
||||
All error responses follow this structure:
|
||||
|
||||
```json
|
||||
{
|
||||
"code": "ERROR_CODE",
|
||||
"message": "Human-readable error message",
|
||||
"details": "Additional context (optional)"
|
||||
}
|
||||
```
|
||||
|
||||
HTTP status codes match error types:
|
||||
- `400` - Bad Request / Validation Error
|
||||
- `401` - Unauthorized
|
||||
- `403` - Forbidden
|
||||
- `404` - Not Found
|
||||
- `409` - Conflict
|
||||
- `500` - Internal Error
|
||||
- `503` - Service Unavailable
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
1. **Error Tracking**: Centralized error tracking and alerting
|
||||
2. **Automatic Retry Queue**: Background retry for failed operations
|
||||
3. **Error Metrics**: Track error rates by type and endpoint
|
||||
4. **User-Friendly Messages**: More descriptive error messages
|
||||
5. **Error Correlation**: Link related errors for debugging
|
||||
Reference in New Issue
Block a user