Files
atlas/docs/ERROR_HANDLING.md
othman.suseno df475bc85e
Some checks failed
CI / test-build (push) Failing after 2m11s
logging and diagnostic features added
2025-12-15 00:45:14 +07:00

243 lines
5.8 KiB
Markdown

# Error Handling & Recovery
## Overview
AtlasOS implements comprehensive error handling with structured error responses, graceful degradation, and automatic recovery mechanisms to ensure system reliability and good user experience.
## Error Types
### Structured API Errors
All API errors follow a consistent structure:
```json
{
"code": "NOT_FOUND",
"message": "dataset not found",
"details": "tank/missing"
}
```
### Error Codes
- `INTERNAL_ERROR` - Unexpected server errors (500)
- `NOT_FOUND` - Resource not found (404)
- `BAD_REQUEST` - Invalid request parameters (400)
- `CONFLICT` - Resource conflict (409)
- `UNAUTHORIZED` - Authentication required (401)
- `FORBIDDEN` - Insufficient permissions (403)
- `SERVICE_UNAVAILABLE` - Service temporarily unavailable (503)
- `VALIDATION_ERROR` - Input validation failed (400)
## Error Handling Patterns
### 1. Structured Error Responses
All errors use the `errors.APIError` type for consistent formatting:
```go
if resource == nil {
writeError(w, errors.ErrNotFound("dataset").WithDetails(datasetName))
return
}
```
### 2. Graceful Degradation
Service operations (SMB/NFS/iSCSI) use graceful degradation:
- **Desired State Stored**: Configuration is always stored in the store
- **Service Application**: Service configuration is applied asynchronously
- **Non-Blocking**: Service failures don't fail API requests
- **Retry Ready**: Failed operations can be retried later
Example:
```go
// Store the configuration (always succeeds)
share, err := a.smbStore.Create(...)
// Apply to service (may fail, but doesn't block)
if err := a.smbService.ApplyConfiguration(shares); err != nil {
// Log but don't fail - desired state is stored
log.Printf("SMB service configuration failed (non-fatal): %v", err)
}
```
### 3. Panic Recovery
All HTTP handlers are wrapped with panic recovery middleware:
```go
func (a *App) errorMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
defer recoverPanic(w, r)
next.ServeHTTP(w, r)
})
}
```
Panics are caught and converted to proper error responses instead of crashing the server.
### 4. Atomic Operations with Rollback
Service configuration operations are atomic with automatic rollback:
1. **Write to temporary file** (`*.atlas.tmp`)
2. **Backup existing config** (`.backup`)
3. **Atomically replace** config file
4. **Reload service**
5. **On failure**: Automatically restore backup
Example (SMB):
```go
// Write to temp file
os.WriteFile(tmpPath, config, 0644)
// Backup existing
cp config.conf config.conf.backup
// Atomic replace
os.Rename(tmpPath, configPath)
// Reload service
if err := reloadService(); err != nil {
// Restore backup automatically
os.Rename(backupPath, configPath)
return err
}
```
## Retry Mechanisms
### Retry Configuration
The `errors.Retry` function provides configurable retry logic:
```go
config := errors.DefaultRetryConfig() // 3 attempts with exponential backoff
err := errors.Retry(func() error {
return serviceOperation()
}, config)
```
### Default Retry Behavior
- **Max Attempts**: 3
- **Backoff**: Exponential (100ms, 200ms, 400ms)
- **Use Case**: Transient failures (network, temporary service unavailability)
## Error Recovery
### Service Configuration Recovery
When service configuration fails:
1. **Configuration is stored** (desired state preserved)
2. **Error is logged** (for debugging)
3. **Operation continues** (API request succeeds)
4. **Manual retry available** (via API or automatic retry later)
### Database Recovery
- **Connection failures**: Logged and retried
- **Transaction failures**: Rolled back automatically
- **Schema errors**: Detected during migration
### ZFS Operation Recovery
- **Command failures**: Returned as errors to caller
- **Partial failures**: State is preserved, operation can be retried
- **Validation**: Performed before destructive operations
## Error Logging
All errors are logged with context:
```go
log.Printf("create SMB share error: %v", err)
log.Printf("%s service error: %v", serviceName, err)
```
Error logs include:
- Error message
- Operation context
- Resource identifiers
- Timestamp (via standard log)
## Best Practices
### 1. Always Use Structured Errors
```go
// Good
writeError(w, errors.ErrNotFound("pool").WithDetails(poolName))
// Avoid
writeJSON(w, http.StatusNotFound, map[string]string{"error": "not found"})
```
### 2. Handle Service Errors Gracefully
```go
// Good - graceful degradation
if err := service.Apply(); err != nil {
log.Printf("service error (non-fatal): %v", err)
// Continue - desired state is stored
}
// Avoid - failing the request
if err := service.Apply(); err != nil {
return err // Don't fail the whole request
}
```
### 3. Validate Before Operations
```go
// Good - validate first
if !datasetExists {
writeError(w, errors.ErrNotFound("dataset"))
return
}
// Then perform operation
```
### 4. Use Context for Error Details
```go
// Good - include context
writeError(w, errors.ErrInternal("failed to create pool").WithDetails(err.Error()))
// Avoid - generic errors
writeError(w, errors.ErrInternal("error"))
```
## Error Response Format
All error responses follow this structure:
```json
{
"code": "ERROR_CODE",
"message": "Human-readable error message",
"details": "Additional context (optional)"
}
```
HTTP status codes match error types:
- `400` - Bad Request / Validation Error
- `401` - Unauthorized
- `403` - Forbidden
- `404` - Not Found
- `409` - Conflict
- `500` - Internal Error
- `503` - Service Unavailable
## Future Enhancements
1. **Error Tracking**: Centralized error tracking and alerting
2. **Automatic Retry Queue**: Background retry for failed operations
3. **Error Metrics**: Track error rates by type and endpoint
4. **User-Friendly Messages**: More descriptive error messages
5. **Error Correlation**: Link related errors for debugging