# Error Handling & Recovery ## Overview AtlasOS implements comprehensive error handling with structured error responses, graceful degradation, and automatic recovery mechanisms to ensure system reliability and good user experience. ## Error Types ### Structured API Errors All API errors follow a consistent structure: ```json { "code": "NOT_FOUND", "message": "dataset not found", "details": "tank/missing" } ``` ### Error Codes - `INTERNAL_ERROR` - Unexpected server errors (500) - `NOT_FOUND` - Resource not found (404) - `BAD_REQUEST` - Invalid request parameters (400) - `CONFLICT` - Resource conflict (409) - `UNAUTHORIZED` - Authentication required (401) - `FORBIDDEN` - Insufficient permissions (403) - `SERVICE_UNAVAILABLE` - Service temporarily unavailable (503) - `VALIDATION_ERROR` - Input validation failed (400) ## Error Handling Patterns ### 1. Structured Error Responses All errors use the `errors.APIError` type for consistent formatting: ```go if resource == nil { writeError(w, errors.ErrNotFound("dataset").WithDetails(datasetName)) return } ``` ### 2. Graceful Degradation Service operations (SMB/NFS/iSCSI) use graceful degradation: - **Desired State Stored**: Configuration is always stored in the store - **Service Application**: Service configuration is applied asynchronously - **Non-Blocking**: Service failures don't fail API requests - **Retry Ready**: Failed operations can be retried later Example: ```go // Store the configuration (always succeeds) share, err := a.smbStore.Create(...) // Apply to service (may fail, but doesn't block) if err := a.smbService.ApplyConfiguration(shares); err != nil { // Log but don't fail - desired state is stored log.Printf("SMB service configuration failed (non-fatal): %v", err) } ``` ### 3. Panic Recovery All HTTP handlers are wrapped with panic recovery middleware: ```go func (a *App) errorMiddleware(next http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { defer recoverPanic(w, r) next.ServeHTTP(w, r) }) } ``` Panics are caught and converted to proper error responses instead of crashing the server. ### 4. Atomic Operations with Rollback Service configuration operations are atomic with automatic rollback: 1. **Write to temporary file** (`*.atlas.tmp`) 2. **Backup existing config** (`.backup`) 3. **Atomically replace** config file 4. **Reload service** 5. **On failure**: Automatically restore backup Example (SMB): ```go // Write to temp file os.WriteFile(tmpPath, config, 0644) // Backup existing cp config.conf config.conf.backup // Atomic replace os.Rename(tmpPath, configPath) // Reload service if err := reloadService(); err != nil { // Restore backup automatically os.Rename(backupPath, configPath) return err } ``` ## Retry Mechanisms ### Retry Configuration The `errors.Retry` function provides configurable retry logic: ```go config := errors.DefaultRetryConfig() // 3 attempts with exponential backoff err := errors.Retry(func() error { return serviceOperation() }, config) ``` ### Default Retry Behavior - **Max Attempts**: 3 - **Backoff**: Exponential (100ms, 200ms, 400ms) - **Use Case**: Transient failures (network, temporary service unavailability) ## Error Recovery ### Service Configuration Recovery When service configuration fails: 1. **Configuration is stored** (desired state preserved) 2. **Error is logged** (for debugging) 3. **Operation continues** (API request succeeds) 4. **Manual retry available** (via API or automatic retry later) ### Database Recovery - **Connection failures**: Logged and retried - **Transaction failures**: Rolled back automatically - **Schema errors**: Detected during migration ### ZFS Operation Recovery - **Command failures**: Returned as errors to caller - **Partial failures**: State is preserved, operation can be retried - **Validation**: Performed before destructive operations ## Error Logging All errors are logged with context: ```go log.Printf("create SMB share error: %v", err) log.Printf("%s service error: %v", serviceName, err) ``` Error logs include: - Error message - Operation context - Resource identifiers - Timestamp (via standard log) ## Best Practices ### 1. Always Use Structured Errors ```go // Good writeError(w, errors.ErrNotFound("pool").WithDetails(poolName)) // Avoid writeJSON(w, http.StatusNotFound, map[string]string{"error": "not found"}) ``` ### 2. Handle Service Errors Gracefully ```go // Good - graceful degradation if err := service.Apply(); err != nil { log.Printf("service error (non-fatal): %v", err) // Continue - desired state is stored } // Avoid - failing the request if err := service.Apply(); err != nil { return err // Don't fail the whole request } ``` ### 3. Validate Before Operations ```go // Good - validate first if !datasetExists { writeError(w, errors.ErrNotFound("dataset")) return } // Then perform operation ``` ### 4. Use Context for Error Details ```go // Good - include context writeError(w, errors.ErrInternal("failed to create pool").WithDetails(err.Error())) // Avoid - generic errors writeError(w, errors.ErrInternal("error")) ``` ## Error Response Format All error responses follow this structure: ```json { "code": "ERROR_CODE", "message": "Human-readable error message", "details": "Additional context (optional)" } ``` HTTP status codes match error types: - `400` - Bad Request / Validation Error - `401` - Unauthorized - `403` - Forbidden - `404` - Not Found - `409` - Conflict - `500` - Internal Error - `503` - Service Unavailable ## Future Enhancements 1. **Error Tracking**: Centralized error tracking and alerting 2. **Automatic Retry Queue**: Background retry for failed operations 3. **Error Metrics**: Track error rates by type and endpoint 4. **User-Friendly Messages**: More descriptive error messages 5. **Error Correlation**: Link related errors for debugging