Files
atlas/docs/ERROR_HANDLING.md
othman.suseno df475bc85e
Some checks failed
CI / test-build (push) Failing after 2m11s
logging and diagnostic features added
2025-12-15 00:45:14 +07:00

5.8 KiB

Error Handling & Recovery

Overview

AtlasOS implements comprehensive error handling with structured error responses, graceful degradation, and automatic recovery mechanisms to ensure system reliability and good user experience.

Error Types

Structured API Errors

All API errors follow a consistent structure:

{
  "code": "NOT_FOUND",
  "message": "dataset not found",
  "details": "tank/missing"
}

Error Codes

  • INTERNAL_ERROR - Unexpected server errors (500)
  • NOT_FOUND - Resource not found (404)
  • BAD_REQUEST - Invalid request parameters (400)
  • CONFLICT - Resource conflict (409)
  • UNAUTHORIZED - Authentication required (401)
  • FORBIDDEN - Insufficient permissions (403)
  • SERVICE_UNAVAILABLE - Service temporarily unavailable (503)
  • VALIDATION_ERROR - Input validation failed (400)

Error Handling Patterns

1. Structured Error Responses

All errors use the errors.APIError type for consistent formatting:

if resource == nil {
    writeError(w, errors.ErrNotFound("dataset").WithDetails(datasetName))
    return
}

2. Graceful Degradation

Service operations (SMB/NFS/iSCSI) use graceful degradation:

  • Desired State Stored: Configuration is always stored in the store
  • Service Application: Service configuration is applied asynchronously
  • Non-Blocking: Service failures don't fail API requests
  • Retry Ready: Failed operations can be retried later

Example:

// Store the configuration (always succeeds)
share, err := a.smbStore.Create(...)

// Apply to service (may fail, but doesn't block)
if err := a.smbService.ApplyConfiguration(shares); err != nil {
    // Log but don't fail - desired state is stored
    log.Printf("SMB service configuration failed (non-fatal): %v", err)
}

3. Panic Recovery

All HTTP handlers are wrapped with panic recovery middleware:

func (a *App) errorMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        defer recoverPanic(w, r)
        next.ServeHTTP(w, r)
    })
}

Panics are caught and converted to proper error responses instead of crashing the server.

4. Atomic Operations with Rollback

Service configuration operations are atomic with automatic rollback:

  1. Write to temporary file (*.atlas.tmp)
  2. Backup existing config (.backup)
  3. Atomically replace config file
  4. Reload service
  5. On failure: Automatically restore backup

Example (SMB):

// Write to temp file
os.WriteFile(tmpPath, config, 0644)

// Backup existing
cp config.conf config.conf.backup

// Atomic replace
os.Rename(tmpPath, configPath)

// Reload service
if err := reloadService(); err != nil {
    // Restore backup automatically
    os.Rename(backupPath, configPath)
    return err
}

Retry Mechanisms

Retry Configuration

The errors.Retry function provides configurable retry logic:

config := errors.DefaultRetryConfig() // 3 attempts with exponential backoff
err := errors.Retry(func() error {
    return serviceOperation()
}, config)

Default Retry Behavior

  • Max Attempts: 3
  • Backoff: Exponential (100ms, 200ms, 400ms)
  • Use Case: Transient failures (network, temporary service unavailability)

Error Recovery

Service Configuration Recovery

When service configuration fails:

  1. Configuration is stored (desired state preserved)
  2. Error is logged (for debugging)
  3. Operation continues (API request succeeds)
  4. Manual retry available (via API or automatic retry later)

Database Recovery

  • Connection failures: Logged and retried
  • Transaction failures: Rolled back automatically
  • Schema errors: Detected during migration

ZFS Operation Recovery

  • Command failures: Returned as errors to caller
  • Partial failures: State is preserved, operation can be retried
  • Validation: Performed before destructive operations

Error Logging

All errors are logged with context:

log.Printf("create SMB share error: %v", err)
log.Printf("%s service error: %v", serviceName, err)

Error logs include:

  • Error message
  • Operation context
  • Resource identifiers
  • Timestamp (via standard log)

Best Practices

1. Always Use Structured Errors

// Good
writeError(w, errors.ErrNotFound("pool").WithDetails(poolName))

// Avoid
writeJSON(w, http.StatusNotFound, map[string]string{"error": "not found"})

2. Handle Service Errors Gracefully

// Good - graceful degradation
if err := service.Apply(); err != nil {
    log.Printf("service error (non-fatal): %v", err)
    // Continue - desired state is stored
}

// Avoid - failing the request
if err := service.Apply(); err != nil {
    return err // Don't fail the whole request
}

3. Validate Before Operations

// Good - validate first
if !datasetExists {
    writeError(w, errors.ErrNotFound("dataset"))
    return
}
// Then perform operation

4. Use Context for Error Details

// Good - include context
writeError(w, errors.ErrInternal("failed to create pool").WithDetails(err.Error()))

// Avoid - generic errors
writeError(w, errors.ErrInternal("error"))

Error Response Format

All error responses follow this structure:

{
  "code": "ERROR_CODE",
  "message": "Human-readable error message",
  "details": "Additional context (optional)"
}

HTTP status codes match error types:

  • 400 - Bad Request / Validation Error
  • 401 - Unauthorized
  • 403 - Forbidden
  • 404 - Not Found
  • 409 - Conflict
  • 500 - Internal Error
  • 503 - Service Unavailable

Future Enhancements

  1. Error Tracking: Centralized error tracking and alerting
  2. Automatic Retry Queue: Background retry for failed operations
  3. Error Metrics: Track error rates by type and endpoint
  4. User-Friendly Messages: More descriptive error messages
  5. Error Correlation: Link related errors for debugging