othman.suseno/atlas

Fork 0

Files

othman.suseno df475bc85e

CI / test-build (push) Failing after 2m11s

Details

logging and diagnostic features added

2025-12-15 00:45:14 +07:00

5.8 KiB

Raw Permalink Blame History

Error Handling & Recovery

Overview

AtlasOS implements comprehensive error handling with structured error responses, graceful degradation, and automatic recovery mechanisms to ensure system reliability and good user experience.

Error Types

Structured API Errors

All API errors follow a consistent structure:

{
  "code": "NOT_FOUND",
  "message": "dataset not found",
  "details": "tank/missing"
}

Error Codes

INTERNAL_ERROR - Unexpected server errors (500)
NOT_FOUND - Resource not found (404)
BAD_REQUEST - Invalid request parameters (400)
CONFLICT - Resource conflict (409)
UNAUTHORIZED - Authentication required (401)
FORBIDDEN - Insufficient permissions (403)
SERVICE_UNAVAILABLE - Service temporarily unavailable (503)
VALIDATION_ERROR - Input validation failed (400)

Error Handling Patterns

1. Structured Error Responses

All errors use the errors.APIError type for consistent formatting:

if resource == nil {
    writeError(w, errors.ErrNotFound("dataset").WithDetails(datasetName))
    return
}

2. Graceful Degradation

Service operations (SMB/NFS/iSCSI) use graceful degradation:

Desired State Stored: Configuration is always stored in the store
Service Application: Service configuration is applied asynchronously
Non-Blocking: Service failures don't fail API requests
Retry Ready: Failed operations can be retried later

Example:

// Store the configuration (always succeeds)
share, err := a.smbStore.Create(...)

// Apply to service (may fail, but doesn't block)
if err := a.smbService.ApplyConfiguration(shares); err != nil {
    // Log but don't fail - desired state is stored
    log.Printf("SMB service configuration failed (non-fatal): %v", err)
}

3. Panic Recovery

All HTTP handlers are wrapped with panic recovery middleware:

func (a *App) errorMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        defer recoverPanic(w, r)
        next.ServeHTTP(w, r)
    })
}

Panics are caught and converted to proper error responses instead of crashing the server.

4. Atomic Operations with Rollback

Service configuration operations are atomic with automatic rollback:

Write to temporary file (*.atlas.tmp)
Backup existing config (.backup)
Atomically replace config file
Reload service
On failure: Automatically restore backup

Example (SMB):

// Write to temp file
os.WriteFile(tmpPath, config, 0644)

// Backup existing
cp config.conf config.conf.backup

// Atomic replace
os.Rename(tmpPath, configPath)

// Reload service
if err := reloadService(); err != nil {
    // Restore backup automatically
    os.Rename(backupPath, configPath)
    return err
}

Retry Mechanisms

Retry Configuration

The errors.Retry function provides configurable retry logic:

config := errors.DefaultRetryConfig() // 3 attempts with exponential backoff
err := errors.Retry(func() error {
    return serviceOperation()
}, config)

Default Retry Behavior

Max Attempts: 3
Backoff: Exponential (100ms, 200ms, 400ms)
Use Case: Transient failures (network, temporary service unavailability)

Error Recovery

Service Configuration Recovery

When service configuration fails:

Configuration is stored (desired state preserved)
Error is logged (for debugging)
Operation continues (API request succeeds)
Manual retry available (via API or automatic retry later)

Database Recovery

Connection failures: Logged and retried
Transaction failures: Rolled back automatically
Schema errors: Detected during migration

ZFS Operation Recovery

Command failures: Returned as errors to caller
Partial failures: State is preserved, operation can be retried
Validation: Performed before destructive operations

Error Logging

All errors are logged with context:

log.Printf("create SMB share error: %v", err)
log.Printf("%s service error: %v", serviceName, err)

Error logs include:

Error message
Operation context
Resource identifiers
Timestamp (via standard log)

Best Practices

1. Always Use Structured Errors

// Good
writeError(w, errors.ErrNotFound("pool").WithDetails(poolName))

// Avoid
writeJSON(w, http.StatusNotFound, map[string]string{"error": "not found"})

2. Handle Service Errors Gracefully

// Good - graceful degradation
if err := service.Apply(); err != nil {
    log.Printf("service error (non-fatal): %v", err)
    // Continue - desired state is stored
}

// Avoid - failing the request
if err := service.Apply(); err != nil {
    return err // Don't fail the whole request
}

3. Validate Before Operations

// Good - validate first
if !datasetExists {
    writeError(w, errors.ErrNotFound("dataset"))
    return
}
// Then perform operation

4. Use Context for Error Details

// Good - include context
writeError(w, errors.ErrInternal("failed to create pool").WithDetails(err.Error()))

// Avoid - generic errors
writeError(w, errors.ErrInternal("error"))

Error Response Format

All error responses follow this structure:

{
  "code": "ERROR_CODE",
  "message": "Human-readable error message",
  "details": "Additional context (optional)"
}

HTTP status codes match error types:

400 - Bad Request / Validation Error
401 - Unauthorized
403 - Forbidden
404 - Not Found
409 - Conflict
500 - Internal Error
503 - Service Unavailable

Future Enhancements

Error Tracking: Centralized error tracking and alerting
Automatic Retry Queue: Background retry for failed operations
Error Metrics: Track error rates by type and endpoint
User-Friendly Messages: More descriptive error messages
Error Correlation: Link related errors for debugging

5.8 KiB Raw Permalink Blame History