atlas/docs/BACKGROUND_JOBS.md

# Background Job System

The AtlasOS API includes a background job system that automatically executes snapshot policies and manages long-running operations.

## Architecture

### Components

1. **Job Manager** (`internal/job/manager.go`)
   - Tracks job lifecycle (pending, running, completed, failed, cancelled)
   - Stores job metadata and progress
   - Thread-safe job operations

2. **Snapshot Scheduler** (`internal/snapshot/scheduler.go`)
   - Automatically creates snapshots based on policies
   - Prunes old snapshots based on retention rules
   - Runs every 15 minutes by default

3. **Integration**
   - Scheduler starts automatically when API server starts
   - Gracefully stops on server shutdown
   - Jobs are accessible via API endpoints

## How It Works

### Snapshot Creation

The scheduler checks all enabled snapshot policies every 15 minutes and:

1. **Frequent snapshots**: Creates every 15 minutes if `frequent > 0`
2. **Hourly snapshots**: Creates every hour if `hourly > 0`
3. **Daily snapshots**: Creates daily if `daily > 0`
4. **Weekly snapshots**: Creates weekly if `weekly > 0`
5. **Monthly snapshots**: Creates monthly if `monthly > 0`
6. **Yearly snapshots**: Creates yearly if `yearly > 0`

Snapshot names follow the pattern: `{type}-{timestamp}` (e.g., `hourly-20241214-143000`)

### Snapshot Pruning

When `autoprune` is enabled, the scheduler:

1. Groups snapshots by type (frequent, hourly, daily, etc.)
2. Sorts by creation time (newest first)
3. Keeps only the number specified in the policy
4. Deletes older snapshots that exceed the retention count

### Job Tracking

Every snapshot operation creates a job that tracks:
- Status (pending → running → completed/failed)
- Progress (0-100%)
- Error messages (if failed)
- Timestamps (created, started, completed)

## API Endpoints

### List Jobs
```bash
GET /api/v1/jobs
GET /api/v1/jobs?status=running
```

### Get Job
```bash
GET /api/v1/jobs/{id}
```

### Cancel Job
```bash
POST /api/v1/jobs/{id}/cancel
```

## Configuration

The scheduler interval is hardcoded to 15 minutes. To change it, modify:

```go
// In internal/httpapp/app.go
scheduler.Start(15 * time.Minute)  // Change interval here
```

## Example Workflow

1. **Create a snapshot policy:**
```bash
curl -X POST http://localhost:8080/api/v1/snapshot-policies \
  -H "Content-Type: application/json" \
  -d '{
    "dataset": "pool/dataset",
    "hourly": 24,
    "daily": 7,
    "autosnap": true,
    "autoprune": true
  }'
```

2. **Scheduler automatically:**
   - Creates hourly snapshots (keeps 24)
   - Creates daily snapshots (keeps 7)
   - Prunes old snapshots beyond retention

3. **Monitor jobs:**
```bash
curl http://localhost:8080/api/v1/jobs
```

## Job Statuses

- `pending`: Job created but not started
- `running`: Job is currently executing
- `completed`: Job finished successfully
- `failed`: Job encountered an error
- `cancelled`: Job was cancelled by user

## Notes

- Jobs are stored in-memory (will be lost on restart)
- Scheduler runs in a background goroutine
- Snapshot operations are synchronous (blocking)
- For production, consider:
  - Database persistence for jobs
  - Async job execution with worker pool
  - Job history retention policies
  - Metrics/alerting for failed jobs