Files
hephaestus-hpc-api/srs-sds/HCP_SDS_v1.md
2025-12-30 13:00:47 +07:00

6.4 KiB
Raw Permalink Blame History

Hypervisor Control Plane (HCP)

Software Design Specification (SDS)

Version: 1.0 (V1 Enterprise Foundation)


1. Overview Arsitektur

HCP menerapkan pola Control Plane compute dengan desain:

  • Northbound API: Stabil, provider-agnostic, digunakan oleh Central Gateway dan console.
  • Core: Orkestrasi job, policy hook, persistence state, audit/event.
  • Southbound Provider Layer: Adapter/driver per hypervisor/provider.
  • Workers/Agents: Mengeksekusi job yang berdampak pada infrastruktur.

HCP mendukung dua mode eksekusi:

  1. Direct mode: worker memanggil API provider langsung (cepat untuk bootstrap)
  2. Agent mode: job dikirim ke agent dekat cluster (lebih enterprise: multi-site, firewall-friendly)

2. Komponen Utama

2.1 HCP API Service

Tanggung jawab:

  • Expose REST API (tenant/ops) untuk compute
  • AuthN/AuthZ enforcement (scope-based)
  • Validasi request + idempotency
  • Persist desired state dan job record
  • Publish job ke queue/stream

2.2 HCP Worker Service

Tanggung jawab:

  • Subscribe job queue
  • Jalankan state machine job (RUNNING/RETRY/FAILED/SUCCEEDED)
  • Panggil provider adapter
  • Update state VM/job di datastore
  • Emit audit + metering events

2.3 Provider Adapter Layer

Tanggung jawab:

  • Implement kontrak provider generik
  • Mapping spec generik → API spesifik provider
  • Normalisasi error provider → error taxonomy HCP
  • Normalisasi VM actual state → model internal

2.4 Data Store

  • PostgreSQL untuk state persisten (tenancy binding, vms, jobs, catalog, locations)
  • Event store (append-only) untuk audit (bisa table khusus atau log pipeline)
  • Queue/Stream untuk distribusi job (NATS JetStream / RabbitMQ)

3. Domain Model

3.1 Entities Inti

  • Location/Zone
  • Provider
  • ComputeCluster (pool/cluster per provider, terikat location)
  • Image (catalog)
  • Flavor (catalog)
  • VM
  • Job
  • AuditEvent

3.2 Resource Fields (VM)

VM minimal memiliki:

  • id, org_id, project_id
  • name, status
  • image_id, flavor_id
  • placement (location_id, cluster_id optional)
  • addresses (read-model)
  • labels/tags
  • provider_id
  • provider_ref (opaque/internal)
  • timestamps

3.3 Job Fields

  • id, type
  • resource_type, resource_id
  • state (PENDING/RUNNING/SUCCEEDED/FAILED/RETRYING)
  • attempt, max_attempt
  • error_code, error_message
  • timestamps

4. Northbound API Design

4.1 Namespace

Disarankan memisahkan tenant dan ops, meskipun HCP bisa diakses via Central Gateway:

  • Tenant: /api/hcp/tenant/v1/*
  • Ops: /api/hcp/ops/v1/*
  • Common catalog (read): /api/hcp/common/v1/*

4.2 Async Pattern

  • Create/modify/delete mengembalikan 202 Accepted dengan job_id.
  • Status job dapat dipolling: GET /jobs/{job_id}.
  • Resource dapat dipolling: GET /vms/{id}.

4.3 Core Endpoints (V1)

Tenant:

  • POST /projects/{projectId}/vms
  • GET /projects/{projectId}/vms
  • GET /projects/{projectId}/vms/{vmId}
  • POST /projects/{projectId}/vms/{vmId}:start
  • POST /projects/{projectId}/vms/{vmId}:stop
  • POST /projects/{projectId}/vms/{vmId}:reboot
  • DELETE /projects/{projectId}/vms/{vmId}
  • POST /projects/{projectId}/vms/{vmId}:console
  • GET /jobs/{jobId}

Ops:

  • POST /providers
  • POST /locations
  • POST /compute-clusters
  • GET /providers
  • GET /compute-clusters
  • POST /catalog/images (opsional v1 jika platform membutuhkan)
  • POST /catalog/flavors (opsional v1)

Common:

  • GET /catalog/images
  • GET /catalog/flavors
  • GET /capabilities

5. Capability Model

HCP menyimpan dan mengekspos capability flags pada provider/cluster, contoh:

  • supports_cloud_init
  • supports_snapshot
  • supports_live_migration
  • supports_console_vnc
  • supports_console_spice
  • supports_uefi
  • supports_gpu_passthrough
  • supports_secure_boot
  • supports_tags

Pemakaian:

  • UI dan service upstream dapat menyesuaikan fitur yang ditampilkan.
  • API mengembalikan FEATURE_NOT_SUPPORTED jika action tidak tersedia.

6. Provider Interface (Conceptual)

6.1 ComputeProvider (minimum)

  • CreateVM(spec) -> ProviderRef
  • DeleteVM(ref)
  • StartVM(ref)
  • StopVM(ref)
  • RebootVM(ref)
  • GetVM(ref) -> ActualState
  • ListVMs(scope) -> []ActualState (opsional untuk reconcile)

6.2 ConsoleProvider

  • GetConsole(ref) -> ConsoleSession(type, url/token, expires_at)

6.3 Catalog Providers (opsional)

  • ListImages(scope)
  • ImportImage(source)
  • DeleteImage(id)

ProviderRef bersifat opaque:

  • provider + external_id + location_id + extra(json)

7. Job & Workflow

7.1 Job Types (V1)

  • provision_vm
  • start_vm
  • stop_vm
  • reboot_vm
  • delete_vm (attach nic/volume bisa ditambahkan jika masuk scope platform v1)

7.2 State Machine

  • PENDING → RUNNING → SUCCEEDED
  • PENDING → RUNNING → FAILED
  • PENDING → RUNNING → RETRYING → RUNNING ...

Retry hanya untuk error transient:

  • provider timeout
  • temporary network error
  • 5xx upstream

Idempotency:

  • create VM harus aman jika dieksekusi ulang.
  • handler wajib memeriksa provider_ref dan actual state sebelum membuat resource baru.

8. Reconciliation Loop

Reconciliation dijalankan periodik untuk:

  • Mengupdate VM yang PENDING/RUNNING berdasarkan actual state provider.
  • Mendeteksi drift: VM hilang di provider namun masih ACTIVE di DB.
  • Menandai incident/alert (post-V1 bisa integrasi incident service).

9. Security Design

9.1 AuthN/AuthZ

  • JWT bearer token dengan claims minimal: org_id, project_bindings, roles, scopes
  • Tenant scope tidak boleh mengakses ops endpoints.
  • Semua request harus divalidasi terhadap path param projectId.

9.2 Secrets & Credentials

  • Provider credential disimpan terenkripsi.
  • Worker/agent menggunakan credential scoped (per cluster/pool) jika memungkinkan.
  • Console session menggunakan token sementara (short-lived).

10. Observability

  • Metrics: RPS, latency, error rate, job success/fail, provider latency, retry count
  • Logs terstruktur dengan trace_id, job_id, vm_id
  • Tracing end-to-end (OpenTelemetry ready)

11. Deployment Notes (V1)

  • HCP API: stateless, autoscale-ready
  • HCP Worker: scale out sesuai throughput job
  • DB: PostgreSQL
  • Queue: NATS JetStream atau RabbitMQ
  • Provider adapter: modul internal dalam worker (v1) atau sidecar/agent (enterprise mode)

12. Kompatibilitas Provider (Target)

  • Proxmox driver sebagai implementasi pertama
  • VMware vSphere driver (post-V1 atau parallel development)
  • Libvirt/KVM driver
  • Hyper-V driver