Files
hephaestus-hpc-api/srs-sds/HCP_SDS_v1.md
2025-12-30 13:00:47 +07:00

245 lines
6.4 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Hypervisor Control Plane (HCP)
## Software Design Specification (SDS)
**Version: 1.0 (V1 Enterprise Foundation)**
---
## 1. Overview Arsitektur
HCP menerapkan pola **Control Plane compute** dengan desain:
- **Northbound API**: Stabil, provider-agnostic, digunakan oleh Central Gateway dan console.
- **Core**: Orkestrasi job, policy hook, persistence state, audit/event.
- **Southbound Provider Layer**: Adapter/driver per hypervisor/provider.
- **Workers/Agents**: Mengeksekusi job yang berdampak pada infrastruktur.
HCP mendukung dua mode eksekusi:
1) **Direct mode**: worker memanggil API provider langsung (cepat untuk bootstrap)
2) **Agent mode**: job dikirim ke agent dekat cluster (lebih enterprise: multi-site, firewall-friendly)
---
## 2. Komponen Utama
### 2.1 HCP API Service
Tanggung jawab:
- Expose REST API (tenant/ops) untuk compute
- AuthN/AuthZ enforcement (scope-based)
- Validasi request + idempotency
- Persist desired state dan job record
- Publish job ke queue/stream
### 2.2 HCP Worker Service
Tanggung jawab:
- Subscribe job queue
- Jalankan state machine job (RUNNING/RETRY/FAILED/SUCCEEDED)
- Panggil provider adapter
- Update state VM/job di datastore
- Emit audit + metering events
### 2.3 Provider Adapter Layer
Tanggung jawab:
- Implement kontrak provider generik
- Mapping spec generik → API spesifik provider
- Normalisasi error provider → error taxonomy HCP
- Normalisasi VM actual state → model internal
### 2.4 Data Store
- PostgreSQL untuk state persisten (tenancy binding, vms, jobs, catalog, locations)
- Event store (append-only) untuk audit (bisa table khusus atau log pipeline)
- Queue/Stream untuk distribusi job (NATS JetStream / RabbitMQ)
---
## 3. Domain Model
### 3.1 Entities Inti
- `Location/Zone`
- `Provider`
- `ComputeCluster` (pool/cluster per provider, terikat location)
- `Image` (catalog)
- `Flavor` (catalog)
- `VM`
- `Job`
- `AuditEvent`
### 3.2 Resource Fields (VM)
VM minimal memiliki:
- `id`, `org_id`, `project_id`
- `name`, `status`
- `image_id`, `flavor_id`
- `placement` (location_id, cluster_id optional)
- `addresses` (read-model)
- `labels/tags`
- `provider_id`
- `provider_ref` (opaque/internal)
- timestamps
### 3.3 Job Fields
- `id`, `type`
- `resource_type`, `resource_id`
- `state` (PENDING/RUNNING/SUCCEEDED/FAILED/RETRYING)
- `attempt`, `max_attempt`
- `error_code`, `error_message`
- timestamps
---
## 4. Northbound API Design
### 4.1 Namespace
Disarankan memisahkan tenant dan ops, meskipun HCP bisa diakses via Central Gateway:
- Tenant: `/api/hcp/tenant/v1/*`
- Ops: `/api/hcp/ops/v1/*`
- Common catalog (read): `/api/hcp/common/v1/*`
### 4.2 Async Pattern
- Create/modify/delete mengembalikan `202 Accepted` dengan `job_id`.
- Status job dapat dipolling: `GET /jobs/{job_id}`.
- Resource dapat dipolling: `GET /vms/{id}`.
### 4.3 Core Endpoints (V1)
Tenant:
- `POST /projects/{projectId}/vms`
- `GET /projects/{projectId}/vms`
- `GET /projects/{projectId}/vms/{vmId}`
- `POST /projects/{projectId}/vms/{vmId}:start`
- `POST /projects/{projectId}/vms/{vmId}:stop`
- `POST /projects/{projectId}/vms/{vmId}:reboot`
- `DELETE /projects/{projectId}/vms/{vmId}`
- `POST /projects/{projectId}/vms/{vmId}:console`
- `GET /jobs/{jobId}`
Ops:
- `POST /providers`
- `POST /locations`
- `POST /compute-clusters`
- `GET /providers`
- `GET /compute-clusters`
- `POST /catalog/images` (opsional v1 jika platform membutuhkan)
- `POST /catalog/flavors` (opsional v1)
Common:
- `GET /catalog/images`
- `GET /catalog/flavors`
- `GET /capabilities`
---
## 5. Capability Model
HCP menyimpan dan mengekspos capability flags pada provider/cluster, contoh:
- `supports_cloud_init`
- `supports_snapshot`
- `supports_live_migration`
- `supports_console_vnc`
- `supports_console_spice`
- `supports_uefi`
- `supports_gpu_passthrough`
- `supports_secure_boot`
- `supports_tags`
Pemakaian:
- UI dan service upstream dapat menyesuaikan fitur yang ditampilkan.
- API mengembalikan `FEATURE_NOT_SUPPORTED` jika action tidak tersedia.
---
## 6. Provider Interface (Conceptual)
### 6.1 ComputeProvider (minimum)
- `CreateVM(spec) -> ProviderRef`
- `DeleteVM(ref)`
- `StartVM(ref)`
- `StopVM(ref)`
- `RebootVM(ref)`
- `GetVM(ref) -> ActualState`
- `ListVMs(scope) -> []ActualState` (opsional untuk reconcile)
### 6.2 ConsoleProvider
- `GetConsole(ref) -> ConsoleSession(type, url/token, expires_at)`
### 6.3 Catalog Providers (opsional)
- `ListImages(scope)`
- `ImportImage(source)`
- `DeleteImage(id)`
ProviderRef bersifat opaque:
- `provider` + `external_id` + `location_id` + `extra(json)`
---
## 7. Job & Workflow
### 7.1 Job Types (V1)
- `provision_vm`
- `start_vm`
- `stop_vm`
- `reboot_vm`
- `delete_vm`
(attach nic/volume bisa ditambahkan jika masuk scope platform v1)
### 7.2 State Machine
- PENDING → RUNNING → SUCCEEDED
- PENDING → RUNNING → FAILED
- PENDING → RUNNING → RETRYING → RUNNING ...
Retry hanya untuk error transient:
- provider timeout
- temporary network error
- 5xx upstream
Idempotency:
- create VM harus aman jika dieksekusi ulang.
- handler wajib memeriksa `provider_ref` dan actual state sebelum membuat resource baru.
---
## 8. Reconciliation Loop
Reconciliation dijalankan periodik untuk:
- Mengupdate VM yang `PENDING/RUNNING` berdasarkan actual state provider.
- Mendeteksi drift: VM hilang di provider namun masih ACTIVE di DB.
- Menandai incident/alert (post-V1 bisa integrasi incident service).
---
## 9. Security Design
### 9.1 AuthN/AuthZ
- JWT bearer token dengan claims minimal: `org_id`, `project_bindings`, `roles`, `scopes`
- Tenant scope tidak boleh mengakses ops endpoints.
- Semua request harus divalidasi terhadap path param projectId.
### 9.2 Secrets & Credentials
- Provider credential disimpan terenkripsi.
- Worker/agent menggunakan credential scoped (per cluster/pool) jika memungkinkan.
- Console session menggunakan token sementara (short-lived).
---
## 10. Observability
- Metrics: RPS, latency, error rate, job success/fail, provider latency, retry count
- Logs terstruktur dengan `trace_id`, `job_id`, `vm_id`
- Tracing end-to-end (OpenTelemetry ready)
---
## 11. Deployment Notes (V1)
- HCP API: stateless, autoscale-ready
- HCP Worker: scale out sesuai throughput job
- DB: PostgreSQL
- Queue: NATS JetStream atau RabbitMQ
- Provider adapter: modul internal dalam worker (v1) atau sidecar/agent (enterprise mode)
---
## 12. Kompatibilitas Provider (Target)
- Proxmox driver sebagai implementasi pertama
- VMware vSphere driver (post-V1 atau parallel development)
- Libvirt/KVM driver
- Hyper-V driver
---