# Hypervisor Control Plane (HCP) ## Software Design Specification (SDS) **Version: 1.0 (V1 – Enterprise Foundation)** --- ## 1. Overview Arsitektur HCP menerapkan pola **Control Plane compute** dengan desain: - **Northbound API**: Stabil, provider-agnostic, digunakan oleh Central Gateway dan console. - **Core**: Orkestrasi job, policy hook, persistence state, audit/event. - **Southbound Provider Layer**: Adapter/driver per hypervisor/provider. - **Workers/Agents**: Mengeksekusi job yang berdampak pada infrastruktur. HCP mendukung dua mode eksekusi: 1) **Direct mode**: worker memanggil API provider langsung (cepat untuk bootstrap) 2) **Agent mode**: job dikirim ke agent dekat cluster (lebih enterprise: multi-site, firewall-friendly) --- ## 2. Komponen Utama ### 2.1 HCP API Service Tanggung jawab: - Expose REST API (tenant/ops) untuk compute - AuthN/AuthZ enforcement (scope-based) - Validasi request + idempotency - Persist desired state dan job record - Publish job ke queue/stream ### 2.2 HCP Worker Service Tanggung jawab: - Subscribe job queue - Jalankan state machine job (RUNNING/RETRY/FAILED/SUCCEEDED) - Panggil provider adapter - Update state VM/job di datastore - Emit audit + metering events ### 2.3 Provider Adapter Layer Tanggung jawab: - Implement kontrak provider generik - Mapping spec generik → API spesifik provider - Normalisasi error provider → error taxonomy HCP - Normalisasi VM actual state → model internal ### 2.4 Data Store - PostgreSQL untuk state persisten (tenancy binding, vms, jobs, catalog, locations) - Event store (append-only) untuk audit (bisa table khusus atau log pipeline) - Queue/Stream untuk distribusi job (NATS JetStream / RabbitMQ) --- ## 3. Domain Model ### 3.1 Entities Inti - `Location/Zone` - `Provider` - `ComputeCluster` (pool/cluster per provider, terikat location) - `Image` (catalog) - `Flavor` (catalog) - `VM` - `Job` - `AuditEvent` ### 3.2 Resource Fields (VM) VM minimal memiliki: - `id`, `org_id`, `project_id` - `name`, `status` - `image_id`, `flavor_id` - `placement` (location_id, cluster_id optional) - `addresses` (read-model) - `labels/tags` - `provider_id` - `provider_ref` (opaque/internal) - timestamps ### 3.3 Job Fields - `id`, `type` - `resource_type`, `resource_id` - `state` (PENDING/RUNNING/SUCCEEDED/FAILED/RETRYING) - `attempt`, `max_attempt` - `error_code`, `error_message` - timestamps --- ## 4. Northbound API Design ### 4.1 Namespace Disarankan memisahkan tenant dan ops, meskipun HCP bisa diakses via Central Gateway: - Tenant: `/api/hcp/tenant/v1/*` - Ops: `/api/hcp/ops/v1/*` - Common catalog (read): `/api/hcp/common/v1/*` ### 4.2 Async Pattern - Create/modify/delete mengembalikan `202 Accepted` dengan `job_id`. - Status job dapat dipolling: `GET /jobs/{job_id}`. - Resource dapat dipolling: `GET /vms/{id}`. ### 4.3 Core Endpoints (V1) Tenant: - `POST /projects/{projectId}/vms` - `GET /projects/{projectId}/vms` - `GET /projects/{projectId}/vms/{vmId}` - `POST /projects/{projectId}/vms/{vmId}:start` - `POST /projects/{projectId}/vms/{vmId}:stop` - `POST /projects/{projectId}/vms/{vmId}:reboot` - `DELETE /projects/{projectId}/vms/{vmId}` - `POST /projects/{projectId}/vms/{vmId}:console` - `GET /jobs/{jobId}` Ops: - `POST /providers` - `POST /locations` - `POST /compute-clusters` - `GET /providers` - `GET /compute-clusters` - `POST /catalog/images` (opsional v1 jika platform membutuhkan) - `POST /catalog/flavors` (opsional v1) Common: - `GET /catalog/images` - `GET /catalog/flavors` - `GET /capabilities` --- ## 5. Capability Model HCP menyimpan dan mengekspos capability flags pada provider/cluster, contoh: - `supports_cloud_init` - `supports_snapshot` - `supports_live_migration` - `supports_console_vnc` - `supports_console_spice` - `supports_uefi` - `supports_gpu_passthrough` - `supports_secure_boot` - `supports_tags` Pemakaian: - UI dan service upstream dapat menyesuaikan fitur yang ditampilkan. - API mengembalikan `FEATURE_NOT_SUPPORTED` jika action tidak tersedia. --- ## 6. Provider Interface (Conceptual) ### 6.1 ComputeProvider (minimum) - `CreateVM(spec) -> ProviderRef` - `DeleteVM(ref)` - `StartVM(ref)` - `StopVM(ref)` - `RebootVM(ref)` - `GetVM(ref) -> ActualState` - `ListVMs(scope) -> []ActualState` (opsional untuk reconcile) ### 6.2 ConsoleProvider - `GetConsole(ref) -> ConsoleSession(type, url/token, expires_at)` ### 6.3 Catalog Providers (opsional) - `ListImages(scope)` - `ImportImage(source)` - `DeleteImage(id)` ProviderRef bersifat opaque: - `provider` + `external_id` + `location_id` + `extra(json)` --- ## 7. Job & Workflow ### 7.1 Job Types (V1) - `provision_vm` - `start_vm` - `stop_vm` - `reboot_vm` - `delete_vm` (attach nic/volume bisa ditambahkan jika masuk scope platform v1) ### 7.2 State Machine - PENDING → RUNNING → SUCCEEDED - PENDING → RUNNING → FAILED - PENDING → RUNNING → RETRYING → RUNNING ... Retry hanya untuk error transient: - provider timeout - temporary network error - 5xx upstream Idempotency: - create VM harus aman jika dieksekusi ulang. - handler wajib memeriksa `provider_ref` dan actual state sebelum membuat resource baru. --- ## 8. Reconciliation Loop Reconciliation dijalankan periodik untuk: - Mengupdate VM yang `PENDING/RUNNING` berdasarkan actual state provider. - Mendeteksi drift: VM hilang di provider namun masih ACTIVE di DB. - Menandai incident/alert (post-V1 bisa integrasi incident service). --- ## 9. Security Design ### 9.1 AuthN/AuthZ - JWT bearer token dengan claims minimal: `org_id`, `project_bindings`, `roles`, `scopes` - Tenant scope tidak boleh mengakses ops endpoints. - Semua request harus divalidasi terhadap path param projectId. ### 9.2 Secrets & Credentials - Provider credential disimpan terenkripsi. - Worker/agent menggunakan credential scoped (per cluster/pool) jika memungkinkan. - Console session menggunakan token sementara (short-lived). --- ## 10. Observability - Metrics: RPS, latency, error rate, job success/fail, provider latency, retry count - Logs terstruktur dengan `trace_id`, `job_id`, `vm_id` - Tracing end-to-end (OpenTelemetry ready) --- ## 11. Deployment Notes (V1) - HCP API: stateless, autoscale-ready - HCP Worker: scale out sesuai throughput job - DB: PostgreSQL - Queue: NATS JetStream atau RabbitMQ - Provider adapter: modul internal dalam worker (v1) atau sidecar/agent (enterprise mode) --- ## 12. Kompatibilitas Provider (Target) - Proxmox driver sebagai implementasi pertama - VMware vSphere driver (post-V1 atau parallel development) - Libvirt/KVM driver - Hyper-V driver ---