add srs and sds documents

This commit is contained in:
Othman H. Suseno
2025-12-30 13:00:47 +07:00
commit eefa9d7035
4 changed files with 716 additions and 0 deletions

244
srs-sds/HCP_SDS_v1.md Normal file
View File

@@ -0,0 +1,244 @@
# Hypervisor Control Plane (HCP)
## Software Design Specification (SDS)
**Version: 1.0 (V1 Enterprise Foundation)**
---
## 1. Overview Arsitektur
HCP menerapkan pola **Control Plane compute** dengan desain:
- **Northbound API**: Stabil, provider-agnostic, digunakan oleh Central Gateway dan console.
- **Core**: Orkestrasi job, policy hook, persistence state, audit/event.
- **Southbound Provider Layer**: Adapter/driver per hypervisor/provider.
- **Workers/Agents**: Mengeksekusi job yang berdampak pada infrastruktur.
HCP mendukung dua mode eksekusi:
1) **Direct mode**: worker memanggil API provider langsung (cepat untuk bootstrap)
2) **Agent mode**: job dikirim ke agent dekat cluster (lebih enterprise: multi-site, firewall-friendly)
---
## 2. Komponen Utama
### 2.1 HCP API Service
Tanggung jawab:
- Expose REST API (tenant/ops) untuk compute
- AuthN/AuthZ enforcement (scope-based)
- Validasi request + idempotency
- Persist desired state dan job record
- Publish job ke queue/stream
### 2.2 HCP Worker Service
Tanggung jawab:
- Subscribe job queue
- Jalankan state machine job (RUNNING/RETRY/FAILED/SUCCEEDED)
- Panggil provider adapter
- Update state VM/job di datastore
- Emit audit + metering events
### 2.3 Provider Adapter Layer
Tanggung jawab:
- Implement kontrak provider generik
- Mapping spec generik → API spesifik provider
- Normalisasi error provider → error taxonomy HCP
- Normalisasi VM actual state → model internal
### 2.4 Data Store
- PostgreSQL untuk state persisten (tenancy binding, vms, jobs, catalog, locations)
- Event store (append-only) untuk audit (bisa table khusus atau log pipeline)
- Queue/Stream untuk distribusi job (NATS JetStream / RabbitMQ)
---
## 3. Domain Model
### 3.1 Entities Inti
- `Location/Zone`
- `Provider`
- `ComputeCluster` (pool/cluster per provider, terikat location)
- `Image` (catalog)
- `Flavor` (catalog)
- `VM`
- `Job`
- `AuditEvent`
### 3.2 Resource Fields (VM)
VM minimal memiliki:
- `id`, `org_id`, `project_id`
- `name`, `status`
- `image_id`, `flavor_id`
- `placement` (location_id, cluster_id optional)
- `addresses` (read-model)
- `labels/tags`
- `provider_id`
- `provider_ref` (opaque/internal)
- timestamps
### 3.3 Job Fields
- `id`, `type`
- `resource_type`, `resource_id`
- `state` (PENDING/RUNNING/SUCCEEDED/FAILED/RETRYING)
- `attempt`, `max_attempt`
- `error_code`, `error_message`
- timestamps
---
## 4. Northbound API Design
### 4.1 Namespace
Disarankan memisahkan tenant dan ops, meskipun HCP bisa diakses via Central Gateway:
- Tenant: `/api/hcp/tenant/v1/*`
- Ops: `/api/hcp/ops/v1/*`
- Common catalog (read): `/api/hcp/common/v1/*`
### 4.2 Async Pattern
- Create/modify/delete mengembalikan `202 Accepted` dengan `job_id`.
- Status job dapat dipolling: `GET /jobs/{job_id}`.
- Resource dapat dipolling: `GET /vms/{id}`.
### 4.3 Core Endpoints (V1)
Tenant:
- `POST /projects/{projectId}/vms`
- `GET /projects/{projectId}/vms`
- `GET /projects/{projectId}/vms/{vmId}`
- `POST /projects/{projectId}/vms/{vmId}:start`
- `POST /projects/{projectId}/vms/{vmId}:stop`
- `POST /projects/{projectId}/vms/{vmId}:reboot`
- `DELETE /projects/{projectId}/vms/{vmId}`
- `POST /projects/{projectId}/vms/{vmId}:console`
- `GET /jobs/{jobId}`
Ops:
- `POST /providers`
- `POST /locations`
- `POST /compute-clusters`
- `GET /providers`
- `GET /compute-clusters`
- `POST /catalog/images` (opsional v1 jika platform membutuhkan)
- `POST /catalog/flavors` (opsional v1)
Common:
- `GET /catalog/images`
- `GET /catalog/flavors`
- `GET /capabilities`
---
## 5. Capability Model
HCP menyimpan dan mengekspos capability flags pada provider/cluster, contoh:
- `supports_cloud_init`
- `supports_snapshot`
- `supports_live_migration`
- `supports_console_vnc`
- `supports_console_spice`
- `supports_uefi`
- `supports_gpu_passthrough`
- `supports_secure_boot`
- `supports_tags`
Pemakaian:
- UI dan service upstream dapat menyesuaikan fitur yang ditampilkan.
- API mengembalikan `FEATURE_NOT_SUPPORTED` jika action tidak tersedia.
---
## 6. Provider Interface (Conceptual)
### 6.1 ComputeProvider (minimum)
- `CreateVM(spec) -> ProviderRef`
- `DeleteVM(ref)`
- `StartVM(ref)`
- `StopVM(ref)`
- `RebootVM(ref)`
- `GetVM(ref) -> ActualState`
- `ListVMs(scope) -> []ActualState` (opsional untuk reconcile)
### 6.2 ConsoleProvider
- `GetConsole(ref) -> ConsoleSession(type, url/token, expires_at)`
### 6.3 Catalog Providers (opsional)
- `ListImages(scope)`
- `ImportImage(source)`
- `DeleteImage(id)`
ProviderRef bersifat opaque:
- `provider` + `external_id` + `location_id` + `extra(json)`
---
## 7. Job & Workflow
### 7.1 Job Types (V1)
- `provision_vm`
- `start_vm`
- `stop_vm`
- `reboot_vm`
- `delete_vm`
(attach nic/volume bisa ditambahkan jika masuk scope platform v1)
### 7.2 State Machine
- PENDING → RUNNING → SUCCEEDED
- PENDING → RUNNING → FAILED
- PENDING → RUNNING → RETRYING → RUNNING ...
Retry hanya untuk error transient:
- provider timeout
- temporary network error
- 5xx upstream
Idempotency:
- create VM harus aman jika dieksekusi ulang.
- handler wajib memeriksa `provider_ref` dan actual state sebelum membuat resource baru.
---
## 8. Reconciliation Loop
Reconciliation dijalankan periodik untuk:
- Mengupdate VM yang `PENDING/RUNNING` berdasarkan actual state provider.
- Mendeteksi drift: VM hilang di provider namun masih ACTIVE di DB.
- Menandai incident/alert (post-V1 bisa integrasi incident service).
---
## 9. Security Design
### 9.1 AuthN/AuthZ
- JWT bearer token dengan claims minimal: `org_id`, `project_bindings`, `roles`, `scopes`
- Tenant scope tidak boleh mengakses ops endpoints.
- Semua request harus divalidasi terhadap path param projectId.
### 9.2 Secrets & Credentials
- Provider credential disimpan terenkripsi.
- Worker/agent menggunakan credential scoped (per cluster/pool) jika memungkinkan.
- Console session menggunakan token sementara (short-lived).
---
## 10. Observability
- Metrics: RPS, latency, error rate, job success/fail, provider latency, retry count
- Logs terstruktur dengan `trace_id`, `job_id`, `vm_id`
- Tracing end-to-end (OpenTelemetry ready)
---
## 11. Deployment Notes (V1)
- HCP API: stateless, autoscale-ready
- HCP Worker: scale out sesuai throughput job
- DB: PostgreSQL
- Queue: NATS JetStream atau RabbitMQ
- Provider adapter: modul internal dalam worker (v1) atau sidecar/agent (enterprise mode)
---
## 12. Kompatibilitas Provider (Target)
- Proxmox driver sebagai implementasi pertama
- VMware vSphere driver (post-V1 atau parallel development)
- Libvirt/KVM driver
- Hyper-V driver
---