add srs and sds documents
This commit is contained in:
244
srs-sds/HCP_SDS_v1.md
Normal file
244
srs-sds/HCP_SDS_v1.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Hypervisor Control Plane (HCP)
|
||||
## Software Design Specification (SDS)
|
||||
**Version: 1.0 (V1 – Enterprise Foundation)**
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview Arsitektur
|
||||
|
||||
HCP menerapkan pola **Control Plane compute** dengan desain:
|
||||
- **Northbound API**: Stabil, provider-agnostic, digunakan oleh Central Gateway dan console.
|
||||
- **Core**: Orkestrasi job, policy hook, persistence state, audit/event.
|
||||
- **Southbound Provider Layer**: Adapter/driver per hypervisor/provider.
|
||||
- **Workers/Agents**: Mengeksekusi job yang berdampak pada infrastruktur.
|
||||
|
||||
HCP mendukung dua mode eksekusi:
|
||||
1) **Direct mode**: worker memanggil API provider langsung (cepat untuk bootstrap)
|
||||
2) **Agent mode**: job dikirim ke agent dekat cluster (lebih enterprise: multi-site, firewall-friendly)
|
||||
|
||||
---
|
||||
|
||||
## 2. Komponen Utama
|
||||
|
||||
### 2.1 HCP API Service
|
||||
Tanggung jawab:
|
||||
- Expose REST API (tenant/ops) untuk compute
|
||||
- AuthN/AuthZ enforcement (scope-based)
|
||||
- Validasi request + idempotency
|
||||
- Persist desired state dan job record
|
||||
- Publish job ke queue/stream
|
||||
|
||||
### 2.2 HCP Worker Service
|
||||
Tanggung jawab:
|
||||
- Subscribe job queue
|
||||
- Jalankan state machine job (RUNNING/RETRY/FAILED/SUCCEEDED)
|
||||
- Panggil provider adapter
|
||||
- Update state VM/job di datastore
|
||||
- Emit audit + metering events
|
||||
|
||||
### 2.3 Provider Adapter Layer
|
||||
Tanggung jawab:
|
||||
- Implement kontrak provider generik
|
||||
- Mapping spec generik → API spesifik provider
|
||||
- Normalisasi error provider → error taxonomy HCP
|
||||
- Normalisasi VM actual state → model internal
|
||||
|
||||
### 2.4 Data Store
|
||||
- PostgreSQL untuk state persisten (tenancy binding, vms, jobs, catalog, locations)
|
||||
- Event store (append-only) untuk audit (bisa table khusus atau log pipeline)
|
||||
- Queue/Stream untuk distribusi job (NATS JetStream / RabbitMQ)
|
||||
|
||||
---
|
||||
|
||||
## 3. Domain Model
|
||||
|
||||
### 3.1 Entities Inti
|
||||
- `Location/Zone`
|
||||
- `Provider`
|
||||
- `ComputeCluster` (pool/cluster per provider, terikat location)
|
||||
- `Image` (catalog)
|
||||
- `Flavor` (catalog)
|
||||
- `VM`
|
||||
- `Job`
|
||||
- `AuditEvent`
|
||||
|
||||
### 3.2 Resource Fields (VM)
|
||||
VM minimal memiliki:
|
||||
- `id`, `org_id`, `project_id`
|
||||
- `name`, `status`
|
||||
- `image_id`, `flavor_id`
|
||||
- `placement` (location_id, cluster_id optional)
|
||||
- `addresses` (read-model)
|
||||
- `labels/tags`
|
||||
- `provider_id`
|
||||
- `provider_ref` (opaque/internal)
|
||||
- timestamps
|
||||
|
||||
### 3.3 Job Fields
|
||||
- `id`, `type`
|
||||
- `resource_type`, `resource_id`
|
||||
- `state` (PENDING/RUNNING/SUCCEEDED/FAILED/RETRYING)
|
||||
- `attempt`, `max_attempt`
|
||||
- `error_code`, `error_message`
|
||||
- timestamps
|
||||
|
||||
---
|
||||
|
||||
## 4. Northbound API Design
|
||||
|
||||
### 4.1 Namespace
|
||||
Disarankan memisahkan tenant dan ops, meskipun HCP bisa diakses via Central Gateway:
|
||||
- Tenant: `/api/hcp/tenant/v1/*`
|
||||
- Ops: `/api/hcp/ops/v1/*`
|
||||
- Common catalog (read): `/api/hcp/common/v1/*`
|
||||
|
||||
### 4.2 Async Pattern
|
||||
- Create/modify/delete mengembalikan `202 Accepted` dengan `job_id`.
|
||||
- Status job dapat dipolling: `GET /jobs/{job_id}`.
|
||||
- Resource dapat dipolling: `GET /vms/{id}`.
|
||||
|
||||
### 4.3 Core Endpoints (V1)
|
||||
Tenant:
|
||||
- `POST /projects/{projectId}/vms`
|
||||
- `GET /projects/{projectId}/vms`
|
||||
- `GET /projects/{projectId}/vms/{vmId}`
|
||||
- `POST /projects/{projectId}/vms/{vmId}:start`
|
||||
- `POST /projects/{projectId}/vms/{vmId}:stop`
|
||||
- `POST /projects/{projectId}/vms/{vmId}:reboot`
|
||||
- `DELETE /projects/{projectId}/vms/{vmId}`
|
||||
- `POST /projects/{projectId}/vms/{vmId}:console`
|
||||
- `GET /jobs/{jobId}`
|
||||
|
||||
Ops:
|
||||
- `POST /providers`
|
||||
- `POST /locations`
|
||||
- `POST /compute-clusters`
|
||||
- `GET /providers`
|
||||
- `GET /compute-clusters`
|
||||
- `POST /catalog/images` (opsional v1 jika platform membutuhkan)
|
||||
- `POST /catalog/flavors` (opsional v1)
|
||||
|
||||
Common:
|
||||
- `GET /catalog/images`
|
||||
- `GET /catalog/flavors`
|
||||
- `GET /capabilities`
|
||||
|
||||
---
|
||||
|
||||
## 5. Capability Model
|
||||
|
||||
HCP menyimpan dan mengekspos capability flags pada provider/cluster, contoh:
|
||||
- `supports_cloud_init`
|
||||
- `supports_snapshot`
|
||||
- `supports_live_migration`
|
||||
- `supports_console_vnc`
|
||||
- `supports_console_spice`
|
||||
- `supports_uefi`
|
||||
- `supports_gpu_passthrough`
|
||||
- `supports_secure_boot`
|
||||
- `supports_tags`
|
||||
|
||||
Pemakaian:
|
||||
- UI dan service upstream dapat menyesuaikan fitur yang ditampilkan.
|
||||
- API mengembalikan `FEATURE_NOT_SUPPORTED` jika action tidak tersedia.
|
||||
|
||||
---
|
||||
|
||||
## 6. Provider Interface (Conceptual)
|
||||
|
||||
### 6.1 ComputeProvider (minimum)
|
||||
- `CreateVM(spec) -> ProviderRef`
|
||||
- `DeleteVM(ref)`
|
||||
- `StartVM(ref)`
|
||||
- `StopVM(ref)`
|
||||
- `RebootVM(ref)`
|
||||
- `GetVM(ref) -> ActualState`
|
||||
- `ListVMs(scope) -> []ActualState` (opsional untuk reconcile)
|
||||
|
||||
### 6.2 ConsoleProvider
|
||||
- `GetConsole(ref) -> ConsoleSession(type, url/token, expires_at)`
|
||||
|
||||
### 6.3 Catalog Providers (opsional)
|
||||
- `ListImages(scope)`
|
||||
- `ImportImage(source)`
|
||||
- `DeleteImage(id)`
|
||||
|
||||
ProviderRef bersifat opaque:
|
||||
- `provider` + `external_id` + `location_id` + `extra(json)`
|
||||
|
||||
---
|
||||
|
||||
## 7. Job & Workflow
|
||||
|
||||
### 7.1 Job Types (V1)
|
||||
- `provision_vm`
|
||||
- `start_vm`
|
||||
- `stop_vm`
|
||||
- `reboot_vm`
|
||||
- `delete_vm`
|
||||
(attach nic/volume bisa ditambahkan jika masuk scope platform v1)
|
||||
|
||||
### 7.2 State Machine
|
||||
- PENDING → RUNNING → SUCCEEDED
|
||||
- PENDING → RUNNING → FAILED
|
||||
- PENDING → RUNNING → RETRYING → RUNNING ...
|
||||
|
||||
Retry hanya untuk error transient:
|
||||
- provider timeout
|
||||
- temporary network error
|
||||
- 5xx upstream
|
||||
|
||||
Idempotency:
|
||||
- create VM harus aman jika dieksekusi ulang.
|
||||
- handler wajib memeriksa `provider_ref` dan actual state sebelum membuat resource baru.
|
||||
|
||||
---
|
||||
|
||||
## 8. Reconciliation Loop
|
||||
|
||||
Reconciliation dijalankan periodik untuk:
|
||||
- Mengupdate VM yang `PENDING/RUNNING` berdasarkan actual state provider.
|
||||
- Mendeteksi drift: VM hilang di provider namun masih ACTIVE di DB.
|
||||
- Menandai incident/alert (post-V1 bisa integrasi incident service).
|
||||
|
||||
---
|
||||
|
||||
## 9. Security Design
|
||||
|
||||
### 9.1 AuthN/AuthZ
|
||||
- JWT bearer token dengan claims minimal: `org_id`, `project_bindings`, `roles`, `scopes`
|
||||
- Tenant scope tidak boleh mengakses ops endpoints.
|
||||
- Semua request harus divalidasi terhadap path param projectId.
|
||||
|
||||
### 9.2 Secrets & Credentials
|
||||
- Provider credential disimpan terenkripsi.
|
||||
- Worker/agent menggunakan credential scoped (per cluster/pool) jika memungkinkan.
|
||||
- Console session menggunakan token sementara (short-lived).
|
||||
|
||||
---
|
||||
|
||||
## 10. Observability
|
||||
|
||||
- Metrics: RPS, latency, error rate, job success/fail, provider latency, retry count
|
||||
- Logs terstruktur dengan `trace_id`, `job_id`, `vm_id`
|
||||
- Tracing end-to-end (OpenTelemetry ready)
|
||||
|
||||
---
|
||||
|
||||
## 11. Deployment Notes (V1)
|
||||
|
||||
- HCP API: stateless, autoscale-ready
|
||||
- HCP Worker: scale out sesuai throughput job
|
||||
- DB: PostgreSQL
|
||||
- Queue: NATS JetStream atau RabbitMQ
|
||||
- Provider adapter: modul internal dalam worker (v1) atau sidecar/agent (enterprise mode)
|
||||
|
||||
---
|
||||
|
||||
## 12. Kompatibilitas Provider (Target)
|
||||
- Proxmox driver sebagai implementasi pertama
|
||||
- VMware vSphere driver (post-V1 atau parallel development)
|
||||
- Libvirt/KVM driver
|
||||
- Hyper-V driver
|
||||
|
||||
---
|
||||
Reference in New Issue
Block a user