add srs and sds documents

This commit is contained in:
Othman H. Suseno
2025-12-30 13:00:47 +07:00
commit eefa9d7035
4 changed files with 716 additions and 0 deletions

244
srs-sds/HCP_SDS_v1.md Normal file
View File

@@ -0,0 +1,244 @@
# Hypervisor Control Plane (HCP)
## Software Design Specification (SDS)
**Version: 1.0 (V1 Enterprise Foundation)**
---
## 1. Overview Arsitektur
HCP menerapkan pola **Control Plane compute** dengan desain:
- **Northbound API**: Stabil, provider-agnostic, digunakan oleh Central Gateway dan console.
- **Core**: Orkestrasi job, policy hook, persistence state, audit/event.
- **Southbound Provider Layer**: Adapter/driver per hypervisor/provider.
- **Workers/Agents**: Mengeksekusi job yang berdampak pada infrastruktur.
HCP mendukung dua mode eksekusi:
1) **Direct mode**: worker memanggil API provider langsung (cepat untuk bootstrap)
2) **Agent mode**: job dikirim ke agent dekat cluster (lebih enterprise: multi-site, firewall-friendly)
---
## 2. Komponen Utama
### 2.1 HCP API Service
Tanggung jawab:
- Expose REST API (tenant/ops) untuk compute
- AuthN/AuthZ enforcement (scope-based)
- Validasi request + idempotency
- Persist desired state dan job record
- Publish job ke queue/stream
### 2.2 HCP Worker Service
Tanggung jawab:
- Subscribe job queue
- Jalankan state machine job (RUNNING/RETRY/FAILED/SUCCEEDED)
- Panggil provider adapter
- Update state VM/job di datastore
- Emit audit + metering events
### 2.3 Provider Adapter Layer
Tanggung jawab:
- Implement kontrak provider generik
- Mapping spec generik → API spesifik provider
- Normalisasi error provider → error taxonomy HCP
- Normalisasi VM actual state → model internal
### 2.4 Data Store
- PostgreSQL untuk state persisten (tenancy binding, vms, jobs, catalog, locations)
- Event store (append-only) untuk audit (bisa table khusus atau log pipeline)
- Queue/Stream untuk distribusi job (NATS JetStream / RabbitMQ)
---
## 3. Domain Model
### 3.1 Entities Inti
- `Location/Zone`
- `Provider`
- `ComputeCluster` (pool/cluster per provider, terikat location)
- `Image` (catalog)
- `Flavor` (catalog)
- `VM`
- `Job`
- `AuditEvent`
### 3.2 Resource Fields (VM)
VM minimal memiliki:
- `id`, `org_id`, `project_id`
- `name`, `status`
- `image_id`, `flavor_id`
- `placement` (location_id, cluster_id optional)
- `addresses` (read-model)
- `labels/tags`
- `provider_id`
- `provider_ref` (opaque/internal)
- timestamps
### 3.3 Job Fields
- `id`, `type`
- `resource_type`, `resource_id`
- `state` (PENDING/RUNNING/SUCCEEDED/FAILED/RETRYING)
- `attempt`, `max_attempt`
- `error_code`, `error_message`
- timestamps
---
## 4. Northbound API Design
### 4.1 Namespace
Disarankan memisahkan tenant dan ops, meskipun HCP bisa diakses via Central Gateway:
- Tenant: `/api/hcp/tenant/v1/*`
- Ops: `/api/hcp/ops/v1/*`
- Common catalog (read): `/api/hcp/common/v1/*`
### 4.2 Async Pattern
- Create/modify/delete mengembalikan `202 Accepted` dengan `job_id`.
- Status job dapat dipolling: `GET /jobs/{job_id}`.
- Resource dapat dipolling: `GET /vms/{id}`.
### 4.3 Core Endpoints (V1)
Tenant:
- `POST /projects/{projectId}/vms`
- `GET /projects/{projectId}/vms`
- `GET /projects/{projectId}/vms/{vmId}`
- `POST /projects/{projectId}/vms/{vmId}:start`
- `POST /projects/{projectId}/vms/{vmId}:stop`
- `POST /projects/{projectId}/vms/{vmId}:reboot`
- `DELETE /projects/{projectId}/vms/{vmId}`
- `POST /projects/{projectId}/vms/{vmId}:console`
- `GET /jobs/{jobId}`
Ops:
- `POST /providers`
- `POST /locations`
- `POST /compute-clusters`
- `GET /providers`
- `GET /compute-clusters`
- `POST /catalog/images` (opsional v1 jika platform membutuhkan)
- `POST /catalog/flavors` (opsional v1)
Common:
- `GET /catalog/images`
- `GET /catalog/flavors`
- `GET /capabilities`
---
## 5. Capability Model
HCP menyimpan dan mengekspos capability flags pada provider/cluster, contoh:
- `supports_cloud_init`
- `supports_snapshot`
- `supports_live_migration`
- `supports_console_vnc`
- `supports_console_spice`
- `supports_uefi`
- `supports_gpu_passthrough`
- `supports_secure_boot`
- `supports_tags`
Pemakaian:
- UI dan service upstream dapat menyesuaikan fitur yang ditampilkan.
- API mengembalikan `FEATURE_NOT_SUPPORTED` jika action tidak tersedia.
---
## 6. Provider Interface (Conceptual)
### 6.1 ComputeProvider (minimum)
- `CreateVM(spec) -> ProviderRef`
- `DeleteVM(ref)`
- `StartVM(ref)`
- `StopVM(ref)`
- `RebootVM(ref)`
- `GetVM(ref) -> ActualState`
- `ListVMs(scope) -> []ActualState` (opsional untuk reconcile)
### 6.2 ConsoleProvider
- `GetConsole(ref) -> ConsoleSession(type, url/token, expires_at)`
### 6.3 Catalog Providers (opsional)
- `ListImages(scope)`
- `ImportImage(source)`
- `DeleteImage(id)`
ProviderRef bersifat opaque:
- `provider` + `external_id` + `location_id` + `extra(json)`
---
## 7. Job & Workflow
### 7.1 Job Types (V1)
- `provision_vm`
- `start_vm`
- `stop_vm`
- `reboot_vm`
- `delete_vm`
(attach nic/volume bisa ditambahkan jika masuk scope platform v1)
### 7.2 State Machine
- PENDING → RUNNING → SUCCEEDED
- PENDING → RUNNING → FAILED
- PENDING → RUNNING → RETRYING → RUNNING ...
Retry hanya untuk error transient:
- provider timeout
- temporary network error
- 5xx upstream
Idempotency:
- create VM harus aman jika dieksekusi ulang.
- handler wajib memeriksa `provider_ref` dan actual state sebelum membuat resource baru.
---
## 8. Reconciliation Loop
Reconciliation dijalankan periodik untuk:
- Mengupdate VM yang `PENDING/RUNNING` berdasarkan actual state provider.
- Mendeteksi drift: VM hilang di provider namun masih ACTIVE di DB.
- Menandai incident/alert (post-V1 bisa integrasi incident service).
---
## 9. Security Design
### 9.1 AuthN/AuthZ
- JWT bearer token dengan claims minimal: `org_id`, `project_bindings`, `roles`, `scopes`
- Tenant scope tidak boleh mengakses ops endpoints.
- Semua request harus divalidasi terhadap path param projectId.
### 9.2 Secrets & Credentials
- Provider credential disimpan terenkripsi.
- Worker/agent menggunakan credential scoped (per cluster/pool) jika memungkinkan.
- Console session menggunakan token sementara (short-lived).
---
## 10. Observability
- Metrics: RPS, latency, error rate, job success/fail, provider latency, retry count
- Logs terstruktur dengan `trace_id`, `job_id`, `vm_id`
- Tracing end-to-end (OpenTelemetry ready)
---
## 11. Deployment Notes (V1)
- HCP API: stateless, autoscale-ready
- HCP Worker: scale out sesuai throughput job
- DB: PostgreSQL
- Queue: NATS JetStream atau RabbitMQ
- Provider adapter: modul internal dalam worker (v1) atau sidecar/agent (enterprise mode)
---
## 12. Kompatibilitas Provider (Target)
- Proxmox driver sebagai implementasi pertama
- VMware vSphere driver (post-V1 atau parallel development)
- Libvirt/KVM driver
- Hyper-V driver
---

169
srs-sds/HCP_SRS_v1.md Normal file
View File

@@ -0,0 +1,169 @@
# Hypervisor Control Plane (HCP)
## Software Requirements Specification (SRS)
**Version: 1.0 (V1 Enterprise Foundation)**
---
## 1. Ringkasan
Dokumen ini mendefinisikan kebutuhan perangkat lunak (Software Requirements Specification) untuk **Hypervisor Control Plane (HCP)**, yaitu layanan control-plane yang menyediakan API *provider-agnostic* untuk manajemen lifecycle Virtual Machine (VM) dan resource compute terkait.
HCP dirancang untuk mendukung banyak backend hypervisor/provider, termasuk namun tidak terbatas pada:
- Proxmox VE
- VMware vSphere/ESXi
- KVM/QEMU (via libvirt atau API lain)
- Hyper-V
HCP menjadi komponen inti yang diakses oleh **Central API Gateway** dan/atau layanan lain dalam platform, dengan pola **desired state + asynchronous jobs** untuk operasi yang berdampak pada infrastruktur.
---
## 2. Tujuan & Prinsip Desain
### 2.1 Tujuan
- Menyediakan API compute yang konsisten untuk berbagai hypervisor.
- Mendukung multi-tenant dan multi-project dengan isolasi akses yang ketat.
- Menyediakan mekanisme provisioning yang robust: idempotent, dapat di-retry, dapat di-reconcile.
- Menjadi fondasi enterprise untuk ekspansi fitur (snapshot, migration, GPU, dsb) melalui *capability negotiation*.
### 2.2 Prinsip
- **Provider-agnostic Northbound API**: API tidak mengekspos detail spesifik provider (mis. `vmid`, `moid`, `datastore`, `node`).
- **Plugin/Driver model**: integrasi provider melalui adapter dengan kontrak yang jelas.
- **Async by default** untuk operasi yang memerlukan waktu (create, delete, start/stop, attach).
- **Strict authorization boundary**: walaupun ada central gateway, HCP tetap melakukan verifikasi scope/claims.
- **Auditability**: semua aksi control-plane dapat diaudit.
---
## 3. Stakeholders & Peran
### 3.1 Tenant-side (melalui Central Gateway)
- Project Owner / Admin
- Operator
- Viewer
### 3.2 Provider/Ops-side
- Cloud Operator
- Infrastructure Admin
- Security/Audit Admin
- Break-glass Admin
---
## 4. Ruang Lingkup (V1)
### 4.1 In Scope (MUST)
- VM lifecycle: create, start, stop, reboot, delete
- VM query: list/detail, status, addresses, metadata/tags
- Catalog: images & flavors (read-only untuk tenant, writable untuk ops sesuai kebijakan platform)
- Console access: request sesi console short-lived (VNC/SPICE/WebConsole/RDP via abstraction)
- Job system integration: setiap operasi berdampak infra menghasilkan `job_id`
- Capability model: expose dukungan fitur per provider dan/atau per cluster/location
- Placement abstraction: zone/location/cluster (tanpa expose host/node spesifik)
- Audit event emission (minimal: request/started/succeeded/failed)
- Error taxonomy yang konsisten lintas provider
### 4.2 Out of Scope (V1)
- Auto-scaling dan elasticity
- Full billing engine (HCP hanya emit metering events; perhitungan final di service lain)
- Advanced SDN dan orchestration network di luar attach NIC dasar
- Backup orchestration end-to-end (bisa disiapkan hook/extension)
- Live migration/DRS otomatis (opsional post-V1)
- Multi-region replication control-plane
---
## 5. Kebutuhan Fungsional
### 5.1 API & Operasi Compute
- HCP SHALL menyediakan endpoint untuk membuat VM dari image/template yang disetujui.
- HCP SHALL menyediakan operasi: start, stop, reboot, delete VM.
- HCP SHALL menyediakan endpoint list/detail VM dalam scope project.
- HCP SHALL menjaga state VM dan job state di datastore yang persisten.
### 5.2 Asynchronous Jobs
- Semua operasi yang memodifikasi infra (create/start/stop/reboot/delete) SHALL berjalan asynchronous.
- HCP SHALL mengembalikan `202 Accepted` dengan `resource_id` dan `job_id`.
- HCP SHALL menyediakan endpoint untuk query status job dan error detail.
### 5.3 Multi-Provider Support
- HCP SHALL mendukung integrasi beberapa provider melalui adapter/driver.
- HCP SHALL mendukung selection provider berdasarkan placement (zone/location/cluster) dan kebijakan ops.
- HCP SHALL menyimpan referensi resource provider secara internal (provider_ref) tanpa diekspos dalam northbound API.
### 5.4 Capability Negotiation
- HCP SHALL mendefinisikan model kemampuan (*capabilities*) untuk fitur compute.
- HCP SHALL mengembalikan error yang konsisten saat fitur tidak didukung oleh provider (mis. `FEATURE_NOT_SUPPORTED`).
### 5.5 Console Access Abstraction
- HCP SHALL menyediakan endpoint untuk meminta sesi console VM.
- Sesi console SHALL bersifat short-lived (memiliki expiry) dan tidak membocorkan credential provider jangka panjang.
- HCP SHOULD mendukung beberapa jenis console (VNC, SPICE, WebConsole, RDP) melalui tipe abstraksi.
### 5.6 Placement & Location Abstraction
- HCP SHALL mendukung konsep `zone/location` dan `compute_cluster/pool` untuk placement.
- Tenant API SHOULD memilih placement secara generik (mis. zone) tanpa melihat host/node.
- Ops API SHALL mengelola mapping zone → provider cluster/pool.
### 5.7 Security & Authorization
- HCP SHALL memverifikasi token/JWT untuk setiap request.
- HCP SHALL menegakkan RBAC pada level endpoint dan resource scope:
- tenant scope tidak dapat mengakses ops scope
- user hanya dapat mengakses resource dalam project yang diikat role binding
- HCP SHALL mendukung idempotency key untuk operasi create agar aman terhadap retry.
### 5.8 Audit Logging
- HCP SHALL menghasilkan audit event untuk:
- request received
- job started
- job succeeded
- job failed
- Audit event minimal memuat: actor, action, target, timestamp, result, trace_id.
### 5.9 Error Taxonomy
- HCP SHALL mengembalikan error yang seragam lintas provider, minimal:
- INVALID_REQUEST
- NOT_FOUND
- UNAUTHORIZED / FORBIDDEN
- QUOTA_EXCEEDED
- FEATURE_NOT_SUPPORTED
- PROVIDER_UNAVAILABLE
- PROVIDER_TIMEOUT
- CONFLICT
- INTERNAL_ERROR
---
## 6. Kebutuhan Non-Fungsional
### 6.1 Reliability
- HCP SHALL tahan terhadap restart komponen tanpa kehilangan state.
- Job processing SHALL bersifat at-least-once dengan idempotency pada handler.
### 6.2 Availability
- API service SHALL stateless dan dapat di-scale horizontal.
- State (VM/job) SHALL tersimpan pada datastore persisten.
### 6.3 Performance
- API read (list/detail) SHOULD responsif dan dapat memakai caching read-model bila diperlukan.
- Operasi write menggunakan async jobs untuk memisahkan latency provider dari latency API.
### 6.4 Security
- Secrets provider credential SHALL disimpan terenkripsi (Vault-ready / envelope encryption).
- Console sessions SHALL menggunakan token sementara (short-lived).
- Input validation SHALL ketat untuk mencegah injection/abuse.
### 6.5 Observability
- HCP SHALL mengekspos metrics (request rate, latency, job success/fail, provider errors).
- HCP SHALL menggunakan trace_id/correlation_id untuk request end-to-end.
---
## 7. Acceptance Criteria (V1)
- Tenant dapat membuat VM dari image yang tersedia dan memonitor status via job.
- Tenant dapat melakukan start/stop/reboot/delete VM dengan job tracking.
- Ops dapat mendaftarkan provider/cluster dan mengatur mapping zone.
- Sistem dapat berjalan dengan minimal 1 provider driver (Proxmox) tanpa mengunci desain untuk provider lain.
- Error dan audit event konsisten, dapat dipakai untuk troubleshooting.
---

160
srs-sds/SDS_v1.md Normal file
View File

@@ -0,0 +1,160 @@
# Cloud Infrastructure Management Platform
## Software Design Specification (SDS)
**Version: 1.0 (V1 Enterprise Foundation)**
---
## 1. Architectural Overview
The platform adopts a Control Plane and Data Plane architecture.
- Control Plane manages APIs, identity, orchestration, policy, and state.
- Data Plane executes infrastructure operations via agents and providers.
---
## 2. High-Level Components
### 2.1 Management Layer
- Tenant Management Console
- Provider / Operations Console
In Version 1, both consoles MAY be implemented as a single UI with strict role-based access control.
---
### 2.2 API Gateway
Responsibilities:
- Authentication and authorization
- API namespace separation
- Request validation and rate limiting
- Centralized audit logging hook
---
### 2.3 Core Services
| Service | Responsibility |
|-------|----------------|
| Identity Service | Users, roles, RBAC |
| Resource Manager | Projects, quotas, metadata |
| Compute Service | Virtual machine lifecycle |
| Network Service | Virtual network management |
| Storage Service | Volume or object storage |
| Job Service | Workflow orchestration and retries |
| Audit Service | Append-only audit logging |
| Metering Service | Usage aggregation |
---
## 3. Data Model Overview
### Core Entities
- Organization
- Project
- User
- Role
- Role Binding
- Virtual Machine
- Network
- Volume / Bucket
- Job
- Audit Event
- Quota
- Provider
### Common Resource Attributes
```
id
organization_id
project_id
name
status
labels
provider_reference
created_at
updated_at
```
---
## 4. API Design Principles
- REST-based APIs
- Versioned endpoints
- Clear separation between tenant and provider APIs
### Namespace Examples
- /api/tenant/v1/*
- /api/ops/v1/*
- /api/common/v1/*
---
## 5. Job & Workflow Design
### Job Lifecycle States
- PENDING
- RUNNING
- SUCCEEDED
- FAILED
- RETRYING
### Design Characteristics
- Idempotent create operations
- Retry for transient failures only
- Persistent job state storage
---
## 6. Provider & Agent Architecture
### Provider Interfaces
- Compute Provider
- Network Provider
- Storage Provider
### Agent Responsibilities
- Execute infrastructure-level operations
- Report actual state to the control plane
- Emit audit and telemetry data
---
## 7. Reconciliation Mechanism
- Periodic reconciliation loop
- Desired state vs actual state comparison
- Drift handling via:
- Automated correction
- Operator alert and incident escalation
---
## 8. Security Architecture
- Token-based authentication
- RBAC enforcement across services
- Encrypted secret storage
- Distributed request tracing
---
## 9. Deployment Model (V1)
- Stateless API services
- PostgreSQL as primary datastore
- Message queue for job distribution
- Agent deployment per infrastructure cluster
---
## 10. Future Evolution
- Multi-cluster federation
- Kubernetes services
- Policy-as-Code
- Billing and invoicing
- Application marketplace
---

143
srs-sds/SRS_v1.md Normal file
View File

@@ -0,0 +1,143 @@
# Cloud Infrastructure Management Platform
## Software Requirements Specification (SRS)
**Version: 1.0 (V1 Enterprise Foundation)**
---
## 1. Purpose & Vision
This document defines the Software Requirements Specification (SRS) for the Cloud Infrastructure Management Platform (CIMP).
The platform is designed to deliver enterprise-grade, IaaS-like cloud capabilities inspired by AWS, GCP, and Azure, primarily targeting private and managed cloud environments.
Version 1 focuses on strong architectural foundations, governance, and security, while maintaining a controlled and achievable feature scope.
---
## 2. Target Users & Roles
### 2.1 Tenant Roles
- Tenant Owner
- Project Admin
- Project Operator
- Project Viewer
### 2.2 Provider / Operator Roles
- Cloud Operator
- Infrastructure Administrator
- Security / Audit Administrator
- Break-glass Super Administrator
---
## 3. Scope Definition
### 3.1 In Scope (V1)
- Multi-tenant and multi-project architecture
- Identity and Access Management (RBAC)
- Compute service (Virtual Machine lifecycle)
- Basic virtual networking
- Basic storage service (block or object)
- Asynchronous job execution
- Audit logging (append-only)
- Usage metering and reporting
- Provider / operations management console
### 3.2 Out of Scope (V1)
- Public cloud federation
- Auto-scaling and elasticity
- Kubernetes and container orchestration
- Application marketplace
- Billing or payment gateway
- Advanced SDN automation (BGP / EVPN)
---
## 4. Functional Requirements
### 4.1 Identity & Access Management
- The system SHALL support Organizations (Tenants), Projects, Users, Roles, and Role Bindings.
- The system SHALL enforce strict separation between tenant and provider scopes.
- The system SHALL use token-based API authentication.
- The system SHOULD be extensible to support external Identity Providers (OIDC).
---
### 4.2 Project & Resource Management
- Tenants SHALL be able to create and manage projects.
- Projects SHALL support quota assignment.
- Every resource SHALL belong to exactly one project.
- All resources SHALL include ownership and lifecycle metadata.
---
### 4.3 Compute Service
- Tenants SHALL be able to create Virtual Machines from predefined images.
- The system SHALL support start, stop, reboot, and delete operations.
- VM provisioning SHALL be asynchronous.
- VM lifecycle states SHALL be exposed through the API.
---
### 4.4 Network Service
- Tenants SHALL be able to create virtual networks per project.
- Virtual networks SHALL enforce isolation between projects.
- Virtual machines SHALL be attachable to one or more virtual networks.
---
### 4.5 Storage Service
- Tenants SHALL be able to create storage volumes or object buckets.
- Storage resources SHALL be attachable to compute resources where applicable.
- Snapshot functionality MAY be supported depending on backend capability.
---
### 4.6 Job & Workflow Management
- All infrastructure-impacting operations SHALL be executed via an asynchronous job system.
- Each job SHALL return a job identifier.
- Job execution status SHALL be queryable.
---
### 4.7 Audit Logging
- The system SHALL record all control-plane actions.
- Audit logs SHALL include actor, action, target resource, timestamp, and result.
- Audit logs SHALL be immutable and append-only.
---
### 4.8 Metering & Reporting
- The system SHALL collect usage metrics for compute, network, and storage.
- Usage reports SHALL be generated per project and tenant.
- Billing integration is out of scope for V1.
---
### 4.9 Provider / Operations Management
- Operators SHALL be able to onboard infrastructure clusters.
- Operators SHALL be able to define global policies and catalogs.
- Operators SHALL have visibility into tenant activities for auditing and troubleshooting.
---
## 5. Non-Functional Requirements
### 5.1 Security
- RBAC enforcement at the API layer.
- Encryption for sensitive data at rest.
- Full auditability of administrative actions.
### 5.2 Availability
- Control plane services SHALL be stateless.
- The system SHALL tolerate service restarts without data loss.
### 5.3 Scalability
- Horizontal scalability for API services.
- Asynchronous processing for long-running tasks.
### 5.4 Maintainability
- Modular service architecture.
- Clear separation between control plane and data plane.
---