add srs and sds documents
This commit is contained in:
244
srs-sds/HCP_SDS_v1.md
Normal file
244
srs-sds/HCP_SDS_v1.md
Normal file
@@ -0,0 +1,244 @@
|
|||||||
|
# Hypervisor Control Plane (HCP)
|
||||||
|
## Software Design Specification (SDS)
|
||||||
|
**Version: 1.0 (V1 – Enterprise Foundation)**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Overview Arsitektur
|
||||||
|
|
||||||
|
HCP menerapkan pola **Control Plane compute** dengan desain:
|
||||||
|
- **Northbound API**: Stabil, provider-agnostic, digunakan oleh Central Gateway dan console.
|
||||||
|
- **Core**: Orkestrasi job, policy hook, persistence state, audit/event.
|
||||||
|
- **Southbound Provider Layer**: Adapter/driver per hypervisor/provider.
|
||||||
|
- **Workers/Agents**: Mengeksekusi job yang berdampak pada infrastruktur.
|
||||||
|
|
||||||
|
HCP mendukung dua mode eksekusi:
|
||||||
|
1) **Direct mode**: worker memanggil API provider langsung (cepat untuk bootstrap)
|
||||||
|
2) **Agent mode**: job dikirim ke agent dekat cluster (lebih enterprise: multi-site, firewall-friendly)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Komponen Utama
|
||||||
|
|
||||||
|
### 2.1 HCP API Service
|
||||||
|
Tanggung jawab:
|
||||||
|
- Expose REST API (tenant/ops) untuk compute
|
||||||
|
- AuthN/AuthZ enforcement (scope-based)
|
||||||
|
- Validasi request + idempotency
|
||||||
|
- Persist desired state dan job record
|
||||||
|
- Publish job ke queue/stream
|
||||||
|
|
||||||
|
### 2.2 HCP Worker Service
|
||||||
|
Tanggung jawab:
|
||||||
|
- Subscribe job queue
|
||||||
|
- Jalankan state machine job (RUNNING/RETRY/FAILED/SUCCEEDED)
|
||||||
|
- Panggil provider adapter
|
||||||
|
- Update state VM/job di datastore
|
||||||
|
- Emit audit + metering events
|
||||||
|
|
||||||
|
### 2.3 Provider Adapter Layer
|
||||||
|
Tanggung jawab:
|
||||||
|
- Implement kontrak provider generik
|
||||||
|
- Mapping spec generik → API spesifik provider
|
||||||
|
- Normalisasi error provider → error taxonomy HCP
|
||||||
|
- Normalisasi VM actual state → model internal
|
||||||
|
|
||||||
|
### 2.4 Data Store
|
||||||
|
- PostgreSQL untuk state persisten (tenancy binding, vms, jobs, catalog, locations)
|
||||||
|
- Event store (append-only) untuk audit (bisa table khusus atau log pipeline)
|
||||||
|
- Queue/Stream untuk distribusi job (NATS JetStream / RabbitMQ)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Domain Model
|
||||||
|
|
||||||
|
### 3.1 Entities Inti
|
||||||
|
- `Location/Zone`
|
||||||
|
- `Provider`
|
||||||
|
- `ComputeCluster` (pool/cluster per provider, terikat location)
|
||||||
|
- `Image` (catalog)
|
||||||
|
- `Flavor` (catalog)
|
||||||
|
- `VM`
|
||||||
|
- `Job`
|
||||||
|
- `AuditEvent`
|
||||||
|
|
||||||
|
### 3.2 Resource Fields (VM)
|
||||||
|
VM minimal memiliki:
|
||||||
|
- `id`, `org_id`, `project_id`
|
||||||
|
- `name`, `status`
|
||||||
|
- `image_id`, `flavor_id`
|
||||||
|
- `placement` (location_id, cluster_id optional)
|
||||||
|
- `addresses` (read-model)
|
||||||
|
- `labels/tags`
|
||||||
|
- `provider_id`
|
||||||
|
- `provider_ref` (opaque/internal)
|
||||||
|
- timestamps
|
||||||
|
|
||||||
|
### 3.3 Job Fields
|
||||||
|
- `id`, `type`
|
||||||
|
- `resource_type`, `resource_id`
|
||||||
|
- `state` (PENDING/RUNNING/SUCCEEDED/FAILED/RETRYING)
|
||||||
|
- `attempt`, `max_attempt`
|
||||||
|
- `error_code`, `error_message`
|
||||||
|
- timestamps
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Northbound API Design
|
||||||
|
|
||||||
|
### 4.1 Namespace
|
||||||
|
Disarankan memisahkan tenant dan ops, meskipun HCP bisa diakses via Central Gateway:
|
||||||
|
- Tenant: `/api/hcp/tenant/v1/*`
|
||||||
|
- Ops: `/api/hcp/ops/v1/*`
|
||||||
|
- Common catalog (read): `/api/hcp/common/v1/*`
|
||||||
|
|
||||||
|
### 4.2 Async Pattern
|
||||||
|
- Create/modify/delete mengembalikan `202 Accepted` dengan `job_id`.
|
||||||
|
- Status job dapat dipolling: `GET /jobs/{job_id}`.
|
||||||
|
- Resource dapat dipolling: `GET /vms/{id}`.
|
||||||
|
|
||||||
|
### 4.3 Core Endpoints (V1)
|
||||||
|
Tenant:
|
||||||
|
- `POST /projects/{projectId}/vms`
|
||||||
|
- `GET /projects/{projectId}/vms`
|
||||||
|
- `GET /projects/{projectId}/vms/{vmId}`
|
||||||
|
- `POST /projects/{projectId}/vms/{vmId}:start`
|
||||||
|
- `POST /projects/{projectId}/vms/{vmId}:stop`
|
||||||
|
- `POST /projects/{projectId}/vms/{vmId}:reboot`
|
||||||
|
- `DELETE /projects/{projectId}/vms/{vmId}`
|
||||||
|
- `POST /projects/{projectId}/vms/{vmId}:console`
|
||||||
|
- `GET /jobs/{jobId}`
|
||||||
|
|
||||||
|
Ops:
|
||||||
|
- `POST /providers`
|
||||||
|
- `POST /locations`
|
||||||
|
- `POST /compute-clusters`
|
||||||
|
- `GET /providers`
|
||||||
|
- `GET /compute-clusters`
|
||||||
|
- `POST /catalog/images` (opsional v1 jika platform membutuhkan)
|
||||||
|
- `POST /catalog/flavors` (opsional v1)
|
||||||
|
|
||||||
|
Common:
|
||||||
|
- `GET /catalog/images`
|
||||||
|
- `GET /catalog/flavors`
|
||||||
|
- `GET /capabilities`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Capability Model
|
||||||
|
|
||||||
|
HCP menyimpan dan mengekspos capability flags pada provider/cluster, contoh:
|
||||||
|
- `supports_cloud_init`
|
||||||
|
- `supports_snapshot`
|
||||||
|
- `supports_live_migration`
|
||||||
|
- `supports_console_vnc`
|
||||||
|
- `supports_console_spice`
|
||||||
|
- `supports_uefi`
|
||||||
|
- `supports_gpu_passthrough`
|
||||||
|
- `supports_secure_boot`
|
||||||
|
- `supports_tags`
|
||||||
|
|
||||||
|
Pemakaian:
|
||||||
|
- UI dan service upstream dapat menyesuaikan fitur yang ditampilkan.
|
||||||
|
- API mengembalikan `FEATURE_NOT_SUPPORTED` jika action tidak tersedia.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Provider Interface (Conceptual)
|
||||||
|
|
||||||
|
### 6.1 ComputeProvider (minimum)
|
||||||
|
- `CreateVM(spec) -> ProviderRef`
|
||||||
|
- `DeleteVM(ref)`
|
||||||
|
- `StartVM(ref)`
|
||||||
|
- `StopVM(ref)`
|
||||||
|
- `RebootVM(ref)`
|
||||||
|
- `GetVM(ref) -> ActualState`
|
||||||
|
- `ListVMs(scope) -> []ActualState` (opsional untuk reconcile)
|
||||||
|
|
||||||
|
### 6.2 ConsoleProvider
|
||||||
|
- `GetConsole(ref) -> ConsoleSession(type, url/token, expires_at)`
|
||||||
|
|
||||||
|
### 6.3 Catalog Providers (opsional)
|
||||||
|
- `ListImages(scope)`
|
||||||
|
- `ImportImage(source)`
|
||||||
|
- `DeleteImage(id)`
|
||||||
|
|
||||||
|
ProviderRef bersifat opaque:
|
||||||
|
- `provider` + `external_id` + `location_id` + `extra(json)`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Job & Workflow
|
||||||
|
|
||||||
|
### 7.1 Job Types (V1)
|
||||||
|
- `provision_vm`
|
||||||
|
- `start_vm`
|
||||||
|
- `stop_vm`
|
||||||
|
- `reboot_vm`
|
||||||
|
- `delete_vm`
|
||||||
|
(attach nic/volume bisa ditambahkan jika masuk scope platform v1)
|
||||||
|
|
||||||
|
### 7.2 State Machine
|
||||||
|
- PENDING → RUNNING → SUCCEEDED
|
||||||
|
- PENDING → RUNNING → FAILED
|
||||||
|
- PENDING → RUNNING → RETRYING → RUNNING ...
|
||||||
|
|
||||||
|
Retry hanya untuk error transient:
|
||||||
|
- provider timeout
|
||||||
|
- temporary network error
|
||||||
|
- 5xx upstream
|
||||||
|
|
||||||
|
Idempotency:
|
||||||
|
- create VM harus aman jika dieksekusi ulang.
|
||||||
|
- handler wajib memeriksa `provider_ref` dan actual state sebelum membuat resource baru.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Reconciliation Loop
|
||||||
|
|
||||||
|
Reconciliation dijalankan periodik untuk:
|
||||||
|
- Mengupdate VM yang `PENDING/RUNNING` berdasarkan actual state provider.
|
||||||
|
- Mendeteksi drift: VM hilang di provider namun masih ACTIVE di DB.
|
||||||
|
- Menandai incident/alert (post-V1 bisa integrasi incident service).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Security Design
|
||||||
|
|
||||||
|
### 9.1 AuthN/AuthZ
|
||||||
|
- JWT bearer token dengan claims minimal: `org_id`, `project_bindings`, `roles`, `scopes`
|
||||||
|
- Tenant scope tidak boleh mengakses ops endpoints.
|
||||||
|
- Semua request harus divalidasi terhadap path param projectId.
|
||||||
|
|
||||||
|
### 9.2 Secrets & Credentials
|
||||||
|
- Provider credential disimpan terenkripsi.
|
||||||
|
- Worker/agent menggunakan credential scoped (per cluster/pool) jika memungkinkan.
|
||||||
|
- Console session menggunakan token sementara (short-lived).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Observability
|
||||||
|
|
||||||
|
- Metrics: RPS, latency, error rate, job success/fail, provider latency, retry count
|
||||||
|
- Logs terstruktur dengan `trace_id`, `job_id`, `vm_id`
|
||||||
|
- Tracing end-to-end (OpenTelemetry ready)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Deployment Notes (V1)
|
||||||
|
|
||||||
|
- HCP API: stateless, autoscale-ready
|
||||||
|
- HCP Worker: scale out sesuai throughput job
|
||||||
|
- DB: PostgreSQL
|
||||||
|
- Queue: NATS JetStream atau RabbitMQ
|
||||||
|
- Provider adapter: modul internal dalam worker (v1) atau sidecar/agent (enterprise mode)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 12. Kompatibilitas Provider (Target)
|
||||||
|
- Proxmox driver sebagai implementasi pertama
|
||||||
|
- VMware vSphere driver (post-V1 atau parallel development)
|
||||||
|
- Libvirt/KVM driver
|
||||||
|
- Hyper-V driver
|
||||||
|
|
||||||
|
---
|
||||||
169
srs-sds/HCP_SRS_v1.md
Normal file
169
srs-sds/HCP_SRS_v1.md
Normal file
@@ -0,0 +1,169 @@
|
|||||||
|
# Hypervisor Control Plane (HCP)
|
||||||
|
## Software Requirements Specification (SRS)
|
||||||
|
**Version: 1.0 (V1 – Enterprise Foundation)**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Ringkasan
|
||||||
|
|
||||||
|
Dokumen ini mendefinisikan kebutuhan perangkat lunak (Software Requirements Specification) untuk **Hypervisor Control Plane (HCP)**, yaitu layanan control-plane yang menyediakan API *provider-agnostic* untuk manajemen lifecycle Virtual Machine (VM) dan resource compute terkait.
|
||||||
|
|
||||||
|
HCP dirancang untuk mendukung banyak backend hypervisor/provider, termasuk namun tidak terbatas pada:
|
||||||
|
- Proxmox VE
|
||||||
|
- VMware vSphere/ESXi
|
||||||
|
- KVM/QEMU (via libvirt atau API lain)
|
||||||
|
- Hyper-V
|
||||||
|
|
||||||
|
HCP menjadi komponen inti yang diakses oleh **Central API Gateway** dan/atau layanan lain dalam platform, dengan pola **desired state + asynchronous jobs** untuk operasi yang berdampak pada infrastruktur.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Tujuan & Prinsip Desain
|
||||||
|
|
||||||
|
### 2.1 Tujuan
|
||||||
|
- Menyediakan API compute yang konsisten untuk berbagai hypervisor.
|
||||||
|
- Mendukung multi-tenant dan multi-project dengan isolasi akses yang ketat.
|
||||||
|
- Menyediakan mekanisme provisioning yang robust: idempotent, dapat di-retry, dapat di-reconcile.
|
||||||
|
- Menjadi fondasi enterprise untuk ekspansi fitur (snapshot, migration, GPU, dsb) melalui *capability negotiation*.
|
||||||
|
|
||||||
|
### 2.2 Prinsip
|
||||||
|
- **Provider-agnostic Northbound API**: API tidak mengekspos detail spesifik provider (mis. `vmid`, `moid`, `datastore`, `node`).
|
||||||
|
- **Plugin/Driver model**: integrasi provider melalui adapter dengan kontrak yang jelas.
|
||||||
|
- **Async by default** untuk operasi yang memerlukan waktu (create, delete, start/stop, attach).
|
||||||
|
- **Strict authorization boundary**: walaupun ada central gateway, HCP tetap melakukan verifikasi scope/claims.
|
||||||
|
- **Auditability**: semua aksi control-plane dapat diaudit.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Stakeholders & Peran
|
||||||
|
|
||||||
|
### 3.1 Tenant-side (melalui Central Gateway)
|
||||||
|
- Project Owner / Admin
|
||||||
|
- Operator
|
||||||
|
- Viewer
|
||||||
|
|
||||||
|
### 3.2 Provider/Ops-side
|
||||||
|
- Cloud Operator
|
||||||
|
- Infrastructure Admin
|
||||||
|
- Security/Audit Admin
|
||||||
|
- Break-glass Admin
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Ruang Lingkup (V1)
|
||||||
|
|
||||||
|
### 4.1 In Scope (MUST)
|
||||||
|
- VM lifecycle: create, start, stop, reboot, delete
|
||||||
|
- VM query: list/detail, status, addresses, metadata/tags
|
||||||
|
- Catalog: images & flavors (read-only untuk tenant, writable untuk ops sesuai kebijakan platform)
|
||||||
|
- Console access: request sesi console short-lived (VNC/SPICE/WebConsole/RDP via abstraction)
|
||||||
|
- Job system integration: setiap operasi berdampak infra menghasilkan `job_id`
|
||||||
|
- Capability model: expose dukungan fitur per provider dan/atau per cluster/location
|
||||||
|
- Placement abstraction: zone/location/cluster (tanpa expose host/node spesifik)
|
||||||
|
- Audit event emission (minimal: request/started/succeeded/failed)
|
||||||
|
- Error taxonomy yang konsisten lintas provider
|
||||||
|
|
||||||
|
### 4.2 Out of Scope (V1)
|
||||||
|
- Auto-scaling dan elasticity
|
||||||
|
- Full billing engine (HCP hanya emit metering events; perhitungan final di service lain)
|
||||||
|
- Advanced SDN dan orchestration network di luar attach NIC dasar
|
||||||
|
- Backup orchestration end-to-end (bisa disiapkan hook/extension)
|
||||||
|
- Live migration/DRS otomatis (opsional post-V1)
|
||||||
|
- Multi-region replication control-plane
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Kebutuhan Fungsional
|
||||||
|
|
||||||
|
### 5.1 API & Operasi Compute
|
||||||
|
- HCP SHALL menyediakan endpoint untuk membuat VM dari image/template yang disetujui.
|
||||||
|
- HCP SHALL menyediakan operasi: start, stop, reboot, delete VM.
|
||||||
|
- HCP SHALL menyediakan endpoint list/detail VM dalam scope project.
|
||||||
|
- HCP SHALL menjaga state VM dan job state di datastore yang persisten.
|
||||||
|
|
||||||
|
### 5.2 Asynchronous Jobs
|
||||||
|
- Semua operasi yang memodifikasi infra (create/start/stop/reboot/delete) SHALL berjalan asynchronous.
|
||||||
|
- HCP SHALL mengembalikan `202 Accepted` dengan `resource_id` dan `job_id`.
|
||||||
|
- HCP SHALL menyediakan endpoint untuk query status job dan error detail.
|
||||||
|
|
||||||
|
### 5.3 Multi-Provider Support
|
||||||
|
- HCP SHALL mendukung integrasi beberapa provider melalui adapter/driver.
|
||||||
|
- HCP SHALL mendukung selection provider berdasarkan placement (zone/location/cluster) dan kebijakan ops.
|
||||||
|
- HCP SHALL menyimpan referensi resource provider secara internal (provider_ref) tanpa diekspos dalam northbound API.
|
||||||
|
|
||||||
|
### 5.4 Capability Negotiation
|
||||||
|
- HCP SHALL mendefinisikan model kemampuan (*capabilities*) untuk fitur compute.
|
||||||
|
- HCP SHALL mengembalikan error yang konsisten saat fitur tidak didukung oleh provider (mis. `FEATURE_NOT_SUPPORTED`).
|
||||||
|
|
||||||
|
### 5.5 Console Access Abstraction
|
||||||
|
- HCP SHALL menyediakan endpoint untuk meminta sesi console VM.
|
||||||
|
- Sesi console SHALL bersifat short-lived (memiliki expiry) dan tidak membocorkan credential provider jangka panjang.
|
||||||
|
- HCP SHOULD mendukung beberapa jenis console (VNC, SPICE, WebConsole, RDP) melalui tipe abstraksi.
|
||||||
|
|
||||||
|
### 5.6 Placement & Location Abstraction
|
||||||
|
- HCP SHALL mendukung konsep `zone/location` dan `compute_cluster/pool` untuk placement.
|
||||||
|
- Tenant API SHOULD memilih placement secara generik (mis. zone) tanpa melihat host/node.
|
||||||
|
- Ops API SHALL mengelola mapping zone → provider cluster/pool.
|
||||||
|
|
||||||
|
### 5.7 Security & Authorization
|
||||||
|
- HCP SHALL memverifikasi token/JWT untuk setiap request.
|
||||||
|
- HCP SHALL menegakkan RBAC pada level endpoint dan resource scope:
|
||||||
|
- tenant scope tidak dapat mengakses ops scope
|
||||||
|
- user hanya dapat mengakses resource dalam project yang diikat role binding
|
||||||
|
- HCP SHALL mendukung idempotency key untuk operasi create agar aman terhadap retry.
|
||||||
|
|
||||||
|
### 5.8 Audit Logging
|
||||||
|
- HCP SHALL menghasilkan audit event untuk:
|
||||||
|
- request received
|
||||||
|
- job started
|
||||||
|
- job succeeded
|
||||||
|
- job failed
|
||||||
|
- Audit event minimal memuat: actor, action, target, timestamp, result, trace_id.
|
||||||
|
|
||||||
|
### 5.9 Error Taxonomy
|
||||||
|
- HCP SHALL mengembalikan error yang seragam lintas provider, minimal:
|
||||||
|
- INVALID_REQUEST
|
||||||
|
- NOT_FOUND
|
||||||
|
- UNAUTHORIZED / FORBIDDEN
|
||||||
|
- QUOTA_EXCEEDED
|
||||||
|
- FEATURE_NOT_SUPPORTED
|
||||||
|
- PROVIDER_UNAVAILABLE
|
||||||
|
- PROVIDER_TIMEOUT
|
||||||
|
- CONFLICT
|
||||||
|
- INTERNAL_ERROR
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Kebutuhan Non-Fungsional
|
||||||
|
|
||||||
|
### 6.1 Reliability
|
||||||
|
- HCP SHALL tahan terhadap restart komponen tanpa kehilangan state.
|
||||||
|
- Job processing SHALL bersifat at-least-once dengan idempotency pada handler.
|
||||||
|
|
||||||
|
### 6.2 Availability
|
||||||
|
- API service SHALL stateless dan dapat di-scale horizontal.
|
||||||
|
- State (VM/job) SHALL tersimpan pada datastore persisten.
|
||||||
|
|
||||||
|
### 6.3 Performance
|
||||||
|
- API read (list/detail) SHOULD responsif dan dapat memakai caching read-model bila diperlukan.
|
||||||
|
- Operasi write menggunakan async jobs untuk memisahkan latency provider dari latency API.
|
||||||
|
|
||||||
|
### 6.4 Security
|
||||||
|
- Secrets provider credential SHALL disimpan terenkripsi (Vault-ready / envelope encryption).
|
||||||
|
- Console sessions SHALL menggunakan token sementara (short-lived).
|
||||||
|
- Input validation SHALL ketat untuk mencegah injection/abuse.
|
||||||
|
|
||||||
|
### 6.5 Observability
|
||||||
|
- HCP SHALL mengekspos metrics (request rate, latency, job success/fail, provider errors).
|
||||||
|
- HCP SHALL menggunakan trace_id/correlation_id untuk request end-to-end.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Acceptance Criteria (V1)
|
||||||
|
- Tenant dapat membuat VM dari image yang tersedia dan memonitor status via job.
|
||||||
|
- Tenant dapat melakukan start/stop/reboot/delete VM dengan job tracking.
|
||||||
|
- Ops dapat mendaftarkan provider/cluster dan mengatur mapping zone.
|
||||||
|
- Sistem dapat berjalan dengan minimal 1 provider driver (Proxmox) tanpa mengunci desain untuk provider lain.
|
||||||
|
- Error dan audit event konsisten, dapat dipakai untuk troubleshooting.
|
||||||
|
|
||||||
|
---
|
||||||
160
srs-sds/SDS_v1.md
Normal file
160
srs-sds/SDS_v1.md
Normal file
@@ -0,0 +1,160 @@
|
|||||||
|
# Cloud Infrastructure Management Platform
|
||||||
|
## Software Design Specification (SDS)
|
||||||
|
**Version: 1.0 (V1 – Enterprise Foundation)**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Architectural Overview
|
||||||
|
|
||||||
|
The platform adopts a Control Plane and Data Plane architecture.
|
||||||
|
|
||||||
|
- Control Plane manages APIs, identity, orchestration, policy, and state.
|
||||||
|
- Data Plane executes infrastructure operations via agents and providers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. High-Level Components
|
||||||
|
|
||||||
|
### 2.1 Management Layer
|
||||||
|
- Tenant Management Console
|
||||||
|
- Provider / Operations Console
|
||||||
|
|
||||||
|
In Version 1, both consoles MAY be implemented as a single UI with strict role-based access control.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2.2 API Gateway
|
||||||
|
Responsibilities:
|
||||||
|
- Authentication and authorization
|
||||||
|
- API namespace separation
|
||||||
|
- Request validation and rate limiting
|
||||||
|
- Centralized audit logging hook
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2.3 Core Services
|
||||||
|
|
||||||
|
| Service | Responsibility |
|
||||||
|
|-------|----------------|
|
||||||
|
| Identity Service | Users, roles, RBAC |
|
||||||
|
| Resource Manager | Projects, quotas, metadata |
|
||||||
|
| Compute Service | Virtual machine lifecycle |
|
||||||
|
| Network Service | Virtual network management |
|
||||||
|
| Storage Service | Volume or object storage |
|
||||||
|
| Job Service | Workflow orchestration and retries |
|
||||||
|
| Audit Service | Append-only audit logging |
|
||||||
|
| Metering Service | Usage aggregation |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Data Model Overview
|
||||||
|
|
||||||
|
### Core Entities
|
||||||
|
- Organization
|
||||||
|
- Project
|
||||||
|
- User
|
||||||
|
- Role
|
||||||
|
- Role Binding
|
||||||
|
- Virtual Machine
|
||||||
|
- Network
|
||||||
|
- Volume / Bucket
|
||||||
|
- Job
|
||||||
|
- Audit Event
|
||||||
|
- Quota
|
||||||
|
- Provider
|
||||||
|
|
||||||
|
### Common Resource Attributes
|
||||||
|
```
|
||||||
|
id
|
||||||
|
organization_id
|
||||||
|
project_id
|
||||||
|
name
|
||||||
|
status
|
||||||
|
labels
|
||||||
|
provider_reference
|
||||||
|
created_at
|
||||||
|
updated_at
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. API Design Principles
|
||||||
|
|
||||||
|
- REST-based APIs
|
||||||
|
- Versioned endpoints
|
||||||
|
- Clear separation between tenant and provider APIs
|
||||||
|
|
||||||
|
### Namespace Examples
|
||||||
|
- /api/tenant/v1/*
|
||||||
|
- /api/ops/v1/*
|
||||||
|
- /api/common/v1/*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Job & Workflow Design
|
||||||
|
|
||||||
|
### Job Lifecycle States
|
||||||
|
- PENDING
|
||||||
|
- RUNNING
|
||||||
|
- SUCCEEDED
|
||||||
|
- FAILED
|
||||||
|
- RETRYING
|
||||||
|
|
||||||
|
### Design Characteristics
|
||||||
|
- Idempotent create operations
|
||||||
|
- Retry for transient failures only
|
||||||
|
- Persistent job state storage
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Provider & Agent Architecture
|
||||||
|
|
||||||
|
### Provider Interfaces
|
||||||
|
- Compute Provider
|
||||||
|
- Network Provider
|
||||||
|
- Storage Provider
|
||||||
|
|
||||||
|
### Agent Responsibilities
|
||||||
|
- Execute infrastructure-level operations
|
||||||
|
- Report actual state to the control plane
|
||||||
|
- Emit audit and telemetry data
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Reconciliation Mechanism
|
||||||
|
|
||||||
|
- Periodic reconciliation loop
|
||||||
|
- Desired state vs actual state comparison
|
||||||
|
- Drift handling via:
|
||||||
|
- Automated correction
|
||||||
|
- Operator alert and incident escalation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Security Architecture
|
||||||
|
|
||||||
|
- Token-based authentication
|
||||||
|
- RBAC enforcement across services
|
||||||
|
- Encrypted secret storage
|
||||||
|
- Distributed request tracing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Deployment Model (V1)
|
||||||
|
|
||||||
|
- Stateless API services
|
||||||
|
- PostgreSQL as primary datastore
|
||||||
|
- Message queue for job distribution
|
||||||
|
- Agent deployment per infrastructure cluster
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Future Evolution
|
||||||
|
|
||||||
|
- Multi-cluster federation
|
||||||
|
- Kubernetes services
|
||||||
|
- Policy-as-Code
|
||||||
|
- Billing and invoicing
|
||||||
|
- Application marketplace
|
||||||
|
|
||||||
|
---
|
||||||
143
srs-sds/SRS_v1.md
Normal file
143
srs-sds/SRS_v1.md
Normal file
@@ -0,0 +1,143 @@
|
|||||||
|
# Cloud Infrastructure Management Platform
|
||||||
|
## Software Requirements Specification (SRS)
|
||||||
|
**Version: 1.0 (V1 – Enterprise Foundation)**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Purpose & Vision
|
||||||
|
|
||||||
|
This document defines the Software Requirements Specification (SRS) for the Cloud Infrastructure Management Platform (CIMP).
|
||||||
|
|
||||||
|
The platform is designed to deliver enterprise-grade, IaaS-like cloud capabilities inspired by AWS, GCP, and Azure, primarily targeting private and managed cloud environments.
|
||||||
|
|
||||||
|
Version 1 focuses on strong architectural foundations, governance, and security, while maintaining a controlled and achievable feature scope.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Target Users & Roles
|
||||||
|
|
||||||
|
### 2.1 Tenant Roles
|
||||||
|
- Tenant Owner
|
||||||
|
- Project Admin
|
||||||
|
- Project Operator
|
||||||
|
- Project Viewer
|
||||||
|
|
||||||
|
### 2.2 Provider / Operator Roles
|
||||||
|
- Cloud Operator
|
||||||
|
- Infrastructure Administrator
|
||||||
|
- Security / Audit Administrator
|
||||||
|
- Break-glass Super Administrator
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Scope Definition
|
||||||
|
|
||||||
|
### 3.1 In Scope (V1)
|
||||||
|
- Multi-tenant and multi-project architecture
|
||||||
|
- Identity and Access Management (RBAC)
|
||||||
|
- Compute service (Virtual Machine lifecycle)
|
||||||
|
- Basic virtual networking
|
||||||
|
- Basic storage service (block or object)
|
||||||
|
- Asynchronous job execution
|
||||||
|
- Audit logging (append-only)
|
||||||
|
- Usage metering and reporting
|
||||||
|
- Provider / operations management console
|
||||||
|
|
||||||
|
### 3.2 Out of Scope (V1)
|
||||||
|
- Public cloud federation
|
||||||
|
- Auto-scaling and elasticity
|
||||||
|
- Kubernetes and container orchestration
|
||||||
|
- Application marketplace
|
||||||
|
- Billing or payment gateway
|
||||||
|
- Advanced SDN automation (BGP / EVPN)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Functional Requirements
|
||||||
|
|
||||||
|
### 4.1 Identity & Access Management
|
||||||
|
- The system SHALL support Organizations (Tenants), Projects, Users, Roles, and Role Bindings.
|
||||||
|
- The system SHALL enforce strict separation between tenant and provider scopes.
|
||||||
|
- The system SHALL use token-based API authentication.
|
||||||
|
- The system SHOULD be extensible to support external Identity Providers (OIDC).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4.2 Project & Resource Management
|
||||||
|
- Tenants SHALL be able to create and manage projects.
|
||||||
|
- Projects SHALL support quota assignment.
|
||||||
|
- Every resource SHALL belong to exactly one project.
|
||||||
|
- All resources SHALL include ownership and lifecycle metadata.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4.3 Compute Service
|
||||||
|
- Tenants SHALL be able to create Virtual Machines from predefined images.
|
||||||
|
- The system SHALL support start, stop, reboot, and delete operations.
|
||||||
|
- VM provisioning SHALL be asynchronous.
|
||||||
|
- VM lifecycle states SHALL be exposed through the API.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4.4 Network Service
|
||||||
|
- Tenants SHALL be able to create virtual networks per project.
|
||||||
|
- Virtual networks SHALL enforce isolation between projects.
|
||||||
|
- Virtual machines SHALL be attachable to one or more virtual networks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4.5 Storage Service
|
||||||
|
- Tenants SHALL be able to create storage volumes or object buckets.
|
||||||
|
- Storage resources SHALL be attachable to compute resources where applicable.
|
||||||
|
- Snapshot functionality MAY be supported depending on backend capability.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4.6 Job & Workflow Management
|
||||||
|
- All infrastructure-impacting operations SHALL be executed via an asynchronous job system.
|
||||||
|
- Each job SHALL return a job identifier.
|
||||||
|
- Job execution status SHALL be queryable.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4.7 Audit Logging
|
||||||
|
- The system SHALL record all control-plane actions.
|
||||||
|
- Audit logs SHALL include actor, action, target resource, timestamp, and result.
|
||||||
|
- Audit logs SHALL be immutable and append-only.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4.8 Metering & Reporting
|
||||||
|
- The system SHALL collect usage metrics for compute, network, and storage.
|
||||||
|
- Usage reports SHALL be generated per project and tenant.
|
||||||
|
- Billing integration is out of scope for V1.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4.9 Provider / Operations Management
|
||||||
|
- Operators SHALL be able to onboard infrastructure clusters.
|
||||||
|
- Operators SHALL be able to define global policies and catalogs.
|
||||||
|
- Operators SHALL have visibility into tenant activities for auditing and troubleshooting.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Non-Functional Requirements
|
||||||
|
|
||||||
|
### 5.1 Security
|
||||||
|
- RBAC enforcement at the API layer.
|
||||||
|
- Encryption for sensitive data at rest.
|
||||||
|
- Full auditability of administrative actions.
|
||||||
|
|
||||||
|
### 5.2 Availability
|
||||||
|
- Control plane services SHALL be stateless.
|
||||||
|
- The system SHALL tolerate service restarts without data loss.
|
||||||
|
|
||||||
|
### 5.3 Scalability
|
||||||
|
- Horizontal scalability for API services.
|
||||||
|
- Asynchronous processing for long-running tasks.
|
||||||
|
|
||||||
|
### 5.4 Maintainability
|
||||||
|
- Modular service architecture.
|
||||||
|
- Clear separation between control plane and data plane.
|
||||||
|
|
||||||
|
---
|
||||||
Reference in New Issue
Block a user