add srs and sds documents
This commit is contained in:
244
srs-sds/HCP_SDS_v1.md
Normal file
244
srs-sds/HCP_SDS_v1.md
Normal file
@@ -0,0 +1,244 @@
|
||||
# Hypervisor Control Plane (HCP)
|
||||
## Software Design Specification (SDS)
|
||||
**Version: 1.0 (V1 – Enterprise Foundation)**
|
||||
|
||||
---
|
||||
|
||||
## 1. Overview Arsitektur
|
||||
|
||||
HCP menerapkan pola **Control Plane compute** dengan desain:
|
||||
- **Northbound API**: Stabil, provider-agnostic, digunakan oleh Central Gateway dan console.
|
||||
- **Core**: Orkestrasi job, policy hook, persistence state, audit/event.
|
||||
- **Southbound Provider Layer**: Adapter/driver per hypervisor/provider.
|
||||
- **Workers/Agents**: Mengeksekusi job yang berdampak pada infrastruktur.
|
||||
|
||||
HCP mendukung dua mode eksekusi:
|
||||
1) **Direct mode**: worker memanggil API provider langsung (cepat untuk bootstrap)
|
||||
2) **Agent mode**: job dikirim ke agent dekat cluster (lebih enterprise: multi-site, firewall-friendly)
|
||||
|
||||
---
|
||||
|
||||
## 2. Komponen Utama
|
||||
|
||||
### 2.1 HCP API Service
|
||||
Tanggung jawab:
|
||||
- Expose REST API (tenant/ops) untuk compute
|
||||
- AuthN/AuthZ enforcement (scope-based)
|
||||
- Validasi request + idempotency
|
||||
- Persist desired state dan job record
|
||||
- Publish job ke queue/stream
|
||||
|
||||
### 2.2 HCP Worker Service
|
||||
Tanggung jawab:
|
||||
- Subscribe job queue
|
||||
- Jalankan state machine job (RUNNING/RETRY/FAILED/SUCCEEDED)
|
||||
- Panggil provider adapter
|
||||
- Update state VM/job di datastore
|
||||
- Emit audit + metering events
|
||||
|
||||
### 2.3 Provider Adapter Layer
|
||||
Tanggung jawab:
|
||||
- Implement kontrak provider generik
|
||||
- Mapping spec generik → API spesifik provider
|
||||
- Normalisasi error provider → error taxonomy HCP
|
||||
- Normalisasi VM actual state → model internal
|
||||
|
||||
### 2.4 Data Store
|
||||
- PostgreSQL untuk state persisten (tenancy binding, vms, jobs, catalog, locations)
|
||||
- Event store (append-only) untuk audit (bisa table khusus atau log pipeline)
|
||||
- Queue/Stream untuk distribusi job (NATS JetStream / RabbitMQ)
|
||||
|
||||
---
|
||||
|
||||
## 3. Domain Model
|
||||
|
||||
### 3.1 Entities Inti
|
||||
- `Location/Zone`
|
||||
- `Provider`
|
||||
- `ComputeCluster` (pool/cluster per provider, terikat location)
|
||||
- `Image` (catalog)
|
||||
- `Flavor` (catalog)
|
||||
- `VM`
|
||||
- `Job`
|
||||
- `AuditEvent`
|
||||
|
||||
### 3.2 Resource Fields (VM)
|
||||
VM minimal memiliki:
|
||||
- `id`, `org_id`, `project_id`
|
||||
- `name`, `status`
|
||||
- `image_id`, `flavor_id`
|
||||
- `placement` (location_id, cluster_id optional)
|
||||
- `addresses` (read-model)
|
||||
- `labels/tags`
|
||||
- `provider_id`
|
||||
- `provider_ref` (opaque/internal)
|
||||
- timestamps
|
||||
|
||||
### 3.3 Job Fields
|
||||
- `id`, `type`
|
||||
- `resource_type`, `resource_id`
|
||||
- `state` (PENDING/RUNNING/SUCCEEDED/FAILED/RETRYING)
|
||||
- `attempt`, `max_attempt`
|
||||
- `error_code`, `error_message`
|
||||
- timestamps
|
||||
|
||||
---
|
||||
|
||||
## 4. Northbound API Design
|
||||
|
||||
### 4.1 Namespace
|
||||
Disarankan memisahkan tenant dan ops, meskipun HCP bisa diakses via Central Gateway:
|
||||
- Tenant: `/api/hcp/tenant/v1/*`
|
||||
- Ops: `/api/hcp/ops/v1/*`
|
||||
- Common catalog (read): `/api/hcp/common/v1/*`
|
||||
|
||||
### 4.2 Async Pattern
|
||||
- Create/modify/delete mengembalikan `202 Accepted` dengan `job_id`.
|
||||
- Status job dapat dipolling: `GET /jobs/{job_id}`.
|
||||
- Resource dapat dipolling: `GET /vms/{id}`.
|
||||
|
||||
### 4.3 Core Endpoints (V1)
|
||||
Tenant:
|
||||
- `POST /projects/{projectId}/vms`
|
||||
- `GET /projects/{projectId}/vms`
|
||||
- `GET /projects/{projectId}/vms/{vmId}`
|
||||
- `POST /projects/{projectId}/vms/{vmId}:start`
|
||||
- `POST /projects/{projectId}/vms/{vmId}:stop`
|
||||
- `POST /projects/{projectId}/vms/{vmId}:reboot`
|
||||
- `DELETE /projects/{projectId}/vms/{vmId}`
|
||||
- `POST /projects/{projectId}/vms/{vmId}:console`
|
||||
- `GET /jobs/{jobId}`
|
||||
|
||||
Ops:
|
||||
- `POST /providers`
|
||||
- `POST /locations`
|
||||
- `POST /compute-clusters`
|
||||
- `GET /providers`
|
||||
- `GET /compute-clusters`
|
||||
- `POST /catalog/images` (opsional v1 jika platform membutuhkan)
|
||||
- `POST /catalog/flavors` (opsional v1)
|
||||
|
||||
Common:
|
||||
- `GET /catalog/images`
|
||||
- `GET /catalog/flavors`
|
||||
- `GET /capabilities`
|
||||
|
||||
---
|
||||
|
||||
## 5. Capability Model
|
||||
|
||||
HCP menyimpan dan mengekspos capability flags pada provider/cluster, contoh:
|
||||
- `supports_cloud_init`
|
||||
- `supports_snapshot`
|
||||
- `supports_live_migration`
|
||||
- `supports_console_vnc`
|
||||
- `supports_console_spice`
|
||||
- `supports_uefi`
|
||||
- `supports_gpu_passthrough`
|
||||
- `supports_secure_boot`
|
||||
- `supports_tags`
|
||||
|
||||
Pemakaian:
|
||||
- UI dan service upstream dapat menyesuaikan fitur yang ditampilkan.
|
||||
- API mengembalikan `FEATURE_NOT_SUPPORTED` jika action tidak tersedia.
|
||||
|
||||
---
|
||||
|
||||
## 6. Provider Interface (Conceptual)
|
||||
|
||||
### 6.1 ComputeProvider (minimum)
|
||||
- `CreateVM(spec) -> ProviderRef`
|
||||
- `DeleteVM(ref)`
|
||||
- `StartVM(ref)`
|
||||
- `StopVM(ref)`
|
||||
- `RebootVM(ref)`
|
||||
- `GetVM(ref) -> ActualState`
|
||||
- `ListVMs(scope) -> []ActualState` (opsional untuk reconcile)
|
||||
|
||||
### 6.2 ConsoleProvider
|
||||
- `GetConsole(ref) -> ConsoleSession(type, url/token, expires_at)`
|
||||
|
||||
### 6.3 Catalog Providers (opsional)
|
||||
- `ListImages(scope)`
|
||||
- `ImportImage(source)`
|
||||
- `DeleteImage(id)`
|
||||
|
||||
ProviderRef bersifat opaque:
|
||||
- `provider` + `external_id` + `location_id` + `extra(json)`
|
||||
|
||||
---
|
||||
|
||||
## 7. Job & Workflow
|
||||
|
||||
### 7.1 Job Types (V1)
|
||||
- `provision_vm`
|
||||
- `start_vm`
|
||||
- `stop_vm`
|
||||
- `reboot_vm`
|
||||
- `delete_vm`
|
||||
(attach nic/volume bisa ditambahkan jika masuk scope platform v1)
|
||||
|
||||
### 7.2 State Machine
|
||||
- PENDING → RUNNING → SUCCEEDED
|
||||
- PENDING → RUNNING → FAILED
|
||||
- PENDING → RUNNING → RETRYING → RUNNING ...
|
||||
|
||||
Retry hanya untuk error transient:
|
||||
- provider timeout
|
||||
- temporary network error
|
||||
- 5xx upstream
|
||||
|
||||
Idempotency:
|
||||
- create VM harus aman jika dieksekusi ulang.
|
||||
- handler wajib memeriksa `provider_ref` dan actual state sebelum membuat resource baru.
|
||||
|
||||
---
|
||||
|
||||
## 8. Reconciliation Loop
|
||||
|
||||
Reconciliation dijalankan periodik untuk:
|
||||
- Mengupdate VM yang `PENDING/RUNNING` berdasarkan actual state provider.
|
||||
- Mendeteksi drift: VM hilang di provider namun masih ACTIVE di DB.
|
||||
- Menandai incident/alert (post-V1 bisa integrasi incident service).
|
||||
|
||||
---
|
||||
|
||||
## 9. Security Design
|
||||
|
||||
### 9.1 AuthN/AuthZ
|
||||
- JWT bearer token dengan claims minimal: `org_id`, `project_bindings`, `roles`, `scopes`
|
||||
- Tenant scope tidak boleh mengakses ops endpoints.
|
||||
- Semua request harus divalidasi terhadap path param projectId.
|
||||
|
||||
### 9.2 Secrets & Credentials
|
||||
- Provider credential disimpan terenkripsi.
|
||||
- Worker/agent menggunakan credential scoped (per cluster/pool) jika memungkinkan.
|
||||
- Console session menggunakan token sementara (short-lived).
|
||||
|
||||
---
|
||||
|
||||
## 10. Observability
|
||||
|
||||
- Metrics: RPS, latency, error rate, job success/fail, provider latency, retry count
|
||||
- Logs terstruktur dengan `trace_id`, `job_id`, `vm_id`
|
||||
- Tracing end-to-end (OpenTelemetry ready)
|
||||
|
||||
---
|
||||
|
||||
## 11. Deployment Notes (V1)
|
||||
|
||||
- HCP API: stateless, autoscale-ready
|
||||
- HCP Worker: scale out sesuai throughput job
|
||||
- DB: PostgreSQL
|
||||
- Queue: NATS JetStream atau RabbitMQ
|
||||
- Provider adapter: modul internal dalam worker (v1) atau sidecar/agent (enterprise mode)
|
||||
|
||||
---
|
||||
|
||||
## 12. Kompatibilitas Provider (Target)
|
||||
- Proxmox driver sebagai implementasi pertama
|
||||
- VMware vSphere driver (post-V1 atau parallel development)
|
||||
- Libvirt/KVM driver
|
||||
- Hyper-V driver
|
||||
|
||||
---
|
||||
169
srs-sds/HCP_SRS_v1.md
Normal file
169
srs-sds/HCP_SRS_v1.md
Normal file
@@ -0,0 +1,169 @@
|
||||
# Hypervisor Control Plane (HCP)
|
||||
## Software Requirements Specification (SRS)
|
||||
**Version: 1.0 (V1 – Enterprise Foundation)**
|
||||
|
||||
---
|
||||
|
||||
## 1. Ringkasan
|
||||
|
||||
Dokumen ini mendefinisikan kebutuhan perangkat lunak (Software Requirements Specification) untuk **Hypervisor Control Plane (HCP)**, yaitu layanan control-plane yang menyediakan API *provider-agnostic* untuk manajemen lifecycle Virtual Machine (VM) dan resource compute terkait.
|
||||
|
||||
HCP dirancang untuk mendukung banyak backend hypervisor/provider, termasuk namun tidak terbatas pada:
|
||||
- Proxmox VE
|
||||
- VMware vSphere/ESXi
|
||||
- KVM/QEMU (via libvirt atau API lain)
|
||||
- Hyper-V
|
||||
|
||||
HCP menjadi komponen inti yang diakses oleh **Central API Gateway** dan/atau layanan lain dalam platform, dengan pola **desired state + asynchronous jobs** untuk operasi yang berdampak pada infrastruktur.
|
||||
|
||||
---
|
||||
|
||||
## 2. Tujuan & Prinsip Desain
|
||||
|
||||
### 2.1 Tujuan
|
||||
- Menyediakan API compute yang konsisten untuk berbagai hypervisor.
|
||||
- Mendukung multi-tenant dan multi-project dengan isolasi akses yang ketat.
|
||||
- Menyediakan mekanisme provisioning yang robust: idempotent, dapat di-retry, dapat di-reconcile.
|
||||
- Menjadi fondasi enterprise untuk ekspansi fitur (snapshot, migration, GPU, dsb) melalui *capability negotiation*.
|
||||
|
||||
### 2.2 Prinsip
|
||||
- **Provider-agnostic Northbound API**: API tidak mengekspos detail spesifik provider (mis. `vmid`, `moid`, `datastore`, `node`).
|
||||
- **Plugin/Driver model**: integrasi provider melalui adapter dengan kontrak yang jelas.
|
||||
- **Async by default** untuk operasi yang memerlukan waktu (create, delete, start/stop, attach).
|
||||
- **Strict authorization boundary**: walaupun ada central gateway, HCP tetap melakukan verifikasi scope/claims.
|
||||
- **Auditability**: semua aksi control-plane dapat diaudit.
|
||||
|
||||
---
|
||||
|
||||
## 3. Stakeholders & Peran
|
||||
|
||||
### 3.1 Tenant-side (melalui Central Gateway)
|
||||
- Project Owner / Admin
|
||||
- Operator
|
||||
- Viewer
|
||||
|
||||
### 3.2 Provider/Ops-side
|
||||
- Cloud Operator
|
||||
- Infrastructure Admin
|
||||
- Security/Audit Admin
|
||||
- Break-glass Admin
|
||||
|
||||
---
|
||||
|
||||
## 4. Ruang Lingkup (V1)
|
||||
|
||||
### 4.1 In Scope (MUST)
|
||||
- VM lifecycle: create, start, stop, reboot, delete
|
||||
- VM query: list/detail, status, addresses, metadata/tags
|
||||
- Catalog: images & flavors (read-only untuk tenant, writable untuk ops sesuai kebijakan platform)
|
||||
- Console access: request sesi console short-lived (VNC/SPICE/WebConsole/RDP via abstraction)
|
||||
- Job system integration: setiap operasi berdampak infra menghasilkan `job_id`
|
||||
- Capability model: expose dukungan fitur per provider dan/atau per cluster/location
|
||||
- Placement abstraction: zone/location/cluster (tanpa expose host/node spesifik)
|
||||
- Audit event emission (minimal: request/started/succeeded/failed)
|
||||
- Error taxonomy yang konsisten lintas provider
|
||||
|
||||
### 4.2 Out of Scope (V1)
|
||||
- Auto-scaling dan elasticity
|
||||
- Full billing engine (HCP hanya emit metering events; perhitungan final di service lain)
|
||||
- Advanced SDN dan orchestration network di luar attach NIC dasar
|
||||
- Backup orchestration end-to-end (bisa disiapkan hook/extension)
|
||||
- Live migration/DRS otomatis (opsional post-V1)
|
||||
- Multi-region replication control-plane
|
||||
|
||||
---
|
||||
|
||||
## 5. Kebutuhan Fungsional
|
||||
|
||||
### 5.1 API & Operasi Compute
|
||||
- HCP SHALL menyediakan endpoint untuk membuat VM dari image/template yang disetujui.
|
||||
- HCP SHALL menyediakan operasi: start, stop, reboot, delete VM.
|
||||
- HCP SHALL menyediakan endpoint list/detail VM dalam scope project.
|
||||
- HCP SHALL menjaga state VM dan job state di datastore yang persisten.
|
||||
|
||||
### 5.2 Asynchronous Jobs
|
||||
- Semua operasi yang memodifikasi infra (create/start/stop/reboot/delete) SHALL berjalan asynchronous.
|
||||
- HCP SHALL mengembalikan `202 Accepted` dengan `resource_id` dan `job_id`.
|
||||
- HCP SHALL menyediakan endpoint untuk query status job dan error detail.
|
||||
|
||||
### 5.3 Multi-Provider Support
|
||||
- HCP SHALL mendukung integrasi beberapa provider melalui adapter/driver.
|
||||
- HCP SHALL mendukung selection provider berdasarkan placement (zone/location/cluster) dan kebijakan ops.
|
||||
- HCP SHALL menyimpan referensi resource provider secara internal (provider_ref) tanpa diekspos dalam northbound API.
|
||||
|
||||
### 5.4 Capability Negotiation
|
||||
- HCP SHALL mendefinisikan model kemampuan (*capabilities*) untuk fitur compute.
|
||||
- HCP SHALL mengembalikan error yang konsisten saat fitur tidak didukung oleh provider (mis. `FEATURE_NOT_SUPPORTED`).
|
||||
|
||||
### 5.5 Console Access Abstraction
|
||||
- HCP SHALL menyediakan endpoint untuk meminta sesi console VM.
|
||||
- Sesi console SHALL bersifat short-lived (memiliki expiry) dan tidak membocorkan credential provider jangka panjang.
|
||||
- HCP SHOULD mendukung beberapa jenis console (VNC, SPICE, WebConsole, RDP) melalui tipe abstraksi.
|
||||
|
||||
### 5.6 Placement & Location Abstraction
|
||||
- HCP SHALL mendukung konsep `zone/location` dan `compute_cluster/pool` untuk placement.
|
||||
- Tenant API SHOULD memilih placement secara generik (mis. zone) tanpa melihat host/node.
|
||||
- Ops API SHALL mengelola mapping zone → provider cluster/pool.
|
||||
|
||||
### 5.7 Security & Authorization
|
||||
- HCP SHALL memverifikasi token/JWT untuk setiap request.
|
||||
- HCP SHALL menegakkan RBAC pada level endpoint dan resource scope:
|
||||
- tenant scope tidak dapat mengakses ops scope
|
||||
- user hanya dapat mengakses resource dalam project yang diikat role binding
|
||||
- HCP SHALL mendukung idempotency key untuk operasi create agar aman terhadap retry.
|
||||
|
||||
### 5.8 Audit Logging
|
||||
- HCP SHALL menghasilkan audit event untuk:
|
||||
- request received
|
||||
- job started
|
||||
- job succeeded
|
||||
- job failed
|
||||
- Audit event minimal memuat: actor, action, target, timestamp, result, trace_id.
|
||||
|
||||
### 5.9 Error Taxonomy
|
||||
- HCP SHALL mengembalikan error yang seragam lintas provider, minimal:
|
||||
- INVALID_REQUEST
|
||||
- NOT_FOUND
|
||||
- UNAUTHORIZED / FORBIDDEN
|
||||
- QUOTA_EXCEEDED
|
||||
- FEATURE_NOT_SUPPORTED
|
||||
- PROVIDER_UNAVAILABLE
|
||||
- PROVIDER_TIMEOUT
|
||||
- CONFLICT
|
||||
- INTERNAL_ERROR
|
||||
|
||||
---
|
||||
|
||||
## 6. Kebutuhan Non-Fungsional
|
||||
|
||||
### 6.1 Reliability
|
||||
- HCP SHALL tahan terhadap restart komponen tanpa kehilangan state.
|
||||
- Job processing SHALL bersifat at-least-once dengan idempotency pada handler.
|
||||
|
||||
### 6.2 Availability
|
||||
- API service SHALL stateless dan dapat di-scale horizontal.
|
||||
- State (VM/job) SHALL tersimpan pada datastore persisten.
|
||||
|
||||
### 6.3 Performance
|
||||
- API read (list/detail) SHOULD responsif dan dapat memakai caching read-model bila diperlukan.
|
||||
- Operasi write menggunakan async jobs untuk memisahkan latency provider dari latency API.
|
||||
|
||||
### 6.4 Security
|
||||
- Secrets provider credential SHALL disimpan terenkripsi (Vault-ready / envelope encryption).
|
||||
- Console sessions SHALL menggunakan token sementara (short-lived).
|
||||
- Input validation SHALL ketat untuk mencegah injection/abuse.
|
||||
|
||||
### 6.5 Observability
|
||||
- HCP SHALL mengekspos metrics (request rate, latency, job success/fail, provider errors).
|
||||
- HCP SHALL menggunakan trace_id/correlation_id untuk request end-to-end.
|
||||
|
||||
---
|
||||
|
||||
## 7. Acceptance Criteria (V1)
|
||||
- Tenant dapat membuat VM dari image yang tersedia dan memonitor status via job.
|
||||
- Tenant dapat melakukan start/stop/reboot/delete VM dengan job tracking.
|
||||
- Ops dapat mendaftarkan provider/cluster dan mengatur mapping zone.
|
||||
- Sistem dapat berjalan dengan minimal 1 provider driver (Proxmox) tanpa mengunci desain untuk provider lain.
|
||||
- Error dan audit event konsisten, dapat dipakai untuk troubleshooting.
|
||||
|
||||
---
|
||||
160
srs-sds/SDS_v1.md
Normal file
160
srs-sds/SDS_v1.md
Normal file
@@ -0,0 +1,160 @@
|
||||
# Cloud Infrastructure Management Platform
|
||||
## Software Design Specification (SDS)
|
||||
**Version: 1.0 (V1 – Enterprise Foundation)**
|
||||
|
||||
---
|
||||
|
||||
## 1. Architectural Overview
|
||||
|
||||
The platform adopts a Control Plane and Data Plane architecture.
|
||||
|
||||
- Control Plane manages APIs, identity, orchestration, policy, and state.
|
||||
- Data Plane executes infrastructure operations via agents and providers.
|
||||
|
||||
---
|
||||
|
||||
## 2. High-Level Components
|
||||
|
||||
### 2.1 Management Layer
|
||||
- Tenant Management Console
|
||||
- Provider / Operations Console
|
||||
|
||||
In Version 1, both consoles MAY be implemented as a single UI with strict role-based access control.
|
||||
|
||||
---
|
||||
|
||||
### 2.2 API Gateway
|
||||
Responsibilities:
|
||||
- Authentication and authorization
|
||||
- API namespace separation
|
||||
- Request validation and rate limiting
|
||||
- Centralized audit logging hook
|
||||
|
||||
---
|
||||
|
||||
### 2.3 Core Services
|
||||
|
||||
| Service | Responsibility |
|
||||
|-------|----------------|
|
||||
| Identity Service | Users, roles, RBAC |
|
||||
| Resource Manager | Projects, quotas, metadata |
|
||||
| Compute Service | Virtual machine lifecycle |
|
||||
| Network Service | Virtual network management |
|
||||
| Storage Service | Volume or object storage |
|
||||
| Job Service | Workflow orchestration and retries |
|
||||
| Audit Service | Append-only audit logging |
|
||||
| Metering Service | Usage aggregation |
|
||||
|
||||
---
|
||||
|
||||
## 3. Data Model Overview
|
||||
|
||||
### Core Entities
|
||||
- Organization
|
||||
- Project
|
||||
- User
|
||||
- Role
|
||||
- Role Binding
|
||||
- Virtual Machine
|
||||
- Network
|
||||
- Volume / Bucket
|
||||
- Job
|
||||
- Audit Event
|
||||
- Quota
|
||||
- Provider
|
||||
|
||||
### Common Resource Attributes
|
||||
```
|
||||
id
|
||||
organization_id
|
||||
project_id
|
||||
name
|
||||
status
|
||||
labels
|
||||
provider_reference
|
||||
created_at
|
||||
updated_at
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. API Design Principles
|
||||
|
||||
- REST-based APIs
|
||||
- Versioned endpoints
|
||||
- Clear separation between tenant and provider APIs
|
||||
|
||||
### Namespace Examples
|
||||
- /api/tenant/v1/*
|
||||
- /api/ops/v1/*
|
||||
- /api/common/v1/*
|
||||
|
||||
---
|
||||
|
||||
## 5. Job & Workflow Design
|
||||
|
||||
### Job Lifecycle States
|
||||
- PENDING
|
||||
- RUNNING
|
||||
- SUCCEEDED
|
||||
- FAILED
|
||||
- RETRYING
|
||||
|
||||
### Design Characteristics
|
||||
- Idempotent create operations
|
||||
- Retry for transient failures only
|
||||
- Persistent job state storage
|
||||
|
||||
---
|
||||
|
||||
## 6. Provider & Agent Architecture
|
||||
|
||||
### Provider Interfaces
|
||||
- Compute Provider
|
||||
- Network Provider
|
||||
- Storage Provider
|
||||
|
||||
### Agent Responsibilities
|
||||
- Execute infrastructure-level operations
|
||||
- Report actual state to the control plane
|
||||
- Emit audit and telemetry data
|
||||
|
||||
---
|
||||
|
||||
## 7. Reconciliation Mechanism
|
||||
|
||||
- Periodic reconciliation loop
|
||||
- Desired state vs actual state comparison
|
||||
- Drift handling via:
|
||||
- Automated correction
|
||||
- Operator alert and incident escalation
|
||||
|
||||
---
|
||||
|
||||
## 8. Security Architecture
|
||||
|
||||
- Token-based authentication
|
||||
- RBAC enforcement across services
|
||||
- Encrypted secret storage
|
||||
- Distributed request tracing
|
||||
|
||||
---
|
||||
|
||||
## 9. Deployment Model (V1)
|
||||
|
||||
- Stateless API services
|
||||
- PostgreSQL as primary datastore
|
||||
- Message queue for job distribution
|
||||
- Agent deployment per infrastructure cluster
|
||||
|
||||
---
|
||||
|
||||
## 10. Future Evolution
|
||||
|
||||
- Multi-cluster federation
|
||||
- Kubernetes services
|
||||
- Policy-as-Code
|
||||
- Billing and invoicing
|
||||
- Application marketplace
|
||||
|
||||
---
|
||||
143
srs-sds/SRS_v1.md
Normal file
143
srs-sds/SRS_v1.md
Normal file
@@ -0,0 +1,143 @@
|
||||
# Cloud Infrastructure Management Platform
|
||||
## Software Requirements Specification (SRS)
|
||||
**Version: 1.0 (V1 – Enterprise Foundation)**
|
||||
|
||||
---
|
||||
|
||||
## 1. Purpose & Vision
|
||||
|
||||
This document defines the Software Requirements Specification (SRS) for the Cloud Infrastructure Management Platform (CIMP).
|
||||
|
||||
The platform is designed to deliver enterprise-grade, IaaS-like cloud capabilities inspired by AWS, GCP, and Azure, primarily targeting private and managed cloud environments.
|
||||
|
||||
Version 1 focuses on strong architectural foundations, governance, and security, while maintaining a controlled and achievable feature scope.
|
||||
|
||||
---
|
||||
|
||||
## 2. Target Users & Roles
|
||||
|
||||
### 2.1 Tenant Roles
|
||||
- Tenant Owner
|
||||
- Project Admin
|
||||
- Project Operator
|
||||
- Project Viewer
|
||||
|
||||
### 2.2 Provider / Operator Roles
|
||||
- Cloud Operator
|
||||
- Infrastructure Administrator
|
||||
- Security / Audit Administrator
|
||||
- Break-glass Super Administrator
|
||||
|
||||
---
|
||||
|
||||
## 3. Scope Definition
|
||||
|
||||
### 3.1 In Scope (V1)
|
||||
- Multi-tenant and multi-project architecture
|
||||
- Identity and Access Management (RBAC)
|
||||
- Compute service (Virtual Machine lifecycle)
|
||||
- Basic virtual networking
|
||||
- Basic storage service (block or object)
|
||||
- Asynchronous job execution
|
||||
- Audit logging (append-only)
|
||||
- Usage metering and reporting
|
||||
- Provider / operations management console
|
||||
|
||||
### 3.2 Out of Scope (V1)
|
||||
- Public cloud federation
|
||||
- Auto-scaling and elasticity
|
||||
- Kubernetes and container orchestration
|
||||
- Application marketplace
|
||||
- Billing or payment gateway
|
||||
- Advanced SDN automation (BGP / EVPN)
|
||||
|
||||
---
|
||||
|
||||
## 4. Functional Requirements
|
||||
|
||||
### 4.1 Identity & Access Management
|
||||
- The system SHALL support Organizations (Tenants), Projects, Users, Roles, and Role Bindings.
|
||||
- The system SHALL enforce strict separation between tenant and provider scopes.
|
||||
- The system SHALL use token-based API authentication.
|
||||
- The system SHOULD be extensible to support external Identity Providers (OIDC).
|
||||
|
||||
---
|
||||
|
||||
### 4.2 Project & Resource Management
|
||||
- Tenants SHALL be able to create and manage projects.
|
||||
- Projects SHALL support quota assignment.
|
||||
- Every resource SHALL belong to exactly one project.
|
||||
- All resources SHALL include ownership and lifecycle metadata.
|
||||
|
||||
---
|
||||
|
||||
### 4.3 Compute Service
|
||||
- Tenants SHALL be able to create Virtual Machines from predefined images.
|
||||
- The system SHALL support start, stop, reboot, and delete operations.
|
||||
- VM provisioning SHALL be asynchronous.
|
||||
- VM lifecycle states SHALL be exposed through the API.
|
||||
|
||||
---
|
||||
|
||||
### 4.4 Network Service
|
||||
- Tenants SHALL be able to create virtual networks per project.
|
||||
- Virtual networks SHALL enforce isolation between projects.
|
||||
- Virtual machines SHALL be attachable to one or more virtual networks.
|
||||
|
||||
---
|
||||
|
||||
### 4.5 Storage Service
|
||||
- Tenants SHALL be able to create storage volumes or object buckets.
|
||||
- Storage resources SHALL be attachable to compute resources where applicable.
|
||||
- Snapshot functionality MAY be supported depending on backend capability.
|
||||
|
||||
---
|
||||
|
||||
### 4.6 Job & Workflow Management
|
||||
- All infrastructure-impacting operations SHALL be executed via an asynchronous job system.
|
||||
- Each job SHALL return a job identifier.
|
||||
- Job execution status SHALL be queryable.
|
||||
|
||||
---
|
||||
|
||||
### 4.7 Audit Logging
|
||||
- The system SHALL record all control-plane actions.
|
||||
- Audit logs SHALL include actor, action, target resource, timestamp, and result.
|
||||
- Audit logs SHALL be immutable and append-only.
|
||||
|
||||
---
|
||||
|
||||
### 4.8 Metering & Reporting
|
||||
- The system SHALL collect usage metrics for compute, network, and storage.
|
||||
- Usage reports SHALL be generated per project and tenant.
|
||||
- Billing integration is out of scope for V1.
|
||||
|
||||
---
|
||||
|
||||
### 4.9 Provider / Operations Management
|
||||
- Operators SHALL be able to onboard infrastructure clusters.
|
||||
- Operators SHALL be able to define global policies and catalogs.
|
||||
- Operators SHALL have visibility into tenant activities for auditing and troubleshooting.
|
||||
|
||||
---
|
||||
|
||||
## 5. Non-Functional Requirements
|
||||
|
||||
### 5.1 Security
|
||||
- RBAC enforcement at the API layer.
|
||||
- Encryption for sensitive data at rest.
|
||||
- Full auditability of administrative actions.
|
||||
|
||||
### 5.2 Availability
|
||||
- Control plane services SHALL be stateless.
|
||||
- The system SHALL tolerate service restarts without data loss.
|
||||
|
||||
### 5.3 Scalability
|
||||
- Horizontal scalability for API services.
|
||||
- Asynchronous processing for long-running tasks.
|
||||
|
||||
### 5.4 Maintainability
|
||||
- Modular service architecture.
|
||||
- Clear separation between control plane and data plane.
|
||||
|
||||
---
|
||||
Reference in New Issue
Block a user