Files
hephaestus-hpc-api/srs-sds/SDS_v1.md
2025-12-30 13:00:47 +07:00

2.9 KiB
Raw Blame History

Cloud Infrastructure Management Platform

Software Design Specification (SDS)

Version: 1.0 (V1 Enterprise Foundation)


1. Architectural Overview

The platform adopts a Control Plane and Data Plane architecture.

  • Control Plane manages APIs, identity, orchestration, policy, and state.
  • Data Plane executes infrastructure operations via agents and providers.

2. High-Level Components

2.1 Management Layer

  • Tenant Management Console
  • Provider / Operations Console

In Version 1, both consoles MAY be implemented as a single UI with strict role-based access control.


2.2 API Gateway

Responsibilities:

  • Authentication and authorization
  • API namespace separation
  • Request validation and rate limiting
  • Centralized audit logging hook

2.3 Core Services

Service Responsibility
Identity Service Users, roles, RBAC
Resource Manager Projects, quotas, metadata
Compute Service Virtual machine lifecycle
Network Service Virtual network management
Storage Service Volume or object storage
Job Service Workflow orchestration and retries
Audit Service Append-only audit logging
Metering Service Usage aggregation

3. Data Model Overview

Core Entities

  • Organization
  • Project
  • User
  • Role
  • Role Binding
  • Virtual Machine
  • Network
  • Volume / Bucket
  • Job
  • Audit Event
  • Quota
  • Provider

Common Resource Attributes

id
organization_id
project_id
name
status
labels
provider_reference
created_at
updated_at

4. API Design Principles

  • REST-based APIs
  • Versioned endpoints
  • Clear separation between tenant and provider APIs

Namespace Examples

  • /api/tenant/v1/*
  • /api/ops/v1/*
  • /api/common/v1/*

5. Job & Workflow Design

Job Lifecycle States

  • PENDING
  • RUNNING
  • SUCCEEDED
  • FAILED
  • RETRYING

Design Characteristics

  • Idempotent create operations
  • Retry for transient failures only
  • Persistent job state storage

6. Provider & Agent Architecture

Provider Interfaces

  • Compute Provider
  • Network Provider
  • Storage Provider

Agent Responsibilities

  • Execute infrastructure-level operations
  • Report actual state to the control plane
  • Emit audit and telemetry data

7. Reconciliation Mechanism

  • Periodic reconciliation loop
  • Desired state vs actual state comparison
  • Drift handling via:
    • Automated correction
    • Operator alert and incident escalation

8. Security Architecture

  • Token-based authentication
  • RBAC enforcement across services
  • Encrypted secret storage
  • Distributed request tracing

9. Deployment Model (V1)

  • Stateless API services
  • PostgreSQL as primary datastore
  • Message queue for job distribution
  • Agent deployment per infrastructure cluster

10. Future Evolution

  • Multi-cluster federation
  • Kubernetes services
  • Policy-as-Code
  • Billing and invoicing
  • Application marketplace