- 1 Executive Summary
- 2 Core Design Principles
- 3 Non-Goals
- 4 System Overview
- 5 Hardware Architecture (Recommended)
- 6 Network & Communication Diagram (Textual)
- 7 Storage Model
- 7.1 Invariant
- 7.2 Checkpoint Flow
- 7.3 Artifacts & Logs
- 8 Unified Checkpointing (JAX Pytrees)
- 9 Repository Structure (Tech Spec)
- 10 Execution Model
- 10.1 Job Lifecycle
- 10.2 Backend Interface
- 11 DAG / “Ray-lite” Model
- 12 Example YAML Specifications
- 12.1 Backend Inventory
- 12.2 Storage
- 12.3 Single Run Spec
- 12.4 DAG Spec (Tokenize → Train → Rollouts)
- 13 Implementation Plan
- 13.1 Phase 0 (2–3 weeks)
- 13.2 Phase 1 (4–6 weeks)
- 13.3 Phase 2 (3–4 weeks)
- 13.4 Phase 3 (optional)
- 14 Cost Estimates
- 14.1 One-Time Hardware (Target ~$50k)
- 14.2 Ongoing
- 15 Risks & Mitigations
- 16 Success Criteria
- 17 Conclusion
After a conversation with an LM https://chatgpt.com/share/697143db-c3e0-8000-b56c-07cf7ca43795 the following proposal was generated.
1 Executive Summary
Research labs consistently suffer from fragile, bespoke infrastructure that fails under preemption, heterogeneous clusters, and rapid iteration. This project proposes AdventureTime: a minimal, JAX-first training and experimentation infrastructure designed for small frontier research groups (~10 people) with access to heterogeneous compute.
The system prioritizes two non-negotiable guarantees:
- If an experiment runs locally, it runs on any cluster by changing only the submit command.
- No experiment ever loses progress; all workloads are preemption-safe and resumable.
AdventureTime achieves this by unifying all workloads (training, eval, tokenization, API rollouts, DAG workflows) under a single abstraction: checkpointable JAX pytrees, paired with a lightweight control-plane scheduler and object-store-backed artifact system.
The total infrastructure cost target is ~$50k, with modest ongoing operational costs.
2 Core Design Principles
- JAX-first (Flax, Optax, Orbax)
- All resumable state is a JAX pytree
- Restart is cheap; elasticity is optional
- No Kubernetes, no Ray, no containers
- UV for dependency management
- SSH + SLURM + object storage as primary integration points
- Control plane orchestrates; workers are stateless
3 Non-Goals
- Enterprise multi-tenant auth
- Perfect elastic world-size training
- Replacing SLURM or cloud schedulers
- Building a general-purpose data lake
- Long-lived actors or services on workers
4 System Overview
AdventureTime consists of three layers:
- Control Plane (colo-hosted)
- Execution Backends (heterogeneous clusters)
- Runtime Library + CLI (monorepo)
All interaction is mediated via job specs, checkpoint manifests, and object storage.
5 Hardware Architecture (Recommended)
5.1 Goals (hardware)
- One always-on control node (scheduler + state + UI + adapters)
- One interactive “debug box” for SSH ingress and editing (e.g., Emacs/tmux)
- No GPU requirement in colo; GPUs live in external clusters
- Small footprint: ~2U rack total is acceptable; 1U is possible with tradeoffs
5.2 Option A: 2U total (recommended)
5.2.1 1U Control Plane Server (always-on)
- Role: scheduler/controller, DB, logging UI, artifact index, backend adapters
- Specs (baseline):
- CPU: 16–32 cores (e.g., EPYC / Xeon)
- RAM: 128–256 GB
- Storage:
- OS: mirrored SSD (e.g., 2×1TB)
- Local scratch/cache: 2–8TB NVMe (single or mirrored; not canonical)
- Network:
- 10GbE preferred (1GbE workable)
- Notes:
- This node should be stable, boring, and easy to replace.
5.2.2 1U Debug / SSH Bastion (interactive)
- Role: SSH endpoint, “human box,” editor, small-scale local runs, diagnostics
- Specs:
- CPU: 8–16 cores
- RAM: 64–128 GB
- Storage: 1–2TB SSD
- Notes:
- Can also host small services (docs preview, dashboards) if desired.
5.3 Option B: 1U total (aggressive)
- Single 1U machine runs everything (control + debug)
- Risk: interrupts/reboots/maintenance hit both scheduling and your “human box”
- Acceptable only if you’re okay with occasional coordination pauses.
5.4 Option C: “DGX Spark as Debug Box”
- If DGX Spark is available, treat it as:
- Debug / interactive SSH box
- Not mandatory for control-plane correctness
- Control plane remains a boring 1U server.
5.5 Storage in colo
- Weka is optional. For this project’s goals, treat Weka as a hot cache/staging layer, not a dependency.
- Canonical storage is S3-compatible object storage.
5.6 Power Budget (planning)
(Exact draw depends on chosen servers; below is a conservative sizing guide.)
- Control plane 1U server: ~150–350W typical, ~500W peak
- Debug/bastion 1U server: ~100–250W typical, ~400W peak
- 10GbE switch (small): ~20–60W typical
- Total typical: ~300–660W
- Total peak (safe provision): ~900–1,200W
Power provisioning recommendation:
- Budget 1.2kW on the PDU for comfort
- Single 120V/15A circuit can be tight at peak; prefer 120V/20A or 208V if available
5.7 Space Budget (rack units)
- Option A: 2U servers + optional 1U switch = 2U–3U total
- Option B: 1U total + optional 1U switch = 1U–2U total
- Cabling: plan front-to-back airflow, short DACs for 10GbE where possible
5.8 Colo notes
- Put the control plane on UPS-backed power (colo UPS or your own small UPS if permitted)
- Maintain remote serial / out-of-band management (iDRAC/iLO) for recovery
6 Network & Communication Diagram (Textual)
[Dev Laptop]
|
| adventuretime run/submit (SSH/HTTPS)
v
[Debug/Bastion Box] (Emacs/tmux, human ingress)
|
| (SSH, internal)
v
[Control Plane Server] (scheduler, state, UI)
|
|-- SSH --> [SLURM login A] --> sbatch --> compute nodes
|-- SSH --> [SLURM login B] --> sbatch --> compute nodes
|-- SSH --> [SLURM login C] --> sbatch --> compute nodes
|
|-- HTTPS --> [S3-compatible Object Store]
|
|-- LAN --> [Optional Hot Cache (Weka/NAS)]
Workers never communicate with each other or the control plane beyond optional heartbeats.
7 Storage Model
7.1 Invariant
All resumable state is a JAX pytree plus a small JSON manifest.
7.2 Checkpoint Flow
[Worker Scratch Disk]
└── ckpt.tmp/
├── orbax blobs
└── metadata.json
|
| upload blobs
v
[S3://runs/<run_id>/ckpt/<step>/...]
|
| upload manifest.json LAST
v
Checkpoint committed atomically
The presence of a manifest indicates a valid checkpoint.
7.3 Artifacts & Logs
- Artifacts (JSONL, images, tables) are written in parts
- Each part is immutable
- A manifest tracks completion
- Same mechanism for API rollouts and training
8 Unified Checkpointing (JAX Pytrees)
All workloads checkpoint a pytree:
Training:
- model params
- optimizer state
- RNG state
- step counters
API rollouts / DAG nodes:
- cursor / index
- RNG seed
- cached responses (optional)
- progress metadata
CheckpointManager API:
- save(pytree, step) -> CheckpointRef
- latest() -> Optional[CheckpointRef]
- restore(target_pytree) -> pytree
Orbax is used under the hood; transport is abstracted.
9 Repository Structure (Tech Spec)
monorepo/
adventuretime/
cli/
main.py ; run/submit/status/logs
core/
env.py ; RunEnv
spec.py ; JobSpec, DAGSpec, ResourceSpec
registry.py ; experiment discovery
heartbeat.py ; liveness + preemption hooks
ckpt/
pytree.py ; pytree API
orbax.py ; orbax adapters
manifest.py ; atomic checkpoint manifests
transport.py ; local <-> object store
io/
datasets.py ; dataset refs + caching
artifacts.py ; artifact refs
cache.py ; scratch cache mgmt
log/
events.py ; structured metrics
sink_wandb.py ; optional wandb
sink_selfhost.py ; self-host UI client
backends/
base.py ; Backend interface
slurm.py ; sbatch emitter + watcher
ssh.py ; direct SSH executor
gcp.py ; cloud fallback
sched/
controller.py ; reconcile loop
planner.py ; backend selection
state.py ; sqlite/pg run state
queue.py ; DAG execution
dag/
model.py ; Node, Edge, Resources
exec.py ; node runner
experiments/
<exp>.py ; returns Job or DAG
lego/
datasets/
layers/
models/
optimizers/
configs/
backends.yaml
storage.yaml
clusters/
<cluster>.yaml
10 Execution Model
10.1 Job Lifecycle
- Each run has a stable run_id
- Each submission attempt increments attempt_id
- Scheduler reconciles desired vs observed state
- On failure or preemption:
- find latest checkpoint
- resubmit on next viable backend
10.2 Backend Interface
Each backend implements:
- submit(JobSpec) -> JobHandle
- poll(JobHandle) -> state
- cancel(JobHandle)
- tail_logs(JobHandle)
SLURM adapter parses exit codes and reasons to detect preemption.
11 DAG / “Ray-lite” Model
- DAG nodes are jobs, not actors
- Nodes request resources
- Nodes checkpoint state
- Retries happen at node granularity
- Within-node parallelism uses:
- JAX multihost
- Python multiprocessing
This avoids Ray’s complexity while retaining fault tolerance.
12 Example YAML Specifications
12.1 Backend Inventory
backends:
slurm_a:
type: slurm
ssh_host: login-a.example.edu
ssh_user: houjun
sbatch_defaults:
partition: gpu
time: "24:00:00"
slurm_b:
type: slurm
ssh_host: login-b.example.org
ssh_user: houjun
sbatch_defaults:
partition: preempt
time: "12:00:00"
gcp:
type: gcp
project: myproj
region: us-central2
12.2 Storage
storage:
object:
type: s3
endpoint: "https://s3.example.com"
bucket: "adventuretime"
prefix: "runs"
hot_cache:
type: weka
mount: "/mnt/weka"
enabled: true
12.3 Single Run Spec
run:
id: "fork-mid-2026-01-21-001"
experiment: "experiments/fork_mid.py:build"
resources:
gpus: 8
gpu_type: "H100|A100|any"
policy:
preemptible: true
checkpoint_interval_sec: 120
backend_selector:
order: ["slurm_a", "slurm_b", "gcp"]
12.4 DAG Spec (Tokenize → Train → Rollouts)
dag:
id: "ragdoll-2026-01-21"
nodes:
- id: tokenize
entry: "experiments/tokenize.py:node"
resources: { cpus: 32 }
- id: train
entry: "experiments/fork_mid.py:build"
needs: [tokenize]
resources: { gpus: 16 }
- id: rollouts
entry: "experiments/ragdoll_api.py:node"
needs: [train]
resources: { cpus: 16 }
13 Implementation Plan
13.1 Phase 0 (2–3 weeks)
- CLI skeleton
- RunEnv
- Single SLURM backend
- Pytree checkpoint manager (local + S3)
13.2 Phase 1 (4–6 weeks)
- Scheduler reconcile loop
- Multi-backend failover
- Preemption detection
- Unified logging
13.3 Phase 2 (3–4 weeks)
- DAG execution
- API rollout support
- Dataset + artifact caching
13.4 Phase 3 (optional)
- Shrink-only topology changes
- Smarter backend planning
- UI polish
14 Cost Estimates
14.1 One-Time Hardware (Target ~$50k)
| Item | Cost (USD) |
|---|---|
| 1U Control Plane server | $6–12k |
| 1U Debug/SSH Bastion (interactive) | $3–8k |
| 10GbE switch + DACs | $0.5–2k |
| Colo + networking (1 yr) | $5–10k |
| Optional hot cache (NAS/Weka-like) | $3–12k |
| Buffer / spares / rails / misc | $2–5k |
| Total | ~$20–50k |
Notes:
- This budget intentionally does NOT include GPUs.
- If you already have colo networking/space, the bottom end is realistic.
- If you actually deploy Weka proper, it can push you toward the top end.
14.2 Ongoing
- Object storage: low-to-moderate (checkpoints + artifacts; depends on retention)
- Colo: recurring monthly fee (varies widely)
- Maintenance: ~0.25–0.5 FTE systems effort
15 Risks & Mitigations
Heterogeneous cluster quirks → adapter isolation + retry semantics
Checkpoint corruption → manifest-based atomic commits
Scheduler complexity → narrow scope; no intra-cluster scheduling
Research velocity slowdown → local-first workflow preserved
16 Success Criteria
- Local debug → cluster run requires no code changes
- Preempted jobs resume automatically
- No lost experiments over 3+ months
- Researchers add lego modules without infra changes
17 Conclusion
AdventureTime is intentionally narrow, opinionated infrastructure that trades breadth for reliability. By unifying all workloads under JAX pytree checkpointing and delegating scheduling to a lightweight control plane, it provides frontier-grade robustness at a cost and complexity appropriate for small research labs.
