adventuretime

Table of Contents

1 Executive Summary
2 Core Design Principles
3 Non-Goals
4 System Overview
5 Hardware Architecture (Recommended)
- 5.1 Goals (hardware)
- 5.2 Option A: 2U total (recommended)
- 5.3 Option B: 1U total (aggressive)
- 5.4 Option C: “DGX Spark as Debug Box”
- 5.5 Storage in colo
- 5.6 Power Budget (planning)
- 5.7 Space Budget (rack units)
- 5.8 Colo notes
6 Network & Communication Diagram (Textual)
7 Storage Model
- 7.1 Invariant
- 7.2 Checkpoint Flow
- 7.3 Artifacts & Logs
8 Unified Checkpointing (JAX Pytrees)
9 Repository Structure (Tech Spec)
10 Execution Model
- 10.1 Job Lifecycle
- 10.2 Backend Interface
11 DAG / “Ray-lite” Model
12 Example YAML Specifications
- 12.1 Backend Inventory
- 12.2 Storage
- 12.3 Single Run Spec
- 12.4 DAG Spec (Tokenize → Train → Rollouts)
13 Implementation Plan
- 13.1 Phase 0 (2–3 weeks)
- 13.2 Phase 1 (4–6 weeks)
- 13.3 Phase 2 (3–4 weeks)
- 13.4 Phase 3 (optional)
14 Cost Estimates
- 14.1 One-Time Hardware (Target ~$50k)
- 14.2 Ongoing
15 Risks & Mitigations
16 Success Criteria
17 Conclusion

After a conversation with an LM https://chatgpt.com/share/697143db-c3e0-8000-b56c-07cf7ca43795 the following proposal was generated.

1 Executive Summary

Research labs consistently suffer from fragile, bespoke infrastructure that fails under preemption, heterogeneous clusters, and rapid iteration. This project proposes AdventureTime: a minimal, JAX-first training and experimentation infrastructure designed for small frontier research groups (~10 people) with access to heterogeneous compute.

The system prioritizes two non-negotiable guarantees:

If an experiment runs locally, it runs on any cluster by changing only the submit command.
No experiment ever loses progress; all workloads are preemption-safe and resumable.

AdventureTime achieves this by unifying all workloads (training, eval, tokenization, API rollouts, DAG workflows) under a single abstraction: checkpointable JAX pytrees, paired with a lightweight control-plane scheduler and object-store-backed artifact system.

The total infrastructure cost target is ~$50k, with modest ongoing operational costs.

2 Core Design Principles

JAX-first (Flax, Optax, Orbax)
All resumable state is a JAX pytree
Restart is cheap; elasticity is optional
No Kubernetes, no Ray, no containers
UV for dependency management
SSH + SLURM + object storage as primary integration points
Control plane orchestrates; workers are stateless

3 Non-Goals

Enterprise multi-tenant auth
Perfect elastic world-size training
Replacing SLURM or cloud schedulers
Building a general-purpose data lake
Long-lived actors or services on workers

4 System Overview

AdventureTime consists of three layers:

Control Plane (colo-hosted)
Execution Backends (heterogeneous clusters)
Runtime Library + CLI (monorepo)

All interaction is mediated via job specs, checkpoint manifests, and object storage.

5 Hardware Architecture (Recommended)

5.1 Goals (hardware)

One always-on control node (scheduler + state + UI + adapters)
One interactive “debug box” for SSH ingress and editing (e.g., Emacs/tmux)
No GPU requirement in colo; GPUs live in external clusters
Small footprint: ~2U rack total is acceptable; 1U is possible with tradeoffs

5.2 Option A: 2U total (recommended)

5.2.1 1U Control Plane Server (always-on)

Role: scheduler/controller, DB, logging UI, artifact index, backend adapters
Specs (baseline):
- CPU: 16–32 cores (e.g., EPYC / Xeon)
- RAM: 128–256 GB
- Storage:
  - OS: mirrored SSD (e.g., 2×1TB)
  - Local scratch/cache: 2–8TB NVMe (single or mirrored; not canonical)
- Network:
  - 10GbE preferred (1GbE workable)
- Notes:
  - This node should be stable, boring, and easy to replace.

5.2.2 1U Debug / SSH Bastion (interactive)

Role: SSH endpoint, “human box,” editor, small-scale local runs, diagnostics
Specs:
- CPU: 8–16 cores
- RAM: 64–128 GB
- Storage: 1–2TB SSD
Notes:
- Can also host small services (docs preview, dashboards) if desired.

5.3 Option B: 1U total (aggressive)

Single 1U machine runs everything (control + debug)
Risk: interrupts/reboots/maintenance hit both scheduling and your “human box”
Acceptable only if you’re okay with occasional coordination pauses.

5.4 Option C: “DGX Spark as Debug Box”

If DGX Spark is available, treat it as:
- Debug / interactive SSH box
- Not mandatory for control-plane correctness
Control plane remains a boring 1U server.

5.5 Storage in colo

Weka is optional. For this project’s goals, treat Weka as a hot cache/staging layer, not a dependency.
Canonical storage is S3-compatible object storage.

5.6 Power Budget (planning)

(Exact draw depends on chosen servers; below is a conservative sizing guide.)

Control plane 1U server: ~150–350W typical, ~500W peak
Debug/bastion 1U server: ~100–250W typical, ~400W peak
10GbE switch (small): ~20–60W typical
Total typical: ~300–660W
Total peak (safe provision): ~900–1,200W

Power provisioning recommendation:

Budget 1.2kW on the PDU for comfort
Single 120V/15A circuit can be tight at peak; prefer 120V/20A or 208V if available

5.7 Space Budget (rack units)

Option A: 2U servers + optional 1U switch = 2U–3U total
Option B: 1U total + optional 1U switch = 1U–2U total
Cabling: plan front-to-back airflow, short DACs for 10GbE where possible

5.8 Colo notes

Put the control plane on UPS-backed power (colo UPS or your own small UPS if permitted)
Maintain remote serial / out-of-band management (iDRAC/iLO) for recovery

6 Network & Communication Diagram (Textual)

[Dev Laptop]
   |
   | adventuretime run/submit (SSH/HTTPS)
   v
[Debug/Bastion Box]  (Emacs/tmux, human ingress)
   |
   | (SSH, internal)
   v
[Control Plane Server] (scheduler, state, UI)
   |
   |-- SSH --> [SLURM login A] --> sbatch --> compute nodes
   |-- SSH --> [SLURM login B] --> sbatch --> compute nodes
   |-- SSH --> [SLURM login C] --> sbatch --> compute nodes
   |
   |-- HTTPS --> [S3-compatible Object Store]
   |
   |-- LAN --> [Optional Hot Cache (Weka/NAS)]

Workers never communicate with each other or the control plane beyond optional heartbeats.

7 Storage Model

7.1 Invariant

All resumable state is a JAX pytree plus a small JSON manifest.

7.2 Checkpoint Flow

[Worker Scratch Disk]
  └── ckpt.tmp/
        ├── orbax blobs
        └── metadata.json
        |
        | upload blobs
        v
[S3://runs/<run_id>/ckpt/<step>/...]
        |
        | upload manifest.json LAST
        v
Checkpoint committed atomically

The presence of a manifest indicates a valid checkpoint.

7.3 Artifacts & Logs

Artifacts (JSONL, images, tables) are written in parts
Each part is immutable
A manifest tracks completion
Same mechanism for API rollouts and training

8 Unified Checkpointing (JAX Pytrees)

All workloads checkpoint a pytree:

Training:
- model params
- optimizer state
- RNG state
- step counters
API rollouts / DAG nodes:
- cursor / index
- RNG seed
- cached responses (optional)
- progress metadata

CheckpointManager API:

save(pytree, step) -> CheckpointRef
latest() -> Optional[CheckpointRef]
restore(target_pytree) -> pytree

Orbax is used under the hood; transport is abstracted.

9 Repository Structure (Tech Spec)

monorepo/
  adventuretime/
    cli/
      main.py                  ; run/submit/status/logs
    core/
      env.py                   ; RunEnv
      spec.py                  ; JobSpec, DAGSpec, ResourceSpec
      registry.py              ; experiment discovery
      heartbeat.py             ; liveness + preemption hooks
    ckpt/
      pytree.py                ; pytree API
      orbax.py                 ; orbax adapters
      manifest.py              ; atomic checkpoint manifests
      transport.py             ; local <-> object store
    io/
      datasets.py              ; dataset refs + caching
      artifacts.py             ; artifact refs
      cache.py                 ; scratch cache mgmt
    log/
      events.py                ; structured metrics
      sink_wandb.py            ; optional wandb
      sink_selfhost.py         ; self-host UI client
    backends/
      base.py                  ; Backend interface
      slurm.py                 ; sbatch emitter + watcher
      ssh.py                   ; direct SSH executor
      gcp.py                   ; cloud fallback
    sched/
      controller.py            ; reconcile loop
      planner.py               ; backend selection
      state.py                 ; sqlite/pg run state
      queue.py                 ; DAG execution
    dag/
      model.py                 ; Node, Edge, Resources
      exec.py                  ; node runner
  experiments/
    <exp>.py                   ; returns Job or DAG
  lego/
    datasets/
    layers/
    models/
    optimizers/
  configs/
    backends.yaml
    storage.yaml
    clusters/
      <cluster>.yaml

10 Execution Model

10.1 Job Lifecycle

Each run has a stable run_id
Each submission attempt increments attempt_id
Scheduler reconciles desired vs observed state
On failure or preemption:
- find latest checkpoint
- resubmit on next viable backend

10.2 Backend Interface

Each backend implements:

submit(JobSpec) -> JobHandle
poll(JobHandle) -> state
cancel(JobHandle)
tail_logs(JobHandle)

SLURM adapter parses exit codes and reasons to detect preemption.

11 DAG / “Ray-lite” Model

DAG nodes are jobs, not actors
Nodes request resources
Nodes checkpoint state
Retries happen at node granularity
Within-node parallelism uses:
- JAX multihost
- Python multiprocessing

This avoids Ray’s complexity while retaining fault tolerance.

12 Example YAML Specifications

12.1 Backend Inventory

backends:
  slurm_a:
    type: slurm
    ssh_host: login-a.example.edu
    ssh_user: houjun
    sbatch_defaults:
      partition: gpu
      time: "24:00:00"
  slurm_b:
    type: slurm
    ssh_host: login-b.example.org
    ssh_user: houjun
    sbatch_defaults:
      partition: preempt
      time: "12:00:00"
  gcp:
    type: gcp
    project: myproj
    region: us-central2

12.2 Storage

storage:
  object:
    type: s3
    endpoint: "https://s3.example.com"
    bucket: "adventuretime"
    prefix: "runs"
  hot_cache:
    type: weka
    mount: "/mnt/weka"
    enabled: true

12.3 Single Run Spec

run:
  id: "fork-mid-2026-01-21-001"
  experiment: "experiments/fork_mid.py:build"
  resources:
    gpus: 8
    gpu_type: "H100|A100|any"
  policy:
    preemptible: true
    checkpoint_interval_sec: 120
  backend_selector:
    order: ["slurm_a", "slurm_b", "gcp"]

12.4 DAG Spec (Tokenize → Train → Rollouts)

dag:
  id: "ragdoll-2026-01-21"
  nodes:
    - id: tokenize
      entry: "experiments/tokenize.py:node"
      resources: { cpus: 32 }
    - id: train
      entry: "experiments/fork_mid.py:build"
      needs: [tokenize]
      resources: { gpus: 16 }
    - id: rollouts
      entry: "experiments/ragdoll_api.py:node"
      needs: [train]
      resources: { cpus: 16 }

13 Implementation Plan

13.1 Phase 0 (2–3 weeks)

CLI skeleton
RunEnv
Single SLURM backend
Pytree checkpoint manager (local + S3)

13.2 Phase 1 (4–6 weeks)

Scheduler reconcile loop
Multi-backend failover
Preemption detection
Unified logging

13.3 Phase 2 (3–4 weeks)

DAG execution
API rollout support
Dataset + artifact caching

13.4 Phase 3 (optional)

Shrink-only topology changes
Smarter backend planning
UI polish

14 Cost Estimates

14.1 One-Time Hardware (Target ~$50k)

Item	Cost (USD)
1U Control Plane server	$6–12k
1U Debug/SSH Bastion (interactive)	$3–8k
10GbE switch + DACs	$0.5–2k
Colo + networking (1 yr)	$5–10k
Optional hot cache (NAS/Weka-like)	$3–12k
Buffer / spares / rails / misc	$2–5k
Total	~$20–50k

Notes:

This budget intentionally does NOT include GPUs.
If you already have colo networking/space, the bottom end is realistic.
If you actually deploy Weka proper, it can push you toward the top end.

14.2 Ongoing

Object storage: low-to-moderate (checkpoints + artifacts; depends on retention)
Colo: recurring monthly fee (varies widely)
Maintenance: ~0.25–0.5 FTE systems effort

15 Risks & Mitigations

Heterogeneous cluster quirks → adapter isolation + retry semantics
Checkpoint corruption → manifest-based atomic commits
Scheduler complexity → narrow scope; no intra-cluster scheduling
Research velocity slowdown → local-first workflow preserved

16 Success Criteria

Local debug → cluster run requires no code changes
Preempted jobs resume automatically
No lost experiments over 3+ months
Researchers add lego modules without infra changes

17 Conclusion

AdventureTime is intentionally narrow, opinionated infrastructure that trades breadth for reliability. By unifying all workloads under JAX pytree checkpointing and delegating scheduling to a lightweight control plane, it provides frontier-grade robustness at a cost and complexity appropriate for small research labs.