adventuretime

After a conversation with an LM https://chatgpt.com/share/697143db-c3e0-8000-b56c-07cf7ca43795 the following proposal was generated.

1 Executive Summary

Research labs consistently suffer from fragile, bespoke infrastructure that fails under preemption, heterogeneous clusters, and rapid iteration. This project proposes AdventureTime: a minimal, JAX-first training and experimentation infrastructure designed for small frontier research groups (~10 people) with access to heterogeneous compute.

The system prioritizes two non-negotiable guarantees:

  1. If an experiment runs locally, it runs on any cluster by changing only the submit command.
  2. No experiment ever loses progress; all workloads are preemption-safe and resumable.

AdventureTime achieves this by unifying all workloads (training, eval, tokenization, API rollouts, DAG workflows) under a single abstraction: checkpointable JAX pytrees, paired with a lightweight control-plane scheduler and object-store-backed artifact system.

The total infrastructure cost target is ~$50k, with modest ongoing operational costs.

2 Core Design Principles

  • JAX-first (Flax, Optax, Orbax)
  • All resumable state is a JAX pytree
  • Restart is cheap; elasticity is optional
  • No Kubernetes, no Ray, no containers
  • UV for dependency management
  • SSH + SLURM + object storage as primary integration points
  • Control plane orchestrates; workers are stateless

3 Non-Goals

  • Enterprise multi-tenant auth
  • Perfect elastic world-size training
  • Replacing SLURM or cloud schedulers
  • Building a general-purpose data lake
  • Long-lived actors or services on workers

4 System Overview

AdventureTime consists of three layers:

  1. Control Plane (colo-hosted)
  2. Execution Backends (heterogeneous clusters)
  3. Runtime Library + CLI (monorepo)

All interaction is mediated via job specs, checkpoint manifests, and object storage.

5.1 Goals (hardware)

  • One always-on control node (scheduler + state + UI + adapters)
  • One interactive “debug box” for SSH ingress and editing (e.g., Emacs/tmux)
  • No GPU requirement in colo; GPUs live in external clusters
  • Small footprint: ~2U rack total is acceptable; 1U is possible with tradeoffs

5.2.1 1U Control Plane Server (always-on)

  • Role: scheduler/controller, DB, logging UI, artifact index, backend adapters
  • Specs (baseline):
    • CPU: 16–32 cores (e.g., EPYC / Xeon)
    • RAM: 128–256 GB
    • Storage:
      • OS: mirrored SSD (e.g., 2×1TB)
      • Local scratch/cache: 2–8TB NVMe (single or mirrored; not canonical)
    • Network:
      • 10GbE preferred (1GbE workable)
    • Notes:
      • This node should be stable, boring, and easy to replace.

5.2.2 1U Debug / SSH Bastion (interactive)

  • Role: SSH endpoint, “human box,” editor, small-scale local runs, diagnostics
  • Specs:
    • CPU: 8–16 cores
    • RAM: 64–128 GB
    • Storage: 1–2TB SSD
  • Notes:
    • Can also host small services (docs preview, dashboards) if desired.

5.3 Option B: 1U total (aggressive)

  • Single 1U machine runs everything (control + debug)
  • Risk: interrupts/reboots/maintenance hit both scheduling and your “human box”
  • Acceptable only if you’re okay with occasional coordination pauses.

5.4 Option C: “DGX Spark as Debug Box”

  • If DGX Spark is available, treat it as:
    • Debug / interactive SSH box
    • Not mandatory for control-plane correctness
  • Control plane remains a boring 1U server.

5.5 Storage in colo

  • Weka is optional. For this project’s goals, treat Weka as a hot cache/staging layer, not a dependency.
  • Canonical storage is S3-compatible object storage.

5.6 Power Budget (planning)

(Exact draw depends on chosen servers; below is a conservative sizing guide.)

  • Control plane 1U server: ~150–350W typical, ~500W peak
  • Debug/bastion 1U server: ~100–250W typical, ~400W peak
  • 10GbE switch (small): ~20–60W typical
  • Total typical: ~300–660W
  • Total peak (safe provision): ~900–1,200W

Power provisioning recommendation:

  • Budget 1.2kW on the PDU for comfort
  • Single 120V/15A circuit can be tight at peak; prefer 120V/20A or 208V if available

5.7 Space Budget (rack units)

  • Option A: 2U servers + optional 1U switch = 2U–3U total
  • Option B: 1U total + optional 1U switch = 1U–2U total
  • Cabling: plan front-to-back airflow, short DACs for 10GbE where possible

5.8 Colo notes

  • Put the control plane on UPS-backed power (colo UPS or your own small UPS if permitted)
  • Maintain remote serial / out-of-band management (iDRAC/iLO) for recovery

6 Network & Communication Diagram (Textual)

[Dev Laptop]
   |
   | adventuretime run/submit (SSH/HTTPS)
   v
[Debug/Bastion Box]  (Emacs/tmux, human ingress)
   |
   | (SSH, internal)
   v
[Control Plane Server] (scheduler, state, UI)
   |
   |-- SSH --> [SLURM login A] --> sbatch --> compute nodes
   |-- SSH --> [SLURM login B] --> sbatch --> compute nodes
   |-- SSH --> [SLURM login C] --> sbatch --> compute nodes
   |
   |-- HTTPS --> [S3-compatible Object Store]
   |
   |-- LAN --> [Optional Hot Cache (Weka/NAS)]

Workers never communicate with each other or the control plane beyond optional heartbeats.

7 Storage Model

7.1 Invariant

All resumable state is a JAX pytree plus a small JSON manifest.

7.2 Checkpoint Flow

[Worker Scratch Disk]
  └── ckpt.tmp/
        ├── orbax blobs
        └── metadata.json
        |
        | upload blobs
        v
[S3://runs/<run_id>/ckpt/<step>/...]
        |
        | upload manifest.json LAST
        v
Checkpoint committed atomically

The presence of a manifest indicates a valid checkpoint.

7.3 Artifacts & Logs

  • Artifacts (JSONL, images, tables) are written in parts
  • Each part is immutable
  • A manifest tracks completion
  • Same mechanism for API rollouts and training

8 Unified Checkpointing (JAX Pytrees)

All workloads checkpoint a pytree:

  • Training:

    • model params
    • optimizer state
    • RNG state
    • step counters
  • API rollouts / DAG nodes:

    • cursor / index
    • RNG seed
    • cached responses (optional)
    • progress metadata

CheckpointManager API:

  • save(pytree, step) -> CheckpointRef
  • latest() -> Optional[CheckpointRef]
  • restore(target_pytree) -> pytree

Orbax is used under the hood; transport is abstracted.

9 Repository Structure (Tech Spec)

monorepo/
  adventuretime/
    cli/
      main.py                  ; run/submit/status/logs
    core/
      env.py                   ; RunEnv
      spec.py                  ; JobSpec, DAGSpec, ResourceSpec
      registry.py              ; experiment discovery
      heartbeat.py             ; liveness + preemption hooks
    ckpt/
      pytree.py                ; pytree API
      orbax.py                 ; orbax adapters
      manifest.py              ; atomic checkpoint manifests
      transport.py             ; local <-> object store
    io/
      datasets.py              ; dataset refs + caching
      artifacts.py             ; artifact refs
      cache.py                 ; scratch cache mgmt
    log/
      events.py                ; structured metrics
      sink_wandb.py            ; optional wandb
      sink_selfhost.py         ; self-host UI client
    backends/
      base.py                  ; Backend interface
      slurm.py                 ; sbatch emitter + watcher
      ssh.py                   ; direct SSH executor
      gcp.py                   ; cloud fallback
    sched/
      controller.py            ; reconcile loop
      planner.py               ; backend selection
      state.py                 ; sqlite/pg run state
      queue.py                 ; DAG execution
    dag/
      model.py                 ; Node, Edge, Resources
      exec.py                  ; node runner
  experiments/
    <exp>.py                   ; returns Job or DAG
  lego/
    datasets/
    layers/
    models/
    optimizers/
  configs/
    backends.yaml
    storage.yaml
    clusters/
      <cluster>.yaml

10 Execution Model

10.1 Job Lifecycle

  • Each run has a stable run_id
  • Each submission attempt increments attempt_id
  • Scheduler reconciles desired vs observed state
  • On failure or preemption:
    • find latest checkpoint
    • resubmit on next viable backend

10.2 Backend Interface

Each backend implements:

  • submit(JobSpec) -> JobHandle
  • poll(JobHandle) -> state
  • cancel(JobHandle)
  • tail_logs(JobHandle)

SLURM adapter parses exit codes and reasons to detect preemption.

11 DAG / “Ray-lite” Model

  • DAG nodes are jobs, not actors
  • Nodes request resources
  • Nodes checkpoint state
  • Retries happen at node granularity
  • Within-node parallelism uses:
    • JAX multihost
    • Python multiprocessing

This avoids Ray’s complexity while retaining fault tolerance.

12 Example YAML Specifications

12.1 Backend Inventory

backends:
  slurm_a:
    type: slurm
    ssh_host: login-a.example.edu
    ssh_user: houjun
    sbatch_defaults:
      partition: gpu
      time: "24:00:00"
  slurm_b:
    type: slurm
    ssh_host: login-b.example.org
    ssh_user: houjun
    sbatch_defaults:
      partition: preempt
      time: "12:00:00"
  gcp:
    type: gcp
    project: myproj
    region: us-central2

12.2 Storage

storage:
  object:
    type: s3
    endpoint: "https://s3.example.com"
    bucket: "adventuretime"
    prefix: "runs"
  hot_cache:
    type: weka
    mount: "/mnt/weka"
    enabled: true

12.3 Single Run Spec

run:
  id: "fork-mid-2026-01-21-001"
  experiment: "experiments/fork_mid.py:build"
  resources:
    gpus: 8
    gpu_type: "H100|A100|any"
  policy:
    preemptible: true
    checkpoint_interval_sec: 120
  backend_selector:
    order: ["slurm_a", "slurm_b", "gcp"]

12.4 DAG Spec (Tokenize → Train → Rollouts)

dag:
  id: "ragdoll-2026-01-21"
  nodes:
    - id: tokenize
      entry: "experiments/tokenize.py:node"
      resources: { cpus: 32 }
    - id: train
      entry: "experiments/fork_mid.py:build"
      needs: [tokenize]
      resources: { gpus: 16 }
    - id: rollouts
      entry: "experiments/ragdoll_api.py:node"
      needs: [train]
      resources: { cpus: 16 }

13 Implementation Plan

13.1 Phase 0 (2–3 weeks)

  • CLI skeleton
  • RunEnv
  • Single SLURM backend
  • Pytree checkpoint manager (local + S3)

13.2 Phase 1 (4–6 weeks)

  • Scheduler reconcile loop
  • Multi-backend failover
  • Preemption detection
  • Unified logging

13.3 Phase 2 (3–4 weeks)

  • DAG execution
  • API rollout support
  • Dataset + artifact caching

13.4 Phase 3 (optional)

  • Shrink-only topology changes
  • Smarter backend planning
  • UI polish

14 Cost Estimates

14.1 One-Time Hardware (Target ~$50k)

ItemCost (USD)
1U Control Plane server$6–12k
1U Debug/SSH Bastion (interactive)$3–8k
10GbE switch + DACs$0.5–2k
Colo + networking (1 yr)$5–10k
Optional hot cache (NAS/Weka-like)$3–12k
Buffer / spares / rails / misc$2–5k
Total~$20–50k

Notes:

  • This budget intentionally does NOT include GPUs.
  • If you already have colo networking/space, the bottom end is realistic.
  • If you actually deploy Weka proper, it can push you toward the top end.

14.2 Ongoing

  • Object storage: low-to-moderate (checkpoints + artifacts; depends on retention)
  • Colo: recurring monthly fee (varies widely)
  • Maintenance: ~0.25–0.5 FTE systems effort

15 Risks & Mitigations

  • Heterogeneous cluster quirks → adapter isolation + retry semantics

  • Checkpoint corruption → manifest-based atomic commits

  • Scheduler complexity → narrow scope; no intra-cluster scheduling

  • Research velocity slowdown → local-first workflow preserved

16 Success Criteria

  • Local debug → cluster run requires no code changes
  • Preempted jobs resume automatically
  • No lost experiments over 3+ months
  • Researchers add lego modules without infra changes

17 Conclusion

AdventureTime is intentionally narrow, opinionated infrastructure that trades breadth for reliability. By unifying all workloads under JAX pytree checkpointing and delegating scheduling to a lightweight control plane, it provides frontier-grade robustness at a cost and complexity appropriate for small research labs.