Hyperion Compute

Work Orders · HPC Platform Ops

Command center for provisioning, cabling, and technician routing across NVIDIA-powered fabrics.

CLUSTER UTILIZATION

82%

+6%

GPU busy time, rolling 15m

WORK ORDERS

27

−3 open

8 ready for parallel execution

POWER ENVELOPE

9.8MW

safe margin

Load vs 11MW ceiling

QUEUE DEPTH

412 jobs

syncing

22 flagged for priority HPC

Admin · Tickets

Work order intelligence

Group by parallel-ready and ambiguous signals to unblock provisioning.

Known issue playbooks

WO-9823parallel A
critical

Replace PSU on rack P44

Telemetry caught undervolt on redundant PSU before failover; swap primary and run soak.

ETA • 00:15Generated from slack
WO-9824parallel B
high

Provision 10-node pod for SentiAI

Provision GPU pod with pre-approved network + storage wiring. Waiting on automation trigger.

ETA • 00:25Generated from manual
WO-9832parallel B
medium

Grant burst permissions for OmniBio

Grant storage + GPU slices for burst workloads via IAM template; pre-checks already green.

ETA • On-demandGenerated from manual
WO-9836parallel C
high

Patch Ceph gateway firmware

Firmware regression exposes stale locks; patch across gateway pair before peak jobs.

ETA • 00:45Generated from manual
WO-9837parallel D
medium

Expand cluster VLANs for FluxAI

FluxAI adding two inference clusters; need VLAN carve out and DHCP scopes.

ETA • 00:30Generated from manual
WO-9842parallel E
medium

Audit IAM keys for Atlas Lab

Quarterly IAM key rotation for Atlas Lab workloads.

ETA • On-demandGenerated from manual
WO-9844parallel F
medium

Deploy telemetry collector 5.2

Collector fleet needs upgrade to resolve timestamp skew bug.

ETA • 00:40Generated from manual

Ambiguous signals

SIG-442high

Node pool 7 intermittent resets

Sleds 7A-7C show 18% draw variance and intermittent PCIe faults without CRCs.

  • Cooling pump cavitation42%
  • NVLink harness pinch33%
  • Firmware regression25%
SIG-448medium

West aisle network flap

BER trending upward on QSFP lanes west aisle; no CRC yet but margin is shrinking.

  • Fiber bend radius violation47%
  • Faulty QSFP29%
  • Patch panel contamination24%
SIG-455medium

East pod cooling variance

Rack pairs E22/E23 heating faster under GPU saturations.

  • CRAC damper drift51%
  • Obstructed floor tile28%
  • Sensor calibration21%
SIG-458high

NVLink parity drift

Parity drift 12% between 7C/7D for 15 minutes.

  • Retimer thermal throttling40%
  • Link training regression34%
  • Telemetry bug26%

Admin · Forecasting

Parallelizable runs

Slot quick wins to keep fabric saturated without stepping on serial maintenance.

Redline Motors00:30-02:00

Provision 10x H100 nodes

Nodes • 10parallel safe
OmniBioNow-00:45

Secure permissions audit

Nodes • 4parallel safe
TopoSynth02:00-04:00

Scale image training pod

Nodes • 32serial only

Admin · Automation

Push-button fixes

For easy tasks like permissions or lightweight provisioning, issue creds instantly.

AUTO-12medium footprint
Provision 10 node pod

Spin up containers, wire storage paths, and push creds to customer vault.

AUTO-19low footprint
Set project permissions

Grant storage + GPU slices for burst workloads in 1 click.

Comms

Auto-generated updates

All parsed comms log their origin so everyone knows what bot created what.

08:46 CSTGenerated from slack

Alert: PSU P44 trending to fail — auto ticket WO-9823 created.

via #fabric-alerts

Linked ticket • WO-9823

07:55 CSTGenerated from email

Request to expand NVSwitch fabric — pending design approval.

via ops@toposynth.ai

Linked ticket • WO-9827

Technicians

Route + rank

Distance, floor elevation, and priority produce the go-now list.

1
WO-9830critical

Swap QSFP on spine 6

Distance • 58mFloor • L3 Aisle 6Elevation Δ • +7ft
2
WO-9828high

Inspect liquid loop Delta pod

Distance • 74mFloor • L2 Cold RowElevation Δ • -4ft
3
WO-9831standard

Verify cabling bundle 44B

Distance • 110mFloor • L2 Hot RowElevation Δ • +2ft

Cabling

Bundle health

Visualize installs, reroutes, and degraded loss in one glance.

Fabric A

Healthy · loss 0.4dB

Fabric B

Rerouted · awaiting sweep

NVLink spine

Investigate · loss 2.1dB

Inventory

Predictive restocking

Tracking burn so spare shelves stay ahead of the demand curve.

ItemOn-handThresholdDaysSignal
QSFP-DD 400G optics46306Rising consumption
Cold plate gaskets1409018Stable
Power shelf (N+1)8612Rising consumption
Fiber trunks (MTP-24)624025Cooling demand