Workload thread profiles

A pod is not one uniform workload. Workload profiles describe the thread structure inside a pod — connection threads, I/O chains, background housekeeping — and give each thread group its own scheduling treatment. This granularity exists nowhere above the kernel.

Why thread groups

Consider MySQL: connection threads want exclusive cores and instant wakeups; InnoDB I/O threads want latency treatment on short wake chains; purge and background threads should yield to everything else. Container-level tools see one CPU number for the whole pod and must treat those threads identically. Temper schedules at the layer where threads actually exist, so a profile can say: this group gets exclusive cores, that group gets latency treatment, the rest yield — all inside one pod, with the pod’s QoS tier still governing its standing against other pods.

The division of labor: tiers arbitrate between workloads; profiles structure the threads within one. Workload identity (image or annotation) selects which profile applies.

Builtin profiles

The agent ships with compiled-in profiles for common workload shapes — for example a PyTorch training profile (keeps DataLoader worker threads fed so the GPU never starves) and a MySQL/InnoDB profile (connection threads exclusive, I/O threads latency-treated, background threads yielding). Builtins apply automatically when detection matches and can be overridden by file-based profiles with the same id.

Detection: how a pod gets a profile

  1. Annotation (explicit, wins): temper.codes/workload-profile: mysql-innodb on the pod.
  2. Image match: each profile carries container-image patterns; a pod whose image matches gets the profile automatically.

Profiles are additionally keyed by machine shape — a profile tuned for an 8-core SMT node is not blindly applied to a 64-core one. Shape-matched file profiles override builtins with the same id, most specific match first.

File-based profiles

Custom profiles are TOML files (schema v1) with four sections: fingerprint (how to detect the workload), machine_shape (what hardware the tuning was measured on), traits (workload-level characteristics), and one or more thread_groups (the per-group treatments). The shape, illustratively:

# my-service.toml — illustrative sketch; the shipped profiles are the
# authoritative schema reference
schema_version = 1
id = "my-service"

[fingerprint]      # how pods are matched to this profile
# image patterns and/or annotation id

[machine_shape]    # the node shape this tuning was measured on
# core count, SMT topology

[traits]           # workload-level characteristics

[[thread_groups]]  # one block per thread group:
# how to identify the group's threads, and its scheduling treatment
# (exclusive cores / latency treatment / yield)

Deploy profiles with the helm chart — they render into a ConfigMap mounted into the agent, and edits roll the DaemonSet automatically:

helm upgrade temper deploy/helm/temper --reuse-values \
  --set-file 'agent.profiles.my-service\.toml'=./my-service.toml

Training mode: profiles you don’t write by hand

Writing a thread-group profile from first principles requires knowing your workload’s thread structure. Training mode measures it instead:

  1. Observe — capture a bounded kernel trace plus an /observe snapshot while the workload runs under representative load.
  2. Analyze — cluster the workload’s threads by runtime distribution, wake rate, and waker→wakee relationships; classify each cluster (sync compute, latency critical, I/O wake chain, sporadic).
  3. Synthesize — emit a profile TOML keyed to the machine shape it was measured on.
  4. Evaluate & refine — benchmark the profile against baseline, hill-climb one parameter at a time, and keep only measured improvements.

The pipeline is automated end to end on a live cluster; perfetto trace bursts are bounded and only run during training or canary cycles, so the always-on cost stays under 1% CPU.

Measured effect

Tier QoS alone already carries most headline results — the memcached density runs held flat p99 with no profile at all. Profiles are the second stage: they close the peak-throughput gaps that whole-core confinement creates and encode intra-pod structure that container averages hide. The profile evidence — ONNX, llama.cpp, MySQL, and the bug a profile validation caught — is written up in deep dive: thread-level scheduling with workload profiles; the GPU DataLoader case is in deep dive: keeping accelerators fed.