OPS dispatch runbook (pre-fase2)

Scope

This runbook covers the detached launcher worker for interactive tasks in OPS Center.

  • API base: https://ops.solucionesabiertas.net/api
  • Main worker service: ops-dispatch-launcher.service
  • Worker logs: /var/log/ops-dispatch-launcher.log
  • Detached process logs: /root/.ops-launcher-logs/*.log
  • Service token header: X-OPS-Service-Token (key service_api_token in ops_settings)
  • OPS Center env (recommended): TRUST_PROXY=true when running behind reverse proxy
  • Optional launcher env overrides: AGENT_CMD_OPENCODE, AGENT_CMD_CLAUDECODE, AGENT_CMD_CODEX, AGENT_CMD_HERMES, SUPPORTED_AGENT_SLUGS, SUPPORTED_MODELS
  • Dispatch stale watchdog: setting dispatch_stale_minutes (default runtime fallback: 20)
  • Worker heartbeat: POST /api/dispatch/heartbeat, stored in worker_heartbeats
  • Worker stale threshold: setting dispatch_worker_stale_seconds (default 90)

Related guides:

  • docs/README.md
  • docs/ops-daily-mode.md
  • docs/templates-subagents-playbook.md
  • docs/ops-efficient-workflow.md
  • docs/sa-ops-system/README.md
  • docs/sa-ops-system/operating-modes.md

1) Quick health check

Run on sa-tools:


systemctl is-active ops-dispatch-launcher.service

systemctl is-active ops-center.service

Expected: both services return active.

Then run runtime doctor from project root:


npm run worker:doctor

Expected: [doctor] PASS all runtimes available.

If doctor fails with MISSING, install the missing CLI before enabling dispatch.

If a new agent uses fallback command resolution, ensure its slug is covered by the launcher env map or set explicit launch_command in OPS.

Current VPS note, 2026-05-05:

  • claudecode is intentionally inactive in the DB because the VPS has neither ANTHROPIC_API_KEY nor a claude auth login session.
  • Reactivate it only after one of those credentials is configured and npm run worker:doctor passes.
  • Until then, dispatch runs with the fresh worker capabilities reported by heartbeat, currently opencode, codex and hermes.

Health with worker capabilities:


curl -H "X-OPS-Service-Token: $OPS_SERVICE_TOKEN" \

  https://ops.solucionesabiertas.net/api/dispatch/health

Expected:

  • ok=true
  • dispatch_enabled matches the intended operating mode
  • worker_summary.fresh >= 1 when dispatch is enabled
  • queue.stale_dispatching = 0
  • At least one fresh worker has agent_slugs matching authenticated runtimes on that host

The launcher only requests tasks for usable agents reported by heartbeat.

From the UI, open Operacion -> Automatizaciones to see the same diagnostics without SSH. Workspace admins can refresh health, toggle dispatch, inspect stale tasks and run the manual watchdog.

For CLI/API operations export the service token once:


export OPS_SERVICE_TOKEN='<token>'

If auth was just migrated, rotate defaults before enabling dispatch:

  1. Set OPS_ADMIN_USER and OPS_ADMIN_PASSWORD in OPS Center service env.
  2. Set a random OPS_SERVICE_TOKEN in launcher/doctor env.
  3. Update ops_settings.service_api_token with that same token.

2) Safe pilot execution

For API-level dispatch smoke without launching an AI runtime:


OPS_API_BASE=http://127.0.0.1:3847/api \

OPS_SERVICE_TOKEN=... \

npm run smoke:dispatch

The smoke creates temporary tasks, verifies next -> queued -> dispatching -> review, verifies Human Gate blocking for a sensitive task, and deletes the temporary tasks.

  1. Ensure pilot task exists in OPS with execution=interactive and agent_id assigned.
  2. Enable dispatch:

curl -H "X-OPS-Service-Token: $OPS_SERVICE_TOKEN" \

  -H 'Content-Type: application/json' \

  -X PUT \

  -d '{"value":"true"}' \

  https://ops.solucionesabiertas.net/api/settings/dispatch_enabled

  1. Trigger task dispatch (UI or API /tasks/:id/dispatch).
  2. Verify task transitions: queued -> dispatching -> doing.
  3. Inspect dispatch_output and detached log files.
  4. Disable dispatch again after pilot:

curl -H "X-OPS-Service-Token: $OPS_SERVICE_TOKEN" \

  -H 'Content-Type: application/json' \

  -X PUT \

  -d '{"value":"false"}' \

  https://ops.solucionesabiertas.net/api/settings/dispatch_enabled

3) Failure handling

Common failure signatures:

  • runtime no disponible en host: <cli>
  • launcher process exited early (code=..., signal=...)
  • command not found in .err.log

Required actions:

  1. Keep dispatch_enabled=false.
  2. Set task blocked=true with concrete blocked_reason.
  3. Add task context entry (kind=blocker) with evidence.
  4. Log AI time in the task.

3.1) Stale dispatch watchdog

/api/dispatch/next now reconciles stale dispatching tasks automatically before selecting a new task.

  • Criteria: task in dispatching for more than dispatch_stale_minutes.
  • Action: moves task to waiting, sets blocked=true, clears worker claim fields, writes blocker context and opens/reopens a system alert linked to the task.
  • Audit: manual reconciliation writes dispatch.reconciled_stale.

Manual reconciliation endpoint (admin/service only):


curl -H "X-OPS-Service-Token: $OPS_SERVICE_TOKEN" \

  -H 'Content-Type: application/json' \

  -X POST \

  -d '{"minutes":20}' \

  https://ops.solucionesabiertas.net/api/dispatch/reconcile-stale

Returns:

  • stale_minutes
  • reconciled
  • task_ids

Controlled smoke:


OPS_API_BASE=http://127.0.0.1:3847/api \

OPS_SERVICE_TOKEN=$OPS_SERVICE_TOKEN \

OPS_WORKSPACE_ID=1 \

npm run smoke:watchdog

The smoke creates temporary dispatch and recurring failures, verifies watchdog alerts/tasks, then archives/resolves its temporary artifacts.

UI path:

  1. Open Operacion -> Automatizaciones.
  2. Check Dispatching and Watchdog.
  3. Click Reconciliar stale.
  4. Review the task blockers before redispatching.

Recommended default:

  • Start at 20 min
  • Tune with:

curl -H "X-OPS-Service-Token: $OPS_SERVICE_TOKEN" \

  -H 'Content-Type: application/json' \

  -X PUT \

  -d '{"value":"20"}' \

  https://ops.solucionesabiertas.net/api/settings/dispatch_stale_minutes

3.2) AI done guard (visible output required)

To avoid false positives (dispatch OK without deliverable), AI tasks cannot move to done unless there is visible output evidence:

  • Context entry (excluding launcher checkpoint)
  • or user-visible comment (excluding ai-dispatch log comments)

Guard applies to:

  • PATCH /api/tasks/:id/status when status=done
  • PATCH /api/tasks/:id when status is changed to done

Bypass only when strictly needed:

  • ?force=true (admin/service/manual recovery)

Operational policy:

  1. Always add final deliverable to task context.
  2. Add short pointer comment for quick UI lookup.
  3. Then move to done.

3.3) Model-family matching for workers

Worker model filters now match by family, not only exact label. This allows safe aliasing:

  • minimax* matches minimax-m2, minimax2.7
  • qwen* matches qwen3.6, local-qwen3.6-coder
  • gemma* matches gemma4, local-gemma4
  • gpt-* and other families keep existing behavior

This reduces routing misses during gradual model renaming/migration.

4) Log hygiene

Install logrotate rule:


install -m 0644 scripts/ops-dispatch-logrotate.conf /etc/logrotate.d/ops-dispatch

Manual test:


logrotate -d /etc/logrotate.d/ops-dispatch

5) Recovery checklist

  1. systemctl restart ops-dispatch-launcher.service
  2. npm run worker:doctor
  3. Verify dispatch_enabled is the intended value
  4. Run a controlled pilot task
  5. Register context and move task to review or done only after visible deliverable evidence is present.

6) Automation helpers

Two scripts are available for this workflow:

  1. Server-side deploy and auth hardening:

OPS_ADMIN_USER='<admin>' \

OPS_ADMIN_PASSWORD='<strong-pass>' \

OPS_SERVICE_TOKEN='<strong-token>' \

scripts/deploy-auth-hardening.sh \

  --repo-path /opt/sa-ops-center \

  --confirm-prod

  1. Controlled pilot dispatch for one task (auto-restore dispatch_enabled):

OPS_SERVICE_TOKEN='<token>' \

scripts/pilot-dispatch-task.sh \

  --task-id 411 \

  --confirm-pilot

Both scripts include hard stop flags (--confirm-prod, --confirm-pilot) to prevent accidental production execution.