OPS dispatch runbook (pre-fase2)

Scope

This runbook covers the detached launcher worker for interactive tasks in OPS Center.

API base: https://ops.solucionesabiertas.net/api
Main worker service: ops-dispatch-launcher.service
Worker logs: /var/log/ops-dispatch-launcher.log
Detached process logs: /root/.ops-launcher-logs/*.log
Service token header: X-OPS-Service-Token (key service_api_token in ops_settings)
OPS Center env (recommended): TRUST_PROXY=true when running behind reverse proxy
Optional launcher env overrides: AGENT_CMD_OPENCODE, AGENT_CMD_CLAUDECODE, AGENT_CMD_CODEX, AGENT_CMD_HERMES, SUPPORTED_AGENT_SLUGS, SUPPORTED_MODELS
Dispatch stale watchdog: setting dispatch_stale_minutes (default runtime fallback: 20)
Worker heartbeat: POST /api/dispatch/heartbeat, stored in worker_heartbeats
Worker stale threshold: setting dispatch_worker_stale_seconds (default 90)

Related guides:

docs/README.md
docs/ops-daily-mode.md
docs/templates-subagents-playbook.md
docs/ops-efficient-workflow.md
docs/sa-ops-system/README.md
docs/sa-ops-system/operating-modes.md

1) Quick health check

Run on sa-tools:


systemctl is-active ops-dispatch-launcher.service

systemctl is-active ops-center.service

Expected: both services return active.

Then run runtime doctor from project root:


npm run worker:doctor

Expected: [doctor] PASS all runtimes available.

If doctor fails with MISSING, install the missing CLI before enabling dispatch.

If a new agent uses fallback command resolution, ensure its slug is covered by the launcher env map or set explicit launch_command in OPS.

Current VPS note, 2026-05-05:

claudecode is intentionally inactive in the DB because the VPS has neither ANTHROPIC_API_KEY nor a claude auth login session.
Reactivate it only after one of those credentials is configured and npm run worker:doctor passes.
Until then, dispatch runs with the fresh worker capabilities reported by heartbeat, currently opencode, codex and hermes.

Health with worker capabilities:


curl -H "X-OPS-Service-Token: $OPS_SERVICE_TOKEN" \

  https://ops.solucionesabiertas.net/api/dispatch/health

Expected:

ok=true
dispatch_enabled matches the intended operating mode
worker_summary.fresh >= 1 when dispatch is enabled
queue.stale_dispatching = 0
At least one fresh worker has agent_slugs matching authenticated runtimes on that host

The launcher only requests tasks for usable agents reported by heartbeat.

From the UI, open Operacion -> Automatizaciones to see the same diagnostics without SSH. Workspace admins can refresh health, toggle dispatch, inspect stale tasks and run the manual watchdog.

For CLI/API operations export the service token once:


export OPS_SERVICE_TOKEN='<token>'

If auth was just migrated, rotate defaults before enabling dispatch:

Set OPS_ADMIN_USER and OPS_ADMIN_PASSWORD in OPS Center service env.
Set a random OPS_SERVICE_TOKEN in launcher/doctor env.
Update ops_settings.service_api_token with that same token.

2) Safe pilot execution

For API-level dispatch smoke without launching an AI runtime:


OPS_API_BASE=http://127.0.0.1:3847/api \

OPS_SERVICE_TOKEN=... \

npm run smoke:dispatch

The smoke creates temporary tasks, verifies next -> queued -> dispatching -> review, verifies Human Gate blocking for a sensitive task, and deletes the temporary tasks.

Ensure pilot task exists in OPS with execution=interactive and agent_id assigned.
Enable dispatch:


curl -H "X-OPS-Service-Token: $OPS_SERVICE_TOKEN" \

  -H 'Content-Type: application/json' \

  -X PUT \

  -d '{"value":"true"}' \

  https://ops.solucionesabiertas.net/api/settings/dispatch_enabled

Trigger task dispatch (UI or API /tasks/:id/dispatch).
Verify task transitions: queued -> dispatching -> doing.
Inspect dispatch_output and detached log files.
Disable dispatch again after pilot:


curl -H "X-OPS-Service-Token: $OPS_SERVICE_TOKEN" \

  -H 'Content-Type: application/json' \

  -X PUT \

  -d '{"value":"false"}' \

  https://ops.solucionesabiertas.net/api/settings/dispatch_enabled

3) Failure handling

Common failure signatures:

runtime no disponible en host: <cli>
launcher process exited early (code=..., signal=...)
command not found in .err.log

Required actions:

Keep dispatch_enabled=false.
Set task blocked=true with concrete blocked_reason.
Add task context entry (kind=blocker) with evidence.
Log AI time in the task.

3.1) Stale dispatch watchdog

/api/dispatch/next now reconciles stale dispatching tasks automatically before selecting a new task.

Criteria: task in dispatching for more than dispatch_stale_minutes.
Action: moves task to waiting, sets blocked=true, clears worker claim fields, writes blocker context and opens/reopens a system alert linked to the task.
Audit: manual reconciliation writes dispatch.reconciled_stale.

Manual reconciliation endpoint (admin/service only):


curl -H "X-OPS-Service-Token: $OPS_SERVICE_TOKEN" \

  -H 'Content-Type: application/json' \

  -X POST \

  -d '{"minutes":20}' \

  https://ops.solucionesabiertas.net/api/dispatch/reconcile-stale

Returns:

stale_minutes
reconciled
task_ids

Controlled smoke:


OPS_API_BASE=http://127.0.0.1:3847/api \

OPS_SERVICE_TOKEN=$OPS_SERVICE_TOKEN \

OPS_WORKSPACE_ID=1 \

npm run smoke:watchdog

The smoke creates temporary dispatch and recurring failures, verifies watchdog alerts/tasks, then archives/resolves its temporary artifacts.

UI path:

Open Operacion -> Automatizaciones.
Check Dispatching and Watchdog.
Click Reconciliar stale.
Review the task blockers before redispatching.

Recommended default:

Start at 20 min
Tune with:


curl -H "X-OPS-Service-Token: $OPS_SERVICE_TOKEN" \

  -H 'Content-Type: application/json' \

  -X PUT \

  -d '{"value":"20"}' \

  https://ops.solucionesabiertas.net/api/settings/dispatch_stale_minutes

3.2) AI done guard (visible output required)

To avoid false positives (dispatch OK without deliverable), AI tasks cannot move to done unless there is visible output evidence:

Context entry (excluding launcher checkpoint)
or user-visible comment (excluding ai-dispatch log comments)

Guard applies to:

PATCH /api/tasks/:id/status when status=done
PATCH /api/tasks/:id when status is changed to done

Bypass only when strictly needed:

?force=true (admin/service/manual recovery)

Operational policy:

Always add final deliverable to task context.
Add short pointer comment for quick UI lookup.
Then move to done.

3.3) Model-family matching for workers

Worker model filters now match by family, not only exact label. This allows safe aliasing:

minimax* matches minimax-m2, minimax2.7
qwen* matches qwen3.6, local-qwen3.6-coder
gemma* matches gemma4, local-gemma4
gpt-* and other families keep existing behavior

This reduces routing misses during gradual model renaming/migration.

4) Log hygiene

Install logrotate rule:


install -m 0644 scripts/ops-dispatch-logrotate.conf /etc/logrotate.d/ops-dispatch

Manual test:


logrotate -d /etc/logrotate.d/ops-dispatch

5) Recovery checklist

systemctl restart ops-dispatch-launcher.service
npm run worker:doctor
Verify dispatch_enabled is the intended value
Run a controlled pilot task
Register context and move task to review or done only after visible deliverable evidence is present.

6) Automation helpers

Two scripts are available for this workflow:

Server-side deploy and auth hardening:


OPS_ADMIN_USER='<admin>' \

OPS_ADMIN_PASSWORD='<strong-pass>' \

OPS_SERVICE_TOKEN='<strong-token>' \

scripts/deploy-auth-hardening.sh \

  --repo-path /opt/sa-ops-center \

  --confirm-prod

Controlled pilot dispatch for one task (auto-restore dispatch_enabled):


OPS_SERVICE_TOKEN='<token>' \

scripts/pilot-dispatch-task.sh \

  --task-id 411 \

  --confirm-pilot

Both scripts include hard stop flags (--confirm-prod, --confirm-pilot) to prevent accidental production execution.