A three-tier agentic operations platform for incident triage, worker execution, monitor-driven self-healing, HITL approvals, and MLOps observability across edge and IoT fleets.
Nexus receives alerts from drones, IoT devices, logs, telemetry, and model pipelines. The Manager agent classifies severity, Workers execute the right tools, and the Monitor validates state or triggers recovery.
Battery drops, GPS anomalies, and motor faults are classified by severity and routed to the right Worker without human involvement for low-risk events.
The Monitor watches heartbeats and tool-call success rates, then restarts failing Workers and re-queues pending tasks.
Telemetry Workers retrieve similar historical incidents and proven resolutions through LlamaIndex before acting.
High-risk OTA actions pass through a human approval gate with device lists, firmware diffs, risk scores, and rollback plans.
Nexus monitors inference latency, drift signals, and rollback triggers across edge deployments.
When device heartbeats go silent, Network Workers test alternate routes and issue mesh reconfiguration where recovery is possible.
Long incident chains can create large graphs. Nexus uses state compaction, max-depth guards, and SQLite checkpoints for resumable workflows.
Approval gates must avoid flooding operators. Historical replay, per-action decision boundaries, and confidence intervals tune autonomous versus human-reviewed actions.
Device APIs and OTA endpoints over mesh links need idempotent wrappers, deduplication keys, and result caching so Monitor agents verify actual state.
Progressive heartbeat deadlines tied to task type prevent large OTA transfers from being mistaken for stalled Workers.
Strict similarity thresholds, metadata filtering, and citation checks prevent past incidents from polluting reasoning with partial matches.
Per-worker topic partitioning, priority lanes, and backpressure signalling keep critical incidents moving under load.
Directed state graphs, conditional triage routing, checkpointing, and streaming node outputs for dashboards.
Incident log retrieval with metadata filters by device type, firmware version, and confidence threshold.
Manager reasoning, structured task dispatch, HITL summaries, and deterministic tool invocation.
Per-worker topic partitioning, priority lanes, and consumer groups for remediation actions.
QoS heartbeats, Will messages, retained last-known state, and fleet/device topic hierarchies.
Idempotent wrappers for SSH remediation, REST APIs, OTA push, mesh reconfiguration, and model rollback.