๐Ÿ”ง Troubleshooting

Common errors, error detection patterns, and debugging strategies for Bifrost jobs.

Error Types

Error Status Set Recovery
Rate Limit rate_limited Auto-retry when quota resets. Can cancel or manually resume.
Model Error failed Check model name. Resubmit with correct model.
Auth Error failed Check API key / credentials. Restart Bifrost after fixing .env.
Timeout failed (cancelled) Reduce scope or use autopilot type (60 min). Check for stuck loops.
Inactivity failed (cancelled) Agent is stuck. Check prompt clarity. Gemini: 5 min, Claude: 10 min.
Spawn Error failed CLI binary not found. Check CLAUDE_PATH / GEMINI_PATH in .env.

Gemini Error Detection

Bifrost actively monitors Gemini CLI stderr for known error patterns and handles them proactively.

Error Type Detection Patterns
Rate Limit 429, RESOURCE_EXHAUSTED, quota exceeded, too many requests, retryWithBackoff
Model Error ModelNotFoundError, model not found, invalid model
Auth Error PERMISSION_DENIED, API key not valid, UNAUTHENTICATED, authentication failed
Why stderr? Gemini CLI writes error messages to stderr, not NDJSON stdout. Bifrost parses both streams separately โ€” NDJSON from stdout, error detection from stderr.

Common Issues

1. Bifrost shows offline (red indicator)

Check Fix
PM2 status pm2 status โ€” is bifrost-api "online"?
Heartbeat Heartbeat POSTs to /api/remote-jobs?heartbeat=1 every 30s. Check D1 for recent heartbeats.
Startup guard PM2 sets PM2_HOME env var. The server checks isDirectRun โ€” if it doesn't detect PM2, it won't start. Check server.js.
Port conflict Default port 4003. Check if another process is using it: netstat -an | findstr 4003

2. Job stays "pending" forever

Check Fix
Bifrost running? Check Bifrost health indicator. If offline, start it.
Polling working? Bifrost polls for pending jobs. Check Bifrost logs for poll activity.
Job in D1? GET /api/remote-jobs?status=pending โ€” verify the job exists.

3. Job fails immediately

Check Fix
CLI binary exists? Verify CLAUDE_PATH or GEMINI_PATH points to a valid executable.
API key set? Gemini needs GEMINI_API_KEY. Claude needs its own auth setup.
Invalid model? Check the model name matches exactly (e.g. gemini-3.1-pro-preview, not gemini-pro).
Job type valid? Must be one of: session, research, workflow, clara, autopilot.

4. Job output is empty

Check Fix
NDJSON parsing Output comes from assistant NDJSON events. If CLI isn't producing NDJSON, check the --output-format stream-json flag.
SSE connection SSE viewer connects to /jobs/:id/stream. Check browser console for connection errors.
Prompt issue The agent may have no actionable work. Check the prompt in the job's channel messages.

5. Rate limit but no auto-retry

Check Fix
Session ID captured? Auto-retry needs a session_id from the CLI's init event. If it fails before init, no session ID is available.
Reset time parsed? The reset time parser looks for patterns like "resets 9pm". If the error message format changes, parsing may fail.
Retry queue running? The scheduler checks every 60 seconds. Verify in Bifrost logs: [retry] Fired N retry(s).

Debugging Tips

Tool How to Use
Bifrost logs pm2 logs bifrost-api โ€” see all console output including spawn commands, NDJSON events, errors.
Health endpoint GET bifrost.mipos.io:4003/health โ€” uptime, active job count, version.
Job channels Check #job-{id} channel in Channels tab โ€” has start, progress, and completion messages.
D1 records GET /api/remote-jobs?id={id} โ€” full job record including status, error, output, timestamps.
SSE direct curl bifrost.mipos.io:4003/jobs/{id}/stream โ€” raw SSE stream for debugging output issues.

Recovery Playbook

Scenario Action
Job stuck running Click Stop button in Remote tab, or DELETE /jobs/{id}
Rate limited, want to retry now Cancel the queued retry, then manually resume with the session ID
Bifrost crashed mid-job Restart Bifrost. In-memory jobs are lost. PATCH the stuck D1 record to failed
Wrong model selected Cancel the job. Resubmit with the correct model in the payload
Job produced partial output Check the #job channel for progress messages. Resume from checkpoint (autopilot) or session ID (session)