# Troubleshooting Fly.io Deployments

## Contents
- [Health Check Failures](#health-check-failures)
- [Build Failures](#build-failures)
- [App Crashes](#app-crashes)
- [Connection Issues](#connection-issues)
- [Secrets and Environment](#secrets-and-environment)
- [Release Command Failures](#release-command-failures)

## Health Check Failures

**Symptom:** Deploy hangs, then fails with health check timeout.

### Port Mismatch

**Error:** Health check never passes, app seems to start fine.

**Cause:** `internal_port` in fly.toml doesn't match the port your app listens on.

**Fix:**
```bash
# Check what port your app uses
grep -r "listen\|PORT\|port" --include="*.js" --include="*.ts" --include="*.py"

# Update fly.toml
[http_service]
  internal_port = 3000  # Match your app
```

### Grace Period Too Short

**Error:** Health check fails immediately after deploy.

**Cause:** App takes longer to start than `grace_period` allows.

**Fix:**
```toml
[[http_service.checks]]
  grace_period = "30s"  # Increase from default 10s
  interval = "15s"
  timeout = "5s"
  path = "/health"
```

### Missing Health Endpoint

**Error:** Health check returns 404.

**Cause:** App doesn't have a `/health` endpoint (or configured path).

**Fix:** Either add an endpoint or change the check path:
```toml
[[http_service.checks]]
  path = "/"  # Use root if no health endpoint
```

### App Binding to localhost

**Error:** Health check times out, but app works locally.

**Cause:** App binds to `127.0.0.1` instead of `0.0.0.0`.

**Fix:** Configure app to listen on all interfaces:
```bash
# Node.js
server.listen(PORT, '0.0.0.0')

# Python Flask
app.run(host='0.0.0.0', port=PORT)

# Python Uvicorn
uvicorn main:app --host 0.0.0.0 --port $PORT
```

## Build Failures

### Dockerfile Not Found

**Error:** `Error: Dockerfile not found`

**Fix:**
```toml
[build]
  dockerfile = "Dockerfile"  # Ensure path is correct
  # Or use absolute path from repo root
  dockerfile = "./deploy/Dockerfile"
```

### Build Timeout

**Error:** Build killed after timeout.

**Fix:**
```bash
fly deploy --wait-timeout 10m
```

Or in fly.toml:
```toml
[deploy]
  wait_timeout = "10m"
```

### Out of Memory During Build

**Error:** Build OOM killed.

**Fix:** Use remote builder (default) or increase local resources:
```bash
fly deploy --remote-only
```

### npm/yarn Install Fails

**Error:** Package installation fails in Dockerfile.

**Common fixes:**
```dockerfile
# Ensure package-lock.json is copied
COPY package*.json ./
RUN npm ci --only=production

# Or for yarn
COPY package.json yarn.lock ./
RUN yarn install --frozen-lockfile --production
```

## App Crashes

### Check Logs First

```bash
fly logs --app my-app

# More context
fly logs --app my-app | head -100
```

### Missing Environment Variables

**Symptom:** App crashes immediately with "undefined" or config errors.

**Fix:**
```bash
# List current secrets
fly secrets list

# Set missing ones
fly secrets set DATABASE_URL="..." API_KEY="..."
```

### Memory Issues

**Symptom:** App killed with OOM.

**Fix:**
```bash
fly scale memory 512  # Increase to 512MB
# Or
fly scale vm shared-cpu-2x
```

### Process Exits Immediately

**Symptom:** Machine starts then stops right away.

**Causes:**
1. Missing CMD/ENTRYPOINT in Dockerfile
2. App exits without error (no long-running process)
3. Crash before any logging

**Debug:**
```bash
# Check recent logs
fly logs

# SSH into a machine for debugging
fly ssh console
```

## Connection Issues

### App Not Accessible

**Symptom:** `fly open` shows error or timeout.

**Checklist:**
1. Check app is running: `fly status`
2. Check machines exist: `fly machines list`
3. Check IP allocation: `fly ips list`
4. If no IPs: `fly ips allocate-v4 --shared`

### HTTPS Redirect Loop

**Symptom:** Browser shows redirect loop.

**Cause:** App also redirects to HTTPS, double redirect.

**Fix:** Let Fly.io handle HTTPS, disable in app:
```toml
[http_service]
  force_https = true  # Fly handles this
```

And remove HTTPS redirect from app code.

### App Works Then Stops

**Symptom:** App responds, then becomes unavailable.

**Cause:** `auto_stop_machines` is enabled, machine stopped due to inactivity.

**Options:**
```toml
[http_service]
  auto_stop_machines = "off"    # Never stop (costs more)
  # Or
  min_machines_running = 1      # Keep 1 always running
```

## Secrets and Environment

### Secrets Not Available

**Symptom:** App can't read environment variable that was set.

**Checks:**
```bash
# Verify secret is set
fly secrets list

# Secrets require redeploy to take effect
fly deploy
```

### Secrets Visible in Logs

**Problem:** Sensitive values appearing in logs.

**Fix:** Never log environment variables directly. Sanitize logs.

## Release Command Failures

### Timeout

**Error:** Release command timed out.

**Fix:**
```toml
[deploy]
  release_command = "npm run migrate"
  release_command_timeout = "10m"  # Increase from 5m default
```

### Database Connection Fails

**Error:** Release command can't connect to database.

**Causes:**
1. DATABASE_URL secret not set
2. Database not accessible from Fly.io network
3. Database requires IP allowlist

**Fixes:**
```bash
# Check secret exists
fly secrets list | grep DATABASE

# For external databases, ensure Fly.io IPs are allowed
# Or use Fly Postgres which is on the same private network
```

### Command Not Found

**Error:** `release_command: command not found`

**Cause:** Command not available in Docker image.

**Fix:** Ensure command is installed in Dockerfile:
```dockerfile
RUN npm install  # Ensures npm scripts work
# Or
RUN pip install -r requirements.txt  # Ensures Python deps available
```

## Quick Diagnostic Commands

```bash
# Overall status
fly status

# Machine list and states
fly machines list

# Recent logs
fly logs

# SSH into running machine
fly ssh console

# Check allocated IPs
fly ips list

# Check secrets
fly secrets list

# Check current config
fly config show
```
