Zero downtime sounds like magic. It is mostly discipline: reload instead of restart, health checks before trusting the new version, and enough instances that one going down does not matter. Here is the pipeline I use.
PM2 reload vs restart
This is the single most important thing to know. pm2 restart kills and replaces the process. pm2 reload rolls instances one at a time. With a clustered Node app, reload means at least one worker is always accepting requests.
pm2 reload ecosystem.config.js --update-env
The --update-env flag is easy to miss and important. Without it, changes to environment variables do not take effect until the next hard restart, and you will spend 20 minutes debugging why your new env var is not being read.
The ecosystem config
module.exports = {
apps: [{
name: 'api',
script: './dist/main.js',
instances: 'max',
exec_mode: 'cluster',
max_memory_restart: '500M',
env: { NODE_ENV: 'production' },
}],
};
instances: 'max' spawns one worker per CPU core. exec_mode: 'cluster' enables load balancing across workers and — critically — enables pm2 reload.
GitHub Actions workflow
name: Deploy
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: appleboy/ssh-action@v1
with:
host: ${{ secrets.DEPLOY_HOST }}
username: ${{ secrets.DEPLOY_USER }}
key: ${{ secrets.DEPLOY_KEY }}
script: |
cd /var/www/api
git pull origin main
pnpm install --frozen-lockfile
pnpm build
pm2 reload ecosystem.config.js --update-env
Health checks, not hope
PM2 thinks an instance is healthy when the Node process is up. That is not the same as accepting traffic successfully. I added a post-reload curl loop that polls /health for 30 seconds before the workflow reports success. If the new code has a startup bug, the deploy fails instead of silently hanging.
The part I still do not love
SSH-based deploys are fragile. If I were starting today, I would containerize and push to a registry, then the target pulls and restarts. PM2 reload stays either way.