I Spent 3 Months Self-Hosting Trigger.dev. Here Is What It Actually Taught Me.

You googled Trigger.dev because you need background jobs. Maybe you already installed it. Maybe it is already running in production. And maybe you are wondering why the server crashes while CPU and RAM are still mostly free.

That is not a configuration mistake. That is architecture. And the three months I spent learning that the hard way are now yours for free.

Who I am, and why three months matters

I did not read a blog about Trigger.dev and summarize it. I built these systems myself, starting with plain Python and plain JavaScript at university, writing automations for my own projects. Then Make.com, Zapier and n8n once the no-code tools matured. Then back to code when the limits of visual tools became obvious. Custom automation systems for clients, my own webhook layers, my own job queues.

Trigger.dev was the logical next step. A modern, code-first background job platform with durable execution, realtime monitoring and a clean TypeScript API. So I did what most people skip: I evaluated it properly, in production, at scale, on a 96-core Hetzner dedicated server with 251 GB of RAM. That is the only way to actually know whether a tool holds up. Reading the docs does not tell you. A weekend prototype does not tell you. Thirty thousand jobs in a single batch tells you.

What came out of those three months: a 56-page disaster recovery playbook, a custom warm-start broker service, a zombie killer running as a systemd daemon, and a clear answer to the only question that matters, which I will give you near the end.

This is what I wish someone had handed me on day one.

Why I needed this in the first place: cold email at scale

I did not build a 96-core job pipeline for fun. I built it because cold email that actually converts is an enrichment problem, and enrichment at scale is an execution problem.

A generic cold email gets ignored. A personalized one gets a reply. But real personalization is not a mail-merge field. For each prospect it means scraping their website, finding and verifying a deliverable email, pulling company data, scoring the fit, and generating an opening line specific enough to prove a human actually looked. That is five or more executions per prospect, before a single email goes out.

Run one campaign to 5,000 prospects and you are at 25,000+ executions. Run several a week across multiple clients and you are past 100,000 executions a day. Every one of them is I/O-bound, an API call, a scrape, a lookup, which is exactly the workload that separates a job runner that scales from one that falls over at 200 concurrent pods.

This is where the architecture choice stops being academic. If your enrichment pipeline stalls, your campaign ships late or half-personalized, and shallow personalization is the difference between a 4 percent reply rate and a 0.4 percent one. And on managed per-execution pricing, 100,000 executions a day at $0.05 each is $150,000 a month just to research prospects. The wrong infrastructure does not only cost you uptime. It caps how personalized, and how large, your outbound can ever be.

The stack I run for this is deliberately boring: Apify for scraping, Prospeo for finding and verifying emails, Instantly for sending. The job runner underneath all of it is what decides whether the whole thing survives contact with volume. Everything below is the story of finding that out.

The crash, and what is really behind it

Hetzner dedicated server. 96 cores, 251 GB RAM, Ubuntu 24.04. A server that should handle 500 concurrent processes without breaking a sweat.

At 200 concurrent Trigger.dev jobs: load average 1018. SSH unreachable. Server effectively dead.

CPU usage at the moment of the crash: 16 percent. RAM usage: 12 percent.

The server had massive capacity left and was still completely blocked.

What had happened: 29,000 jobs started in one batch. The configuration allowed 300 concurrent executions with a burst factor of 2.0x, so effectively up to 600 pods. 354 of them were stuck in ContainerCreating at the same time. Load average: 1018. SSH: dead. Dashboard: dead. The server was unreachable.

Not because of insufficient resources. Because of a single kernel mechanism.

The net_mutex bottleneck

Trigger.dev v4 in Kubernetes mode creates a new Kubernetes pod for every single task run. Creating a pod means creating a Linux network namespace. And in the kernel, that operation is guarded by a global lock: net_mutex.

One namespace operation at a time. On a single node, the number of CPU cores does not change this. A 96-core server and a 4-core server hit the same wall here.

When 300 pods try to come up at once, 299 processes wait on that one lock. They enter the kernel state called uninterruptible sleep, or D-state. Load average explodes. The server stops responding.

containerd, the runtime Kubernetes uses, physically creates a maximum of around 1.7 new pods per second on a single node in this kind of setup. To be precise: this is what I measured in my configuration, with this kernel version and this CNI. It is not a number you tune away with a config flag, and more cores do not move it.

How Trigger.dev Cloud solves the same problem

Trigger.dev Cloud runs the same architecture, but they operate thousands of nodes per region. 300 pods spread across 300 different nodes means one pod per node and no lock contention. That is not better software. It is simply more hardware you do not see and do not pay for, until the invoice arrives.

Self-hosting on a single node is structurally different from their cloud. That is not stated anywhere in the documentation. It is the single most important thing to understand before you start, and you only discover it after you are already committed.

What I built to work around it

I did not give up on the system. I spent three months taming it. Everything below is shared in full, because the tactical knowledge is not where the value is. The value is the conclusion at the end.

Cilium instead of Flannel

Flannel as the CNI plugin causes severe D-state storms during mass pod creation. Kernel network namespace operations block in a way Flannel makes worse. I replaced Flannel with Cilium using eBPF, which reduced D-state from 2500+ to under 10 at 200 concurrent pods.

It does not solve the net_mutex problem. It makes it manageable.

The custom warm-start broker

The real issue is that every run creates a new pod, and pod boot takes 5 to 15 seconds. At thousands of runs, that is the bottleneck.

My solution was a custom Node.js service acting as a broker between the supervisor and the runner pods. After a run finishes, the runner pod waits instead of terminating. It polls the warm-start service with a long-poll. When a new run comes in for the same deployment and the same machine size, the service hands the run directly to the waiting pod, so no new pod is needed.

The service uses a two-level index for O(1) lookup: a deployment index mapping deployment IDs to waiting workers, and a worker index mapping controller IDs to deployment buckets. Maximum of 750 workers waiting at once.

Result in production: a 91.4 percent match rate in early stress tests. 2103 matches, 198 misses. 131 workers waiting in the pool simultaneously. Warm-start latency: 1ms inside the service, 930ms end to end. Without warm-start: 5 to 15 seconds per run.

This is an entirely separate service, built from scratch, purely to make Trigger.dev work at volume.

The zombie killer

Some tasks stayed stuck in EXECUTING status permanently, not because they were running, but because a deadlock prevented them from ever finishing. The supervisor would not dequeue new runs for those slots. Concurrency was held hostage by zombies.

Fix: a Python script running as a systemd service that checks the PostgreSQL database every 60 seconds and sets tasks stuck in EXECUTING for more than 15 minutes to CANCELED.

This runs as zombie-killer.service in the background. Permanently. In production.

The recovery procedure after a D-state crash

When the server was down, recovery was not trivial. SSH was unreachable, so I went in through the Hetzner console. Then: Stop k3s, manipulate the k3s SQLite database directly, restart k3s.

Result: zero runs lost. But this only worked because I had written the playbook beforehand. If you hit this without one, you are restoring from backup.

The operational traps nobody documents

These are the problems that do not end in a D-state storm. They build up slowly and eventually cause an outage in production. None of them are in the official documentation.

Trap 1: kubectl set env orphans runs permanently

You want to update an environment variable. You run kubectl set env deployment/trigger-webapp .... That automatically triggers a deployment rollout, so the webapp restarts.

What happens: Redis queue entries (engine:runqueue:*) are lost. Runs sitting in PostgreSQL as PENDING or EXECUTING no longer have Redis entries. They are never dequeued. Orphaned forever. You have to cancel and re-trigger them one by one.

Fix: set environment variables only through Helm values, never through kubectl set env while runs are active. This is documented nowhere.

Trap 2: Update the Caddyfile manually after every Helm deploy

Caddy runs as a reverse proxy and routes traffic to the webapp and registry through their Kubernetes ClusterIPs. Those ClusterIPs change after every helm install or helm upgrade.

After every deploy: kubectl get services -n trigger, note the new IPs, update the Caddyfile, systemctl reload caddy. Forget this and the dashboard is unreachable and CI/CD deploys fail.

Trap 3: Docker registry 401 on CI/CD, a silent failure

Trigger.dev CLI's generateRegistryCredentials API only supports AWS ECR. For self-hosted registries it returns a 409 error silently, with no error log in the output. The buildx docker-container driver does not inherit host Docker credentials. Result: every CI/CD deploy fails with 401 Unauthorized.

Debugging this takes time. The fix is an explicit docker login step in the GitHub Actions pipeline, but first you have to understand what the problem even is.

Trap 4: Electric SQL must bypass PgBouncer

Electric SQL needs wal_level = logical and direct logical replication. PgBouncer in transaction mode does not support that.

The Helm chart ignores the Electric extraEnvVars configuration and uses the PgBouncer URL anyway (port 6432). Electric must connect directly to port 5432. You have to manually set the env var on the Electric deployment after every deploy, and again after every helm upgrade, because the upgrade overwrites it.

Trap 5: The PostgreSQL resource trap

The Helm chart's default setup allocates 50 cores and 129 GB of RAM to PostgreSQL. That sounds like a solid database configuration.

Trigger.dev uses PostgreSQL only for small, indexed OLTP queries via Prisma ORM. The run queue lives in Redis. Logs live in ClickHouse. PostgreSQL is just the state of record.

Actual requirement: 8 to 12 cores, 32 to 48 GB of RAM.

The effect: 40 cores and 80 GB of RAM are reserved for a PostgreSQL that never needs them, and unavailable to the runner pods that actually do.

The part that actually matters: this is a tooling decision, and Claude Code changed it

Here is the thing I wish I had understood before month one. The hard part of background jobs is not the code. It is choosing an architecture that matches your actual workload, so you never end up writing a zombie killer at 2am.

And that decision changed recently, because of how these systems are now built. With Claude Code, the implementation effort for any of the tools below collapses. You describe what a job should do, Claude writes it, and for the right platform it deploys it and wires the webhook for you. The bottleneck is no longer "can I build this." It is "did I pick the right thing to build on." That is now the entire game.

Which is why the integration depth matters more than it used to. Windmill has an official windmill-claude-plugin and an MCP server. Claude Code knows Windmill-specific patterns, CLI conventions and script structures, and it interacts with the workspace directly: trigger jobs, query status, deploy scripts, all through chat. The feedback loop is shorter than with any other tool here. pg-boss and Celery also work very well with Claude Code, without a plugin but with deep training coverage. Trigger.dev has no official integration.

If you are building a service business or an internal automation layer and you are not a full-time infrastructure engineer, this is the lever. The tool you can operate and extend with Claude Code, on an architecture that does not fight you, beats the technically impressive tool that needs a specialist.

Tool comparison: what actually works

Trigger.dev Cloud

What it is: Managed background job platform. Pod-per-task isolation, durable execution, realtime monitoring.

When it makes sense: You need real durable execution. Tasks run for hours or days. You do not want to manage infrastructure.

Cost: At 100,000 executions per day with 5 seconds runtime: $150,000+ per month in invocation fees alone, before compute. Even at a fraction of that volume, Trigger.dev Cloud gets expensive fast.

Self-hosted: Technically possible, operationally heavy. The net_mutex problem, the five traps above, the warm-start broker, the zombie killer. Budget 4 to 8 weeks of infrastructure time before the system runs reliably.

Claude Code: No official integration. TypeScript SDK available.

Windmill

What it is: Open-source workflow engine. Python, TypeScript, Go, Bash, SQL. Every script gets a webhook endpoint automatically. Dashboard and logs built in. A direct n8n alternative for teams who prefer code over drag-and-drop.

Architecture: Workers are long-lived processes that pull jobs from a PostgreSQL queue. No pod per task. No net_mutex. No Kubernetes required.

Dedicated workers: For high-frequency scripts, a worker reserved for a single script runs as a while-loop with 0ms cold start.

Claude Code: Official windmill-claude-plugin. MCP server built in. The strongest Claude Code integration of any tool here.

Cost self-hosted: Community edition: free. Unlimited executions, AGPLv3. Enterprise features (SAML, audit logs, Git sync, S3 cache) cost extra and are not needed for most self-hosted setups.

Known drawback: 60ms cold start per job on default workers. On dedicated workers: 0ms. For many different rarely-run scripts, default workers are fine. For high-frequency scripts, use dedicated workers.

Celery + Redis (Python)

What it is: The de facto standard for Python background jobs since 2011. A direct n8n alternative for Python teams.

Architecture: Long-lived worker processes, Redis queue, no container per job.

Known production issues: Memory leaks in the worker process. Fix: --max-tasks-per-child=100. Without explicit --concurrency, Celery uses the CPU core count as default, uncontrolled on 96 cores. Workers stop after a Redis reconnect, requiring an explicit error handler with auto-reconnect. No built-in dashboard. Flower (open source) works but is limited.

Claude Code: Very good. No official integration, but established patterns and deep training coverage.

BullMQ (JavaScript/TypeScript)

What it is: Redis-based job queue for Node.js. MIT license. An n8n alternative for TypeScript teams.

Known production issues: FlowProducer.add() can fail silently with no error, leaving the job nowhere. Scheduled jobs stop randomly after a few hours. Redis maxmemory-policy must be noeviction; allkeys-lru deletes BullMQ keys.

Claude Code: Good. TypeScript-first makes collaboration smooth.

pg-boss (PostgreSQL-native)

What it is: A job queue library that uses PostgreSQL as its only backend. MIT license. Around since 2016, 268,000 weekly npm downloads. An n8n alternative for teams already running PostgreSQL.

Core advantage: Transactional enqueue. Job and business data in a single ACID transaction. Impossible with Redis-based solutions.

Known production issues: No auto-reconnect after a PostgreSQL restart; you implement it yourself. Performance degrades past 500,000 to 1 million rows without autovacuum tuning. Default settings delete jobs after 14 days; unprocessed jobs disappear. Bus factor of 1: a single main maintainer.

When it makes sense: PostgreSQL is already running. Zero new infrastructure. ACID guarantees for enqueue needed.

Claude Code: Good. Small codebase, clear TypeScript API.

Comparison table

Criterion	Trigger.dev Cloud	Windmill	Celery	pg-boss
Kubernetes required	No	No	No	No
Pod per task	Yes	No	No	No
net_mutex risk	No	No	No	No
Cold start	No	0-60ms	0ms	0ms
Webhooks	Automatic	Automatic	Build yourself	Build yourself
Dashboard	Excellent	Good	Flower	SQL only
Languages	TypeScript	Python, TS, Go, SQL	Python	JavaScript
Claude Code plugin	No	Yes, official	No	No
Durable execution	Yes	No	No	No
Cost at scale	Very high	Server	Server	Server

Server costs and execution limits

All numbers below are estimates to give you a sense of scale, not quotes. Your actual needs depend on job duration and workload type.

I/O-bound vs compute-intensive

I/O-bound: The script waits on external services such as HTTP requests, API calls, scraping, database queries. The worker process is mostly idle. A server can handle far more concurrent I/O-bound jobs than it has cores.

Compute-intensive: The script actively computes, such as LLM inference on your own hardware, image processing, data transformation. Rule of thumb: one concurrent job per 2 CPU cores.

Rough orientation, self-hosted, I/O-bound (Windmill or Celery)

Executions/day	Concurrent	Server	Cost/month
up to 10,000	up to 20	VPS 4 cores / 8 GB	€5–15
up to 50,000	up to 100	VPS 8 cores / 16 GB	€20–35
up to 200,000	up to 500	VPS 16 cores / 32 GB	€50–80
up to 1,000,000	up to 2,000	Dedicated 32+ cores / 64+ GB	€90–160
over 1,000,000	over 2,000	Dedicated 96 cores / 256 GB	€200–250

Trigger.dev Cloud cost comparison (estimated)

Executions/day	Runtime/job	Cloud cost/month	Self-hosted/month
1,000	5s	hundreds	€5–15
10,000	5s	low thousands	€20–35
100,000	5s	high thousands+	€50–80
1,000,000	5s	tens of thousands+	€200–250

The break-even sits somewhere around 1,000 to 2,000 executions per day depending on job duration and machine type. Below that, cloud is cheaper. Above it, the gap grows fast.

The one question that matters

After three months, the most valuable thing I produced was not the warm-start broker or the zombie killer. It was the answer to a single question: does this workload actually need pod-per-task isolation and durable execution, or does it just need a reliable queue with retries?

For scraping, enrichment, AI scoring, webhook processing, and almost every automation a service business runs, the answer is the second one. And the moment the answer is the second one, Trigger.dev's entire architecture, the Kubernetes, the pods, the net_mutex wall, becomes cost you pay for a guarantee you do not need.

That single distinction would have saved me three months. It is worth more than every workaround in this article combined.

My recommendation

For most use cases: Windmill self-hosted. Webhooks automatic, dashboard built in, Python and TypeScript native, official Claude Code plugin, no Kubernetes. Community edition free with no execution limits. Start with a €20–35 VPS.

All scripts in Python, maximum control: Celery + Redis.

PostgreSQL already running, ACID guarantees needed: pg-boss.

You genuinely need durable execution and do not want to manage infrastructure: Trigger.dev Cloud, after you run the math. Above 2,000 executions per day, self-hosting with the right tool pays off.

If your problem is less "which job runner" and more "my workflows depend on a freelancer nobody else can read," that is a different and more expensive problem. I wrote about it separately.

30-MINUTE CALL

Which job runner is right for your workload?

We map your jobs, estimate volume, and recommend the right tool before you commit weeks to the wrong foundation. No pitch.

30 minutes · No commitment · Free