Scaling MongoDB to 200K Users: Lessons Learned

When I joined, the MongoDB cluster was a single replica set handling 40K daily users. 18 months later we were at 200K with a 2x-improved P95. Here is what actually moved the needle.

Indexes, not instances

The first instinct when things slow down is to scale horizontally. Do not. I spent a week with MongoDB Compass's profiler and found that 70 percent of our slow queries were scanning collections because an index was missing or the compound-index order was wrong.

Rule of thumb for compound indexes: equality, sort, range. A query like this:

db.posts.find({ status: 'active' }).sort({ createdAt: -1 }).limit(20)

wants an index on { status: 1, createdAt: -1 }, in that exact order. Put the range field last or the index will not help the sort.

Projection is free performance

Every find that returns the whole document when you only need three fields is leaving bandwidth on the floor. Adding a projection shrunk one of our hottest endpoints by 60 percent:

db.posts.find(
  { status: 'active' },
  { projection: { title: 1, slug: 1, _id: 0 } }
);

Avoid $lookup in the hot path

$lookup is MongoDB's JOIN, and it is as expensive as you would expect. For read-heavy hot paths, I denormalize the joined fields and accept the write-time cost. For cold reports and admin tools, $lookup is fine.

Connection pooling matters more than you think

Default pool size is 100. When you have 20 app instances, that is 2,000 connections MongoDB has to juggle. I dropped it to 20 per instance and latency improved. Fewer connections meant MongoDB was not context-switching as much.

Run a restore drill quarterly

Do not find out your backup strategy is broken during an incident. I am not kidding.