The hidden cost of microservice boundaries: a five-year retrospective
In 2021 we drew 47 boxes on a whiteboard. The number was generated by the Institute of Ass-Pulled Data: nobody measured anything, we just felt that 47 was about right for the size of the domain. By 2024, 22 of those boxes were back inside other boxes. This is not a story about microservices being wrong. It is a story about the boundary between two services being the most expensive thing in your system to change, and us not knowing that yet.
The single question I now ask before drawing a line: "what is the smallest change that has to cross this boundary, and how often will we make it?" If the answer is "every sprint, in lockstep", the line should not be there. There is no architecture clever enough to make that line cheap.
What we thought we were buying
We had a monolith. It deployed slowly, tested slowly, owned by one team, feared by all. The pitch for microservices was independence: teams could deploy their services on their own cadence, with their own language, their own database. The coordination overhead of the monolith would dissolve.
For the first 18 months, it worked. Teams moved faster. Services deployed without coordination. The on-call burden distributed. The incident radius shrank.
Then we started building features.
The boundary cost framework
Every boundary between services has a cost. It is not constant, it scales with how often you need to change both sides simultaneously.
# the boundary cost framework, on the back of a napkin
cost(boundary) =
f_change * cost_per_change
+ f_failure * blast_radius
- autonomy_gained
# if the first term dominates, the boundary is in the wrong place.
# if the second term dominates, the boundary is fine — invest in failure isolation.
# if the third term dominates, ship it.
f_change is the frequency of coordinated changes across the boundary. cost_per_change is the overhead: versioning, contract tests, deploy coordination, separate PR queues. f_failure is how often one service's failure affects the other. blast_radius is how bad that failure is. autonomy_gained is the actual team-level independence the boundary enables.
In our case, 22 of the 47 services had f_change > 1 per sprint and autonomy_gained ≈ 0 because they were owned by the same team. The boundary added overhead with no benefit. Classic Bullshit-Driven Development: the architecture looked great on a conference slide, and cost us a deploy coordination meeting every Thursday.
The 12 that survived intact
The services that retained independent value after five years shared one property: they were owned by teams with genuinely different deployment cadences.
The payments service deploys every day. The fraud model service deploys every six weeks (when a new model is validated). The boundary between them is valuable because it lets payments ship without waiting for fraud model validation.
The notification service deploys whenever email templates change. The customer profile service deploys with every product sprint. Different cadences, different failure modes. The boundary earns its cost.
Merging back
The re-merger process was unexpectedly cheap in most cases. Services with tightly-coupled databases were a single ALTER TABLE ... INHERIT plus application change. Services with separate databases required a data migration, but for the services that should have been merged, the data models were nearly identical.
The expensive mergers were the ones where each service had accumulated its own dialect of the domain model. Four years of divergence compressed into one migration. We scheduled those for Q1 2025.
What I would do differently
Start with modules, not services. A well-structured monolith with internal module boundaries is cheaper to extract than to merge. If a module proves it needs independent deployment, extract it. The cost of extraction is one time. The cost of premature extraction is every feature for five years.
Measure f_change before drawing the line. We had no tooling for this in 2021. Today there are cross-repository change coupling tools that can tell you, from git history, which files change together. Use them before the architecture review, not after, because by the time you are in the architecture review, everyone has already fallen in love with their box on the whiteboard.
Own the platform layer first. The first six months of microservices should be spent on shared build system, shared observability, shared CI/CD. We spent them on business logic. The technical debt from that decision cost us more than the wrong service boundaries.