After Agile.

We are uncovering better ways of developing
software by doing it and helping others do it.

The First lines of the agile manifesto

Since the adoption of the agile manifesto, we’ve spawned dozens of ‘agile’ methodologies[0], in pursuit of better ways of developing software. Given the staleness of software today, it’s evident we haven’t uncovered better ways of working.


Computers microprocessors (such a quaint word, given where we are today) are thousands of times faster than what we had available when the agile manifesto was written, and yet I write this on WordPress’s online block editor, on a MBA 2019 model, in Firefox, and the cursor is very slow. It takes ten seconds to paint 12 characters. I’ve force-quit everything I could, and only one tab is open.


23 years ago, the major problem facing software teams was the planning process. Software Development methodologies looked a lot like manufacturing methodologies: capability research, capacity planning, design, development, testing, and maintenance. Each step was discrete and carried into the next. The folks that met at Snowbird saw this process as the enemy of good software; and for a time[1] they were right. For a time the ability to take an idea and put it into production in weeks instead of months and years was the greatest improvement in the delivery of software. Instead of conquering that mountain and moving on to the next, we’ve stayed there, built a city, and let it decay.

Software delivery times have decreased[citation needed], and software has gotten worse to use. We’ve spent so long optimizing for delivery that we’ve completely forgotten the purpose of software: To make our lives better.

Cory Doctorow calls the decay of software enshittification, and he explains that it’s due to platform lock-in[2]. Even without platform lock-in, we’d be in this mess. The incentives are the same in all cases — a desire for faster, cheaper software — but the problem is culture, not market position. It’s endemic to how we create software. We’ve drilled it into ourselves that we have to do things cheaply and faster than fast. Since most software organizations have adopted agile in its various forms (sigh: Scrum), then I’m going to blame our current predicament on our over-reliance of agile as the answer to our problems in developing software[3].

It’s 22 years old, it can take it.


I accidentally touch my trackpad (also a quaint word) and the block editor dialog menu pops up. I move my cursor across the page and the editor menu pops up over the words I’m reading. Frustrated, I have to move the page and click further down to get the editor to stop popping up over the words I’m trying to proofread.

The company that makes wordpress dot com, and who makes this block editor is Automattic. Automattic is valued at over $7.5 billion. $7.5 billion! I hope they would spend a mere fraction of a fraction of that on fixing the problems with this editor. The wordpress dot com block editor editor is more downtrodden than a Dickensian protagonist in the first act.


There is one common theme that unites the software process of teams at tech unicorns, mom and pop shops, and startups, Agile. Popularized by the agile manifesto (yes, yes, I know, the Scrum folks will claim they were first, but the wheel by itself wasn’t popular until someone put it on a wheelbarrow), agile has come to mean any iterative software development process that ideally pays lip service to the principles of the manifesto[4].

Agile was an iterative software process codified in the year 2001 by mostly programmers and consultants, at a ski lodge and focused on the areas of agreement, along with principles for software teams and businesses to follow in the development of software.

What emerged was the Agile ‘Software Development’ Manifesto. Representatives from Extreme Programming, SCRUM, DSDM, Adaptive Software Development, Crystal, Feature-Driven Development, Pragmatic Programming, and others sympathetic to the need for an alternative to documentation driven, heavyweight software development processes convened.

From “About the Manifesto

Far better than the waterfall methods that preceded it, the agile manifesto focused on producing working software in weeks instead of months.[5]

Agile’s tenets are simple[6]:

1. Get working software in front of a customer on regular, short intervals.
2. Use the feedback from customers to iterate on it.
3. Inspect how you build software and adapt to change brought on by the your customers, the market, stakeholders, and the changing skill-sets in your own team.
4. Make adjustments to your process, and rinse, lather, and repeat — indefinitely.

The process itself relies on a focus on the team building the software, with a bit of ignorance towards the processes that businesses need to survive. It all but ignores the cost of producing features in favor of just doing it. It ignores the cost of long term maintenance, of saying yes, in favor of a process that says yes and delivers in short intervals. It does not elevate the creation of software to the realities of the business that relies on that software, and instead posits that somehow if you inspect and adapt enough, you’ll be a better software feature widget maker and that is what your business needs.

Agile delivery pairs perfectly with a sales-driven feature factory, where new features drive new sales, until all of it collapses on the weight of itself and the business is forced to sell itself or start over from scratch. IT Projects are either too simple to be useful, or too complex to be delivered (or replaced) successfully, all while the team inspects and adapts every two weeks, like clockwork.[7]

Of course, it’s not supposed to be this way. One of those cute little things we say but rarely practice is, “Agile is the art of maximizing the work not done”, where you can be agile by saying “no”, and where saying “yes” creates commitments, which wears down the team’s ability to move… well.. agily.

Is this agile’s fault? Yes. If the rules of the game are vague enough to drive a truck through, the rules are at fault. If the rules of the game regularly produce outcomes that are sub-optimal, then it’s up to the rules of the game to be fixed.[8]

Agile has several truck-sized vagaries that we ignore at our own peril:

  1. There’s a cost to being agile. There’s a cost to delivering that widget that the CEO or the sales team says the customer wants. To delivering what someone asks for, and not looking deeper into their expensive problems and seeing if it should even be solved how they’re asking. This has to be the cornerstone of any software process: uncovering the root of the symptoms the customer talks about, and digging deep into that, and its first order causes and second order effects. If you thought you’ve dug deep enough, you probably haven’t. I like Jonathan Stark’s Why Conversation for this; because it tries to talk the person out of that thing they say they want and uses three questions: Why This?, Why Now?, and Why Are We the ones that need to solve it?

    A process that optimizes for quick-turnaround, for learning through iteration, encourages us to stop at the level of “produce what is asked for, not what is needed”.

2. Inspecting and Adapting is not nearly enough. You can’t solve the right problem without being in the room where the problems are discussed. By the time the problem has gotten to the ‘agile team’, it’s become a directive with a solution. The idea of a cross-functional team doesn’t take into account budgetary and business considerations. Those become inputs that the product manager or “Product Owner” uses to decide what to build that fits those parameters, without the team being fully involved in the business problem definition process and the budget process. If the team isn’t talking budget numbers, they aren’t close enough to the business processes that define success and failure.

How many times have we seen a matrix of decision making? The Product Manager decides what to build, the developers decide how to build it, and the business decides how to use that to solve its problems. Those divisions have caused more harm to our industry than any other singular mindset.

Businesses don’t make line teams part of the budget or business goal planning process. By the time the realities of the business get down to the teams, it’s purely an exercise in specific features or at best KPIs/OKRs that need moving. This is a fundamentally broken process that agile encourages in part of its origins of being development team focused.

3. Features are promises realized. Likewise, broken features break trust. The idea of deliver, then iterate presupposes two conditions that are not universally true:

  1. That the broken feature will be improved until it’s good (instead of “good enough”), and trust regained as the feature improves.
  2. That you can make up the loss of trust brought on by broken features with more features.

While this problem has existed long before agile, agile normalized delivering broken features and iterating on it until it wasn’t broken any more. It did this for good reason — that it’s better to deliver something half-baked early than it is to deliver the perfect thing months too late. And the sentiment holds truth: if you want to know whether a feature will be used, ship the feature. But the realities of software development have made this mantra a punchline.


As I keep the browser window open that I’m typing this into, WordPress’s cursor and text being painted into the editor is taking longer and longer, with me waiting 10s of seconds for it to catch up to me, and futilely moving my cursor to correct a spelling mistake because WordPress was 10 seconds behind me by the time I stopped typing.


Move fast and break things

Facebook internal motto until 2014

The agile world has brought us a world where broken is normalized, where Perfect is the enemy of the good has become a rallying cry for agile software delivery. This is where I align with Cory Doctorow. Customer lock-in is an easy path to repeatable profits, and VC money means there’s a new market to chase, and no need to keep the existing market segment happy, and that the fundamental practice of making good software is shunned in favor of finding new ways to extract attention from folks. Our whole process for creating software is optimized for this world[9].

4. The gulf between the software team and the team making the business decisions grows wider each year.

Ben Affleck: “Wouldn’t it be easier for NASA to train astronauts how to drill rather than training drillers to be astronauts?”
Michael Bay: “Shut up.”

DVD Commentary, Ben Affleck on the premise of the movie Armageddon

There’s one key ingredient that’s missing from software teams, and that’s the budgetary authority and responsibility for the software that’s produced.

Teams don’t control their own Profit and Loss, they don’t control their own headcount, and they are often the last to know about any strategic changes in the software they’re charged with delivering. While this practice makes (some) sense in developed industries like Construction and manufacturing; it falls flat when you try to apply that to a software team. Software teams don’t have a known cost for a feature. The budget for a block editor (like wordpress’s block editor) could range from a few hundred dollars a year and forty hours of developer time to several hundred thousands — or even millions of dollars. There’s a whole ton of context around the budget that the developers have that most other folks won’t have. In some companies, getting that budgetary authority requires the CTO to step in, and often it becomes a negotiation (at best) with the CEO and CFO to allocate a pot of money. By the time it gets down to the team, there’s no discretion or authority to change from an obviously bad approach.

It’s sort of like 20 privates trying to tell congress how to run the Army. Yea, they’re in deep, and yea, they understand the ‘on the ground’ problems, but the Army’s on the ground problems and their strategic problems are miles apart in theory, even if they aren’t in practice.

The second biggest issue — besides the wrong framing for trying to fix the problem — is the lack of institutional buy-in from the business side. Agile requires business folks to buy in to the idea of incremental delivery, frequent introspection, developers being as close to the customer as possible, and maximizing the amount of work not done. And I mean really buy in. Owners of businesses will often fashion themselves after Steve Jobs and believe they’re a good proxy for the customer. They aren’t.

CEOs will often claim incremental and iterative delivery is in their best interest, and then want a delivery plan for the next quarter, or 6 months, or a year.

It’s impossible to have an accurate, or even 50% accurate, delivery plan for a year in an industry where priorities change weekly or even monthly, and yet software folks are expected to stick to their estimates as gospel quotes.

Even if agile is a perfect software delivery process (it doesn’t claim to be), the problem lies not in agile, but in our reliance on agile as the be all and end all to creating software.

We’re doing things the same way we did them 23 years ago.

“Agile” is 20+ years old, SAFe is taking the enterprise world by storm, Scrum Masters are being laid off while the Scrum industry deals with fallout from the broken promises of Scrum. Experienced software folks have moved on from Scrum to kanban or “scrum, but” (we do scrum, but we don’t do X, Y, or Z from the Scrum guide — or my personal favorite — we do the parts of scrum we like but we haven’t actually read the scrum guide recently), folks are still calling “scrum events” ‘ceremonies’ or ‘rituals’, and we live in this bastardized world where the rules are made up and the story points don’t matter — except to the executives that take them as gospel truth.

This is not a great place to be.

We’ve known story points were bad for over a decade now; and the last four years it’s become mainstream enough you’d think we’d have dropped them, but we haven’t.

We’ve known developers can’t estimate unless you break down ‘stories’ so small that it becomes trivial to estimate each part, and we also know there’s very little value in spending enough time on breaking down stories that small you can estimate them, but we still have business folks pressuring software folks for estimates and then holding them to those estimates.

We’ve known that focusing on feature delivery severely undermines both the user experience and quality of the software being delivered; and yet the idea of ‘slack’ in capacity is seen as a cardinal sin. If reading this doesn’t make your skin crawl, you and I probably won’t agree on much. I wish you well.

And yet, in 2023, these very serious flaws in development are alive and well and present in just about every “agile” team out there, and even if they aren’t, it’s a fight against entropy and business trying to control the situation by bringing up these very flawed ideas.

We thought Scrum would save us, it did not. We thought Kanban would save us, it did not. The conventional wisdom now is that ShapeUp will save us. Spoiler alert: it will not.

The problem is none of these methodologies take into account the realities of how decisions get made. By the time you’ve gotten to improving software delivery, the decisions have been made and are (for all intents and purposes) cast in stone. These methodologies are pretty good when it comes to behaviors of the team making the software itself; but somehow ignores the entire business process before the act of creating and delivering software. They miss that there’s this pot of money that has to be allocated, and they treat that money allocation as a given, and as a limitless resource. These methodologies also miss that business priorities do not align neatly to features, they align neatly to problems, but at the moment the software team becomes involved, we’ve skipped past the problem and are solutioning.

It just so happen that that solution looks a lot like things the management team wanted anyway.

Another common problem is that teams don’t know how to talk about money with the business; or communicate the carrying cost and opportunity cost associated with certain courses of action. And when we do talk about these problems, we phrase them as ‘technical debt’, which not only obfuscates their impact, but uses a term from the financial world that misses the actual problems faced by teams when dealing with technical debt.

‘debt’ is a two-dimensional term, financially speaking. I’m either paying to alleviate that debt through reduction of the principal, or I’m paying just interest on the debt. But at no point do I find out that I can’t even start my car because I’ve taken out this debt to buy the car. Debt doesn’t affect the current operation of the thing I bought with it. It affects what I buy, but it doesn’t affect my ability to go about my day-to-day business of what I’ve already bought. 

That’s not true for technical debt. How many times have you had to deal with the problems arising from technical debt keeping you from doing a simple bug fix? Or adding a new feature? Or not even being able to adopt a new style of working because of the ‘technical debt’ the team took on? Or existing features become impossible to maintain because of the debt you took on? Sometimes it feels like building software is negotiating with a sentient and malevolent being hell-bent on your mental destruction, manifested as no more than text files on the ethereal plane.

So when we try to talk technical foundations in business terms as software developers, we get it wrong. When we focus on optimizing software delivery and ignore the alignment of business, finances, and economic realities before us, we get it wrong.

The point is this:

Agile is not (and never was) sufficient to produce software that folks love to use and that meets the goals for which it was created. We spend more time creating churn and problems for ourselves, our businesses, and our customers through our over-reliance on agile than the benefits we get from optimizing software delivery with agile methodologies.

It’s time to look past agile. It’s time to focus on a holistic software creation process that aligns with budget realities and the realities of business. A software creation process that aligns what software creation is with what we need it to be: A way to improve people’s lives and enrich our culture. A source of joy, not frustration.

We’ve optimized software delivery. Now let’s focus on the realities surrounding software creation. Let’s optimize the next 20 years optimizing for the purpose of software: To make people’s lives better. We can’t do that until we fully align software development with the underlying business realities funding it.


Endnotes:

[0]: Scrum, Kanban, XP, Lean, LESS, SAFe. KPIs. OKRs. Knolway. You may ask why KPIs and OKRs are in this list. They have all the same characteristics as other ‘agile’ methodologies. The chief distinction between a waterfall/pre-planned approach to developing software is the ability to point to a goal outside of building the software itself, and iterate towards that goal, knowing the conditions that will consider that goal met (inspect and adapt). KPIs and OKRs provide an iterative framework to achieve a goal, and to inspect and adapt to reach that goal. At the most basic level, all agile methodologies seek to do that exact same thing, for that exact same reason. Just because some guys didn’t come up with it at a ski resort doesn’t make it any less agile.

[1]: So there’s data here; and I haven’t dug into it deeply (yet), but a cursory glance suggests that in 1995, 83.7% of software projects can be considered ‘failures’ in 1995 (“the project is completed on-time and on-budget, with all features and functions as initially specified.”), and the Project Management Institute survey suggests (edit: 8/20/2024, link updated) that around 69% are failures today (“Do not meet their goals”). Now, data-wise this is not a big jump, and not enough to tell businesses to, “adopt this approach because you’re 14% less likely to fail”.

This ungodly long opinion piece is not about whether agile was ever right, just that it’s past time to look at “What’s next”, and treat agile the same way Waterfall was treated 30 years ago; as standard, but insufficient in creating good software that people love to use and brings joy to their life.

[2]: Cory Doctorow is a luminary in tech. But, I see a particular blind-spot with his way of thinking, and that is it reduces our problems to a specific economic model, whereas if this problem were only about the economics of VCs and startups and “Get big fast and corner the market”, then it wouldn’t be endemic to all software. But as I write this, it’s hard to find software that hasn’t allowed this entropy to set in (even as I hesitate to call it ‘entropy’ because that pre-supposes that it’s inevitable, and I don’t think it is, as it’s entirely in our power to maintain quality and joy of use. Nothing is stopping us, except us).

[3]: The problem was never the mechanics of software delivery. When that’s the easy thing to blame (and the easiest to fix politically!) we focus on that. I haven’t gathered the data to back this up (this is a wake up call, not a research paper), but I’m willing to bet we’ve been looking in the wrong places for answers all these years. Optimizing software delivery has turned into the goal, the game, where the game all along should have been alignment of capital, direction, and purpose, and delighting the folks using our software.

[4]: Even more ‘fun’ (if you’re into that sort of thing) is that folks will tell you how the slightest deviation from those principles means you aren’t agile, and that somehow all these processes that derive from agile aren’t actually agile (Except theirs, of course). I will concede the folks that hate on SAFe have a bit of a point, however.

[5]: There’s that word. working. Not great. Not amazing. Not “Software you’ll be proud of”. Working software. That’s a low bar, but when you compare it to what came before — software that was produced for months and years but never saw the light of day because it was obsolete before being shipped — it was a godsend.

[6]: Feels weird to reword the tenets of agile for the purposes of this post, but I’m doing so because if I use the same words they use, it would defeat the purpose of trying to improve our understanding of the problems at play.

[7]: we have traded one level of software purgatory for another. Either we spend months building a thing that ultimately is obsolete when it’s shipped, or we spend weeks, ship it, realize it’s broken, fix it just enough to not be apparently broken, and move on once the opportunity cost is greater than the value to us to fix it. We’re still left with a process that disconnects us from the desires of the person using our software, and substituting our own process as a barometer instead of that desire. We’ve disconnected software creation from the mechanisms that determine how much runway we have, and wonder why the plane runs out of gas right after we take off.

[8]: Or find a different game to play. At some point you can’t trim around the edges any more and you have to fundamentally change the type of game you’re playing. If you just keep adding rules on top of rules to address the vagaries, you end up with American football, where you have to have 25-30 $100,000 cameras fixed at ungodly angles to determine whether someone caught a football.

[9]: I can’t tell if that’s a good thing or not that we’ve optimized how we build software for the realities of the industry around us, but I find it depressing that we’ve normalized building crap, but it’s comically rare that we build good software that brings joy to people’s life.

Author’s note: This post is the culmination of a writing process that took place over 5 months in 2023 and up until the time of publishing. I’ve spent so long writing and re-writing and editing it that it may come across as rushed. Ironically, it’s better to get the ideas out there and let them ruminate in the world than to keep working on this one post until it’s polished enough to be what I hoped for it to be when I started writing it. This is the beginning of the conversation, not the end. As an aside, once I upgraded to a 2023 MBP M2, the block editor started behaving again. That’s one way to socialize the losses from ineffective software development: Make your customer pay for better hardware to get a good experience.

Support Layers in Microservices Topologies

​One thing I mention frequently in the daily emails is the fact that microservices require a lot more operational support and development support than a monolith does.

In a monolith, once you’ve got your CI/CD set up; it’s set up. Production-wise, you only have to worry about your application server and your database server (and any reverse proxy), and the most complex your architecture gets is when it includes a load balancer and a web farm.

All in all, not relatively complex at all.

Now, Microservices are not synonymous with containerization; but generally microservices end up being containerized for ease-of-deployment.  You can also containerize your monolith (which I generally recommend as a forcing function to make it seamless for new developers to get started in your system), and so containers aren’t just a microservices fad, though they work nicely with microservices.

If you think of microservices as self-contained applications that each need deployment, operational support, and scaffolding to make it easy to develop a new microservice, then you start to realize there are lots of repeated problems you have to solve when developing Microservices:

  1. How do we add new microservices in a uniform way so that we don’t have twelve different ways to do logging, or healthchecks, or monitoring, or authentication or authorization?
  2. How do we make scaffolding that makes it easy to have a generated container image for a new microservice that has all the business and industry specific stuff we need? For instance, if you work in the US government space, you’re going to hear two phrases a lot: “STIGged” and “FIPS-140 compliant”. Your industry may have its own terms, but it’s the non-functional part thatevery application needs to have that you’d rather bake in than worry about doing it each time.
  3. How do we make tooling that makes it easy to generate contracts when new microservices are made?
  4. How do we provision new hardware (when in a private data-center) or provision new instances (when in a public cloud setting)?

Susan Fowler talks about these four types of problems in her book Production Ready Microservices​​, which talks about the ways that you get a microservice from development to production in a sustainable and scalable manner.

Susan mentions four ‘support’ layers you have with Microservices that you don’t have in a monolith (or if you have them, they were already solved long ago and you don’t need to worry about it now).

Layer 4: Microservices
Layer 3: Application Platform
Layer 2: Communication
Layer 1: Hardware/Host*

*I’ve modified layer 1 to be Hardware/Host (because like it or not, a docker image is a host, and has its own patch cycle).

For a development organization, here’s the sort of things you typically need to worry about in those layers:

Layer 1: OS updates; OS library updates; the actual hardware (if in a private datacenter); Virtual Instances staying up to date; local Docker registry
Layer 2: Message contracts; event queue infrastructure; scaffolding for generated types; OpenAPI tooling; Thrift/gRPC tooling
Layer 3: (if in .NET): private nuget (package) registry maintenance and tooling; Keeping .NET up to date;  CI/CD for each microservice, development tooling; internal tooling to make development easier; (above scaffolding for generated types can also be in this layer); logging and monitoring for microservices;making the application systemd compatible
Layer 4: tools to generate microservice-specific configurations; SSL certificates for each service (if needed); environment files; and any tooling that we’d need to apply to a specific type of microservice.

If you have these four support layers in place; then a developer simply has to create a new microservice and go; this tooling takes care of the rest.  This can look like all of the scaffolding being generated for them by internal CLIs.

Sounds like a lot, doesn’t it?  It is. But whether you automate it or make your development team do it manually, it still all has to be done, and is a cost to adopting microservices.

Size is relative: Microservices Edition

​One of the go-to wars around microservices is how small they should be (I have beef with this beef, but that’s another topic for another day).

Some people say they should be no bigger than the code that fits on your screen.

Some people say they should be no bigger than what’s necessary to encompass their reason for existing.

Some people say they should encompass a ‘bounded context’

Some people say you should be able to rewrite a microservice as quickly as you could find and fix a bug in it.

They’re all right, and they’re all wrong.

How big or how small ​a service ​should be​ is relative to your comfort level.

If your monolith is 1MM lines of code, anything 10,000 lines or less is going to be small.

If your monolith is small (50,00 lines of code), then 10,000 lines of code isn’t very ‘micro’.

If you have five people on your team, having 10,000 lines of code per service (with 5 services) seems reasonable.

If you have fifty-five people on your team, then 10,000 lines of code services (5 services) gives you a lot of collaboration points and a higher potential for merge conflicts.

Micro is relative.  How you size your services, depends on a lot of context that ‘some people’ don’t have.  Don’t worry about how others size their services, do what makes sense for your context, your business, and your team.

And because I can’t resist, here’s an image of all the sci-fi ships sized, relative to one another

Questions to ask before pursuing Kubernetes

Today’s post brought to you by an innocent sounding question in the Rands Leadership Slack (paraphrased):

I’m joining a company as their first Devops person and their infra needs a serious upgrade. The codebase is a monolith, but they want to pursue microservices, and if growth keeps on its path, that seems likely. I think I’m going to be allowed to dictate a better setup, but I’m currently having an existential crisis between going with ECS and K8s. I realize this is a huge question, but all things being equal: if you had the chance to start fresh, would you go towards K8s or ECS?​

#devops channel, Rands Leadership Slack

I felt like that “step on the brakes” TikTok meme when I read this (you can google that, there’s no way I’m linking to it). This person is well meaning but it’s a question without necessary context — and a question I’ve seen asked by people who ​haven’t figured out the outcome and strategy they want in the first place!

As an aside, Kubernetes seems to be all the rage these days; and that’s OK. Beanie Babies were once cool too.  But, much like Beanie Baby mania, some of the hype precedes the value, and let’s be clear: Kubernetes is not an outcome or strategy. ​It’s a tactic​.

Let me explain.

When someone says, “Hey, you should adopt Kubernetes”, that’s skipping a few crucial steps.  LIke, “What would buying all these beanie babies get me?”

Or 

“Why do I even feel the need to buy beanie babies?”


So let’s start there.


What pain-points are you having with your current architecture and topology?  Is it legitimate scaling issues? Is the database server experiencing timeouts due to high load? Are you writing the same data to multiple places and finding that you’re clogging up the pipes, as it were? Are users experiencing latency or are you having operational issues?  Can your system handle the scale you need it to?

The answers to those questions will give you the idea of what outcomes you want to fix:

  • Ensuring the database can handle OLAP and OLTP transactions without requiring a complete system rewrite
  • Ensuring the database doesn’t timeout
  • Improving the ability to increase resiliency and not lose data during high load times
  • Ensuring the system can scale to 10x transactions per hour due to expected future load

Keep in mind, none of those outcomes specifies beanie babies or Kubernetes.  Those are desired ​outcomes​.

Next is strategy. What strategy will we use to achieve those outcomes?

Well, each outcome can be solved different ways; only one of those requires buying beanie babies (using Kubernetes).

For the Database; we could implement CQRS; we could add ETL process/data warehousing operations; we could improve the performance of our OLTP transactions so they aren’t blocking OLAP; we could implement a queuing mechanism, etc.

Break the system out into services and have those services communicate with each other through an event-driven architecture.

Those are strategies: they’re ways to solve the problem in an overarching fashion.  ​Strategies don’t include implementations​.

Next are tactics:

What tactics can we use to fulfill the chosen strategy?

That’s where Kubernetes comes in. It’s a tactic to a chosen strategy where your chosen strategy is: ​I want to implement micro-services but I don’t want to use a vendor-specific solution​to orchestrate them​. 

Kinda feels like a self-licking ice cream cone when I put it like that, but alas.

So the next time you hear someone say, “We should adopt Kubernetes”, have them put on the brakes and talk you through the pain-points first, then the outcomes they want, ​then​ the strategy, and then if all of that works out, you can have the Kubernetes discussion.

Before deciding that you need Kubernetes, let’s talk about where you’re going, and what you want.  What ​outcome​ do you want? What strategies are on the table? How will they achieve that outcome? How can they be measured?

Once you’ve done that work, then we can talk about which beanie babies you should buy.

The Realities of Microservices

At this point, I have two(+) years of experience with Microservices, and I’m not an expert, but I have some hard-earned knowledge distilled from working with them (and making lots of mistakes in the process). Here’s what I learned that I wish I had known going into it.

Microservices are not mini-monoliths

Jim Gaffigan has a rather funny skit about (American) Mexican food. Listen to it here before I butcher the punchline. The punchline of the skit is all Mexican food basically consists of a tortilla with cheese, meat, or vegetables. We tend to think of deployable software in that same way. It’s all code, wrapped up with a deployment script, and sent to production. Monoliths are independent complete applications that fulfill a business function. So what’s a Microservice? An independent complete application that fulfills a business function. So why aren’t microservices just ‘mini-monoliths’? The answer comes from the idea that microservices collaborate. A monolith does not rely on another monolith for its uptime, data, or resiliency. It is generally a self-contained view of the world and due to their nature they do not care if anyone else exists. Your company’s website is wholly independent of anything else. More critically though, multiple teams may work on your company’s website. They share code, branches, and a single production pipeline. Microservices, on the other hand, are independent complete applications that fulfill a business function, but doesn’t fulfill more than one. A monolith does.

A Microservice understands that while it is independent, there are possibly zero or more people out there interested in what it has to say, and so it is designed with that understanding in mind. A Monolith is not, and does not have to be. Businesses eventually find out that they wish their monolith was designed to share its information in a de-coupled fashion, but often too late to do anything about it easily.

Microservices are not mini-monoliths; they’re collaborators that operate independently when they need to.

Microservices require a different way of thinking about problem solving

Developers love to write code. We’re so enamored with writing code that we’ll write code even when no one needs us to. We’ll write code to solve nagging problems on our own machines, or to automate silly things, or even write code to solve problems in our households. In fact, I have a new side project to set up a Raspberry Pi as a calendar viewer in my house. This is probably not unique to software development (though maybe it is? Do plumbers re-pipe their houses? Do electricians rewire theirs on a whim?) but the tenor of it is so overdone in software development that we exhort new developers to not write code first.

… And then we ask them to work on a monolith. Monoliths make writing more code easy. It gets to a point where the default state is “find problem”, “write code”, “ship”, without understanding whether or not the problem is best served by a bolt-on or add-on to the existing system. For small things this is not an emergent issue. Those small things can add up, and it will become a problem over time.

For instance, if you’ve ever tried to add a CSV import to any existing system , you’ve probably found out within days that the desired “CSV Import” feature is really a “CSV + Domain Specific Logic” import function, or almost as harmful is if a ‘bulk’ method of inserting wasn’t part of the original requirements; necessitating a change in the API. In a monolith; it’s really easy to write code to add this functionality that has baked in assumptions that aren’t clear, and to potentially change the API your system exposes, or how it presents itself to the user. Because of the ease of ‘just’ writing code, it it easy to rush the implementation without regards to the design. Writing code quickly is not the job; solving problems without causing more problems is the job; and a monolith makes that hard to do.

A user wants to add a stock to their portfolio…

Microservices, on the other hand, require up-front planning before code is written, every time. Every new service or any change to a service may be able to be coupled with completely replacing that service. Anything that has the potential to change the contract in a system (whether with the user or other services), requires more understanding and up-front design than the same change in a monolith. To go back to our CSV import example; a potential way of doing it with microservices is to have a new CSV importer service stood up that takes in a CSV file; does any Domain Specific Formatting; and emits an event or sends an HTTP request to the correct service and uses its existing API for adding/importing information.

And now they want to add multiple through CSV.

Now, these services are necessarily coupled to each other (though the coupling does goes in the right direction), and since the contract has not been changed for the original service; the guarantees of the original service are kept intact. Microservices make it harder to break existing consumers if done well. The trade-off is more upfront planning is required when designing a solution in a microservices based topology.

Domain boundaries are critical to Microservices success

There are three general flows to microservices (There may be more; but the types are escaping me right now):
1. Microservices that give new capabilities to an existing domain bounded context (the previous example of adding CSV import for a portfolio service as a separate microservice is an example of this — there are several trade-offs to doing that, and it depends on your constraints and desires)
2. Microservices that represent a stateless process (viz. validating a credit card)
3. Microservices that represent a stateful process or interaction (the portfolio service)

Notice that I said nothing about size of these services; and depending on whom you speak to, the size of a microservice is a mystery. I have opinions on this, of course; but the one invariant I’ve seen is that good microservices topologies ensure the lines are drawn at the domain’s “bounded context“. This is a fancy Domain Driven Design phrase that means to split up models and interactions by what they mean. To sales, a customer interaction is quite a different model and mode of interaction than a customer interaction for customer support. By splitting them up by their ‘context’ (and the boundaries being sales and customer support), the software can maintain independent ideas of how to interact with a customer depending on the context.

Martin Fowler’s Illustration of Bounded Contexts, source: https://martinfowler.com/bliki/BoundedContext.html


For microservices, this typically means that your customer support portal will be a different bounded context than your sales funnel; even if they share the same properties of a customer (at least demographically). There are three ways to handle the above problem:

Method 1: Set up a separate service with an independent customer model for each service (sales, customer support), and one created in one system is not necessarily referenced elsewhere (or it can be; customer_id, customer_support_id, sales_id)

Method 1, illustrated.

Method 2: Set up a “Customer” service, a sales service, and a customer support service, and both sales and customer support get customer information from the “customer” service.

Method 3: Set up a customer service, a sales service, and a customer support service; and sales and customer support have duplicated data (received through events) of things that happen in the customer service, but they maintain their own disparate models for what a customer means to them. From a system perspective the internal identifier is the same; how it’s used varies from system to system. This means having a customer service that has demographic information; a sales service that may or may not have this same demographic information but adds on sales context, and a customer support service that maintains this duplicate information but adds on its customer support pieces.

Each method has its own trade-offs; but you can quickly see the maintenance issues with each:

  1. Method 1 has three different representations of a customer; and potentially at different states in each service (a sales person sees a customer before they’ve signed on the dotted line, and a customer support person always has a “post sale” view of the customer. This is OK until you want sales to have the customer support information; and then you need to do a bit of juggling to ensure a customer from a sales context is indeed the same customer in a customer support context.
  2. Method 2 allows there to be one representation of a customer; and each service can either “add-on” to this representation of a customer; but each downstream service is still beholden to the customer service; and which context does that live in? Both. There is also a temporal coupling factor as each service “gets” demographic information from the customer service.
  3. Method 3 allows each service to be de-coupled from the “customer” service. It allows each service to add its own data to what it means for there to be a customer; and it allows each service to change independently (since each service will emit events it can listen to to update its model if it wants to). But this also means having a unified contract of what defines the demographics of a customer; and ensuring each service is set up to listen to events pertaining to customers, and each service appropriately handles being down if a customer event is emitted (event sourcing is a possible solution here).

None of these methods are “ideal” from an “easiest to develop” standpoint; and they have different levels of maintenance requirements. The one crucial decision that a team must make is what is the domain context, is this <thing> I’m dealing with talked about differently depending on who I talk to, and what is the maintenance cost of each approach. 

If the team chooses method #1, then they have a lot of distributed systems problems that aren’t easily solved; they’ve made interacting with the system harder. If they choose #2, then two services depend on a third (not really ‘independent’ at that point), and they’ve added an Request/Response dependency between services that may not need to exist (And is harder to debug). If they choose approach #3, they have quite a bit of upfront work (defining contracts; defining patterns), but the maintenance work, reasoning about how a service interacts with another service, debugging, and future expansion is far easier.

Developer Tooling doesn’t support Microservices as well as Monoliths

We have about 25 years of experience as an industry creating tooling around building and deploying software; though it’s only really in the last 15-18 years that the tooling has accelerated. But, even at 18 years of experience, we have pretty solid tooling around developing and debugging monoliths. Debuggers and IDEs take monoliths for-granted, as they likely should. If you write microservices that depend on other microservices over REST, you’re going to have a bad time debugging services locally. Your choices range from standing up the parts of the system that collaborate, or mocking out external dependencies, or dockerizing the system’s services so that they can be stood up independently. Of course, once you do this you’re diving into mixed networking land for Docker; and there’s not a lot of tooling that can make that experience seamless. A service running outside of docker that you’re debugging is hard to set up to work with services running inside of a docker network, or vice versa. Front-end development is even worse; as node.js is a requirement for building front-ends these days. Try live-debugging with docker for your UI where the source is kept locally. Not fun. Teams handle this problem in different ways; but the point is this problem exists, and the solutions are not as mature as debugging a monolith.

If you use microservices, you need to allocate a sizable chunk of time to building the tooling necessary to allow people to develop against those services.

Deployment requires better tooling with Microservices

Deployment considerations are key if you want a fast moving organization. You can’t respond to change without being able to change your software quickly. Even if you can develop changes quickly, if you can’t deploy them quickly you aren’t a fast-moving organization. Continuous Integration (CI) and Continuous Delivery (CD) is essential to being able to respond to change. These products reflect that the deployment view of the world is monolithic in nature. Source control is built for it, CI/CD systems are built around it; and pretty much every commercial CD system is built with monoliths in mind. There are several deployment models where microservices are used; and none of them have good tooling for microservices.

  1. Deploy on-premises as a packaged solution
  2. Deploy to the cloud independently
  3. Deploy to the cloud as a packaged solution

If you sell your product to customers, and they run it in their own data center, deployment method #1 is what you often deal with. Your solution must be packaged up and deployed together as a single unit. Should this necessitate that you develop as a monolith? No. It shouldn’t. However, if you have microservices, you necessarily have multiple deployable artifacts (whether they’re contained in a mono-repository (all services in one source control repository) or micro-repositories (each service in its own source control repository) is a separate matter), and your CD pipeline must take that into account. The trade-offs change whether it’s a micro-repository or mono-repository; but they still exist as problems not solved by current tooling. For instance, tagging master or a release branch with what is in production; or your promotion model to different internal environments; or even local deployments need to be taken into account by the tooling. If you choose method #2 and combine it with continuous delivery, some of those trade-offs go away; as you can make a rule that the latest in master is always pushed to internal promotion environments; and the only tag happens after a particular commit has been pushed to production; but again, tooling is still lacking to make this a seamless experience.

Microservices deliver on the promises of Object-Oriented Programming

I didn’t understand the hype of object oriented programming. I understood the fundamentals of encapsulation, abstraction, inheritance, message dispatching, and polymorphism, but I didn’t understand why they were so useful (I started with Perl, and then moved to Java, so I had nothing to compare Java’s OO nature to. At the time it just seemed like more work to do the same things I could do in Perl. Ahh, youth). The SOLID principles helped later on, but I always felt like there was more hype to OO than actual benefit. After several jobs maintaining and creating Object-Oriented solutions, I was convinced that Object Oriented Programming was a pipe-dream. To the 80% of us who are not “expert” programmers, it is a fad we can never make full use of and it causes more harm than good.

That was until I started researching microservices. This was it! A fully independent object that had agency that could collaborate with others; but encapsulation was ensured! The Open/Closed principle was a requirement! Single responsibility was almost ensured just by the nature of the service! (It says “micro” on the tin) Inheritance was far simpler — consume what the service gives you and modify it to suit your needs (the CSV example above). You couldn’t share information unless you had a common contract and used some sort of message dispatching!

This was absolutely huge for me. All of those principles that I’d been trying to bring to reality for years in codebases I’ve worked on were here — and best of all they didn’t have the downsides of OOP in practice! It’s really easy when modifying code to do something that breaks encapsulation, and business pressures make it even easier. With Microservices, that was no longer possible. Sure, other business induced pressures might cause problems, but they couldn’t alter the contract of a service; and that allowed the system to be reasoned about in ways OOP promised. Perhaps best of all, microservices put up guard rails that keep the mistakes of OOP from happening, and we’re all better for it.

Contracts, Patterns, and Practices should be Code generated

If you do something once, do it manually. If you do it twice, write down the steps, and do it manually. By the third time, automate it. Producing even a dozen services means either manually enforcing the structure of contracts
(the format by which services communicate with each other or to the user), patterns (how you structure common infrastructural concerns), and practices (how you write software) or code generating it for commonality. If you don’t code generate it, entropy wins. Even across features services start to do the same thing different ways; or you find new patterns for structuring your events, and depending on which service you’re in, you could see a different pattern. It’s untenable from a development and maintenance perspective.
Method #3 above shows a world where the Customer Service emits events when a customer is added or updated; allowing interested services to listen for changes and update their own data stores as necessary. Without code generation this would be a tedious process filled with error. With code generation and schema defined models; this is a viable development model.

Can you imagine trying to update any model/contract without code generation?


There are only two sane paths; package the commonalities (which can really only be done for dependencies) into utility functions, or code-generate everything.

Packaging utility classes/models (like the customer model and the events above); is a valid approach. The concerns with using it are taking on dependencies (even internal ones); the overhead of internal infrastructure; and the fact that every service would be required to be in the same programming language.
The latter path (code generation) is exactly what Michael Bryzek advocated in his talk Designing Microservices Architectures the Right Way and coming from trying the other paths (packaging common functionality, and doing it manually), I can see its utility. The trade-off, of course, is that developing the code generation tooling is a heavy investment of time. It requires discipline to develop this tooling first without trying to develop features; and it would likely result in no visible movement on things the business cares about (features, revenue, etc). It also ensures that as long as you have tooling to support that language, you can implement those models in any language you’d like.

You can’t punt on non-functional requirements

There are lots of non-functional requirements in a system that never appear on the roadmap, are never spoken about at the sales meetings, and are only tolerated by the product manager. Things like a user should be signed out after fifteen minutes; or the authorization system should incorporate roles and location; or some data is transient and not part of the backup strategy, and other data needs to be backed up every minute. Or, the system must allow 5000 concurrent users at a time. Those are non-functional requirements; they’re qualities of the system that aren’t part of the user-facing features being developed.

In a Monolith, there’s typically very few places to go to implement a non-functional requirement, and as we’ve discussed previously, IDE tooling is built for the refactoring necessary to ensure a change takes place everywhere it’s needed (only for the statically typed languages; the dynamic folks have their own problems to contend with), and even if you have to implement a new feature, there’s generally one place to do it.

Not so with microservices. If you implement authorization, you must implement it across all services. If you implement a timeout, you must implement it across all services. Unless your microservices are across hosts, any performance improvements must take into account that each service may share host resources with one or more other services. If each service is using the same server instance (i.e., every service that uses postgres shares a postgres server instance, even if they’re separate databases in that instance), then performance tuning and backups must take that into account. This greatly complicates matters of performance tuning and dealing with non-functional requirements; and for the system to be easily built, those non-functional requirements need to be known at the beginning! Every delay in implementing a non-functional requirement makes it more likely that some disparate changes will need to be made across several services; and that will take much longer once the services are built.

Event Driven Programming makes microservices work

In firmware programming, the finite state machine and events got me through the day. Each peripheral has separate states; and those are triggered by events that may happen from user input or other peripherals (for instance, seeing a bluetooth advertisement from a whitelisted address may trigger a connection). Since firmware by-and-large sits on a single core System-on-Chip with limited use of or no threads at all, using an event loop and finite-state-machines are one of the best ways to make firmware work.

Finite State Machines coupled with Event Driven programming also has other nice properties that parlay well into microservices: events ensure each service is de-coupled from the others (there are no direct request/responses between services); and Finite State Machines dictate what happens based on the current state of the service plus its input. This makes debugging a matter of knowing which state the service is in, and what input was received. That’s it. This greatly reduces the complexity in standing up and debugging services; and allows problems to be de-composed into events and states. If you add event sourcing into the mix, you have an event stream that records the events that occurred, so playing back issues is as simple as replaying events.

This is possible because microservices operate on network boundaries. In a monolith you’re forced to debug the entire monolith at once, and hope someone didn’t write code with disastrous side-effects that are impossible to find through normal means. It’s easier to find a needle in a small jar of needles than a giant haystack, and that’s possible because of the observable boundaries of microservices and using patterns that limit the amount complexity that allows you to arrive at a certain state.

If you’re going to start writing microservices; I highly recommend going down the path of event-driven programming, state machines, and some sort of event stream (even if you decide against event sourcing).

Choosing between REST and Events for supporting Microservices is tougher than you may think

If you’ve read the fallacies of distributed systems, then this section almost writes itself. Microservices are distributed systems, no matter how you shake it. One of the major problems when communicating across a network boundary is “is that service down, or am I just having a network timeout?” If you’re using REST, this means implementing the circuit-breaker pattern with some sort of timeout. It also means that if your services communicate to services that communicate to services through REST, then the availability in that chain will eventually hover just above zero. (00:00-12:31). As the video rightfully says, don’t do that. I’d go so far as to say that if at all possible; don’t make calls to other services through REST.

If you need data, have the service publish an event, and consume that event. This sounds great; it’s de-coupled, and it’s resilient to failure. However, each service must now have means to publish to a bus, consume an event off of a bus, and support whatever serialization scheme you want to use. Oh, and now you need to be able to debug all of the above. If you want runtime resiliency, you must sacrifice development simplicity to get there.

Maintaining Microservices requires strong organizational and technical leadership

“The business” does not care what the topology of your system is. They don’t care about its architecture, and they don’t care about how easy to maintain it is, any more than you care whether they use Excel or Quickbooks for forecasting. The business wants two things (really it’s n of 9 things) but work with me here:

  1. Increase Revenue
  2. Reduce costs

They believe more features will increase revenue. It’s a fair belief (correlation does not imply causation), but more features also increases development costs. To “the business”, the way to solve this problem is not by reducing the costs, but by increasing revenues. Again, this is also fair, and in a good number of cases is the right path.

Earlier, I mentioned that microservices keep those nasty shortcuts that cripple development teams from happening, and that’s a good thing, but, to the business, it can also be a bad thing. See, that crippling shortcut may never happen; but adding that feature (to their way of thinking) will increase revenue. If they have to choose between helping revenue but possibly hurting future maintenance, or delaying that feature by several weeks but helping future maintenance, they’ll pick the path to fastest revenue, every time.

The person or people that keep this from happening are hopefully the organization’s CTO and engineering leadership (VP or Director of Engineering, the Architect, and senior leaders of the team). They’re the people with the cachet and experience to know when this is going to hurt future maintenance, and they hopefully know enough to know it’s probably not a sure revenue bet either. But this requires discipline and trust on the part of the engineering leadership team. They must have gained the trust of the business by delivering what the business wants in the timeframe they want it; and they must be disciplined enough to stick to their guns. If someone says, “Well, we could do this in a week if we just hooked Service A up to Service B’s database”, you have now failed with microservices and are maintaining a future monolith. You’ve also lost the advantages of working with microservices.

Shortcuts are easy to say yes to, and shortcuts can greatly endanger the maintainability and health of a development team and the system.

Microservices are a technical solution to an organizational problem

While developers and consultants tend to espouse microservices in a cloud scenario, they tend to ignore that microservices are orthogonal to their deployment scenario, and they’re orthogonal to technology stacks. Take away all these advantages of microservices; and you’re still left with a topology that allows you to segment teams along domain boundaries, and have those teams operate independently of one another. At a small enough scale, you could even have individuals own services and scale out your feature creation to the number of people in your development organization. The Mythical Man month states that adding people to a late project makes it later; and it says that because those people have to communicate with each other. What if they didn’t? or what if you could reduce the amount of communication needed to ship a feature? Microservices let you do that. (I fall firmly in the micro-repository camp as well, so I’m about to conflate the two on purpose). Microservices development means independent repositories, and less issues with merge conflicts, branching, or collaboration needing to happen to push out a particular feature. It also means fewer avenues for the feature to clash with existing features; since by definition the service is independent and autonomous. It means fewer parts to reason about, and that results in faster development time.

Microservices (when architected well), let you go faster and further than you otherwise could, with less need to put organizational guardrails on the development team (code reviews; gated checkins, code freezes) to resolve team performance issues. It minimizes the effect a single developer can have against the whole system. This is a great benefit if the organization does not hire well or pay well (and if every organization did, we’d have a low turnover rate in software development), as it substitutes technology for some of the human training and improvements that organizations should do but don’t do.

If you have all top-notch performers in a high-performing engineering organization with a high performing business with no turnover, you don’t need microservices because you’re not going to make the mistakes that microservices would fix. If, however, you’re in an organization that consists of humans that are fallible, microservices provide a benefit to development that monoliths cannot.

Closing

Microservices are another tool to help make software development better and to make systems easier to maintain. They provide many benefits and have many trade-offs with traditional monoliths, and it’s rarely clear whether or not a system should be developed as a monolith or as microservices. There are several factors that can steer the choice towards one or the other; but those factors depend greatly on the individuals, organizational leadership, business model, constraints, and politics of the organization implementing those services.

These are the things I wish I had known when I started with microservices. What do you wish you had known about Microservices before working with them?

Note: Special thanks to Adam Maras for spending part of his weekend giving me feedback on this post.