Pillars and Pods

One of the fundamental tasks of an engineering leader is to define an organization structure that consistently delivers high quality results. It is well-established that durable teams with high psychological safety outperform other teams, even those with extraordinary individual talent. See Google’s Project Aristotle. For most organizations that means durable feature teams or SCRUM teams that include multiple disciplines required to deliver software from concept to customers. I was recently involved in a reorganization that defined an organization in a Pillar + Pod model. In this model the feature delivery teams are temporary and exist only for the lifecycle of a single project. The durable teams are aligned around horizontal technology areas.

Situation

Automox is a cloud-native patching SaaS that allows IT Administrators to manage patching across a distributed fleet of devices. The tech stack is made up of an Agent that runs on the managed devices, a cloud platform that sends commands to the agent to execute, a public API used to manage devices and patching, and a Vue.js frontend to provide a high-quality UX without writing a custom API client. Before the reorganization engineering was structured as a Platform organization and an Applications organization. The Platform organization was tasked with building out an Agent-Based platform that made it easy to execute actions on a distributed set of devices, at a scheduled time, and emit responses back so Application teams could make business decisions on reporting and next steps. The Application teams could then ignore the complexity of coordinating and scheduling work across millions of devices that are connecting and disconnecting across the internet and focus on delivering customer-facing features.

Complications

In order to maximize velocity we wanted each feature team to be able to deliver customer value without depending on other feature teams. This worked great for changes with relatively contained impact. For instance, if the App Framework team wanted to deliver an improved Authorization system for app teams to use they were empowered to deliver it without a dependency on the Platform or other Application teams. However, when we wanted to make a change in the Patch system we ended up needing to change the command executed by the agent, the communication protocol, scheduling, workflow, etc.

The reason was that our organization was disconnected from the technical reality. We didn’t have a Platform that provided a generic feature set to our patch product. We had a patch product that had core functionality spread throughout the feature set. A platform/application architecture was the long-term goal, not the reality today. We were attempting an Inverse Conway Maneuver. (See Team Topologies). We were trying to structure our teams for the architecture that we wanted to see emerge. However, it meant that the Patch and Reporting teams, which are core to our business were not able to quickly deliver features to customers because they were always waiting on changes to be made by lower-level teams, or coordinating changes with those teams.

We had over-optimized our organization for the long-term end-state. But in the short-term we weren’t delivering the features that our customers needed today. We decided as a business that we couldn’t wait for the benefits that were expected to take another year at the expense of our customer base.

New Organization

We had identified two main organizational issues.

We were diluting focus between new features, new architecture, and maintenance work. Some features were delayed for customers because we pinned their delivery to new architecture. New architecture changes were delayed due to customer escalations, bugs, and feature deadlines.
Key changes that needed to be made to move forward involved coordinating work between multiple teams. These scrum-of-scrum projects combined with engineers swapping between priorities (point #1 above) meant that deadlines were routinely missed.

Improving focus

We improved focus with two moves. First, we split out a new organization to focus 100% on building out a new platform that allowed improved scalability and feature velocity. This is a small team focused on the long-term architecture and migrating our current feature set to it.

For the existing organization, we still needed to add new features and do maintenance work on the existing code base. Previously we had feature teams own both of those responsibilities, but what we found was that bugs and production incidents impacted our ability to hit our delivery dates. In the new organization we established that some people would be 100% focused on features and some 100% on bugs, escalations, tech debt, scalability, and developer experience issues.

Enter Pillars and Pods

We knew that we wanted to carve out sustainment work into a focused role. We could have said that one feature team owned sustainment every quarter while the rest focused on feature work. We could have defined a set support team that handled sustainment long-term. But we also needed to solve the issue that we couldn’t add the kinds of features that our customers wanted with our current teams.

Take our Patch Application team. Usually they could operate with just UI, API, and Patch Platform engineers. But sometimes they needed to change the way that the agent acts to be aware of a new configuration. Or to pass additional information for a reporting need. And where exactly did you draw the line between patching and reporting anyway? The reporting team could only be as good as the data that they received. If they needed additional data points or fields they would have to make the change themselves and coordinate with the patch team (or agent team) or make the change themselves.

What we realized was that these seemed like corner cases initially, but as we talked about the features that we wanted to take on over the next year most projects needed a unique skillset to make all the changes required.

And that’s when we realized that static team definitions assumed that the shape of work we were doing was relatively stable over time. But our work wasn’t stable. We were making large cross-cutting changes as part of strategic investment into the future. And we still needed to solve the sustainment issue. Because realistically a single team couldn’t effectively handle bugs and escalations coming in across the entire product. We had outgrown that years ago.

So what if we didn’t have static feature teams? What if we assembled the teams that we needed for the projects?

So that’s what we did. Instead of having our reporting structure match our execution structure, we went full matrix. We defined pillars based on skillset and domain knowledge. Each engineer reports to a manager that leads a pillar. When we want to build a new feature we grab the right number of people from each pillar and create a pod. When the project ends then the pod disbands and folks either move into a new pod or work on sustaining work for their pillar.

Pros of Pillars and Pods

Focus

Each pod is focused on solving a single problem or moving a single KPI. Individual contributors on the pods aren’t balancing escalations, maintenance work, and bugs. That work is handled by people in the pillar who are not staffed in a pod. It works the other way too. Folks who are not assigned to a pod can focus 100% on sustainability work without having project deadlines to hit.

On-Call rotations

One complexity with full-stack feature teams that own areas of the product is how to handle on-call. A sustainable on-call rotation is typically at least 5 engineers. But you also want everyone on-call to be responsible for code that they have direct impact on. If you have a fullstack javascript stack where everyone works on frontend and backend then everyone can be on-call and respond to alerts across the stack. However, if you have engineers that only work on frontend they should not be on-call for backend systems. Now you need a larger team to staff an on-call rotation. And at that point, is the team really a single team? Or are you running multiple projects simultaneously?

Centralized areas of excellence

If you have a single team that is composed 100% of frontend engineers you can have an engineering manager developing technical skills across the team. You can centralize decisions. You can share knowledge and best practices.

Dynamic Staffing

In a lot of ways, the Pillar and Pod model is internal contracting. The engineering manager that runs a pillar gets a request to staff a project and makes a decision about who from the team is best fit to tackle the project. This means that you can choose to put your strongest engineer on the hardest project. Or give a junior engineer a chance to stretch their wings. Or if someone is going on vacation you can make sure they are not on a critical project. In the old model if you had a feature team with 2 frontend engineers it was much harder to give them challenges or career opportunities. They worked on the projects assigned to the feature team or transferred to a new team.

Cons of Pillars and Pods

The delivery team has to re-gel every quarter

This is the main downside. You can’t avoid the cost of having new teams forming and norming if you want to assemble them dynamically.

The pillar has a high chance of cognitive overload

This is one risk that needs to be mitigated. If you have a single team that owns the API for instance, that team needs to have the context for every feature that exists in the product. If you go into a rapid growth period and triple the number of features in the product the team will grow beyond their maximum cognitive load. Long term pillars will have to be split to constrain cognitive load to a reasonable amount. The further you subdivide your pillars the less ability you have to staff pods. If you have 3 pods with 10 engineers you can staff 5 pods with 1-2 engineers from each pillar and still have capacity for sustaining work. If you have 10 pods with 3 engineers you can staff 1 pod.

Teams are no longer able to own a feature long-term

One compelling benefit of having ownership of a feature area is that your domain expertise will increase over time. You will develop hypothesis of ways to improve the functionality for customers. If the product manager is the only permanent owner of a feature area then they are the only person consistently thinking about hypothesis on how to improve KPIs.

Building a successful pod culture

Individual Contributors

In a feature team model there was a single engineering manager that was accountable for delivery and execution in the feature team. In the pod model there are engineers representing multiple pods and no EM associated with the pod. Every pillar needs to be able to make decisions day-to-day on how to implement their feature in a way that fits the best-practice and long-term sustainability of their pillar. They need to know when and how to escalate issues that are slowing down the pod. They need to take on a leadership role. As such, it’s important to have self-starting, highly communicative engineers that have good judgement.

Engineering Managers

Engineering managers are disconnected from individual projects with the pillar + pod model. Instead they are influencing every pod that has a representative from their pillar. Their main mechanism for improving pods is through training and coaching their ICs. This is a really powerful forcing function for having people-oriented managers. You can’t be in every standup and decision meeting.

Would I advocate for Pillars + Pods?

Yes. Sort of. In hindsight I believe that most of the effective teams I have been on in the past have used similar principles either across teams or inside of a team. You need a large number of people to staff an on-call rotation, but features need 2-3 engineers. So you end up defining a pillar like a “Payments” team and then have sub-teams tackling project work. And some engineers are doing “Blocking and tackling” so the rest of the engineers can focus on features.

We need to separate out maintenance from feature work. We need to have big on-call groups and small execution groups. It is unclear if you need to do this across teams. Don’t be beholden to aligning your reporting structure and your execution structure. Have big pools of folks that are cross-trained and easily swapped into smaller execution teams. Let your team push hard on a feature and then take a tour of duty working on bugs and sustainability for a bit to avoid deadline burnout.

Don’t be afraid to be a little more dynamic in your staffing. But try to get to a point where there is an element of trust between people who end up on a pod. Build it into your culture and processes. Continue to drive for an architecture where a small group of individuals can make radical changes.

Further thinking

How does this work with Dunbar’s number? If we keep the set of pillars small enough and pull engineers from that set into pods do we avoid some of the cost of forming and norming teams?