Applying different architecture lenses in software systems
When building a product as complex and multi-faceted as an Everything App, powered by a large-scale backend system, our team of engineers needs to evolve our systems in a fast-paced, changing domain. Design patterns help us achieve this by applying these patterns through the ongoing evolution of our distributed systems.
Introduction
Software design patterns are general, reusable solutions to common problems that arise during the design and development of software, typically followed to help organise the code base in a single application.
Unlike physical systems, there are unlimited choices a software engineer can make while building software systems. The concept of a design pattern describes common approaches to follow in building software, what value they add, and when they can be useful. Of course, not all patterns are valuable and should be used all the time. Design patterns help software engineers make use of higher-level constructs that are known in the software industry.
Similarly, in the field of distributed systems, different architecture patterns have their characteristic advantages as well as downsides. There are huge opportunities for learning when assessing those patterns and accompanying common trade-offs across different architectural lenses. Two patterns in particular present such opportunities:
- Application architecture: How you organise classes and objects in an application codebase as a single deployable unit.
- Distributed systems architecture: How you organise many applications, microservices, and other system components. And how they interact with one another to achieve the overall desired outcome.
Of course the dynamics of these two dimensions are different. We therefore cannot generalise and claim that a certain pattern that works well in a dimension is expected to work exactly as well in the other dimension. For example, one would try to avoid duplication on the same codebase level by extracting shared logic in common packages and use it across the codebase. But the same principle should not always be applied in a distributed system architecture, since this will be a form of undesired coupling between different independent microservices. In distributed systems, a little duplication is better than a little coupling.
Being conscious of such differences in dynamics, the thought process can be quite useful nonetheless. Let’s take a look at some examples:
The big ball of mud
The Big Ball of Mud is a term used for a state where the software system lacks perceivable architecture. The system can be full of spaghetti code, where every piece of the code can talk directly to another piece. The nodes of interactions, if graphed, form what we call “the big ball of mud.” This is a clear sign of unregulated growth and a lack of refactoring in the codebase of the application.
However, this can also apply to the interactions of microservices. Imagine hundreds of services where each service can talk directly to any other service. Guess what you see when you draw such a dependency graph? Yup, a big nasty ball of mud.
With Engineering teams obsessed with the physical boundaries of their services, it can be quite easy to grow into such a situation. After all, the path to that is just unregulated growth, lack of refactoring, patching quick fixes, and capabilities for a reasonable amount of time. But unlike application code, you do not have static code analysis that would capture increased complexity or cyclic dependencies.
On the other hand, a modular monolith gives more structure to the code. Despite still being a monolith and doing too much when compared to the world of microservices, it is still quite maintainable and possible to evolve – not an all in all anti-pattern, unless it grows into a ball of mud. A similar concept in distributed systems would be to encapsulate a group of services that act together to achieve a high level capability and recognize them as a system that is responsible for a certain business domain. In such systems, you do not deploy all components at the same time, thus re-architecting happens by gradually adjusting the scope and the touch points of internal system components while maintaining the overall outcome of the system, not a single service or component. This way of thinking encourages refactoring, aka re-architecting in this context.
An example is our payments system which allows merchants to ask for payments for the services being provided and allows customers to pay for those services. The payment system has a facade that simplifies the touch points with both merchants and consumer apps while hiding the internal details of how this is implemented, it gets perceived as a standalone “module” plugged into the overall Careem Systems and can evolve independently.
Thinking systems as first-class citizens
At Careem, we acknowledge the importance of going beyond services and thinking systems so that we can measure overall reliability across the app and provide reliable end to end experience to our customer. This is one of the core design principles: and this is why the building blocks of our infrastructure revolve around systems. But what does that mean? Well, a few things:
- Continuous refactoring of a code base for readability and maintainability. Similarly, re-architecting a system for better cohesion, efficiency, and maintainability is something encouraged and has a great ROI.
- It is typical to fall into the trap of setting SLIs and SLOs on services, but there is great value in setting business meaningful SLOs on the overall system. This is where all the meaningful metrics need to be, not on a specific internal implementation point. For example: a system is composed of 4 different services, and we want to track the error rate on the 4 services, which is a great thing to do. But if we have the error rate of those internal services as first-level metrics, this will discourage re-architecting and changing those metrics. SLO on each service is still important as an internal metric and objective, but when you measure overall system reliability, you should measure it on the system level.
- A single application should have a clear purpose and value proposed. Similarly, a system must have a high-level purpose, a promise, and ways to confirm whether it is delivered on that promise or not. It should also show a high level of cohesion.
- While the rate of change in the structure of the system is much slower than the rate of change in the codebase, It should be much higher than the rate of change in the overall system and external contracts.
Hiding internal details
When writing a library or a module in the codebase, software engineers are encouraged to hide internal details through the use of private and package private visibility. This ensures that internal classes will not be used unexpectedly by other pieces of the application. It will also regulate the access to the library through predefined touch points, therefore hiding the internal implementation details. Hiding the complexity and preventing undesired access is a crucial aspect. But lots of engineering teams do not do a similarly good job when it comes to the architecture of distributed systems.
Hiding backend systems, encapsulated behind a gateway, lots of engineers would still use a domain or URI prefix to decide the service being routed to, for example /service-a/uri-1 and /service-b/uri-2. This way, the gateway is indeed used to regulate and restrict traffic to the internal services service-a and service-b, but it is not really helping in hiding internal details from external clients. Details of the internal components of the system get exposed to the public. A refactoring (where ”service-a” and ”service-b” are deprecated and the scope is moved to another ”service-c”) cannot be done internally without clients knowing, which breaks the encapsulation concept.
There might still be a case for creating a public API layer, a facade, without exposing which services or actual end points will be hit under the hood. All APIs exposed by the system’s facade (experience layer) need to be coarse APIs that implement the exact use cases of the system, i.e. the use case coordination needs to be implemented there.
Such a construct gets missed if you just hide hundreds of microservices behind a gateway as a functional layer of the stack without taking the semantics and business meaningful APIs into consideration, i.e. the public interface.
Restricting access to internal system modules can be achieved through network restrictions or service to service authentication. Either way, the emphasis on the acceptable touch points for a system vs internal details is the main point to consider.
In Careem and other tech companies in the industry, such considerations are a matter of continuous learning and evolution of our systems, and our learning and discipline while building those systems.
Conclusion
The ability to shift gears and see problems from different lenses provides a great opportunity for software engineering. This challenges some existing norms and also pushes engineering teams to utilise learnings from different domains and different known patterns, as well as common pitfalls. More importantly, it helps engineering teams apply learnings in a fast-paced changing domain, the codebase, to the world of distributed systems to help speed up the evolution of those systems.
About the author
Mahmoud Salem and Matija Capan work in Careem’s Enterprise Architecture and Platform Engineering teams. In their free time they play padel, cycle, and explore the world.
Check out another article from our Careem Engineering team here.