Welcome to Netsil. We’re building a new type of analytics and observability tool for companies that embrace the cloud, DevOps and microservices.
Before diving into more details of what we’re doing, we thought we’d take a moment to look at the shift from monolithic architectures to microservices and the implications of that change from an operational perspective for both DevOps and SREs.
The Growth of Microservices
Microservices are more than just a buzzword or fad. Over the last decade, microservices have emerged as the preferred developer choice for building webscale architectures at hyper-growth companies. Just look at Netflix, Spotify and Uber. Why? Here are a few reasons:
1. Agility:Componentization and distributed functionality enable applications to iterate and deploy continuously, independent of other business units and developer teams.
2. Freedom of choice: Developers can autonomously choose their preferred frameworks leading to faster build and deploy.
3. Resiliency:Microservices are designed for failure, which in turn makes applications more robust.
4. Efficiency:Thoughtful decoupling of generic functionality has led to increased reuse and significant cost savings.
Fig 1: Application architectures have moved from monolith to microservices
Companies using microservices based architectures are not only more agile, flexible, resilient and efficient, they are also more competitive and more profitable. However, the shift does come with cost s . They include distribution, eventual consistency and operational complexity. In this post we’ll focus on operational complexity.
The Operational Challenges of Microservices
Modern software architectures based on microservices, DevOps, elastic cloud and container technologies shift the cost from building apps to operationalizing and stabilizing microservices in production. According to Google, the costs can be estimated at between 40-90%. How can that be? Following are the core issues and our thoughts on how these challenges can be addressed.
Complexity has shifted outside the code
The breakdown of monolithic applications into hundreds or even thousands of smaller, cohesive, functional microservices has resulted in significantly reduced code footprint inside each service. However, these microservices now need to interact a lot with each other. Function calls within the code in monoliths have been replaced by calls going over the network in microservices. The state of every request has to be transferred from one service to another to build a response. The result is an explosion of chatter such as API calls, RPCs, database calls, memory caching calls, etc.
In production, the critical piece that DevOps need to monitor is no longer the code inside a microservice, rather the interactions between various microservices. Most issues, such as hotspots, chokepoints or cascading failures that arise in production are due to the complex interplay between services. Continuous deployments are the norm and new dependencies between services may emerge after deploys. Whenever there is an issue, precious time is spent in chasing service dependencies, either by looking up (outdated) documentation or consulting other developers. Without a real-time map showing how microservices are interacting with each other, with databases, with external services, DevOps are flying blind.
DevOps, the cloud, containers and microservices often go together, with DevOps as the cultural and operational foundation. The success of DevOps and modern applications depends on cross-team collaboration as the organization becomes less hierarchical and groups move out of their silos. With microservices, organizational boundaries disappear. You can use real-time API calls from inside and outside your organization. Managing this complexity and bringing order to the chaos requires teamwork. However, it’s often hard to change corporate culture and how people work. Tools are needed that help foster this collaboration.
Today, many organizations have either outdated or missing documentation to track services owners, functionality and dependencies. Troubleshooting tools often focus on the infrastructure layer instead of the logical, application layer making it difficult to trace the source of an issue in a particular service. And, observability tools present dashboards that do little to help troubleshoot across the company and into the cloud.
Fig 2: Service topology mirrors organizational graph
Our approach overcomes these issues and makes DevOps collaboration possible. By looking at the network as the source of truth and observing application chatter in real-time, it’s possible to build a live service directory that obviates much of the need for manual documentation. New employees can come up to speed on services quickly and teams troubleshooting issues can quickly identify key stakeholders and services. This approach also allows building a real-time topological map of services that will mirror the organizational graph, a view that few have seen to-date. By overlaying tracing on top of the map, DevOps and SREs can quickly work together to resolve issues.
The mid-2000s strategy of “let’s standardize on one language” has been replaced by “bring on all the languages.” The famous two-pizza rule (modified for the microservices world) is derived from the relatively small sizes of teams. Service developers and architects can now pick and choose their favorite languages and frameworks as they like. Moreover, developers are increasingly adopting a variety of open-source projects and third-party cloud services to quickly build their applications. For large tech companies that leads to a huge diversity in developer preferences (and a lot of pizzas) and, for the most part, that has worked quite well for the developers.
The lives of DevOps/SREs, however, could not have become more complicated. With increasing diversity, it becomes challenging to standardize on an observability solution. Traditional APM tools require code-instrumentation and thus prove to be ineffective as the surface area for them to cover has reduced. They do not provide the necessary level of coverage for external services where teams have little to no control. Moreover, every time a new language or a framework comes out, it takes months, if not years for existing code-level tools to add support (Go language is a case in point).
As a result, businesses are allocating ridiculous amounts of resources to build custom instrumentation frameworks that in-turn rely on developers to add and maintain instrumentation deep into their applications. Developers also need to devote resources to maintain and support these frameworks with all their language dependencies. This strategy comes at a huge cost to developer productivity, bloats their code and is prone to failure. Custom observability kits do not address the root cause because in diverse stacks the observability solution should work across different frameworks, services and systems, regardless of the language they are written in and independent of how frequently APIs change.
Fig 3: Modern distributed systems use a plethora of languages and frameworks
In the microservices world, listening to the application chatter and using that as the “source of truth,” as opposed to using call-stack analysis or log analysis is the optimal strategy for wider coverage, efficient analytics and faster troubleshooting. An elegant way in which this can be achieved is to tap into a fundamental layer such as the operating system or network and analyze the operational behavior in a bottom-up manner.
Scale and Lifetime
Legacy observability and analytics tools were built for hosts and VMs that had lifetimes on the order of days or months. Engineers only had to worry about a handful of performance metrics. With the shift to containers and microservices, the average lifetime of instances has gone down to minutes or seconds and we haven’t even started talking about Lambdas yet that live for milliseconds, at best. Ephemeral instances have caused an explosion in the amount of metrics that need to be collected and made available for analysis in near real-time. It has been often joked that the Netflix microservices architecture is “a metrics generator that occasionally streams movies.”
Cutting through the jargon, engineers are now trying to decipher a huge amount of data in a very short span of time. That’s not the ideal playground for tools that poll periodically for metrics or roll-up metrics locally before shipping them for analysis. If the time required to ingest the data is more than the data lifetime then it’s not really going to work. Any approach that could work in tracking these fast changing, chaotic environments has to move at the speed of the environment being observed.
This can be enabled through stream-processing of live service interactions and aggregating analytics results across the environment. Additionally, the metrics have to be made available for querying within seconds of ingestion and visualized at the right level of abstraction. Finally, some tools structure their costs around log volumes or the rate at which metrics are shipped which penalizes operational growth and scale.
These are all very real issues faced by web-scale businesses and addressing these technology gaps to help people build better software is why we do what we do.
We’re excited to show you what we’ve got so stay tuned! We will be back with a sneak peek very shortly.