5 Problems and Their Solutions With Creating a High-Load Service Using .NET and Kafka


8 min to read

Serhii Kalinets

In 1697, Melvin Conway proposed a famous thesis, which resonates in nearly every guide on crafting a microservices architecture. It's renowned for a good reason; more than one generation of developers has witnessed this thesis's validation.

However, when a company's communication structure evolves due to expansion into new international markets, the product must adapt to meet the needs of these new markets.

Defining the Problem

In the summer of 2020, our development team faced the challenge of preparing a product for international markets. At that time, we had a complex array of microservices designed and maintained in alignment with the company's outdated organizational structure. Managing this assortment of services was both cumbersome and costly. Moreover, many of these services lacked relevance to current business requirements and technological trends.

Confronting these new realities, we recognized the imperative to:

  • Simplify the existing architecture by eliminating redundant data transfers between services.
  • Enhance the speed of data delivery to end-users.
  • Reconfigure our teams to develop and maintain a new solution.

Our Approach

Over the years, .NET framework became our company's primary back-end stack for web applications. We had amassed a substantial code base, libraries for implementing additional functionalities, and a wealth of expertise in C# development. Thus, we aimed to leverage our prior experience during the development phase without making unnecessary switches to a new stack.

We initiated the development of a service using .NET Core 3.1 and smoothly transitioned to .NET 5, achieving remarkable .NET performance improvements. All our services are hosted using Kubernetes, enhancing resource utilization and scalability. The responsibility for CI/CD configuration partly rests with the development team. Still, a dedicated DevOps team develops and maintains the primary approaches and templates for efficient development.

Utilizing Kafka, we facilitate data exchange between our services in messages, each containing a comprehensive snapshot of a specific business entity. These snapshots, however, vary in their update frequency. For instance, taxonomy data (pertaining to championships, sports events, and players) changes every few seconds, while operational data (market coefficients) experiences updates several times per second. We guarantee Kafka delivery and offer robust Kafka observability.

With this in mind, a fundamental requirement for the new service was the ability to handle high loads when delivering data to clients, which is vital for high-demand web services. It was originally designed to circumvent the need for external caches or databases. Therefore, the service initializes by reading data from Kafka topics into RAM, ensuring efficient Kafka queuing. It performs aggregation and distributes updates reactively using .NET standard, leveraging Rx.NET functionality via sockets on the front end, pre-serializing the data into a binary format using Messagepack. For managing sockets, SignalR provides a comprehensive framework for web applications.

Resolving Challenges in the Process

Issue #1: Memory

Dealing with outdated data in Kafka requires a cleanup, ensuring that only necessary data is read at the service's startup (current taxonomy and trading data). Otherwise, the service would need to scan the entire topic from the earliest offsets to determine the data's relevance, resulting in unnecessary overhead.

However, because Kafka is essentially a commit log and not a random-access database, a simple "Delete * From" approach won't suffice (without additional services from Confluent). Hence, a service that handles Kafka data delivery should send null values using the value=null key, effectively marking the data as irrelevant and suitable for removal when retention policies are executed. Moreover, it's essential to keep actual data compact since, when working with snapshots, only the entity's latest state is relevant. To achieve this, the "compact" parameter needs to be set in the topic settings as the retention policy. Efficient memory management is crucial for .NET application performance and Kafka data integrity.

Another memory concern is the suboptimal utilization of data aggregations, impacting .NET problems. For example, in Rx.NET, data on the left and right must be pre-filtered to avoid unnecessary resource consumption.

It's also worth noting the advantages of keeping aggregates in memory rather than raw data. This approach optimizes Large Object Heap (LOH) usage, reducing the likelihood of excessive heap fragmentation and OutOfMemory exceptions when allocating large arrays. Fragmentation can occur even when there appears to be sufficient memory due to LOH fragmentation. To check for heap fragmentation since the last garbage collection, initiate GC.GetMemoryInfo and inspect the FragmentedBytes field, which indicates the number of fragmented bytes in the heap.

In practice, the golden rule is to minimize unnecessary allocations, a best practice for .NET application performance management. Passing an enumerator of a thread-safe collection is more efficient than creating copies of the collection for multi-threaded processing.

Issue #2: Performance

As previously mentioned, performance and the ability to deliver data to clients with an average latency of less than 200 milliseconds are crucial for the service. Achieving this requirement entails sending only updates, which represent the difference between the previous and current data state. To achieve this, we utilize a structure of the Diff<T> type, where T represents the ViewModel, with each field containing only the changed data. Updates are also grouped into batches, ensuring minor updates aren't sent individually but rather buffered by quantity or time, ensuring high-quality .NET application performance.

Garbage Collector settings have a substantial impact on performance. Reducing the use of LOH memory is essential for optimizing memory usage. Experimenting with the System.GC.LOHThreshold parameter can help find the right balance.

Server garbage collection with concurrent mode provides a significant advantage in optimizing data latency. Enabling parameters like <ConcurrentGarbageCollection> and <ServerGarbageCollection> can enhance latency, albeit with a slight increase in CPU usage.

The transition from .NET Core 3.1 to .NET 5.0, facilitated by assembly updates and serialization adjustments, improved service performance by approximately 15% without altering the code. Regularly updating frameworks and libraries can offer performance improvements with minimal developer effort, although some exceptions may apply.

Issue #3: Compatibility

Front-ends (Android/iOS/Web) subscribe to changes in operational data via web sockets for web services .NET. We use SignalR on the back end as a wrapper over web sockets. However, optimized and efficient libraries for supporting SignalR on Android and iOS were lacking for our requirements at the time.

Consequently, front-ends needed to fully support the SignalR binary protocol's features, including hardbits and encoding multiple entities within a single message by adding information about data subset lengths (payloads) in VarInt format to the message. This added some work, but thanks to the well-documented SignalR framework, it enabled the implementation of a custom transport library on the front end, fully supported by our teams.

To define contracts and passed types, we generated a custom JSON schema containing information about field optionality, valid enum values sent to the front end as integers, and other descriptors aiding front-ends in creating subscriptions for parsing received data. When contracts need changes, a new subscription version is created to support these changes, requiring the maintenance of two subscription options simultaneously until all client applications migrate to the new subscriptions.

Issue #4: Monitoring

Effective service monitoring necessitates logs and metrics, which are critical for web applications .NET. The ELK stack is used for logs, while Prometheus/Grafana is employed for collecting and nearly real-time metrics monitoring. The more metrics measured, the better, as even metrics indicating the absence of new data for a period can signal underlying issues, such as problems with data mirroring between Kafka clusters or service lag due to unhandled errors in Rx.NET subscriptions.

Additionally, it is crucial to measure the time it takes for a message to reach our service, bypassing all previous transfers. This parameter allows us to assess message latency within our service and throughout the entire system. The metric revealing the difference between generated batches intended for client delivery and disposed of batches helps evaluate memory leaks and the efficiency of data distribution mechanisms to various subscribers.

Issue #5: Scaling and Hosting

Efficiently utilizing hosting resources while maintaining performance during peak connection loads is essential for .NET applications on Windows. Flexible autoscaling that responds to metrics balancing resource utilization and optimal data latency for end-users is key. Scaling metrics should consider CPU load, the signal flow rate from Kafka at the input to the sockets at the output, and the amount of memory consumed. In cases of metric degradation, the orchestrator can respond by launching additional service instances.

The memory metric is less flexible due to the non-deterministic nature of GC memory cleaning. You can experiment with the System.GC.HighMemoryPercent parameter, aiming for more aggressive GC work, not at 90% of the occupied memory (by default), but 80%. It will allow autoscaling to consume resources more efficiently.

Our Conclusions

In conclusion, optimizing the performance and memory consumption of a highly loaded service is an ongoing necessity for .NET applications. It's essentially a race between adding new features and minimizing their impact on system performance. This requires significant effort on the part of the development team, on both the development and testing sides. Moreover, the presence of Load tests is necessary.

It's worth updating the main libraries and frameworks regularly because they improve performance and stabilize the service. This is particularly important for Kafka experts.

Measure everything measurable. The more metrics you have, the more predictable the system is, and the more data available to improve it.

It can be helpful to reconsider the default settings of the frameworks used, whether it's Kafka or .NET, ensuring Kafka order guarantees and high .NET performance.

We hope our experience will help you prepare for the challenges of building high-load services. Good luck!

Please rate the post
1 2 3 4 5
Thanks for your rating!
Subscribe to our newsletter