Dynamics 365 CSU – Cache is king

In the Dynamics 365 eCommerce project we where implementing a pricing logic that was quite complex and required many steps to calculate a price. The prices are unique for each customer, product, and other aspects, and the prices had to be calculated each time and in real-time. But calculating prices are a time- and compute intensive operation and is not always aligned with how today’s customers expect eCommerce sites to work. The expectation is that when browsing a site, the prices are there instantly, and that it is easy to search, filter, refine and ‘order by’. In traditional eCommerce sites, the prices are most often a flat table of precalculated prices, and often in the millions of records. But we did not want this. We wanted a pricing logic that was dynamic and real-time contextualized, but still fast.

The way we then first designed it was to create a price caching logic in the CSU (Cloud-Scale-Unit), where the prices were stored in a memory cache for up to 24 hours. We tested this approach in a Tier-2 environment, and we were amazed by the performance we archived. Then we decided to deploy to production, and quite soon we realized that we did not get the expected performance. Somehow it seemed like the cache we built was sometimes hit and sometimes not. The production systems behaved differently than what we tested in test/UAT systems. In essence it was sometimes fast, and sometimes slow.

I then went deeper into telemetry of the CSU, that have become very good in later releases, and where Samuel Ardila Carreño have also have provided some excellent CSU/Azure Data Explorer dashboard as exemplified under. Here we can see the exact timing of each API call, frequency, and average. A very nice way to understand performance of API’s.

But back to the story, where the caching failed. In this case we had one CSU for eCommerce, and we see that caching results is not what we expected. By deeper analyzing through the Azure Data Explorer, we realized that One CSU is NOT one machine. Is it multiple stateless services. In the environment we have, we could trace it down to 15 CSU microservices that seem to be load balanced. This is why our new memory cache is failing. The load was under the hood shifting from one microservice to another. The probable reason for this is scalability. We have also realized we can update CSU’s and eCommerce packages in prod without seeing any downtime. This is probably because updates and traffic are just being switched from one stateless microservice to another. And this was the main reason why our memory caching was not working.

When we learned this, we decided to not only have cache in memory, but also to create a shared cache table in the CSU. So, when a price was calculated we stored the price in both the memory cache, but also on a cache table that all 15 microservices read from. This gave us a much better performance that is closer to eCommerce customers’ expectations.

I have always wondered why the eCommerce SCU’s come in Tier-1, Tier-2 and Tier-3 levels, and it have never been clearly explained to me the technical differences. But I think I see it now. It is probably the number of stateless micro services under each CSU.

So the lessons are; 1. Understanding the underlying architecture is essential for achieving the expected outcome. 2. Cache is king, when done correctly.

 

 

 

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.