D365 Performance Checklist: Troubleshooting and Resolving Performance Issues

You often hear phrases like:

  • “The system is slower today.”
  • “Nothing is working!”

Performance in Dynamics 365 is a critical aspect for both users and system owners. For end users, poor performance can lead to frustration, while system owners face challenges in identifying and resolving performance issues. To help streamline this process, I’ve created a practical checklist that outlines steps to quickly diagnose and discuss sudden performance issues.

Below is a step-by-step guide to find the root cause of performance issues and take appropriate action.

Step 1: Check Azure Status

Even if you don’t fully understand what’s causing the slowdown, there are some quick first checks you can do:

  1. Microsoft health status: https://status.cloud.microsoft/ give a short summary status of Microsoft Cloud
  2. Azure speed test: https://azurespeedtest.azurewebsites.net/ let’s you see the current ping times to Microsoft data centers.
  3. Azure Status: Visit the Azure Status page to see if there are any global issues that affect performance.
  4. DownDetector: Use DownDetector for real-time reports of problems from users globally.
  5. Twitter: Check Azure Support on Twitter for any recent announcements of issues.
  6. Power Platform: Visit Power Platform Help and Support for service health reports and known issues.

If these resources show global issues, it’s likely that the problem is already being addressed. Inform your users and take a short break, knowing Microsoft engineers are working to resolve the issue.

Step 2: Use Lifecycle Services (LCS) for Telemetry Data

Once you’ve ruled out Azure-wide issues, the next step is to analyze your environment’s performance data.

  1. Environment Monitoring: In LCS, navigate to environment monitoring to get an overview of your system’s status.
  2. Check SQL Utilization: Look for any long-lasting peaks in SQL utilization that show performance bottlenecks.
  3. Analyze AOS Performance: Check if any of the Application Object Servers (AOS) are struggling.

If no clear issues are identified, proceed to:

  1. SQL Insights: Look for any heavy queries or blocks in the system. This will help you spot any ongoing processes causing delays.
  2. Review Activity Logs: Check for long-running queries or errors in the telemetry data.

At this stage, you’re looking for general indicators of system stress or inefficiency.

Step 3: Collect Detailed Information from Users

If telemetry doesn’t reveal the problem, gather more specific details from the user experiencing the issue.

Key questions to ask:

  • What were you doing when the performance issue occurred?
  • Are other users facing the same issue?
  • Is it related to a specific form or process?
  • Is the problem consistent or does it happen randomly?
  • Can you record a short video to demonstrate the issue?
  • Can you provide a copy/paste of session information (Activity ID, Session ID, AOS name)?

This information will allow you to zero in on the problem and cross-check it with LCS environment monitoring.

Step 4: Reproduce the Issue

Once you’ve gathered preliminary information and checked system telemetry, it’s essential to try to reproduce the issue yourself. This is a critical step, as it lets you confirm the problem and analyze it in a controlled environment.

Why Reproducing the Issue is Important:

  • Validation: Verifying that the issue can be recreated helps make sure that it’s not an isolated user-specific problem, but rather something systematic that can be investigated further.
  • Visibility: Being able to see the performance issue firsthand will give your insight into how the system behaves under the problematic conditions, enabling a deeper analysis.
  • Communication: If you plan to escalate the issue to developers or support, showing that the issue can be replicated provides a solid starting point for others to troubleshoot.

Steps to Reproduce the Issue:

  1. Ask Access to the User’s System or Environment: If you don’t already have access, ask permission to log into the affected user’s environment. Make sure you are using the same permissions and roles as the user to avoid discrepancies.
  2. Follow the User’s Steps: Once logged in, replicate the exact actions the user took when the issue occurred. This include:
      • Navigating through specific forms.
      • Running reports or transactions.
      • Performing specific searches or filtering data.
    • Use Telemetry for Assistance: If the user provided session information (e.g., Activity ID, Session ID, AOS name), use that to find the timeframe of the issue and see if any telemetry logs or queries can help in reproducing it.
    • Consider Different Scenarios: Sometimes, performance issues only occur under certain conditions. Test a variety of scenarios to see if the problem persists:
      • Load Variation: Does the issue only occur when multiple users are logged in and performing heavy tasks at the same time?
      • Data-Specific Issues: Does the issue happen when working with certain records or larger datasets?
      • Time-Sensitive Issues: Are there specific times of day when the issue occurs (e.g., during peak hours)?
    • Simulate a Clean Environment: If you’re can’t reproduce the issue directly, try testing the same functionality in a non-production (e.g., test or sandbox) environment to see if it still occurs. Differences in performance between production and non-production environments can often point to configuration or data issues.

    What to Do If You Can’t Reproduce the Issue:

    • Ask for More Details: If you still can’t replicate the problem, circle back with the user and ask for more context or detailed steps. They may have missed providing key details that could help pinpoint the problem.
    • Collaborate with Other Users: Ask other users if they are experiencing the same issue. If the problem is user-specific, it could be related to personalization, permissions, network connections, or local device configurations.

    Document Your Findings:

    Whether you successfully reproduce the issue or not, it’s essential to document your findings. This documentation will be useful if you escalate the issue to another team, like:

    • A support ticket with Microsoft.
    • An internal report to the development or IT teams.
    • Communication with the affected user to manage expectations.

    By attempting to reproduce the issue yourself, you not only confirm the problem but also narrow down potential causes, making it easier to do with troubleshooting or escalate the issue with confidence.

    Step 5: Do a simple F12 Network analysis

    Using Chrome or Edge’s developer tools (F12), do a network analysis:

    1. Track Load Times: Analyze how long various UI elements take to load.
    2. Find Cryptic Delays: Look for traces that are taking an unusually long time.

    You can save the network data (like the header, payload, and response times) for deeper analysis or support cases with Microsoft.

    You can then record/save the header, payload and response times.  It will give hints and deeper insights.

    If you can “pinpoint” the exact menu item when the performance issue occur, also save a HAR-file, as this may be needed later if you need to create support case with Microsoft.

    Step 6: Perform a small Performance timer in D365 F&O

    After attempting to reproduce the issue and gathering more information, the next step is to utilize the Performance Timer in D365 F&O. This will help you pinpoint where performance bottlenecks may be occurring in the user interface.

    What is the Performance Timer?

    The Performance Timer tool lets you monitor the duration of specific actions within the system. This is a valuable tool for isolating whether performance issues are related to particular tasks, forms, or processes.

    Steps to Run the Performance Timer:

    1. Enable the Performance Timer: In D365 F&O, simply append &debug=develop to the URL you’re using. This will activate developer mode, which includes the Performance Timer. There should be an icon you can click on to see the timers.
    2. Run the Process: Now that the Performance Timer is enabled, carry out the process that is causing performance issues (e.g., navigating through forms, running reports, or completing a transaction).
    3. Analyze the Results: The Performance Timer will display detailed information about the time each step takes. Focus on processes that show unusually high times, as these may indicate where the issue lies.
    4. Save the Data: Save the timer output, which may include SQL query timings, network delays, and any computational lags. This data will be valuable if the issue needs to be escalated to developers or support teams.

    Why Use the Performance Timer?

    By using the Performance Timer, you can gain a clear understanding of the system’s behavior and identify specific steps or components that are underperforming. This allows you to target your troubleshooting efforts and avoid a “needle in a haystack” approach.

    Step 7: Trace Parsing for Deeper Analysis

    Once you have exhausted surface-level checks, it’s time to delve deeper using trace parsing. This step will give you a granular view of what is happening behind the scenes, such as which SQL queries or X++ code is contributing to performance degradation.

    What is Trace Parsing?

    Trace parsing involves generating and analyzing detailed logs of system activity, focusing on SQL execution times, compute times, and overall process flows. This is typically a task for experienced developers familiar with X++ and SQL.

    Steps for Effective Trace Parsing:

    1. Activate Tracing: Enable tracing within the specific environment where the performance issue occurs. Be mindful to limit the tracing to just a few seconds around the time when the issue happens, as traces can quickly become very large and difficult to manage.
    2. Analyze the Trace File: Once tracing is complete, you’ll receive a detailed log of system events, including:
      • SQL Execution Times: Pinpoint how long individual SQL calls are taking.
      • Compute Times: Understand how much CPU time is being consumed during specific processes.
      • Call Stack: See the entire process flow, showing which methods and queries are running.

    Example: A healthy SQL query might take less than 25 milliseconds, but a problematic query could take several seconds, indicating where optimization is needed.

    1. Identify Bottlenecks: Look at the Top 5 X++ Calls and Queries—these can often be related to the main contributors to performance issues. Look for patterns such as repeated heavy queries or inefficient code paths that might be slowing the system down.

    What to Do if the Trace is Inconclusive:

    If no obvious bottleneck is detected, you may be dealing with an aggregated performance issue caused by the cumulative effect of many small, efficient processes. This type of issue can be particularly challenging to resolve, as it requires rethinking broader architectural elements.

    For example, if performance issues arise from standard code, extensions or customizations, it could take weeks or even months to resolve fully, especially if multiple layers of custom code are involved.

    Step 8: Fixing the Issue

    Once you have identified the root cause of the performance issue through telemetry, reproduction, and trace parsing, the next step is to involve the appropriate resources to fix it.

    1. Determine Responsibility:
      • If the issue stems from Microsoft code, open a support case with Microsoft.
      • If the problem lies within an ISV solution, contact the ISV for support.
      • If custom partner extensions are responsible, reach out to the team that developed those extensions.
    2. Check Version and Patches:
      Ensure that your environment is running the latest version of the software, as many performance issues are resolved in later patches or hotfixes. Check LCS Hotfixes and the Release Planner for any upcoming features or fixes that could address the issue.
    3. Evaluate Parameters and Configurations:
      Performance can often be tied to configurations. Check the system’s parameters, as enabling too many features or processes can bog down performance. Disable unnecessary options to streamline the system’s operation.
    4. Optimize Data Management:
      Check if transactional or outdated data (such as completed sales orders or old inventor/WMS transactions) is accumulating in the system. Regularly archiving old data and keeping the system clean can significantly improve performance. Also take a look at the F&O capacity is some data is “exloding”.
    5. Community Support:
      Don’t hesitate to reach out to the broader Dynamics 365 community. Platforms such as Yammer, community forums, and even tools like CoPilot or ChatGPT often provide valuable insights and workarounds shared by other users who have encountered similar issues.
    6. Hire a 10X developer to fix it?  Sorry, but they are myths.  Just like unicorns and Bigfoot.

    Step 9: The blame game

    Hopefully you have now a deep insight into root cause if the performance issue, and now there is a feeling someone must “pay” because of the pain inflicted on the end-users.  A customer may start with their implementation partner, ISV’s and Microsoft.  But I’m not sure how advisable this would be, as the fundamental issue is not the bug/code/data that caused the performance problems. No developer, partner or Microsoft can deliver flawless code that handles every combination of complexity.  I’m arguing that in the project there should be more investments of quality procedures and customer testing of their used combinations of system, people and data. The implementation guide chapter 14 gives a very good overview of how to run the testing strategy. My recommendation is to allocate +50% of the implementation effort to testing and training

    • Unit testing and integration testing: 10-15%
    • User acceptance testing (UAT): 5-10%
    • Performance and security testing: 5% and 15% to training
    • End-user training: 5-10%
    • Administrator and power-user training: 3-5%
    • Training materials development: 2-3%

    Step 10: Reflections and adjusted expectations

    As I hope this blogpost reflects, is that the path of performance issues is complex, and does involve a lot of knowledge from many parties.  But the good thing is that we do have access to a lot of tools, processes and telemetry to pinpoint root case.  It may take time, but the more exact and detailed we are, the faster an issue can be resolved.

    Also keep in mind that ERP systems are complex. As Michael P posted in Linked in it is estimated that Dynamics 365 X++ codebase (Application Suite consists of 27,7 M code lines and 430K methods).  But if you take the entire codebase of all customers, we have crossed more than 1 billion lines of code.   That is a lot of complexity, and the result is that performance issues will be popping up now and then.

    Last comments:

    1. Sales Order processes and Price calculations are slow.  Get used to it!
    2. Dual Write can be a pain. Do you really need it ?
    3. Customizations are most likely the reason. Did you push the developers to be done by Monday?
    4. Scaling of the PROD-environment is tightly related to licensing. Lots of heavy transactional integrations and automations combined with a small 20 users licenses is a recipe for performance issues.
    5. Dynamics 365 is just making sure you enjoy some well-deserved coffee breaks. It’s not a bug, it’s a feature—designed to give you more time to reflect.
    6. Dynamics 365 is also teaching you the virtues of mindfulness and patience. It’s not slow, it’s just showing its respect for all the data you’re processing.

    One thought on “D365 Performance Checklist: Troubleshooting and Resolving Performance Issues

    1. Hi KurtVery nice and useful write-up. Reminds me a bit of now long gone MS performance team blog entries. Could you please verify that Performance Timer is still working? I can’t see at mine site (using 10.0.41 currently). But I’ve seen some yammer discussions from 2020 that state it hasn’t been working that time.

      BR, Atis Dimants

      Like

    Leave a comment

    This site uses Akismet to reduce spam. Learn how your comment data is processed.