Git Product home page Git Product logo

windowsperformance's Introduction

Anurag's GitHub stats

windowsperformance's People

Contributors

itoleck avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

windowsperformance's Issues

Answer key for trace; MemoryUsage1.etl

Scenario:

A Windows Server is showing memory usage fluctuations in Task Manager.

Analysis:

The first thing to check when dealing with memory usage situations and something I like to check in all traces where the overall memory usage is captured is to look at the amount of RAM .vs the overall memory commit usage. If the commit usage is more than RAM, then the machine is utilizing the paging file, which is slower than RAM. If this condition persists, the RAM should be upgraded on the machine if possible.

Open the Memory panel on the Graph Explorer and open a new Memory Utilization analysis window.

Memory

Since the Total Physical Memory (RAM) is larger than the Commit (Total), the server's overall memory usage is within an acceptable amount. If the Total Physical Memory (RAM) was less than the Commit (Total) that would indicate that the server needs more RAM installed for best performance.

Memory

There are some other memory related graphs available in the Graph Explorer. Add the VirtualAlloc Commit LifeTimes graph to an analysis window.

Memory

In the analysis window, sort by the Size (MB) column if it is not already sorted. The carat above the column name indicates that the column is sorted. View the processes responsible for the memory usage. The memory usage is also graphed, and you should see the memory spiking throughout the trace period.

Memory

The processes responsible for the memory spikes are powershell.exe. The process IDs (PID) are different for each, so these are all separate processes. It may help to know what command was issued when the powershell.exe was called so that we know if there was any script ran in the command arguments. This information is available in the System Activity panel in the Graph Explorer.

Memory

Open the Processes graph in an analysis window. The column grouping by default is by Lifetime, then Process. The Lifetime column indicates when the process ran. If it is Transient that means that the process started and stopped within the trace period. A permanent process lifetime indicates that the process started before the trace period. All of the powershell.exe processes that have memory usage spikes are transient.

A good analysis view is shown below; with the memory graphs and process graphs in the same analysis tab, filtered and sorted by the powershell.exe process. When you select each powershell.exe process, you will also select the memory spike that correlates to the process.

In the VirtualAlloc Commit LifeTimes graph there are also columns available for Type and Impacting Stack (Frame Tags). The Type will show that the memory allocations were Allocated Inside Freed Inside (AIFI), this means that the memory was used and cleaned up during the trace period. This is usually indicative of normal memory usage. Look for processes that are (AOFO) or (AIFO) when looking for memory leaks.

Memory

The Impacting Stack (Frame Tags) column is n/a which indicates that no stack information for the memory allocations is available. This can be verified by looking in the System Configuration -> Trace Statistics window. In this case there are StackWalk events captured in the trace. There are other reasons that the stack may not be available, in this case it is because the trace is missing the ImageId Provider trace events. A new trace would have to be taken with the correct providers to understand more about the functions called in the powershell.exe processes.

Memory

Below is a good example of the trace events with ImageId and StackTrace events.

Memory

Remediation:

No remediation needed, but if the PowerShell.exe processes that capture performance information become an issue; stop the collections.

Answer key for trace; HighCPUUsageEasy.etl

Scenario:

A user sees in Task Manager that the CPUs are busy even though not many applications are installed on the system.

Analysis:

For CPU analysis, the number of processors available in Windows needs to be known as the CPU Usage, % Weight column can only be deciphered when the number of CPUs are known.
HighCPUUsageEasy1

There are 8 CPUs available. *These may be Hyperthreading cores, Normal cores, P cores, E cores; aka logical CPUs.

In the Trace Statistics table you can view all of the trace providers that were captured in the trace file (.etl). Notice that the PerfInfo provider was captured with the Sampled Profile events. This indicates that the CPU usage was sampled while the trace was running (Default 1ms sample time. Can be set before trace with wpr.exe -setprofint <#>) Example. Basic CPU usage information is available in the trace.

Also notice that the StackWalk provider is not available in the trace; which means that more detailed information about the stack functions for the process's threads are not available.
HighCPUUsageEasy2

Add the Graph Explorer->Computation->CPU Usage (Sampled) graph to an analysis window. Make sure the columns are sorted by the % Weight column in decsending order. The Idle (0) process should be first in the list. The Idle process is an idle loop in the Windows kernel that just waits while there are no other threads to run. This can be ignored, or you can right-click->Filter Out Selection on the row to remove it from the view.

The next process is cpustres.exe (7072). The number in parenthises is the process ID(PID). This process is consuming 24.94 % Weight. Since the machine has 8 CPUs, the weight of 1 CPU at full speed processing is 12.5%, so cpustres.exe is utilizing about 2 CPUs worth of processing. The threads that are processing may be running on 1 or more CPUs. This can be viewed by adding the CPU column to the view.
HighCPUUsageEasy3

You can see that the cpustres.exe process is utilizing all of the CPUs to do work. This is good behavior for multi-threaded applications.

Remediation:

Shut down the cpustres.exe process to save CPU resources.

Answer key for trace LongShutdown60.etl

Scenario:

The system takes an actual long time to shutdown; >75 seconds.

Analysis:

Shutdown traces are quite a bit different then boot up and logon performance traces. In a shutdown sceario you want processes and drivers to stop and unload as fast as possible. There are a few issues that may stop this from happening, like A GPO logoff or shutdown script, clearing the pagefile on shutdown, slow service shutdown, open applications and others.

There are WPA.exe profiles for shutdown analysis in the WPA Profiles folder.

Overall CPU and Disk Usage

First thing after opening the trace to look for is a high level of resource usage in the left side graphs. Look at the CPU usage and disk usage graphs to quickly verify if the resources were heavily in use during the trace period. In this case, the CPU usage graph spikes pretty high so it warrants more investigation.

Alt Text

Add the CPU Usage (Sampled) graph to the analysis view. See the highest consumers of CPU in weight during the trace period. There were 6 processes > 1% CPU usage in weight. There is only 1 CPU available in the machine. This can be verified by opening the Menu->Trace->'System Configuration'->General tab. The TiWorker.exe (3540) process was the highest user of CPU during the trace period. Let's see if we can investigate what the process was doing during the trace period. This trace includes the stackwalk events, which can be verified by going to Menu->Trace->'System Configuration'->'Trace Statistics' tab.

Alt Text

Since we are looking at CPU usage and not looking for processes that are waiting for CPU at this point you can use the same CPU Usage (Sampled) graph to view the stack of each process. Before continuing, load the symbols by going to Menu->Trace->'Load Symbols'. Your symbols should be set to at least the public Microsoft symbol server. (SRV*https://msdl.microsoft.com/download/symbols). You can configure your symbols by going to Menu->Trace->'Configure Symbol Paths'. A screenshot of my paths are below.

Alt Text

After symbols are loaded you can open the stack for the highest CPU consumer, process TiWorker.exe (3540). Open each of the highest CPU weight rows under the process and look for public functions that will give clues to the processing of the threads in the process.

Alt Text

You should see the many functions are in the module CbsCore.dll. You can look at the details of that dll file in the Images graph or look in the details of the machine's file or search online but you will find that this dll supports servicing in Windows; which is Windows updates and features. You will also see that there were some hard page faults as well, so you should be able to see that in the Memory->Hard Faults graph.

Alt Text

You can see a correlation between the hard page faults and disk usage in the graphs.

Alt Text

So it looks like there was some update servicing when this shutdown occurred.

Continuing Investigation

You may stop the analysis at this point and put the blame on just the Windows servicing but since the CPU and disk was not 100% utilized during the whole trace period there can be other factors at play as well.

Always look at the Services graph and Processes graph when investigating shutdown WPA traces.

Open the Service graph and sort/group by 'Ended At(s)' descending, then End Thread ID, then Service Init Time (s). You will see that there are 3 services that took longer than 1 second to stop. WinDefend, wuauserv and MSSQLSERVER. Also notice that these 3 services did not try to shutdown until after the Tiworker.exe and cleanmgr.exe processes were done processing. In this case, it looks like these services could not being shutdown until the servicing was complete.

Alt Text

To view the process that were shutting down in the trace, use the Process graph and sort by Start Time9s) descending. View what processes started from the end of the shutdown towards the beginning. You should see LogonUI.exe as one of the last processes, this is normal. The LogonUI.exe process shows what the machine is doing in text after the desktop is shutdown.

Alt Text

The other processes give clues to what was occurring at the time as well. There are a few wlrmdr.exe process, some shutdown.exe processes and a MicrosoftEdgeUpdate.exe (5044) process.

The Windows servicing was kicked off by cleanmgr.exe using the following command; C:\Windows\system32\cleanmgr.exe /autoclean /d C:

This caused the TiWorker process to read a bunch of Cbs files on the drive. Microsoft Edge browser was also serviced at this time.

The main cause of the long shutdown though was the command issued to wait to reboot; shutdown /r /t 60 /d p:0:0

It just happened that while the system was waiting to shutdown for 60 seconds the Windows serving stack kicked in.

Remediation:

Do not give a command to shutdown in 60 seconds if you do not want to wait, or force the shutdown/reboot. Forcing shutdown or reboot should have not allowed the system to service Windows while shutting down.

You really should let Windows do it's updates though, when needed, so some of this could not be helped.

Answer key for trace; DiskIOHigh.etl

Scenario:

Windows Server Backup was running, backing up spanned volume SSD disks 2, 3, 4, 5 to HD disk 6.

Analysis:

  1. Find current disk usage. Open trace in WPA and add Disk Usage graph.

  2. Order columns similar to the following: Disk | Priority | IO Type | Process | IO Init Stack |GOLDBAR| Disk Service Time(us) Avg | Size | Count |BLUEBAR| Disk Service Time(us) Sum

  3. Sort by Disk Service Time(us) Avg column. This will show the average latency for each disk. Disks 1-5 are within normal latency levels (1.294ms and less), but disk 6 shows an average of 108.9ms which is above the normal 15-25ms for a 7200RPM hard drive.

DiskIOHigh1

  1. Sometimes long Disk Service Time/High latency is caused by the IO being of a low priority. In this trace, only disk 0 has other priority IO than Normal. You can move the Priority column to the first slot and select the Low and Very Low IO priorities and right-click and select Filter Out Selection to remove them from the view. Move the Disk column back to the first after filtering.

DiskIOHigh2

  1. Open Disk 6, priority Normal and view the type of IO. In this case most of the IO are writes. To find out what process is writing open the Process column. You should find that wbengine.exe (10856) process is responsible for most of the writes.

  2. Open the IO Init Stack columns until the end for the wbengine.exe process and find that there are 69 writes of exactly 32MB each. Most writes have high latency (> 100ms).

DiskIOHigh3

Remediation:

Backup to faster backup storage or split up backups to different hard drives to spread the load.

Answer key for trace; SlowBoot1.etl

This is a long one, get some hot cocoa.

Scenario:

A domain joined Windows 10 virtual machine was slow to boot to the user's desktop. Initial investigation did not show very high CPU usage. Storage IO may be high. There is no information on the type of storage underlying the virtual hard disk (SD or HDD).

Analysis:

Winlogon Issue #1

  1. Open trace and add the 'Other' - 'Boot Phases' graph to the analysis view. If your Windows Performance Analyzer does not have the Other-Boot Phases graph available download the newest Windows ADK installer for your Windows version and uninstall the current version and install the newer version.

  2. The 'Boot Phases' graph shows the 5 different phases of boot, (Pre Session Init, Session Init, Winlogon, Explorer Init and Post Boot). Each of these phases should take as little time as possible to complete. Post boot is not relevant in most cases as it is after the user logon unless there is a performance issue after the user logon launching applications. Focus on the Winlogon and Explorer Init in this trace as they are both fairly long. In this trace the 'Session Init' is a bit long as well but focusing time on the possibility of gaining back 5 seconds .vs the other 2 phases is not a good use of time.

  3. Select the Winlogon phase bar in the graph and right-click the graph and Zoom to the Winlogon phase. Each graph added to the same analysis view will have the same zoom level. This way you can focus on 1 phase at a time.

  4. Some Windows internals knowledge will go a long way in finding root causes of slow performance in ETW traces. Winlogon phase comprises of many different sub phases. View the sub phases by adding in the 'System Activity' - 'Windows Logon' graph to the analysis view. There are 3 Notifications in the table (CreateSession, Logon and StartShell). There is also a time before the CreateSession activity that is empty whitespace where it looks like nothing was happening. This is where the internals knowledge comes to play. Auto-Start services start at the beginning of the Winlogon phase. All Auto-Start services up to the 'ProfSvc_Group' service group must have started before Windows will create the first session.

  5. Investigate the services by adding the 'System Activity' - 'Services graph' in the analysis view. You can also left-click and drag in the graph to select the empty space between 0 seconds and ~39 seconds if you would like to increase your focus.

  6. In the Services graph add the 'Started At (s)' column to the very beginning of the column list and sort Ascending. Sorting descending on the 'Service Init Time (s)' is also a good way to find service start issues. You should see that service, System Optimizer' is taking a long time to start at ~16.5 seconds until ~37.8 seconds. At this popint you can remove or update the service or change the start type to manual but let's see if there is a way to get more information on the service first.

  7. You may want to view more information (Publisher, Version, etc.) on the process that the service runs. That can be done in the 'System Activity' - 'Images' graph; add it to the analysis view. Sort by Process and add the following columns to the view: File Description, File Version, Binary File Version, Productname, Company Name and Product version.

  8. Open the Process System Optimizer.exe (740) and you will see all of the images/modules that the service loaded when started. Find the 'System Optimizer.exe image name and open. You should see the 'Product name', 'Binary File Version' and 'Product Version' have values, while the other columns added do not. At this point there is no way to tell what the service is or where it came from.

  9. At this point the analysis view screen should be filling up with graphs. Minimize the graphs on the left side caret/arrow. Add the 'Computation' - 'CPU Usage (Precise)' graph to the analysis view. You can check if the trace has stack information available by going to the menu and selecting 'Trace' - 'System Configuration' and viewing Trace Statistics. If there are StackWalk events listed there should be stack information in the trace. Usually StackWalk events are the highest count of events in a verbose trace.

  10. There are a few reasons that the System Optimizer process may be taking a long time to start. One is that the service is doing some processing and needs to finish the work and the other is that the processes threads are waiting of something else to finish. Check the CPU usage by sorting by % CPU Usage descending. See that the process is only 0.11% weight. There are 2 CPUs available so if a thread was fully utilizing a CPU the weight would be 50 maximum for 1 thread/CPU. It does not look like the issue is too much processing.

  11. Setup the columns for the 'CPU Usage (Precise)' graph as follows.
    New Process | New Thread | New Thread Stack | Ready Thread Stack | Readying Process | Readying Thread |GOLDBAR| Waits (us) Max | Wait (us) Sum | Count |BLUEBAR| % CPU Usage Sum

  12. Before you can view detailed stack information you must verify your symbols configuration and 'Load Symbols'.

  13. Find the System Optimizer process and open the caret. See the threads. All processes have at least 1 thread, but many processes have many threads which can make finding any threads that are waiting difficult. Sorting by either Waits (us) Max and Waits (us) Sum can help find issues in the threads. Sometimes you may have to open each thread and look at the stack. In this case we know that the time the service takes to start is almost exactly 20 seconds. There is 1 thread with a Waits(us) Max of 20,074,833.700us, thread 768. Open thread 768.

  14. Open each stack entry that has close to the 20 seconds of wait time. The first stack entry is 'ntdll.dll!_RtlUserThreadStart' then 'sechost.dll!ScSvcctrlThreadW' then 'System.ServiceProcess.ni.dll!System.ServiceProcess.ServiceBase.ServiceMainCallback(Int32, IntPtr)' and 'mscorlib.ni.dll!System.Threading.WaitHandle.WaitOne()'. You should get to the first 'Ready Thread Stack' or [Root], open this and do the same opening each entry. At the end you should see the process and thread that the stack was waiting for, System Optimizer.exe (740) Thread 772. You must now repeat the process with thread 772.

  15. Open thread 772 in the System Optimizer.exe (740) process. You should see that the Wait (us) max is still about 20 seconds. This means that the previous thread was not to blame. It was waiting for this thread or another down the chain to finish. Open each entry. You can use the left keyboard key to open stack entries easily. At the end of the Ready Thread Stack entries you should see the process and thread that this thread was waiting for which is Idle (0) thread 0. This means that the waiting was in the 772 thread. Somewhere in the stack of the 772 thread is the key to the issue. look at the functions and find 'mscorlib.ni.dll!System.Threading.Thread.Sleep(Int32)' and the other sleep functions. Notice the count of events is 1. This means that the service has a sleep timer of 20 seconds built in. That does not seem like an optimization to me.

Winlogon Issue #2

  1. After fixing the service issue the Winlogon boot phase should take 20 less seconds, that's pretty good. Let's see if there is anything else that can be optimized. Go back to the Winlogon graph and look for long running notifiction steps in Winlogon. You should see that Logon notification is long at 34.274647200 seconds. Open the Logon item to find SessionID 1, open session 1. You should see 5 Subscribers, GPClient, Profiles, Sens, SessionEnv and TermSrv. The Profiles Subscriber is long at 32.723633100 seconds.

  2. The Profiles Subscriber is when the user's profile is loaded after logon. Select the Profiles row and zoom into the graph. The user's profile consists of registry settings, a local or remote drive location and roaming information if enabled. A good next place to look would be the storage.

  3. 2 storage related graphs are available, Disk Usage and File I/O. Start with Disk Usage and add it to the analysis view. Change the columns to the following view.

Disk | Priority | Process | IO Type | Path Tree | IO Init Stack |GOLDBAR| Size (B) Sum | Count |BLUEBAR| Disk Service Time (us) Sum

  1. Sort by 'Disk Service Time (us) Sum descending. Open disk 0, priority normal and find Process svchost.exe (364). This is the process with the highest latency. Open the process and view the reads, writes and flush IO Types. Read should be the highest. Open the 'Path Tree' and find that $Mft is the highest latency. $Mft is the file table for the formatted partition. In this case C:. Open the 'IO Init Stack' and see that the 'userenv.dll!CopyFileFunc' entry is the highest latency. userenv.dll is a dll for the user profiles. Note that the actual user profile location c:\users is very low latency(118ms total) and size(~1.5MB total).

  2. At this point we know there is a lot of activity in the file table but not in actual files. Let's look at FileIO graph. Add it to the analysis view.

  3. Change the columns to the following view. Sort by 'Duration (us) Sum' descending.

Process | Event Type |GOLDBAR| File Path | Thread | Duration (us) Sum | Size (B) Sum | Result |BLUEBAR| Count

  1. Notice process svchost.exe (364) is the highest duration. Open the process and view the event types. Create is the highest duration. Open the create type. All of the file creates for the process should be in the 'File Path' column. There are thousands of 0 byte creates for files in the user's documents folder. Since the files are 0 bytes we did not see them in the Disk Usage graph previously since no actual data was read, just the file attributes like the name, date, flags and so on.

Explorer Init Issue

  1. The Explorer Init boot phase is long at 37.53 seconds. Un-zoom the Boot Phases graph and select the Explorer Init phase and right-click and zoom.

  2. In this boot phase auto start applications are being started from multiple locations such as the 'Run' and 'RunOnce' registry locations, Startup start menu folder and scheduled tasks. Look for new processes in this phase by adding the 'System Activity' - 'Processes' graph to the analysis view.

  3. Change the columns to the following view.

Process |GOLDBAR| Parent Process ID | Command Line |BLUEBAR| Start Time (s) Min | End Time (s) Max

  1. Sort by Start Time (s) Min ascending. Since the Explorer phase starts at ~94 seconds shift-select all rows that have higher start times than 93 seconds. Right-click the rows and select 'Filter To Selection'.

  2. Find that there are some processes starting at this time and they all take some time before starting the next. Two stand outs are the 2 processes wermgr.exe. This is Windows Error Reporting. Either an application crashed or the process is packaging and sending data of previous crashes to Microsoft for analysis. In 1 of the wermgr.exe command lines you can see the -upload switch. So this looks like an upload.

  3. By looking at the 'Parent Process ID' column for wermgr.exe you can see that the parent process is 364. You can find what service this is by going to the Menu -> Trace -> System Configuration -> Services. You should find that one of the services in process 364 is Scheduler, the Task Scheduler. The Windows Error Report was sent when the task was started after logon.

Remediation:

Remove the sleep function in the System Optimizer.exe service. The service had a 20 second delay which can be removed with a code update.

Remove the empty files from the user's documents folder. There was 17.23 seconds of IO to the user's documents folder so some of this can be saved.

Change Scheduled task for Windows Error Reporting to start after logon only after idle.

Answer key for trace; HighCPUPlusOtherIssues.etl

Scenario:

Python was running a long command to open 2 million record .csv file and write a small file on c:\temp\2mfiles for each row.

Explorer.exe had an open window to the save location c:\temp\2mfiles and was refreshing the file count and list.

Disk access was through Hyper-V VMBus which was causing a bit high kernel CPU usage.

Analysis:

Python

  1. Python.exe(10844) and Explorer.exe(9224) processes high CPU usage. There are 3 logical CPU cores available on machine. CPU %Weight in CPU Usage (Sampled) is close to 1 CPU max weight of 33.3% (100 / 3). Explorer.exe 29.73% weight and Python.exe 29.28% weight.

Alt text

  1. Python symbols are available as part of the Windows Python package. Add symbol path to Python install location, i.e. c:\programdata\anaconda3

Alt text

Alt text

  1. Load symbols

Alt text

  1. Use CPU usage (Precise) graph to view what Python may be doing to utilize CPU. Use column view 'Utilization by Process, Thread, Stack.

Alt text

  1. Open stack for python.exe and find following stacks:
    python39.dll!builtin_print 15.42% weight - Python print function(printing to screen)
    python39.dll!_io_open 8.36% weight - Python file open/create function(opening or creating file)

Alt text

  1. When Python command was run output was being written to stdout/screen. Running without print statement will help CPU usage.

  2. Check file usage for python.exe. Add DiskIO graph to analysis view.

  3. Select column view 'Service Time by Process, Path Name, Stack'

  4. View python.exe (10844) process. Each value in 'Disk Service Time' is within good SSD performance limits. (i.e. 781.900 microseconds). If the storage was slower hardware this could have been an issue.

  5. View the stack functions and find that the files were being written. (i.e. python39.dll!buffered_close, partmgr.sys!PartitionWrite, partmgr.sys!PmWrrite)

Alt text

Explorer

  1. Use CPU Usage (Precise) graph to view utilization of threads in Explorer process. Use column view 'Utilization by Process, Thread, Stack'

  2. Find threads 6,100 and 2,716. 6,100 is highest CPU usage at 20.68% weight.

Alt text

  1. Open thread 6,100 and view stack, find ntskrnl.exe!KiDpcInterrupt function with 11.75% weight. This is high kernel CPU usage.

Alt text

  1. Open DPC/ISR graph. Use column view 'DPC/ISR Duration by Module, Function'. Sort columns descending by 'Duration (Fragmented) (ms).

  2. See vmbus.sys driver has high max duration at 0.302100ms. Any durations > 0.1 is considered high.

Alt text

Remediation
If process really needed to write 2 million files then the process should not output each line to screen and the Explorer window should not be opened to the location were the files are being written for best performance of application.

Answer key for trace; SlowBoot2.etl

Put your charcoal face mask on, sit back and enjoy, this is an ugly Windows system boot up trace.

Scenario:

In this case we see what I hope you never have to experience, a 26+ minute boot up!

The boot process is slow during the entire trace period. The only boot phase that is within an acceptable time is the Pre Session Init at 4.48s. Open the Other panel -> Boot Phases graph and see that Session Init, Winlogon Init and Explorer Init are all slow.

Boot Phases

Whenever you see slowness during the entire boot process it is a good hypothesis that it is hardware related. A few hardware resources may be to blame; CPU, storage and memory. Let's dive into each resource to understand if it is the cause of the slowness.

Analysis:

CPU

The first thing to verify in a trace is the number of logical CPUs that the machine had when the trace was taken. Open the Menu -> Trace -> System Configuration. Then open the General tab. The Number of Processors is 8 and the Processor Speed is 3312MHz. This should be fine for most cases and should not cause a Windows machine to boot in 26+ minutes.

The CPU usage should be verified next to make sure that there was not a process/es using all of the CPU during the trace. You should notice right away that the CPU usage during the trace is low, very low in fact that the CPU is idling for > 99% of the time. This quickly rules out CPU resources as a bottleneck.

Memory

Verify the amount of RAM installed by Opening the Menu -> Trace -> System Configuration. Then open the General tab. The machine had 1004MB of RAM installed. This seems low for Windows 10 and if you check the system requirements for Windows 10 you will find that for 64 bit the requirement is 2GB of RAM. Could it be that simple? Let's verify how much memory the system was using during the boot.

Open the Memory panel - Memory utilization graph. Sort by the column Size (MB) Max descending. You will see that the Total Physical Memory was 1004MB and the Commit (Total) was 856.762MB which is less than the RAM. Even though the requirements were not met, the system did have a small amount of memory left and should not have caused such a long boot process.

Storage

Understanding the storage that the system had when it booted is the next step. Open the Menu -> Trace -> System Configuration. Then open the Storage tab. See that there were 2 disks with 1 partition each, formatted as C: and E:. These disks were virtualized disks and there is no way in the trace to understand what storage subsystem was used to store the virtual hard disks.

We can however check the performance of the storage. Open the Storage panel -> Disk Usage graph. Use the View Editor(gear icon) to add the following columns in order.

Priority | Disk | IO Type | Path Tree | GOLDENBAR | Process | Size (B) Sum | Disk Service Time (us) Avg | Global IO Time (us) Avg | IO Time (us) Avg | Global IO Time (us) Sum | IO Time (us) Sum | QD/I | Complete Time (s) | BLUEBAR | Disk Service Time (us) Sum

Disk Usage Columns

I like to select the Normal priority and right-click Filter To Selection to remove any Low and Very Low IO. These IOs which were introduced in Windows Vista are low priority and only serviced when there is spare time, only normal and high priority IO are important usually in troubleshooting disk performance.

If you look at the Disk Service Time (us) Avg for all Normal IO it shows that the average latency of the disks were 63.617284ms which is pretty bad. That is about the same as a very bad performing 5,400RPM or less HDD. If you then open the disk 0 and 1 to the IO Types you see that for disk 0 the writes are much worse than the reads; write average is 450.907036ms and reads 4.377482ms. The write latency is almost as bad as a floppy disk while reads are like SSD speed.

Disk 1 averages are super good, NVME SSD speeds, < 1ms average. Open the disk 1 and view the Path Tree information. You should see that E:\pagefile.sys is the highest used file.

Open the disk 0 Path Tree and see that all locations on the disk are very bad performing; this indicates there is an issue with the disk. You can calculate the average bandwidth by dividing the Size (B) by the (Disk Service Time (us) /1,000,000). In this case for the disk 0 Normal IO it would be 1,725,725,184 / 1,212.892997595 = 1,422,817.336254621 MBs. About 1.5 MB per second. That is almost 80x slower than a normal 7,200RPM HDD.

Remediation:

Do not use a cheap flash drive as the storage for your virtual machines.

Answer key for trace; SlowBoot4.etl

Scenario:

An old Windows 7 machine, not on the domain takes ~60 seconds to boot to the desktop. Is this normal for an old machine and OS, or is there something else mysterious going on here?

Analysis:

The old Windows 7 machine has 2 CPU cores and 4GB RAM and a 64GB SATA SSD! Windows 7 on an SSD should boot very quickly so there may be some improvement that can be made in the 60 second boot/logon time. All of this information can be verified in the Menu->Trace->System Configuration tables.
SlowBoot4

The first information to check in a boot trace is the boot phases time chart.
SlowBoot4-phases
Each phase should be within certain thresholds available to view here

Based on the thresholds, the Pre Session Init phase is too long. Click on the pink graph bar to select the phase and right-click->Zoom. Now the chart and table data should only show the Pre Session Init data. This phase usually does not have much CPU or storage usage. If there was high usage, then the CPU and Disk graph should be checked. In this case the overall usage is low. In the cases of long Pre Session Init phases; I will look to the System Activity->Generic Events graph to see what providers were sending events at this time.

A trick is to add the Time (s) column to the very first column and sort by Time (s) ascending.
SlowBoot4-time

Look for long gaps in the times to find devices or drivers that are taking a long time to start.
SlowBoot4-presessinit

The DriverName column is the hardware Plug and Play ID of the device. In our case, PCI\VEN_8086&DEV_444E&SUBSYS_444E8086&REV_01\4&2d6ce421&0&00E1 is taking 24.56 seconds to start!

We can lookup what device this ID belongs to in the Menu->Trace->System Configuration->PnP table.
SlowBoot4-PnP

Remediation:

The Intel(R) Turbo Memory Controller driver should be updated to the latest version, or possibly some BIOS settings can be changed to alleviate the issue.

Answer key for trace; CaseOfThe30SecondFileExplorer.etl

Taken from my post https://www.chadschultz.com/2015/07/27/case-of-the-30-second-file-explorer/

Scenario:

I have been having issues with my new Windows 10 gaming/research rig. I upgraded to a 4K TV for a monitor and was running Windows 8.1. This was working well, but I thought that there might be some better DPI scaling in Windows 10, so I decided to give it a try. I downloaded and installed the Windows 10 Insider Preview; found the correct Nvidia driver for the GeForce 980Ti.

There are surely issues with the video driver and TV combination, but mostly it is working. One thing though that was not working though is File Explorer. Every time I clicked on File Explorer or This PC or programs opened an Explorer window there was a huge delay.

Analysis:

At first, I thought it might be just the new Quick Access feature in Windows 10, so I disabled opening to Quick Access in the Options of a File Explorer window.

Still the issue occurred.

I ejected the DVD from my BluRay player, sometimes if there is a scratched disk in the drive File Explorer can take more time to open.

I tracked the time it took to open the window, about 30 seconds or so. Anytime you track a timeout of a nice round number it indicative of a hard-coded timeout in the OS.

Time to take a trace.

I started with a WPRUI trace of the issue. This is part of the Windows 10 ADK and SDK; https://docs.microsoft.com/en-us/windows-hardware/test/wpt/.

I Installed the Windows Performance Toolkit, plus all of the other tools, they were not needed though.

I ran c:\program files\Windows kits\10\Windows Performance Toolkit\wprui.exe to take a Windows Performance Recorder trace.

I selected the following:

Alt Text

I closed all of the programs and opened File Explorer windows and clicked Start on Windows Performance Recorder.

I opened File Explorer from the Taskbar, and it took a long time to open.

I stopped the performance recorder and saved the file. After saving, Windows Performance Recorder will ask if you want to open the trace in Windows Performance Analyzer (WPA), yes please.

To troubleshoot a hang, we need to open the Computation->CPU Usage (Precise) graph.

Now time to configure symbols.

SymCache is also important as cached symbol files will be copied there and I do not want my system drive to fill up with symbols. There seems to be an issue with the version of WPA I am using 10.0.10075 that it does not save the symcache location. I kept on defaulting back to c:symcache, so I created a symbolic link to the path I want.

mklink /D c:symcache z:symcache

WPA

Drag the CPU Usage (Precise) to an analysis tab on the right-hand side of WPA.exe and setup columns like the following:

The Waits (µs)Max is sorted descending, so I see the highest on top, not that it matters, I know the File Explorer is part of the Explorer process, so I just filtered on Explorer process. Knowing your Windows Internals is useful sometimes.

Under the Explorer process I expanded the thread with the highest Waits (µs)Max until I started to find some interesting. Looks like when I open File Explorer windows.storage.dll is called and that calls linkinfo.dll, possibly enumerating some links, shortcuts or favorites. Then on the mpr.dll and ntlanman.dll, so File Explorer is going to the network. Davclnt.dll also gets called as a part of this which is WebDAV, so again calling the network looking for a resource from a shortcut.

Alt Text

Procmon

I couldn’t get the location of what resource was getting called from the .etl trace, so I decided to try a Process Monitor trace. I filtered by PID 3032 and excluded the SUCCESS results. There were some 'NAME NOT FOUND' results pointing to a machine I just decommissioned called HTPC. Offline Files (CSC) was trying to read the location.

Alt Text

At this point I had enough information to verify my hypothesis, that a pin, or favorite in File Explorer was present from my old machine, there was, and I removed it by right-clicking and unpinning from Quick Access.

Alt Text

Remediation:

After this File Explorer windows and Save and Open file dialog windows opened in a flash.

Case closed!

Answer key for trace; LongShutdown.etl

Windows shutdown should be quick but sometimes it's not.

Scenario:

The system takes a 'long' time to shutdown. Not too long in this case, only 12.78 seconds, but who has that kind of time?

Analysis:

Shutdown traces are quite a bit different then boot up and logon performance traces. In a shutdown sceario you want processes and drivers to stop and unload as fast as possible. There are a few issues that may stop this from happening, like A GPO logoff or shutdown script, clearing the pagefile on shutdown, slow service shutdown, open applications and others.

There are WPA.exe profiles for shutdown analysis in the WPA Profiles folder.

Services

I like to look at any services that are not shutting down in a timely manner first. There is a maximum wait time for services to shutdown on Windows shutdown, by default the time is 5 seconds. HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\WaitToKillServiceTimeout

Order the End Thread ID column ascending and you will find some services that have a thread ID of 0. These are the last services that were running on shutdown. They are services that were not able to shutdown before the operating system stopped. Look for any service that could not shutdown and that has a high Service Init Time (s) value. In this case the Rasman service did not shutdown and this Init time is 6.039970997 seconds.

Processes

Processes that are slow to exit can also be an issue with a slow shutdown. There are a few graphs that are important in investigating processes. 'Window in Focus', 'Processes' and 'Images' are some of the important graphs.

Opening the 'Window in Focus' graph will show any applications that had a window open on the desktop at the time of the trace. To find processes that could be slowing down the shutdown sort descending by Duration (s) Sum column. See that there are 3 major processes taking up the time, LogonUI.exe (19492), Skype.exe (1892) and wlrmdr.exe (20636). LogonUI.exe will run after the desktop is shutdown and while Windows is showing 'Shutting Down' screen so it can be ignored. Skype.exe is taking 3.494 seconds to exit which is a bit long. wlrmdr.exe process is also taking 1.708 seconds to exit but this is the 'Windows logon reminder' process which shuts down the machine when taking a shutdown trace so it can be ignored as well.

Alt Text

Drivers

Slow driver shutdown can cause the shutdown to be slow as well.

**Overall CPU and Disk Usage

Use the 'CPU Usage (Sampled)' graph to view the CPU usage and look for high CPU usage during shutdown. In this trace the CPU is 96.19% idle weight so there is plenty of CPU. When taking a shutdown trace a verbose trace with callstacks is usually needed so any wait chains within processes can be investigated.

Always check the CPU frequency as well to make sure the CPUs were not throttled down in speed during the shutdown. Oscillating GHz values are normal, look for an extended time of low GHz. In this trace at the time of shutting down the Rasman service the CPU frequency was at maximum, so no issues with CPU speed throttling.

Remediation:

Shutting down Skype.exe process before shutting down first could help the slow shutdown. The Rasman service can also be investigated to see why it is taking 6 seconds to shutdown. If these 2 issues can be fixed the shutdown time could be reduced by ~9 seconds.

Answer key for trace; Create_100files.etl

Scenario:

100 files were being created from process cmd.exe (2024) with the command 'fsutil file createnew c:\temp\files\A###.tmp'. There was 1 CPU available. Process MsMpEng.exe (2372) was utilizing just over half of the CPU at 58.89% weight. Overall CPU usage was high at 90.92% weight.

There is slow disk access, but overall disk utilization looks low and average Disk Service Time(Latency) is low at 5.759ms. At this point it looks like the issue may just be not enough CPU.

Storage Mini-Filter Delays are very high. There is a mini-filter installed (ChadsBadFilter.sys) that is 10x slower (23.756ms avg) than the next highest driver (Wdfilter.sys) (2.255ms).

Analysis:

CPU Usage

  1. Start with the CPU Usage (Sampled) graph and use the 'Utilization By Process' filter. There is no need for adding stack information because the trace does not contain stack information.

  2. Sort by % weight column and view process MsMpEng.exe (2372), Anti-Virus, utilizing over half of the 1 CPU at 58.89 % weight.

Create_100files-1

  1. Change view to 'Utilization By CPU' to view the overall CPU utilization during the trace at 90.92 % weight. This is nearly all of the CPU.

Create_100files-2

Disk Usage

  1. Add the Disk Usage graph to the analysis view and select the following columns.

Disk | Priority | IO Type | Process | IO Init Stack |GOLDBAR| Disk Service Time(us) Avg | Size | Count |BLUEBAR| Disk Service Time(us) Sum


  1. Sort by Disk Service Time(us) Avg descending. Notice that the overall average latency is low at 5.759ms.

Create_100files-3

  1. Sometimes slow storage performance does not show in the DiskIO chart. Mini-filter drivers are drivers that load in the storage stack to perform functions on I/O Request Packets. To check for delays in the storage mini-filters, add the Graph Explorer->Storage->Mini-Filter Delays graph to the analysis view and change the columns to the following.

Mini-Filter Driver | ProcessID | File Name |GOLDBAR| Duration (us) Sum | Duration (us) Max | Duration (us) Avg | Count |BLUEBAR| Start (s) | End (s)


  1. Sort by Duration (us) Avg descending and notice that the ChadsBadFilter.sys driver is causing an average 23.756ms delay for each IO. This is comparable to a hard disk 🔥, but if the underlying storage was a slow hard disk, the DiskIO graph would have showed higher overall latency instead of 5.759ms average.

Create_100files-4

Other mini-filter drivers to watch for include anti-virus and data loss protection software.

Remediation:

Add CPU core/s and remove/debug the ChadsBadFilter.sys storage mini-filter driver.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.