bklockwood / psperf Goto Github PK
View Code? Open in Web Editor NEWSimple computer health monitoring with PowerShell
License: MIT License
Simple computer health monitoring with PowerShell
License: MIT License
How to handle the situation where Get-PerfData fails?
Possible reasons:
Right now, testing against 6 systems that are up and running, one full 'cycle' (gather info from 6 targets, write page) takes 3-6 minutes. So each system is taking 30-60 secs. I would like to get that down to 10 secs or less per monitored target.
The big culprits are Get-PendingWU (6-55 secs) and Get-EventCount (0.5 - 74 sec). Especially get-pendingwu because it consistently takes longer times; get-eventcount rarely takes more than 16 sec.
Off the top of my head I see these basic mitigation strategies:
when a server goes down, note the time.
Think this through a bit better. A clean way to note history of downtime and up times.
indicate number of non-hidden security/recommended/optional patches outstanding
When an item (monitored server, or disk) is removed from the config file, it should stop being displayed on the web page.
Currently psperf stores CpuQueue and PagesPerSec. These are fine, same-same on every system.
But it also stores multiple DiskQueue values. These are different from machine to machine. On one machine we may have two disks seen by perfmon as "0 c:" and "1 d: f:" while another machine may have disks seen as "0 d: e:" and "1 c: x:"
This makes parsing difficult and annoying. Better to specify this in a config so that the stored values look like they'll look on the web page - currently just "disk1" and "disk2" (better 'disk0' and 'disk1')
I want to deliver this as a single script to ease installation issues.
So it would be nice to refer to jsdelivr versions of added assets such as jquery and jquery.sparkline rather than having to include them.
I will need to submit the Fortes version of jquery.sparkline to the jsdelivr folks.
I'd like to computer a running average of how long each text cycle takes and display something like this on the web page:
Monday 9/5/2015 6:47:31 AM <--this clock runs constantly
Last refresh 22 seconds ago, next refresh in 34 seconds. <--these do too.
doh
Something in the last round of changes has caused server lines to jump around. Not sure what.
By 'jump around' I mean that servers will be listed in this order:
s2
s3
hyper1
hyper2
and on next data refresh they will change order to something like:
hyper2
s3
hyper1
s2
I saw an actual hang occur when running the script back to back via a looping statement, and I rebooted ad6. ad6 itself didn't come back fully, and the script hung at "stopping" when I tried to stop it.
Not positive winrm was the cause but it was the most recently added code. So I'll add a try/catch.
I'd like to put some sort of timeout limiter in the invoke-command arguments, but a quick check shows nothing like that available.
For the moment, "alert" just means "change server's cell to a different color". Later it could mean sending an email or SMS (Skype?) message.
I'm thinking yellow-orange on first high level, deepening to red on subsequent, consecutive high levels.
currently the psperf.ps1 script rewrites index.html every time. There is no need. It should only write the file if it is nonexistent at the save location.
Web output should give time of latest data collection.
I've started documenting here.
I'm thinking a graph indicating number of new critical, warning, or error events in System or Application since last check. Maybe a total number of such events over last 24 hours?
Getting a bit spaghetti.
Functions should be made 'purer' - no more writing directly to storagehash. Each function should give its data back in an object format which is then stored (perhaps to json natively?) by a function written for that purpose.
I think there needs to be a function for constructing the PSsession.
this should flow seamlessly
If I could store data in a format readable by jquery.sparkline directly, the page would not need to auto-refresh.
This could perhaps be done by storing data as JSON rather than the current clixml hashtable.
However, I will defer this work until I have fully fleshed out my current vision. So, will continue using hashtable/clixml until I've completed issues #10, #15, #16 (and perhaps #7).
indicate a system that's awaiting reboot
CAS Servers:
\MSExchange RPCClientAccess\RPC Requests - RPC Requests being processed, anything above 40 indicates bottleneck
\MSExchange RPCClientAccess\RPC Averaged Latency - RPC Average Latency (CAS) - Anything above 250 indicates bottleneck
Hub:
\MSExchangeTransport Queues(_total)\Aggregate Delivery Queue Length (All Queues) - Anything above 200 means your queues are backing up
Mailbox:
\MSExchange Replication(_total)\ActivationSuspended
\MSExchange Replication(_total)\Failed
\MSExchange Replication(_total)\FailedSuspended
\MSExchange Replication(_total)\Suspended
Anything above 0 is bad. BADDD! (These are DAG counters indicating replication) They can be applied to standalone MBX servers not in a DAG, they will return 0 always
\MSExchangeIS Client(_Total)\RPC Average Latency - The best counter for MBX servers, anything above 50 indicates bottleneck, almost always disk, this should be averaged since spikes aren't unusual, it's when it sustained that it's issue
\MSExchange Database(Information Store)\Log Record Stalls/sec - Means Logs are sitting in memory waiting to get written (above 10 is bad)
\MSExchange Database(Information Store)\Log Threads Waiting - Different counter, same thing, logs waiting time on disk, above 10 is bad
\MSExchange Database(Information Store)\I/O Database Reads Average Latency - Overall database read latency, Microsoft says anything above 20 is bad, I'm generally cool with anything up to 100
\MSExchange Database(Information Store)\I/O Database Writes Average Latency - Overall database write latency, anything past 200 is bad, writes are less priority is then reads
Last 4 counters will generally show up in Disk Average Read/Write Latency issues as well but I love knowing application health because on Application servers, that's what you care about, their health
get-counter uses some funky method of connection (not wsman) that I have not fully grokked. Find out more, and document:
The storagehash.computername.downsince value is not being removed. I can't see why.
connecting to machines outside ones domain will also require addition to wsman trusted hosts: http://blogs.technet.com/b/heyscriptingguy/archive/2013/11/29/remoting-week-non-domain-remoting.aspx
Right now PSperf only maintains data for the past 24 hours (at 5 minute increments).
Save old data in such a way that for each computer, it's possible to open a page with a table of old data. The page would look like the current one, except the page would represent one computer, and each table line would represent one calendar day for that computer.
Computername
date cpugraph memgraph disk1graph disk2graph
date cpugraph memgraph disk1graph disk2graph
...
Values not shown correctly. Better documented here
preferably by reading a config file, then running get-perfdata for each computer found there.
I'm still writing get-rebootstatus but it should work like that does when I am done:
If the system is down, then write a timestamp when this was first detected
else write $false
the written value is stored at $storageHash.$compname.down
Should be able to check multiple hosts simultaneously. The checks for each host should run in linear fashion, though.
Or at least, do check-uptime first, and if the host is not down, parallelize any other checks for that host.
Need a config file something like:
servername: cpu hi alert, mem hi alert, disk0 hi alert, disk1 hi alert, authentication strings (optional)
defaults: cpu 50, mem 200, disk0 20, disk1 20, bobdole:password
server1: cpu 50, mem 200, disk0 20, disk1 30, bobjones:password2
server2: defaults
output webfile: c:\path\to\file.html
datafile: c:\path\to\data.clixml
Maybe make it editable via web page. http://commonmark.org/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.