bklockwood / psperf Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 3.0 156 KB

Simple computer health monitoring with PowerShell

License: MIT License

PowerShell 80.49% HTML 19.51%

psperf's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger camilohe modulexcite

psperf's Issues

When perfdata unavailable.

How to handle the situation where Get-PerfData fails?

Possible reasons:

name lookup fail
no network connectivity
account lacks perms to get remote perfdata
other?

Right now, testing against 6 systems that are up and running, one full 'cycle' (gather info from 6 targets, write page) takes 3-6 minutes. So each system is taking 30-60 secs. I would like to get that down to 10 secs or less per monitored target.

The big culprits are Get-PendingWU (6-55 secs) and Get-EventCount (0.5 - 74 sec). Especially get-pendingwu because it consistently takes longer times; get-eventcount rarely takes more than 16 sec.

Off the top of my head I see these basic mitigation strategies:

run get-pendingwu less frequently. The numbers will only change a few times a month. I do not want to do this with get-eventcount, because I want perf graphs and eventcount graphs to run in lockstep.
run tests in parallel.
use a single PSSession for each computer, to reduce session setup/takedown time.

down/up times

when a server goes down, note the time.

Think this through a bit better. A clean way to note history of downtime and up times.

Uptime reporting broken

ad1, Lenny, and Dell-TV were all rebooted for updates in the last hour, yet report uptime of 15+ hours. Note all systems report same uptime.

Add-PerfData should ensure exactly 144 elements in each array

patches outstanding

indicate number of non-hidden security/recommended/optional patches outstanding

Disk space used

In addition to disk queue lengths, provide disk space used as a bar graph, like so:

Removal from psperf.ini should remove from web-page

When an item (monitored server, or disk) is removed from the config file, it should stop being displayed on the web page.

the datahash/datafile should store as the web page will read

Currently psperf stores CpuQueue and PagesPerSec. These are fine, same-same on every system.

But it also stores multiple DiskQueue values. These are different from machine to machine. On one machine we may have two disks seen by perfmon as "0 c:" and "1 d: f:" while another machine may have disks seen as "0 d: e:" and "1 c: x:"

This makes parsing difficult and annoying. Better to specify this in a config so that the stored values look like they'll look on the web page - currently just "disk1" and "disk2" (better 'disk0' and 'disk1')

jsdelivr versions of jquery and jquery.sparkline

I want to deliver this as a single script to ease installation issues.

So it would be nice to refer to jsdelivr versions of added assets such as jquery and jquery.sparkline rather than having to include them.

I will need to submit the Fortes version of jquery.sparkline to the jsdelivr folks.

Timers on web page

I'd like to computer a running average of how long each text cycle takes and display something like this on the web page:

Monday 9/5/2015 6:47:31 AM <--this clock runs constantly
Last refresh 22 seconds ago, next refresh in 34 seconds. <--these do too.

Disk free/used barcharts are backwards

doh

server lines are jumping around

Something in the last round of changes has caused server lines to jump around. Not sure what.

By 'jump around' I mean that servers will be listed in this order:

s2
s3
hyper1
hyper2

and on next data refresh they will change order to something like:

hyper2
s3
hyper1
s2

Page stops writing when target has no diskfree data

If we see this in psperf.json:

"lenny":  {
                  "DiskQueue":  {

                                },
                  "DiskFree":  {

                               },

The page will end up doing this:

No further data will be loaded (in the above example, there were more targets after 'lenny').

Consider d3

https://github.com/mbostock/d3/wiki/Gallery

reload only json, not redraw whole page

http://stackoverflow.com/questions/17853845/getjson-with-reload-functionality

Script slows when system unreachable

I saw an actual hang occur when running the script back to back via a looping statement, and I rebooted ad6. ad6 itself didn't come back fully, and the script hung at "stopping" when I tried to stop it.

Not positive winrm was the cause but it was the most recently added code. So I'll add a try/catch.

I'd like to put some sort of timeout limiter in the invoke-command arguments, but a quick check shows nothing like that available.

alert on high levels

For the moment, "alert" just means "change server's cell to a different color". Later it could mean sending an email or SMS (Skype?) message.

I'm thinking yellow-orange on first high level, deepening to red on subsequent, consecutive high levels.

Don't rewrite the index.html page every time.

currently the psperf.ps1 script rewrites index.html every time. There is no need. It should only write the file if it is nonexistent at the save location.

Timestamp on page

Web output should give time of latest data collection.

server/target names should alpha sort

Use ini file for all config, document in wiki

I've started documenting here.

Show number of errors in eventlogs

I'm thinking a graph indicating number of new critical, warning, or error events in System or Application since last check. Maybe a total number of such events over last 24 hours?

Need an installer

Disk usage graphs are backwards

The C: drive in the above picture should be 97% free, not 97% used

Refactor

Getting a bit spaghetti.

Functions should be made 'purer' - no more writing directly to storagehash. Each function should give its data back in an object format which is then stored (perhaps to json natively?) by a function written for that purpose.

I think there needs to be a function for constructing the PSsession.

pipeline: get-perfdata | add-perfdata

this should flow seamlessly

Have jquery.sparkline read data directly

If I could store data in a format readable by jquery.sparkline directly, the page would not need to auto-refresh.

This could perhaps be done by storing data as JSON rather than the current clixml hashtable.

However, I will defer this work until I have fully fleshed out my current vision. So, will continue using hashtable/clixml until I've completed issues #10, #15, #16 (and perhaps #7).

get-perfdata should return nearest integer values

needs reboot

indicate a system that's awaiting reboot

Exchange 2010 Performance Counters

CAS Servers:
\MSExchange RPCClientAccess\RPC Requests - RPC Requests being processed, anything above 40 indicates bottleneck
\MSExchange RPCClientAccess\RPC Averaged Latency - RPC Average Latency (CAS) - Anything above 250 indicates bottleneck

Hub:
\MSExchangeTransport Queues(_total)\Aggregate Delivery Queue Length (All Queues) - Anything above 200 means your queues are backing up

Mailbox:
\MSExchange Replication(_total)\ActivationSuspended
\MSExchange Replication(_total)\Failed
\MSExchange Replication(_total)\FailedSuspended
\MSExchange Replication(_total)\Suspended
Anything above 0 is bad. BADDD! (These are DAG counters indicating replication) They can be applied to standalone MBX servers not in a DAG, they will return 0 always

\MSExchangeIS Client(_Total)\RPC Average Latency - The best counter for MBX servers, anything above 50 indicates bottleneck, almost always disk, this should be averaged since spikes aren't unusual, it's when it sustained that it's issue

\MSExchange Database(Information Store)\Log Record Stalls/sec - Means Logs are sitting in memory waiting to get written (above 10 is bad)
\MSExchange Database(Information Store)\Log Threads Waiting - Different counter, same thing, logs waiting time on disk, above 10 is bad

\MSExchange Database(Information Store)\I/O Database Reads Average Latency - Overall database read latency, Microsoft says anything above 20 is bad, I'm generally cool with anything up to 100
\MSExchange Database(Information Store)\I/O Database Writes Average Latency - Overall database write latency, anything past 200 is bad, writes are less priority is then reads

Last 4 counters will generally show up in Disk Average Read/Write Latency issues as well but I love knowing application health because on Application servers, that's what you care about, their health

Document perfmon connection issues

get-counter uses some funky method of connection (not wsman) that I have not fully grokked. Find out more, and document:

permissions required
how to manually authenticate connection
troubleshooting when it doesn't work

Hosts do not return to UP status after downtime

The storagehash.computername.downsince value is not being removed. I can't see why.

Store credentials securely

http://social.technet.microsoft.com/wiki/contents/articles/4546.working-with-passwords-secure-strings-and-credentials-in-windows-powershell.aspx

http://www.techrepublic.com/blog/data-center/powershell-code-to-store-user-credentials-encrypted-for-re-use/

connecting to machines outside ones domain will also require addition to wsman trusted hosts: http://blogs.technet.com/b/heyscriptingguy/archive/2013/11/29/remoting-week-non-domain-remoting.aspx

See PSPerf historical data

Right now PSperf only maintains data for the past 24 hours (at 5 minute increments).

Save old data in such a way that for each computer, it's possible to open a page with a table of old data. The page would look like the current one, except the page would represent one computer, and each table line would represent one calendar day for that computer.

Computername
date cpugraph memgraph disk1graph disk2graph
date cpugraph memgraph disk1graph disk2graph
...

If no PSsession is returned, mark system as down.

jquery.sparkline stacked bar charts act strangely

Values not shown correctly. Better documented here

Eventlog entries not being counted.

Check multiple computers

preferably by reading a config file, then running get-perfdata for each computer found there.

get-uptime should work differently

I'm still writing get-rebootstatus but it should work like that does when I am done:

If the system is down, then write a timestamp when this was first detected
else write $false

the written value is stored at $storageHash.$compname.down

Parallelize system checks.

Should be able to check multiple hosts simultaneously. The checks for each host should run in linear fashion, though.

Or at least, do check-uptime first, and if the host is not down, parallelize any other checks for that host.

Config file

Need a config file something like:

servername: cpu hi alert, mem hi alert, disk0 hi alert, disk1 hi alert, authentication strings (optional)
defaults: cpu 50, mem 200, disk0 20, disk1 20, bobdole:password
server1: cpu 50, mem 200, disk0 20, disk1 30, bobjones:password2
server2: defaults

output webfile: c:\path\to\file.html
datafile: c:\path\to\data.clixml

Maybe make it editable via web page. http://commonmark.org/