Git Product home page Git Product logo

psperf's People

Contributors

bklockwood avatar gitter-badger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

psperf's Issues

When perfdata unavailable.

How to handle the situation where Get-PerfData fails?

Possible reasons:

  • name lookup fail
  • no network connectivity
  • account lacks perms to get remote perfdata
  • other?

Performance issues.

Right now, testing against 6 systems that are up and running, one full 'cycle' (gather info from 6 targets, write page) takes 3-6 minutes. So each system is taking 30-60 secs. I would like to get that down to 10 secs or less per monitored target.

The big culprits are Get-PendingWU (6-55 secs) and Get-EventCount (0.5 - 74 sec). Especially get-pendingwu because it consistently takes longer times; get-eventcount rarely takes more than 16 sec.

Off the top of my head I see these basic mitigation strategies:

  1. run get-pendingwu less frequently. The numbers will only change a few times a month. I do not want to do this with get-eventcount, because I want perf graphs and eventcount graphs to run in lockstep.
  2. run tests in parallel.
  3. use a single PSSession for each computer, to reduce session setup/takedown time.

down/up times

when a server goes down, note the time.

Think this through a bit better. A clean way to note history of downtime and up times.

Uptime reporting broken

ad1, Lenny, and Dell-TV were all rebooted for updates in the last hour, yet report uptime of 15+ hours. Note all systems report same uptime.

image

patches outstanding

indicate number of non-hidden security/recommended/optional patches outstanding

Disk space used

In addition to disk queue lengths, provide disk space used as a bar graph, like so:

image

the datahash/datafile should store as the web page will read

Currently psperf stores CpuQueue and PagesPerSec. These are fine, same-same on every system.

But it also stores multiple DiskQueue values. These are different from machine to machine. On one machine we may have two disks seen by perfmon as "0 c:" and "1 d: f:" while another machine may have disks seen as "0 d: e:" and "1 c: x:"

This makes parsing difficult and annoying. Better to specify this in a config so that the stored values look like they'll look on the web page - currently just "disk1" and "disk2" (better 'disk0' and 'disk1')

jsdelivr versions of jquery and jquery.sparkline

I want to deliver this as a single script to ease installation issues.

So it would be nice to refer to jsdelivr versions of added assets such as jquery and jquery.sparkline rather than having to include them.

I will need to submit the Fortes version of jquery.sparkline to the jsdelivr folks.

Timers on web page

I'd like to computer a running average of how long each text cycle takes and display something like this on the web page:

Monday 9/5/2015 6:47:31 AM <--this clock runs constantly
Last refresh 22 seconds ago, next refresh in 34 seconds. <--these do too.

server lines are jumping around

Something in the last round of changes has caused server lines to jump around. Not sure what.

By 'jump around' I mean that servers will be listed in this order:

s2
s3
hyper1
hyper2

and on next data refresh they will change order to something like:

hyper2
s3
hyper1
s2

Page stops writing when target has no diskfree data

If we see this in psperf.json:

"lenny":  {
                  "DiskQueue":  {

                                },
                  "DiskFree":  {

                               },

The page will end up doing this:
image

No further data will be loaded (in the above example, there were more targets after 'lenny').

Script slows when system unreachable

I saw an actual hang occur when running the script back to back via a looping statement, and I rebooted ad6. ad6 itself didn't come back fully, and the script hung at "stopping" when I tried to stop it.

Not positive winrm was the cause but it was the most recently added code. So I'll add a try/catch.

I'd like to put some sort of timeout limiter in the invoke-command arguments, but a quick check shows nothing like that available.

alert on high levels

For the moment, "alert" just means "change server's cell to a different color". Later it could mean sending an email or SMS (Skype?) message.

I'm thinking yellow-orange on first high level, deepening to red on subsequent, consecutive high levels.

Show number of errors in eventlogs

I'm thinking a graph indicating number of new critical, warning, or error events in System or Application since last check. Maybe a total number of such events over last 24 hours?

Refactor

Getting a bit spaghetti.

Functions should be made 'purer' - no more writing directly to storagehash. Each function should give its data back in an object format which is then stored (perhaps to json natively?) by a function written for that purpose.

I think there needs to be a function for constructing the PSsession.

Have jquery.sparkline read data directly

If I could store data in a format readable by jquery.sparkline directly, the page would not need to auto-refresh.

This could perhaps be done by storing data as JSON rather than the current clixml hashtable.

However, I will defer this work until I have fully fleshed out my current vision. So, will continue using hashtable/clixml until I've completed issues #10, #15, #16 (and perhaps #7).

Exchange 2010 Performance Counters

CAS Servers:
\MSExchange RPCClientAccess\RPC Requests - RPC Requests being processed, anything above 40 indicates bottleneck
\MSExchange RPCClientAccess\RPC Averaged Latency - RPC Average Latency (CAS) - Anything above 250 indicates bottleneck

Hub:
\MSExchangeTransport Queues(_total)\Aggregate Delivery Queue Length (All Queues) - Anything above 200 means your queues are backing up

Mailbox:
\MSExchange Replication(_total)\ActivationSuspended
\MSExchange Replication(_total)\Failed
\MSExchange Replication(_total)\FailedSuspended
\MSExchange Replication(_total)\Suspended
Anything above 0 is bad. BADDD! (These are DAG counters indicating replication) They can be applied to standalone MBX servers not in a DAG, they will return 0 always

\MSExchangeIS Client(_Total)\RPC Average Latency - The best counter for MBX servers, anything above 50 indicates bottleneck, almost always disk, this should be averaged since spikes aren't unusual, it's when it sustained that it's issue

\MSExchange Database(Information Store)\Log Record Stalls/sec - Means Logs are sitting in memory waiting to get written (above 10 is bad)
\MSExchange Database(Information Store)\Log Threads Waiting - Different counter, same thing, logs waiting time on disk, above 10 is bad

\MSExchange Database(Information Store)\I/O Database Reads Average Latency - Overall database read latency, Microsoft says anything above 20 is bad, I'm generally cool with anything up to 100
\MSExchange Database(Information Store)\I/O Database Writes Average Latency - Overall database write latency, anything past 200 is bad, writes are less priority is then reads

Last 4 counters will generally show up in Disk Average Read/Write Latency issues as well but I love knowing application health because on Application servers, that's what you care about, their health

Document perfmon connection issues

get-counter uses some funky method of connection (not wsman) that I have not fully grokked. Find out more, and document:

  • permissions required
  • how to manually authenticate connection
  • troubleshooting when it doesn't work

See PSPerf historical data

Right now PSperf only maintains data for the past 24 hours (at 5 minute increments).

Save old data in such a way that for each computer, it's possible to open a page with a table of old data. The page would look like the current one, except the page would represent one computer, and each table line would represent one calendar day for that computer.

Computername
date cpugraph memgraph disk1graph disk2graph
date cpugraph memgraph disk1graph disk2graph
...

Check multiple computers

preferably by reading a config file, then running get-perfdata for each computer found there.

get-uptime should work differently

I'm still writing get-rebootstatus but it should work like that does when I am done:

If the system is down, then write a timestamp when this was first detected
else write $false

the written value is stored at $storageHash.$compname.down

Parallelize system checks.

Should be able to check multiple hosts simultaneously. The checks for each host should run in linear fashion, though.

Or at least, do check-uptime first, and if the host is not down, parallelize any other checks for that host.

Config file

Need a config file something like:

servername: cpu hi alert, mem hi alert, disk0 hi alert, disk1 hi alert, authentication strings (optional)
defaults: cpu 50, mem 200, disk0 20, disk1 20, bobdole:password
server1: cpu 50, mem 200, disk0 20, disk1 30, bobjones:password2
server2: defaults

output webfile: c:\path\to\file.html
datafile: c:\path\to\data.clixml

Maybe make it editable via web page. http://commonmark.org/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.