Git Product home page Git Product logo

haskell-perf's Introduction

haskell-perf

GHC Patch: https://github.com/composewell/ghc/tree/ghc-8.10.7-eventlog-enhancements

Enable Linux perf counters

Enable unrestricted use of perf counters:

# echo -1 > /proc/sys/kernel/perf_event_paranoid

Disable CPU scaling

Set the scaling governer of all your cpus to performance:

echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
...
...
echo performance > /sys/devices/system/cpu/cpu7/cpufreq/scaling_governor

Generating the eventlog

To generate the event log, we need to compile the program with the eventlog enabled and run the program setting the -l rts option.

There are multiple ways of doing this.

Using plain GHC:

ghc Main.hs -rtsopts -eventlog
./Main +RTS -l -RTS

Using Cabal:

The .cabal file should contain the following ghc options

ghc-options: -eventlog "-with-rtsopts=-l"

If the -threaded option is used while compiling. You may want to use the -N1 rts option.

Creating windows

Helper function to create windows:

{-# LANGUAGE BangPatterns #-}

import Control.Monad.IO.Class (MonadIO(..))
import Debug.Trace (traceEventIO)

{-# INLINE withTracingFlow #-}
withTracingFlow :: MonadIO m => String -> m a -> m a
withTracingFlow tag action = do
    liftIO $ traceEventIO ("START:" ++ tag)
    !res <- action
    liftIO $ traceEventIO ("END:" ++ tag)
    pure res

We can wrap parts of the flow we want to analyze with withTracingFlow using a tag to help us identify it.

End of Window

You can put the END of the window in different paths but ensure that all paths are covered:

  r <- f x
  case r of
    Just val -> do
      -- _ <- L.runIO $ traceEventIO $ "END:" ++ "window"
      -- Some processing
    Nothing -> do
      -- _ <- L.runIO $ traceEventIO $ "END:" ++ "window"
      -- Some processing

Measurement Overhead

Even when you are measuring an empty block of code there will be some minimum timing and allocations reported because of the measurement overhead.

    _ <- traceEventIO $ "START:emptyWindow"
    _ <- traceEventIO $ "END:emptyWindow"

The timing is due to the time measurement system call itself. The allocations are due to the traceEventIO haskell code execution. TODO: fix the allocations.

Measurement with Lazy Evaluation

If we want to measure the cost of the lookup in the code below we need to evaluate it right there:

    m <- readIORef _configCache
    return . snd $ SimpleLRU.lookup k m

For correct measurement use the following code:

    m <- readIORef _configCache
    _ <- traceEventIO $ "START:" ++ "mapLookup"
    let !v = HM.lookup k m
    _ <- traceEventIO $ "END:" ++ "mapLookup"
    return v

Labelling Threads

We should label our threads to identify the thread to scrutinize while reading the stats.

For example,

To scrutinize the main thread:

import GHC.Conc (myThreadId, labelThread)

main :: IO ()
main = do
    tid <- myThreadId
    labelThread tid "main-thread"
    withTracingFlow "main" $ do
       ...

To scrutinize the server thread in warp we can use the following middleware:

eventlogMiddleware :: Application -> Application
eventlogMiddleware app request respond = do
    tid <- myThreadId
    labelThread tid "server"
    traceEventIO ("START:server")
    app request respond1

    where

    respond1 r = do
        res <- respond r
        traceEventIO ("END:server")
        return res

We can use eventlogMiddleware as the outermost layer.

Reading the results

We get a lot of output currently. We are in the process of simplifying the statistics and making the details controllable via options.

Currently, the program prints a lot of information. It's essential to understand what to ignore given the use case.

The use-case we assume is: Understand the window CPU time and Thread allocated.

Consider the following program:

{-# LANGUAGE BangPatterns #-}

import Control.Monad (unless)
import Control.Monad.IO.Class (MonadIO(..))
import Debug.Trace (traceEventIO)
import GHC.Conc (myThreadId, labelThread)

{-# INLINE withTracingFlow #-}
withTracingFlow :: MonadIO m => String -> m a -> m a
withTracingFlow tag action = do
    liftIO $ traceEventIO ("START:" ++ tag)
    !res <- action
    liftIO $ traceEventIO ("END:" ++ tag)
    pure res

{-# INLINE printSumLoop #-}
printSumLoop :: Int -> Int -> Int -> IO ()
printSumLoop _ _ 0 = print "All Done!"
printSumLoop chunksOf from times = do
    withTracingFlow "sum" $ print $ sum [from..(from + chunksOf)]
    printSumLoop chunksOf (from + chunksOf) (times - 1)

main :: IO ()
main = do
    tid <- myThreadId
    labelThread tid "main-thread"
    withTracingFlow "main" $ do
         printSumLoop 10000 1 100

The statics gleaned from the eventlog of the above program will look like the following:

--------------------------------------------------
Summary Stats
--------------------------------------------------

Global thread wise stat summary
tid       label samples ThreadCPUTime ThreadAllocated
--- ----------- ------- ------------- ---------------
  1 main-thread       2       967,479         434,384
  2           -       1         5,854          17,664

  -           -       3       973,333         452,048


Window [1:main] thread wise stat summary
ProcessCPUTime: 1,174,455
ProcessUserCPUTime: 0
ProcessSystemCPUTime: 1,175,000

ThreadCPUTime:934,898
GcCPUTime:0
RtsCPUTime:239,557
tid       label samples ThreadCPUTime ThreadAllocated
--- ----------- ------- ------------- ---------------
  1 main-thread       1       934,898         429,952

  -           -       1       934,898         429,952


Window [1:sum] thread wise stat summary
ProcessCPUTime: 953,862
ProcessUserCPUTime: 0
ProcessSystemCPUTime: 949,000

ThreadCPUTime:833,991
GcCPUTime:0
RtsCPUTime:119,871
tid       label samples ThreadCPUTime ThreadAllocated
--- ----------- ------- ------------- ---------------
  1 main-thread     100       833,991         328,224

  -           -     100       833,991         328,224


--------------------------------------------------
Detailed Stats
--------------------------------------------------

Window [1:main] thread wise stats for [ThreadCPUTime]
tid       label   total count     avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
  1 main-thread 934,898     1 934,898 934,898 934,898      0


Grand total: 934,898

Window [1:main] thread wise stats for [ThreadAllocated]
tid       label   total count     avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
  1 main-thread 429,952     1 429,952 429,952 429,952      0


Grand total: 429,952

Window [1:sum] thread wise stats for [ThreadCPUTime]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 833,991   100 8,340   5,533  63,493  5,714


Grand total: 833,991

Window [1:sum] thread wise stats for [ThreadAllocated]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 328,224   100 3,282   2,960  31,584  2,844


Grand total: 328,224

Global thread wise stats for [ThreadCPUTime]
tid       label   total count     avg minimum maximum  stddev
--- ----------- ------- ----- ------- ------- ------- -------
  1 main-thread 967,479     2 483,740  33,519 933,960 450,220
  2           -   5,854     1   5,854   5,854   5,854       0


Grand total: 973,333

Global thread wise stats for [ThreadAllocated]
tid       label   total count     avg minimum maximum  stddev
--- ----------- ------- ----- ------- ------- ------- -------
  1 main-thread 434,384     2 217,192   4,920 429,464 212,272
  2           -  17,664     1  17,664  17,664  17,664       0


Grand total: 452,048

From the Global thread wise stat summary under Summary Stats figure out the thread id we want to scrutinize. In this case, we care about the main-thread. The thread id is 1.

We can skip to the Detailed Stats section.

We want to look at all the windows we want to scrutinize that run in the main-thread. The windows in the above program are main and sum. The thread id is prepended to the windows. So we want to look at sections corresponding to [1:main] and [1:sum].

That is,

Window [1:main] thread wise stats for [ThreadCPUTime]
tid       label   total count     avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
  1 main-thread 934,898     1 934,898 934,898 934,898      0


Grand total: 934,898

Window [1:main] thread wise stats for [ThreadAllocated]
tid       label   total count     avg minimum maximum stddev
--- ----------- ------- ----- ------- ------- ------- ------
  1 main-thread 429,952     1 429,952 429,952 429,952      0


Grand total: 429,952

Window [1:sum] thread wise stats for [ThreadCPUTime]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 833,991   100 8,340   5,533  63,493  5,714


Grand total: 833,991

Window [1:sum] thread wise stats for [ThreadAllocated]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 328,224   100 3,282   2,960  31,584  2,844

Consider one specific section,

Window [1:sum] thread wise stats for [ThreadCPUTime]
tid       label   total count   avg minimum maximum stddev
--- ----------- ------- ----- ----- ------- ------- ------
  1 main-thread 833,991   100 8,340   5,533  63,493  5,714

This section is a table. It has 8 columns. It can have multiple rows. We should only scrutinize the row where the tid matches main-thread. ie. tid == 1.

The granularity of ThreadCPUTime is in nanoseconds and ThreadAllocated is in bytes.

Columns:

  • tid: The thread id
  • label: The thread label
  • total: The total accumulated sum of all the samples
  • count: Number of samples or the times this window is seen
  • avg: The average size of the samples
  • minimum: The minimum of all the samples
  • maximum: The maximum of all the samples
  • stddev: The standard deviation of the samples

NOTE: It is important to look at stddev. If stddev is more than 30% of the average and if the difference between the minimum and maximum is too much, the average might have unecessary outliers. In the future we would like to remove outliers automatically.

haskell-perf's People

Contributors

adithyaov avatar harendra-kumar avatar imviv3kshukla avatar rnjtranjan avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

haskell-perf's Issues

Do not reuse the timestamp eventlog field for counters

Instead, if we are not collecting the timestamp we can fill it with 0 value and use a separate field for the counter value. That way we can keep the timestamp optional but still be compatible with standard eventlog format. Also, when needed we can collect the timestamp as well e.g. for debugging purposes.

Analyze between two arbitrary points irrespective of thread-id

Currently we treat a window specific to a thread, the same thread starts and closes a window. In general, we can allow any thread to open a window and any other thread to close it. As long as we know that the window is not concurrently used by multiple threads we can use successive START and END markers as start and end of the window irrespective of which thread logged it.

Accumulating events instead of logging

  • Logging events too often for a tiny time window may add much overhead.
  • We can instead accumulate the events in memory using the profiling infra
  • Control the event logging via SCC annotations - log periodically instead of every time

We can use event logging for micro-benchmarking as well.

Report the number of threads in different states at any given time

In a concurrent program we want to drive the CPU utilization as high as possible and task latencies as low as possible. If there are tasks that are on the CPU for long time, they can cause latency issues. Usually Haskell threads yield on allocations, but if there are tight loops without allocation or FFI calls, threads can run for longer times.

To study latency behavior in concurrent programs we want to monitor the thread runqueue length. If all the threads are blocked for resources for long durations keeping runq empty and CPU idle, that may indicate an issue. If the runq is idle and busy in bursts that may indicate a problem.

To drive latencies lower, IO bound tasks that may run for short durations on CPU and then block for IO should be given priority compared to longer CPU bound tasks that could run while the IO bound tasks are blocked. But not sure if there is a simple way to achieve roughly something like this. Keeping the time slice lower is the only simple way. But we can at least measure if this is a problem.

GHC eventlog based stats todo list

Changes to GHC, higher pri:

  • Use perf counters in the user defined window as well
  • Add a precise allocation counter
  • Add a OS process level CPU counter as well to account for entire program time, this counter should not be stopped during the entire window. This would be a window only counter.
  • flush the event log buffer before and after the thread run if beyond halfway mark

Changes to GHC, Lower pri:

  • Add more relevant perf counters (esp. HW events)
  • Add pinned allocations, unpinned allocations, and large allocations counters
  • Enable the counter measurement code only when +RTS -l is specified (put code under eventlog_enabled)
  • Put the code under ifdef LINUX_PERF_EVENT
  • traceEventIO adds a lot of allocations, see if we can reduce that
  • Use a unique id for each start/stop event pair for better correlation in case an event is missed
  • Select the enabled counters via RTS options
  • Support multiple counters at the same time
  • Select a particular thread label to trace, to reduce eventlog traffic
  • Select to log only the perf events

Changes to the eventlog parser:

  • Print periodically e.g. every second instead of on each event
  • Print after collecting n events
  • activate a window across all threads, account the time of all haskell threads during the window

Use config/CLI options to control behavior

  • Maximum number of threads to show in a report
  • select counters to show
  • select the windows to show
  • select columns to show
  • select thread-ids to show
  • select threads whose total is above a certain threshold
  • select whether to show detailed reports or not
  • choose a sorting field
  • Collapsing stats of the same window name from different threads
  • show foreign windows or not

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.