Git Product home page Git Product logo

Comments (14)

cheatfate avatar cheatfate commented on May 25, 2024

Windows GetQueuedCompletionStatus() which used to poll completion events and pause supports only milliseconds resolution, so how you propose to use microseconds and nanoseconds duration in such case?

I understand problem with monotonic clock, but currently library uses most performant primitives which are 5x times faster then its monotonic alternatives.

from nim-chronos.

arnetheduck avatar arnetheduck commented on May 25, 2024

The point is to allow the user to specify the time in an unambiguous way - the precision you get is whatever the platform offers you, at that point. by fixing ad2, we can also offer the same usability advantage in downstream libraries.

When you use these functions, you necessarily have to take into account that these are estimates. For example, if you just blindly repeat a call with a 100ms delay, you'll have less than 10 calls per second regardless of the precision the underlying clock - for example because other events are running at the same time.

I'm curious about that 5x claim also, and how much of the total time of calling for example GetQueuedCompletionStatus the extra 4x make up.

from nim-chronos.

cheatfate avatar cheatfate commented on May 25, 2024

If you check source code of poll() function https://github.com/status-im/nim-asyncdispatch2/blob/master/asyncdispatch2/asyncloop.nim#L316, you will find that call to fastEpochTime() performed 2 times.

First call is used to handle already expired callbacks and calculate timeout for system queue waiter, second call is used also to process expired callbacks after system queue waiter.

Second call is also required because you don't know how much time you spend in system queue waiter (it can be equal to timeout value, but if timeout was not set, it will wait infinite time for FDs events).

from nim-chronos.

arnetheduck avatar arnetheduck commented on May 25, 2024

I understand that the function is called - I'm asking two things:

  • compared to other calls in the loop, what % of time is taken by the clock call? for example, if the clock call takes 0.01% of the total loop time and 99.99% is spent in GetQueuedCompletionStatus, it really doesn't matter if one clock takes 5x the other
  • for the above question, also take into account that this is mostly relevant when there is actually work to do, so you would need to measure when there is work to do.. otherwise, if the loop is idling, again, it doesn't matter
  • where did you get the 5x performance from? on linux, CLOCK_REALTIME and CLOCK_MONOTONIC are near identical, measured in cycles

once you have the answers to those questions, you have to weigh that against the massive improvement in ergonomics:

  • you cannot make a mistake with the unit of time
  • on platforms with finer grained clocks you get access to shorter loop times if you want to (no artificial timing barrier)
  • applications that actually make use of the std lib are not penalized by ugly casts - we're having this problem downstream in nim-eth-p2p and nim-beacon-chain where we need to be careful with clocks or the system breaks down. imagine the embarrassment and massive cost of nimbus freezing on a leap second because we wanted to save a few cycles in an idle loop.
#include <stdint.h>
#include <time.h>

uint64_t rdtsc(){
    unsigned int lo,hi;
    __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
    return ((uint64_t)hi << 32) | lo;
}
#include <stdio.h>


const int iters = 1024*1024;

int main(int a, char** b) {
	while(1) {
	uint64_t start, mid, end;
	uint64_t t = 0;
	struct timespec tt;

	start = rdtsc();

	for (int i = 0; i < iters; ++i) {
		t+=clock_gettime(CLOCK_REALTIME, &tt);
	}

	mid = rdtsc();

	for (int i = 0; i < iters; ++i) {
		t += clock_gettime(CLOCK_MONOTONIC, &tt);
	}

	end = rdtsc();

        printf("%ld %ld\n", (mid-start)/iters, (end-mid)/iters);
	}
}

on my crappy laptop, I see more variation due to thermal throttling than between the two calls..

from nim-chronos.

cheatfate avatar cheatfate commented on May 25, 2024

@arnetheduck first of all i'm making asyncdispatch not for Linux only, so if Linux has advantage on some behavior it doesn't matter for library, because it must produce equal behavior on all OSes.

compared to other calls in the loop, what % of time is taken by the clock call? for example, if the clock call takes 0.01% of the total loop time and 99.99% is spent in GetQueuedCompletionStatus, it really doesn't matter if one clock takes 5x the other

nim-lang/Nim#3909

This is my old benchmarks but data still can be used to understand timer's impact on performance.

from nim-chronos.

arnetheduck avatar arnetheduck commented on May 25, 2024

The benefit is in the hardware, not the operating system, and the point is that you need to use the correct clock.

On windows, that's QueryPerformanceCounters for monotonic and fast relative time - windows clocks are a bit of a mystery because they tend to store hardware time in local time and use 100ns increments generally in their time structures. No idea what kind of conversions happen in the background when you call that filetime function (which is UTC) - but notably, your fastEpochTime and epochTime in nim call the same underlying windows function - clearly the performance difference on windows is not in the clock itself.

However, the benchmarks I see in that post for timers look mostly irrelevant - they measure timing without optimizations (-d:release). The core difference on windows between your timer and the upstream timer is the float conversion - that conversion is unnecessary any way you look at it, and you'll notice a Duration doesn't use it either - and it's the kind of thing made worse by the lack of optimizations.

The second thing is that the incoming value in the API can be converted to whatever underlying clock you want - in fact, it should be. The key is to have a clear, powerful and unambiguous ABI followed up by an efficient implementation that makes the best use of the hardware available. Using milliseconds is strange (not an SI unit for time) and suboptimal any way you look at it.

from nim-chronos.

cheatfate avatar cheatfate commented on May 25, 2024

The actual benefit is not in hardware, i can't use rdtsc because it is not cross-platform. It depends on how OS handles time, is it switches to kernel mode to get counters or not, is it uses some heavy calculations or not.

from nim-chronos.

arnetheduck avatar arnetheduck commented on May 25, 2024

I'm not saying we should use RDTSC - the proposal is to use Duration in the API (for developer friendliness) and a monotonic clock in the implementation (for correctness).. clock_gettime(CLOCK_MONOTONIC, &tt) or QueryPerformanceCounters - there's no perf difference between these and whatever you're using - probably, there's an improvement on windows.

I'm also claiming that the we could use just a normal slow clock and probably wouldn't notice the difference, because the benchmark is kind of irrelevant - when there's no load, it doesn't really matter how many loops per second you can do because you're sleeping most of the time and when when there is load, the timing part is usually dwarfed by actual work (ie a beacon node packet arriving, in our case).. thus I'd focus on correctness and the use of good data structures (I see for example we're using a heap for the timers which seems much more sane than a seq or a linked list - in general) before worrying too much about micro-benchmarks

from nim-chronos.

cheatfate avatar cheatfate commented on May 25, 2024

Now i will show my measurements, to benchmark i have used this code https://gist.github.com/cheatfate/d184c9f11fd49e9d0c75d166bd2d2b05.

So Linux benchmark is (compiled by nim c -d:release:

100_000_000 calls to fastEpochTimeMono() takes 1545765406ns
100_000_000 calls to fastEpochTime() takes     1537696307ns
100_000_000 calls to fastEpochTimeRaw() takes  23333947291ns
100_000_000 calls to fastEpochTimeMono() takes 1538502155ns
100_000_000 calls to fastEpochTime() takes     1537760239ns
100_000_000 calls to fastEpochTimeRaw() takes  23364704565ns
100_000_000 calls to fastEpochTimeMono() takes 1537733110ns
100_000_000 calls to fastEpochTime() takes     1538136763ns
100_000_000 calls to fastEpochTimeRaw() takes  23364710793ns
100_000_000 calls to fastEpochTimeMono() takes 1540042299ns
100_000_000 calls to fastEpochTime() takes     1538359638ns
100_000_000 calls to fastEpochTimeRaw() takes  23374689710ns
100_000_000 calls to fastEpochTimeMono() takes 1539592896ns
100_000_000 calls to fastEpochTime() takes     1538834740ns
100_000_000 calls to fastEpochTimeRaw() takes  23360426822ns
100_000_000 calls to fastEpochTimeMono() takes 1538338091ns
100_000_000 calls to fastEpochTime() takes     1537669102ns
100_000_000 calls to fastEpochTimeRaw() takes  23369929946ns
100_000_000 calls to fastEpochTimeMono() takes 1538175000ns
100_000_000 calls to fastEpochTime() takes     1538563607ns
100_000_000 calls to fastEpochTimeRaw() takes  23368758867ns
100_000_000 calls to fastEpochTimeMono() takes 1540537125ns
100_000_000 calls to fastEpochTime() takes     1546923064ns
100_000_000 calls to fastEpochTimeRaw() takes  23364539543ns
100_000_000 calls to fastEpochTimeMono() takes 1538087034ns
100_000_000 calls to fastEpochTime() takes     1537477042ns
100_000_000 calls to fastEpochTimeRaw() takes  23364063367ns
100_000_000 calls to fastEpochTimeMono() takes 1539868275ns
100_000_000 calls to fastEpochTime() takes     1539015094ns
100_000_000 calls to fastEpochTimeRaw() takes  23364882319ns

Windows benchmark is:

100_000_000 calls to fastEpochTimeMono() takes 1799527000ns
100_000_000 calls to fastEpochTime() takes     346997100ns
100_000_000 calls to fastEpochTimeMono() takes 1766980600ns
100_000_000 calls to fastEpochTime() takes     344437400ns
100_000_000 calls to fastEpochTimeMono() takes 1791302400ns
100_000_000 calls to fastEpochTime() takes     342270300ns
100_000_000 calls to fastEpochTimeMono() takes 1784696100ns
100_000_000 calls to fastEpochTime() takes     337763600ns
100_000_000 calls to fastEpochTimeMono() takes 1805359700ns
100_000_000 calls to fastEpochTime() takes     337840100ns
100_000_000 calls to fastEpochTimeMono() takes 1755878000ns
100_000_000 calls to fastEpochTime() takes     343742100ns
100_000_000 calls to fastEpochTimeMono() takes 1756078800ns
100_000_000 calls to fastEpochTime() takes     341183100ns
100_000_000 calls to fastEpochTimeMono() takes 1763183100ns
100_000_000 calls to fastEpochTime() takes     336385800ns
100_000_000 calls to fastEpochTimeMono() takes 1773583000ns
100_000_000 calls to fastEpochTime() takes     336432200ns
100_000_000 calls to fastEpochTimeMono() takes 1769893700ns
100_000_000 calls to fastEpochTime() takes     333117400ns

from nim-chronos.

cheatfate avatar cheatfate commented on May 25, 2024

As you can see on Windows using fastEpochTimeMono() is 500% slower then fastEpochTime().

from nim-chronos.

cheatfate avatar cheatfate commented on May 25, 2024

Now about Linux results, this is quote from man clock_gettime.

CLOCK_MONOTONIC
              Clock  that  cannot be set and represents monotonic time since some unspecified starting point.  This clock is not affected by discontinuous jumps in the system time (e.g., if the system administrator manually changes
              the clock), but is affected by the incremental adjustments performed by adjtime(3) and NTP.
CLOCK_MONOTONIC_RAW (since Linux 2.6.28; Linux-specific)
              Similar to CLOCK_MONOTONIC, but provides access to a raw hardware-based time that is not subject to NTP adjustments or the incremental adjustments performed by adjtime(3).

As you can see CLOCK_MONOTONIC is also affected to adjustments which can be made by ntpd daemon silently. While CLOCK_MONOTONIC_RAW is not affected by all this changes, but its speed for some reason is 1500+% slower.

from nim-chronos.

cheatfate avatar cheatfate commented on May 25, 2024

So usage of MONOTONIC clock you are proposing is also not totally correct.

from nim-chronos.

arnetheduck avatar arnetheduck commented on May 25, 2024

so from adjtime:

NOTES
       The  adjustment  that  adjtime()  makes  to  the clock is carried out in such a manner that the clock is always monotonically increasing.  Using adjtime() to adjust the time prevents the problems that can be caused for certain applications (e.g., make(1)) by
       abrupt positive or negative jumps in the system time.

the two work together to do (mostly) the right thing. specifically, they don't jump back in time, and don't cause large disruptions in timing, beyond what normally would happen anyway with non-realtime OS's.

for windows, you can try GetTickCount64, or you can try moving the frequency multiplication elsewhere.

all that said, this is not an argument about performance, primarily. you will be calling the clock function only when it doesn't matter, relatively speaking. it's possible to construct benchmarks that focus on the timer, but these will be far removed from reality.

from nim-chronos.

cheatfate avatar cheatfate commented on May 25, 2024

Fixed in #24

from nim-chronos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.