Comments (14)
It is only meaningful for Dot
and Norm
. And for those benchmarks it reduces overhead in the smallest case.
Dot | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|
128, std::vector | 1.973E+00 | 1.973E+00 | 1.975E+00 | 2.094E+00 | 2.410E+00 |
128, alloca | 1.559E+00 | 1.525E+00 | 1.590E+00 | 1.719E+00 | 2.243E+00 |
Norm | 1 | 2 | 4 | 8 | 16 |
---|---|---|---|---|---|
128, std::vector | 1.659E+00 | 1.799E+00 | 1.867E+00 | 1.994E+00 | 2.291E+00 |
128, alloca | 1.328E+00 | 1.356E+00 | 1.407E+00 | 1.543E+00 | 1.829E+00 |
from ceres-solver.
I checked out HEAD, numbers are now very competitive, with 2.2 better than 2.1 until you overspecify # of threads, but only by a modest amount:
1 thread performance
Ceres 2.1 mean: 1.15 msec std: 552.40 usec (200 points)
Ceres 2.2 mean: 1.07 msec std: 481.22 usec (200 points)
Ceres 2.1 mean: 3.56 msec std: 1.69 msec (1000 points)
Ceres 2.2 mean: 3.54 msec std: 1.61 msec (1000 points)
2 thread performance
Ceres 2.1 mean: 921.03 usec std: 359.30 usec (200 points)
Ceres 2.2 mean: 913.78 usec std: 367.30 usec (200 points)
Ceres 2.1 mean: 2.58 msec std: 1.16 msec (1000 points)
Ceres 2.2 mean: 2.45 msec std: 1.07 msec (1000 points)
4 thread performance
Ceres 2.1 mean: 845.64 usec std: 416.76 usec (200 points)
Ceres 2.2 mean: 770.07 usec std: 354.09 usec (200 points)
Ceres 2.1 mean: 1.96 msec std: 998.91 usec (1000 points)
Ceres 2.2 mean: 1.73 msec std: 725.61 usec (1000 points
8 thread performance
Ceres 2.1 mean: 801.78 usec std: 407.24 usec (200 points)
Ceres 2.2 mean: 944.09 usec std: 413.59 usec (200 points)
Ceres 2.1 mean: 1.85 msec std: 919.09 usec (1000 points)
Ceres 2.2 mean: 1.96 msec std: 858.14 usec (1000 points)
12 thread performance
Ceres 2.1 mean: 814.86 usec std: 337.59 usec (200 points)
Ceres 2.2 mean: 1.12 msec std: 554.43 usec (200 points)
Ceres 2.1 mean: 1.65 msec std: 564.98 usec (1000 points)
Ceres 2.2 mean: 1.80 msec std: 604.28 usec (1000 points)
16 thread performance
Ceres 2.1 mean: 946.84 usec std: 486.15 usec (200 points)
Ceres 2.2 mean: 1.30 msec std: 527.63 usec (200 points)
Ceres 2.1 mean: 1.63 msec std: 605.31 usec (1000 points)
Ceres 2.2 mean: 1.89 msec std: 590.79 usec (1000 points)
23 thread performance (don't ask why 23 vs 24, nothing to do with ceres)
Ceres 2.1 mean: 1.10 msec std: 428.13 usec (200 points)
Ceres 2.2 mean: 1.60 msec std: 544.80 usec
Ceres 2.1 mean: 1.71 msec std: 521.12 usec (1000 points)
Ceres 2.2 mean: 2.25 msec std: 698.23 usec (1000 points)
from ceres-solver.
from ceres-solver.
Parallel operations on small vectors are indeed very slow
dkorchemkin@home:~/oss/ceres-solver/build$ numactl -m0 -N0 ./bin/parallel_vector_operations_benchmark
2023-10-06T16:41:41+00:00
Running ./bin/parallel_vector_operations_benchmark
Run on (112 X 3200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x56)
L1 Instruction 32 KiB (x56)
L2 Unified 1024 KiB (x56)
L3 Unified 39424 KiB (x2)
Load Average: 1.36, 2.46, 2.52
------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------
SetZero/128 12.8 ns 12.8 ns 54510180
SetZero/256 22.8 ns 22.8 ns 30651968
SetZero/1024 51.3 ns 51.3 ns 13566727
SetZero/4096 187 ns 187 ns 3717838
SetZero/16384 2118 ns 2118 ns 330561
SetZero/65536 8445 ns 8445 ns 83189
SetZero/262144 65333 ns 65325 ns 10897
SetZero/1048576 339241 ns 339234 ns 2056
SetZero/4194304 2012458 ns 2012283 ns 398
SetZero/8388608 9881642 ns 9879515 ns 73
SetZeroParallel/128/1 13.8 ns 13.8 ns 50736474
SetZeroParallel/128/2 3673 ns 2876 ns 245485
SetZeroParallel/128/4 10042 ns 8163 ns 83980
SetZeroParallel/128/8 25209 ns 20324 ns 33505
SetZeroParallel/128/16 52716 ns 43483 ns 15548
SetZeroParallel/256/1 26.4 ns 26.4 ns 26584992
SetZeroParallel/256/2 3675 ns 2892 ns 235985
SetZeroParallel/256/4 10160 ns 8234 ns 85836
SetZeroParallel/256/8 25094 ns 20286 ns 33834
SetZeroParallel/256/16 54485 ns 44815 ns 16205
SetZeroParallel/1024/1 54.1 ns 54.1 ns 12529229
SetZeroParallel/1024/2 4131 ns 3314 ns 209834
SetZeroParallel/1024/4 10590 ns 8644 ns 80252
SetZeroParallel/1024/8 26006 ns 20648 ns 33643
SetZeroParallel/1024/16 53349 ns 43928 ns 15431
SetZeroParallel/4096/1 189 ns 189 ns 3673344
SetZeroParallel/4096/2 6014 ns 5135 ns 135995
SetZeroParallel/4096/4 11459 ns 9486 ns 71060
SetZeroParallel/4096/8 25476 ns 20449 ns 32472
SetZeroParallel/4096/16 53848 ns 44105 ns 15512
SetZeroParallel/16384/1 2127 ns 2127 ns 327985
SetZeroParallel/16384/2 10320 ns 9189 ns 75872
SetZeroParallel/16384/4 13274 ns 11633 ns 58642
SetZeroParallel/16384/8 26719 ns 21349 ns 34300
SetZeroParallel/16384/16 55862 ns 46483 ns 15580
SetZeroParallel/65536/1 8474 ns 8474 ns 81212
SetZeroParallel/65536/2 24117 ns 19293 ns 35435
SetZeroParallel/65536/4 20482 ns 18834 ns 36977
SetZeroParallel/65536/8 26524 ns 21256 ns 32848
SetZeroParallel/65536/16 54782 ns 44951 ns 15322
SetZeroParallel/262144/1 64730 ns 64728 ns 10774
SetZeroParallel/262144/2 73088 ns 64160 ns 9716
SetZeroParallel/262144/4 57310 ns 51761 ns 10000
SetZeroParallel/262144/8 40152 ns 35538 ns 18878
SetZeroParallel/262144/16 59208 ns 47971 ns 14780
SetZeroParallel/1048576/1 340296 ns 340293 ns 2059
SetZeroParallel/1048576/2 271003 ns 250716 ns 2538
SetZeroParallel/1048576/4 183528 ns 168460 ns 4151
SetZeroParallel/1048576/8 112738 ns 102182 ns 6854
SetZeroParallel/1048576/16 83996 ns 73060 ns 9545
SetZeroParallel/4194304/1 1693640 ns 1693623 ns 439
SetZeroParallel/4194304/2 921791 ns 909744 ns 814
SetZeroParallel/4194304/4 772264 ns 701767 ns 998
SetZeroParallel/4194304/8 427478 ns 394644 ns 1785
SetZeroParallel/4194304/16 263210 ns 239946 ns 2869
SetZeroParallel/8388608/1 9869803 ns 9867998 ns 71
SetZeroParallel/8388608/2 5065903 ns 5059889 ns 140
SetZeroParallel/8388608/4 2617067 ns 2541684 ns 272
SetZeroParallel/8388608/8 1348521 ns 1329147 ns 535
SetZeroParallel/8388608/16 877319 ns 852327 ns 816
Note that (for small vector size, let say 128) it scales kinda-linearly with the number of threads (2,4,8,...), which suggests that this is overhead of launching tasks via thread pool.
(parallel benchmark with 1 thread measures cost of going through ParallelFor
invocation to InvokeSerial
call)
Adding threshold on number of vector elements ( https://ceres-solver-review.googlesource.com/c/ceres-solver/+/25280 ) seem to solve this problem, here are benchmark results https://docs.google.com/spreadsheets/d/1qfxrJ1MhzEFmyhWYWN4RY3EJcoJF6d6nv4oJkdK_EaA/edit#gid=1902184755
For each sub-table
- Horizontal axis: number of threads
- Vertical axis: vector size
- Value: ratio of time consumed by parallel and serial versions of the benchmark (less = better)
- Cell color: green = 0.5, red = 2.0 (green = good)
from ceres-solver.
It seems to me that alloca doesn't really help or am I missing something?
from ceres-solver.
Another problem is lock contention in ParallelFor
where num_threads
tasks are pushed into the queue of ThreadPool
one-after-another by the main thread (while ThreadPool
starts executing tasks popping them from the queue).
I think we should put the next task into the queue from the task itself.
This way there almost always should be only one thread trying to do something with ConcurrentQueue
inside ThreadPool
.
Pros:
- Main thread starts doing some work almost immediately
- We can create less tasks if synchronization overhead is significantly larger than the amount of work
Cons:
- Does not solve the issue of 2 threads being slower than one
I have added SchedulerBenchmark
that sets 1 value to zero per iteration.
With the old scheme of launching all tasks from the main thread:
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
SchedulerBenchmark/128/1 14.1 ns 14.1 ns 49496153
SchedulerBenchmark/128/2 3965 ns 3093 ns 240173
SchedulerBenchmark/128/4 13162 ns 9754 ns 71478
SchedulerBenchmark/128/8 30643 ns 23366 ns 29614
SchedulerBenchmark/128/16 63694 ns 50700 ns 10000
SchedulerBenchmark/256/1 24.1 ns 24.1 ns 28943598
SchedulerBenchmark/256/2 3878 ns 3069 ns 227498
SchedulerBenchmark/256/4 13293 ns 9939 ns 69817
SchedulerBenchmark/256/8 31117 ns 23662 ns 32640
SchedulerBenchmark/256/16 59503 ns 47098 ns 14910
SchedulerBenchmark/1024/1 56.7 ns 56.7 ns 12048398
SchedulerBenchmark/1024/2 4346 ns 3492 ns 203140
SchedulerBenchmark/1024/4 13487 ns 10324 ns 66736
SchedulerBenchmark/1024/8 30982 ns 23797 ns 33090
SchedulerBenchmark/1024/16 63199 ns 50200 ns 14762
SchedulerBenchmark/4096/1 189 ns 189 ns 3633540
SchedulerBenchmark/4096/2 5932 ns 5141 ns 131884
SchedulerBenchmark/4096/4 14784 ns 11539 ns 61236
SchedulerBenchmark/4096/8 35857 ns 27114 ns 29276
SchedulerBenchmark/4096/16 63934 ns 50643 ns 10000
With the proposed scheme:
---------------------------------------------------------------------
Benchmark Time CPU Iterations
---------------------------------------------------------------------
SchedulerBenchmark/128/1 14.1 ns 14.1 ns 49323498
SchedulerBenchmark/128/2 2411 ns 1907 ns 362916
SchedulerBenchmark/128/4 3556 ns 2809 ns 243026
SchedulerBenchmark/128/8 4346 ns 3487 ns 200626
SchedulerBenchmark/128/16 5066 ns 4245 ns 169698
SchedulerBenchmark/256/1 24.2 ns 24.2 ns 28960018
SchedulerBenchmark/256/2 2330 ns 1906 ns 388470
SchedulerBenchmark/256/4 3864 ns 3029 ns 219233
SchedulerBenchmark/256/8 4399 ns 3528 ns 195225
SchedulerBenchmark/256/16 5111 ns 4255 ns 161858
SchedulerBenchmark/1024/1 55.9 ns 55.9 ns 12204777
SchedulerBenchmark/1024/2 2541 ns 2083 ns 329807
SchedulerBenchmark/1024/4 3977 ns 3259 ns 222628
SchedulerBenchmark/1024/8 4607 ns 3726 ns 193548
SchedulerBenchmark/1024/16 5031 ns 4175 ns 160285
SchedulerBenchmark/4096/1 188 ns 188 ns 3714433
SchedulerBenchmark/4096/2 4203 ns 3641 ns 188284
SchedulerBenchmark/4096/4 4832 ns 4057 ns 171811
SchedulerBenchmark/4096/8 5605 ns 4469 ns 159093
SchedulerBenchmark/4096/16 6425 ns 5505 ns 126861
This makes too short parallel tasks and too much threads less painful.
Submitted as a separate CL: https://ceres-solver-review.googlesource.com/c/ceres-solver/+/25300
from ceres-solver.
@DmitriyKorchemkin shall we get this latter patch in, since it seems like a pure win? and then re-evaluate your first patch?
from ceres-solver.
@sandwichmaker yes, I think that we should merge the last one first.
It will at least make the growth with the number of threads less pronounced (but the gap between 1 and 2 threads is still huge).
from ceres-solver.
okay so the scheduler change is in, lets look at the performance of https://ceres-solver-review.git.corp.google.com/c/ceres-solver/+/25280 on top of HEAD now.
from ceres-solver.
@sandwichmaker here are benchmark results with old and new task enqueuing: https://docs.google.com/spreadsheets/d/1Mdg6IqvqZd8cRLw69KAf8UM9WcscQmENBay6QpWflJk/edit?usp=sharing
It significantly improves worst-case performance.
- Sheet
vector-ops.limit.old.json
: old enqueuing, no limit on minimal block size - Sheets
vector-ops.limit.X.json
: new enqueuing, minimal block size is2^X
(minimal block size = 1 is equivalent to no limit) - On all sheets green = 0.5, red = 2.0 (green = faster than single-threaded)
- Rightmost group of columns corresponds to stats aggregated by row
- Bottom-most group of rows corresponds to stats aggregated by column
I think it makes sense to set 2^14..2^20 as a limit; I've chosen 2^16 as a "round" number.
from ceres-solver.
@DmitriyKorchemkin I am only seeing one sheet.
from ceres-solver.
@sandwichmaker Strange, should be 8 sheets.
Please find the exported file attached.
2023-10-07 ceres-solver parallel vector ops.ods
from ceres-solver.
ah it was a google account issue. from my official google account I only see one sheet and on [email protected] I can see all the sheets.
from ceres-solver.
Both Roger and Frank have confirmed that @DmitriyKorchemkin's two changes have fixed this issue.
from ceres-solver.
Related Issues (20)
- Consider renaming CudaBuffer to CudaDeviceBuffer
- Github actions fails due to googletest submodule is not checked out. HOT 4
- Do not reset cuDSS context in case of numeric factorization failure
- Error:make: *** No targets specified and no makefile found. Stop. HOT 1
- Building ceres with BUILD_TESTING=OFF and BUILD_BENCHMARKS=ON HOT 2
- The ceres solver can't be installed on my Ubuntu 24.04 LTS system HOT 3
- BUILD_SHARED_LIBS=ON is broken HOT 12
- How to use CUDA_SPARSE in 2.2.0 just a simple demo code ? HOT 16
- Add CI for the Bazel Build HOT 7
- Update documentation for the bazel build HOT 1
- Update build & installation documentation
- Example code doesn't work HOT 15
- cmake error when building packages with ceres due to absl HOT 25
- VLOG(3) in ceres does not work HOT 12
- Add absl/log/vlog_is_on.h as an explicit included header HOT 1
- Change GradientProblem related APIs to use std::unique_ptr HOT 1
- Will the pointer be automatically deleted by Ceres HOT 2
- undefined reference to `ceres::Problem::AddResidualBlock(ceres::CostFunction*, ceres::LossFunction*, double*)' HOT 3
- Session hang compiling ceres file HOT 7
- "libceres.a(problem.cc.o) is referenced by DSO" HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ceres-solver.