Comments (5)
Most uops do not change the frame, so the form stack_pointer = _UOP_func(tstate, frame, stack_pointer, [oparg]);
works.
If the frame is changed, then we can change the form to
frame = _UOP_func(tstate, frame, stack_pointer, [oparg]);
stack_pointer += ...
from cpython.
Some experimentation notes:
Just moving _INIT_CALL_PY_EXACT_ARGS
has a 1% overall speedup. Trying to move more functions than that, I'm struggling to make an improvement.
- Moving all uops greater than 200 bytes (on x86) has net zero effect.
All uops greater than 200 bytes
39: ['_LIST_EXTEND', '_CONTAINS_OP_DICT', '_CALL_METHOD_DESCRIPTOR_NOARGS', '_INIT_CALL_PY_EXACT_ARGS_0', '_INIT_CALL_PY_EXACT_ARGS_1', '_LOAD_ATTR', '_STORE_SLICE', '_BUILD_SLICE', '_BINARY_SUBSCR_LIST_INT', '_STORE_SUBSCR_LIST_INT', '_CALL_BUILTIN_FAST', '_COMPARE_OP_INT', '_BINARY_SUBSCR_DICT', '_BINARY_SUBSCR_STR_INT', '_BINARY_SUBSCR_TUPLE_INT', '_CALL_ISINSTANCE', '_CALL_LEN', '_CONTAINS_OP_SET', '_CALL_BUILTIN_CLASS', '_CALL_BUILTIN_O', '_CALL_METHOD_DESCRIPTOR_FAST', '_CALL_METHOD_DESCRIPTOR_O', '_COMPARE_OP', '_FOR_ITER_TIER_TWO', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_2', '_INIT_CALL_PY_EXACT_ARGS_3', '_INIT_CALL_PY_EXACT_ARGS_4', '_LOAD_GLOBAL', '_STORE_SUBSCR', '_BUILD_MAP', '_LOAD_SUPER_ATTR_METHOD', '_BUILD_STRING', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_BUILD_CONST_KEY_MAP', '_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS', '_GET_ANEXT', '_STORE_NAME', '_COMPARE_OP_FLOAT']
- Moving all functions greater than 300 bytes (on x86) has net zero effect.
All uops greater than 300 bytes
16: ['_CALL_METHOD_DESCRIPTOR_NOARGS', '_BUILD_SLICE', '_CALL_BUILTIN_FAST', '_BINARY_SUBSCR_STR_INT', '_CALL_ISINSTANCE', '_CALL_LEN', '_CALL_BUILTIN_CLASS', '_CALL_BUILTIN_O', '_CALL_METHOD_DESCRIPTOR_FAST', '_CALL_METHOD_DESCRIPTOR_O', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_4', '_LOAD_GLOBAL', '_LOAD_SUPER_ATTR_METHOD', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_CALL_METHOD_DESCRIPTOR_FAST_WITH_KEYWORDS']
All uops greater than 350 bytes also has a net zero effect.
(A threshold of 400 would be equivalent to _INIT_CALL_PY_EXACT_ARGS
alone, which we've already seen improves speed by 1%).
Looking at both code size and execution counts together may be a better metric:
The dot on the far right is _INIT_CALL_PY_EXACT_ARGS
.
As a first rough cut, I tried taking the uops that are in the top 25%-ile in code size and the bottom 25%-ile in frequency, which gives us a set of 10 opcodes. This also has a net-zero effect on runtime.
25%-ile in both size and execution counts
10: ['_CALL_METHOD_DESCRIPTOR_O', '_INIT_CALL_PY_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS_3', '_BUILD_MAP', '_LOAD_SUPER_ATTR_METHOD', '_BUILD_STRING', '_CALL_BUILTIN_FAST_WITH_KEYWORDS', '_LOAD_ATTR_MODULE', '_BUILD_CONST_KEY_MAP', '_STORE_NAME']
I have a couple of theories about why none of this is working. Perhaps the code size has to be pretty large before the overhead of the function call becomes worth it (and _INIT_PY_CALL_EXACT_ARGS
is a real outlier in code size).
Also, I wonder if part of what is "hurting" with these additional functions is that they have exits (they have DEOPT_IF
and/or EXIT_IF
(unlike _INIT_CALL_PY_EXACT_ARGS
). Since the function can not just jump to a label in the caller (as would normally happen in a uop), I was handling this by returning 0, 1 or 2 from the function and handling the result in the caller:
switch (result) {
case 0:
break;
case 1:
JUMP_TO_JUMP_TARGET();
case 2:
JUMP_TO_ERROR();
}
(and only including cases that were actually used in each uop). This means that each exit has an additional branch ... I don't know how much that matters. There may be a better "trick" to make this work -- I have to admit that low-level C tricks aren't my forte. If there's a better way to "jump directly" somehow let me know.
The "simple" (non-exiting) uops are marked in yellow here:
I thought I would try only including uops that don't have exits, but not worrying as much about code size If I take non-exiting UOps, with counts of less than 10^8, and code sizes larger than 100, I get: ['_INIT_CALL_BOUND_METHOD_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS', '_COPY_FREE_VARS', '_SET_FUNCTION_ATTRIBUTE', '_COMPARE_OP_FLOAT']. That is 1% faster, and notably is faster for every interpreter-heavy benchmark.
@gvanrossum observed that the main thing that bulks out _INIT_CALL_PY_EXACT_ARGS
is the static inlining of _PyFrame_PushUnchecked
. If all of this experimentation still points to "only externalizing _INIT_CALL_PY_EXACT_ARGS
helps", it probably makes sense to just not inline _PyFrame_PushUnchecked
in the context of compiling the JIT templates, which could probably be done with a couple lines of C-preprocessor hackery. That would be much simpler than modifying the generator to generate these "external" uops. UPDATE: This is not noticeably faster.
from cpython.
Conclusion: It seems like externalizing these uops make things the fastest of anything I tried.
(This is non-exiting uops, not too frequently used, bigger than double the size of the code for making the function call). ['_INIT_CALL_BOUND_METHOD_EXACT_ARGS', '_INIT_CALL_PY_EXACT_ARGS', '_COPY_FREE_VARS', '_SET_FUNCTION_ATTRIBUTE', '_COMPARE_OP_FLOAT'].
There may be a trick to making uops-with-exits fast enought to externalize as well (@brandtbucher, thoughts?). That could also be left as follow-on work.
from cpython.
This means that each exit has an additional branch ... I don't know how much that matters. There may be a better "trick" to make this work -- I have to admit that low-level C tricks aren't my forte. If there's a better way to "jump directly" somehow let me know.
Haven't thought about it too much, but one option could be to pass _JIT_CONTINUE
and _JIT_JUMP_TARGET
/_JIT_ERROR_TARGET
into the function, and it could return one, which is jumped to:
PyAPI_DATA(void) _JIT_CONTINUE;
PyAPI_DATA(void) _JIT_EXIT_TARGET;
jit_func next = _Py_FOO_func(tstate, frame, stack_pointer, CURRENT_OPARG(),
&_JIT_CONTINUE, &_JIT_EXIT_TARGET);
stack_pointer += ...;
__attribute__((musttail))
return ((jit_func)next)(frame, stack_pointer, tstate);
from cpython.
At @brandtbucher's suggestion, I instead tried large uops that are more frequent, giving the set _INIT_CALL_PY_EXACT_ARGS
, _COMPARE_OP_FLOAT
(there's just not that many large ones without exits). This, unfortunately, made no net change.
I think the next thing to tackle will be a faster way to handle exits which should make the set of uops that we can externalize larger.
from cpython.
Related Issues (20)
- Make `posix.getcwd()` as fast as `nt.getcwd()` HOT 4
- unittest.mock.patch('path') neither forwards or backwards compatible starting from Python >= v3.11 HOT 13
- Random exception of the Python interpreter in a free-threaded build. HOT 6
- Inspect module is prohibitively expensive to import HOT 8
- Automate positional arguments in clinic code HOT 14
- Regression: `venv` virtual environments not providing `pydoc` HOT 3
- (3.14) Avoid `METH_VARARGS` in clinic code HOT 12
- test_httpservers: OSError: [Errno 39] Directory not empty
- Pure python stat.filemode(4095) gives a different filetype from the C implementation HOT 1
- async generator allows concurrent access via async_gen_athrow_throw and async_gen_asend_throw HOT 2
- Tutorial: An issue with 3.1.2. Text HOT 1
- platform_triplet.c uses possibly undefined TARGET_OS_* macros HOT 3
- Output from failed platform_triplet.c compilation is not logged HOT 3
- test_peg_generator fails on a PGO+LTO build with clang HOT 6
- _testexternalinspection.c doesn't check for undefined TARGET_OS_OSX HOT 1
- async generator aclose()/athrow() can be re-used after StopIteration
- _PyCompile_CodeGen compiles nested function all the way to code object
- Documentation says that static methods can be called as regular functions, but I don't see how from the documentation HOT 5
- test_inspect: ValueError: no signature found for builtin <built-in function getobjects> HOT 2
- Rename 'open' function in the 'webpage' library HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cpython.