Comments (23)
Thanks @MatthieuDartiailh for pointing me to this issue. I'm also interested in having a way to get offset easily, cause right now I have to iterate over Bytecode to calculate the offset, which is inefficient.
I read the above comments, from my understanding, the concern is mostly about whether it is appropriate to coupe [Concrete]Bytecode
with the original bytecode. I don't have enough understanding of the library to give a perfect answer, but perhaps we can consider a different approach.
So, we are concerned about having things coupled, how about create a separete class specifically for handling mapping, and leave [Concrete]Bytecode
unchanged.
For example, a class called BytecodeMapper
mapper = BytecodeMapper(my_func.f_code)
bc: Bytecode = mapper.get_bytecode()
cbc: ConcreteBytecode = mapper.get_concrete_bytecode()
# For offset
mapper.find_instr_by_offset(int) # returns a Instr
mapper.find_concrete_instr_by_offset(int) # returns a ConcreteInstr
mapper.get_offset(Union[Instr, ConcreteInstr])
# For lineno
mapper.find_instrs_by_lineno(int) # returns a list of Instr
mapper.find_concrete_instrs_by_lineno(int) # returns a list of ConcreteInstr
mapper.get_lineno(Union[Instr, ConcreteInstr])
You can do whatever you want with the returned Bytecode
object. When get_offset
or get_lineno
is called, it checks whether the passed in instr actually comes from this mapper(aka is
, not ==
). If not, just return None
.
from bytecode.
Yes but we have to make it very clear in the documentation. Furthermore, we could provide methods on the bytecode to update the offset (and lineno) to the value they will take in the synthesized code object.
from bytecode.
I have mixed feelings on this. When I was trying to address the problem is branch offsets expanding to need an additional byte, I found myself wishing many times for a code offset on the instructions.
But here's the thing: Those instructions are logical concepts and the concept does not have an intrinsic size from which you can deduce the offsets. I would use Unicode as an example. There are codepoints with values, but the size of those values depends on whether you encode as 32-bit wchar_t, or Windows UNICODE (ie UCS-2) or UTF-8, or HTML " " (6 characters); yet the codepoint itself is size-less.
I think it's completely reasonable for ConcreteInstr to have a size since it's concrete. But Instr is a logical concept: it has no size, and from there I'm not sure how it can have an offset given that it has no size.
Maybe some helper functions that would make it easier to get this information, because it can be tedious to get back to a ConcreteInstr once you've been manipulating a CFG. I would be very concerned about the information becoming wrong, though: once any instruction is changed then the offsets of anything can change, too. I would think ConcreteInstr.offset() ought to be a method so that it can embed whatever complexity is necessary to deal with that.
from bytecode.
The problem I see with your suggested approach @laike9m is that there is now way to get the new offsets. I still fill that freezing ConcreteBytecode would provide some nice insurance regarding lineno and offsets. Does any of you perform manipulation on ConcreteBytecode instead of Bytecode or ControlFlowGraph ?
from bytecode.
@MatthieuDartiailh By "new offsets" you mean the offset after modifying Bytecode or ConcreteBytecode right? I would suggest not bother doing it, it feels hacky and may have problems? I can not say for other people, maybe someone need the offset to be updated as well, but I just don't feel this to be a legitimate use.
from bytecode.
No currently the instruction do not store their offset so it is lost. What is your use case for the offset ?
from bytecode.
No currently the instructions do not store their offset so it is lost. What is your use case for the offset?
I am instrumenting python programs and offsets will allow me to locate exactly which bytecode my instrumentations correspond too.
from bytecode.
What concerns me is that as soon as you are going to modify the bytecode the offset won't be valid anymore, which means that the offsets would have to be dynamically computed. However the instruction which is the logical level to access this information does not hold a a reference to the surrounding bytecode that can compute the offset.
from bytecode.
You are completely right. However, for my use case, I do need the Actual offsets, i.e., the offsets before I did any instrumentation.
from bytecode.
In that case could you use dis to get the offset ? and do something like:
frozen_b = dis.Bytecode(f)
editable_b = bytecode.bytecode(f)
for frozen_i, editable_i in zip(frozen_b, editable_b):
# do whatever that requires the offset
from bytecode.
Yes, that can be done, of course!
However, what I am trying to understand is, is there any other downsides of having the original offsets as fields along with the Instr objects itself, except that if someone modifies the Instr list of Bytecode objects, the offsets for the modified code will have to be dynamically calculated and may differ from the original offsets.
From my point of view, having some way to have the original offsets within the Instr class or, in an auxiliary data structure (maybe only readable) will enhance the library and will be equivalent to the dis module, information wise.
from bytecode.
If we expose such an attribute I am mostly worried that people may complain about the fact that this attribute becomes out of sync. But I may be over thinking it. I need to look again at how we handle lineno and document it (which suffers from the same kind of issues), but it may be fine. If you want to start a PR feel free to do so. I may be a bit slow to answer but I will be sure to take a look.
from bytecode.
Thank you so much, for your prompt feedback. I get your point, but as this module is written for modifying bytecodes and it is expected that once you modify the bytecodes the original offsets might not hold.
from bytecode.
I will try to come up with the PR soon.
from bytecode.
Good points. Can you/ @MatthieuDartiailh kindly provide your thoughts on the following points:
-
returning an auxiliary data structure when
Bytecode
is generated using theBytecode.from_code
method, mapping Instrs to the "original" offsets ( i.e., the offsets in the code object it is generated from)? -
having a method in ConcreteBytecode, which returns the Current mapping from Instrs to offsets?
from bytecode.
I agree with @SnoopyLane that offset makes sense only on ConcreteBytecode and as already mentioned I share his concern about this getting very easily out of scope. (As I mentioned I feel that lineno suffers from mostly the same problem, Instr has a lineno but lineno is actually determined by the SetLineno meta instruction and I am wondering if lineno should not be limited to ConcreteInstr).
To me it seems that having ConcreteBytecode (and ConcreteInstr) being un-mutable would solve most of those concerns. We could then put as many details as we want in the structure without worrying about it getting out of sync and people would simply have to convert to Bytecode to perform edition. However this would be an API breaking change. Any opinion on this ?
@vstinner @serhiy-storchaka do you have an opinion on this question ?
from bytecode.
Thinking out loud here. This is not a proposal.
I'm trying to imagine how I could use this, and just re-read some of @azadsalam's comments.
If I was trying to instrument bytecodes, I can imagine having an Instr and wanting to know "Where did this come from?". Maybe if an Instr contained a reference to the ConcreteInstr that it was disassembled from? And that ConcreteInstr had an offset?
That would open it's own can of worms though. Once that Instr was assembled back to a ConcreteInstr would you want to know the original ConcreteInstr that it came from? Or the new ConcreteInstr that it was assembled into? Or both?
Somebody slap me if I'm over-thinking this.
I'm not sure what the above could do for @MatthieuDartiailh's point about lineno? I kind of like the idea of Instr not having a lineno and you get it from "Instr.original_concreteinstr.lineno". But lineno is something that we want to carry along when we re-assemble. I feel like there's probably a clean answer for this and yet I'm nowhere close.
As for my own use case for offset in the past: Only as a python dev writing optimization code that uses the Bytecode API (ala Hettinger's make_constants). At a PDB prompt I've stepped through code that removes/adds/changes an Instr, and I want to look at something like a "dis" dump to get a warm fuzzy feeling that my code does what I think. And step through more changes, dump it again. And again. And again. In the past I've done that by writing a little helper method that just compiles to Concrete and dumps to stdout, and I just call that method from the PDB prompt. Really low tech and old school, and I don't even have an example to post because it looks like the helper never made it to a commit.
from bytecode.
That does seem extra complicate to me. If a user do need to keep that kind of information I would say he should convert the code object in whatever we design that hold that information. Then convert it to Bytecode. He can later convert the Bytecode to a new low-level representation if he wants to look at the offset. It feels more and more to me that we just need to expose a way to convert to/from the representation existing in the dis
module to avoid duplicating code.
Regarding lineno, what bothers me currently is that the information is duplicated between the Instr
, and SetLineno
. Note that SetLineno
is also permitted in ConcreteBytecode
. I feel it would be cleaner to:
- enforce the imutability of
ConcreteBytecode
andConcreteInstr
- eliminate
SetLineno
fromConcreteBytecode
- remove
lineno
fromInstr
and only useSetLineno
inBytecode
andControlFlowGraph
from bytecode.
ping @SnoopyLane @azadsalam
Any comment ?
from bytecode.
I do not use ConcereteBytecode and I think freezing the "source"/"original" offsets (the original offsets in the input code object) is a good idea.
For the offsets of a Bytecode after some manipulation, the interface should only allow getting "new" offsets via ConcereteBytecode.
Sorry for replying late.
from bytecode.
After reading the documentation again, I realized that the presence of lineno on Instr
was by design with the SetLineno meta instruction being just a convenience so I am coming back on my suggestion of removing lineno from Instr
. However I may consider a helper function to remove those meta instructions by updating the lineno.
This makes me actually more inclined towards @laike9m idea of having a mapper. To handle the case of looking at offset after a manipulation we could then simply allow to create a mapper from an existing Bytecode (basically through assembly and disassembly). Is anybody interested in trying to implement those ideas ?
from bytecode.
Sorry for the late response on this. I wouldn't care if ConcreteBytecode were frozen. If anything it seems like a good thing to me.
Seems to me that for people that want a greater amount of detail than what dis provides, that ConcreteBytecode is a solution for that. But if you're trying to manipulate bytecodes then ConcreteBytecode doesn't seem that useful: hide it as an internal intermediate representation while doing from_code/to_code.
from bytecode.
FYI: I've just been looking at https://github.com/fabioz/PyDev.Debugger, and see that they have a vendored version of bytecode
with this somewhat addressed: see fabioz/PyDev.Debugger@bcb8a28 (though I confess that I do not fully understand the details).
from bytecode.
Related Issues (20)
- The `Compare` enum is broken starting from Python 3.9
- Python 3.11 support HOT 5
- EXTENDED_ARG + NOP Error HOT 3
- except Exception as e fails on Python 3.11 HOT 1
- Flag inference is too agressive in determining generator HOT 1
- Using special optimization to get rid of recursion limitations when compiling really huge code HOT 1
- stack size calculation issue HOT 5
- Stack size with EXTENDED_ARG HOT 10
- Question: how stable is it to round-trip code? HOT 3
- Why are extended line offsets -127 and 126, and not -128 and 127? HOT 2
- Stack size computation issue HOT 2
- Creating functions through bytecode HOT 3
- Missing tag on 0.12.0 release HOT 1
- Update jump address handling for 3.10 HOT 1
- Bytecode doesn't properly set linenumbers (in corner case in Python 3.10rc1) HOT 2
- Errors in compute stacksize on 3.10 HOT 7
- Please clarify the copyright HOT 4
- Remove `Compare.EXC_MATCH` on Python>=3.9 HOT 1
- Update pre_and_post_stack_effect HOT 1
- Labels don't seem to be handled correctly with Python<3.9 HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bytecode.