windelbouwman / ppci Goto Github PK

A compiler for ARM, X86, MSP430, xtensa and more implemented in pure Python

Home Page: https://ppci.readthedocs.io/en/latest/

License: BSD 2-Clause "Simplified" License

Python 89.56% Brainfuck 0.16% LLVM 0.31% C 6.79% HTML 0.01% JavaScript 0.06% Verilog 2.99% Assembly 0.01% Shell 0.03% C++ 0.03% Pascal 0.05%

compiler msp430 xtensa riscv assembler c-preprocessor c-compiler python webassembly arm

ppci's Introduction

Introduction

The PPCI (Pure Python Compiler Infrastructure) project is a compiler written entirely in the Python programming language. It contains front-ends for various programming languages as well as machine code generation functionality. With this library you can generate (working!) machine code using Python (and thus very easy to explore, extend, etc.)!

The project contains:

Language frontends for C, Python, Pascal, Basic and Brainfuck
Code generation for several architectures: 6500, arm, avr, m68k, microblaze, msp430, openrisc, risc-v, stm8, x86_64, xtensa
Command line utilities, such as ppci-cc, ppci-ld and ppci-opt
WebAssembly, JVM, OCaml support
Support for ELF, EXE, S-record and hexfile formats
An intermediate representation (IR) which can be serialized in json
The project can be used as a library so you can script the compilation process

Installation

Since the compiler is a python package, you can install it with pip:

$ pip install ppci

Usage

An example of commandline usage:

$ cd examples/linux64/hello-make
$ ppci-cc -c -O1 -o hello.o hello.c
...
$ ppci-ld --entry main --layout linux64.ld hello.o -o hello
...
$ ./hello
Hello, World!

API example to compile C code:

>>> import io
>>> from ppci.api import cc, link
>>> source_file = io.StringIO("""
...  int printf(char* fmt) { }
...  
...  void main() {
...     printf("Hello world!\n");
...  }
... """)
>>> obj = cc(source_file, 'arm')
>>> obj = link([obj])

Example how to assemble some assembly code:

>>> import io
>>> from ppci.api import asm
>>> source_file = io.StringIO("""section code
... pop rbx
... push r10
... mov rdi, 42""")
>>> obj = asm(source_file, 'x86_64')
>>> obj.get_section('code').data
bytearray(b'[ARH\xbf*\x00\x00\x00\x00\x00\x00\x00')

Example of the low level api usage:

>>> from ppci.arch.x86_64 import instructions, registers
>>> i = instructions.Pop(registers.rbx)
>>> i.encode()
b'['

Functionality

Command line utilities:
- ppci-cc
- ppci-ld
- and many more.
Can be used with tools like make or other build tools.
Language support:
- C
- Pascal
- Python
- Basic
- Brainfuck
- C3 (PPCI's own systems language, intended to address some pitfalls of C)
CPU support:
- 6500, arm, avr, m68k, microblaze, msp430, openrisc, risc-v, stm8, x86_64, xtensa
Support for:
- WebAssembly
- JVM
- OCaml bytecode
- LLVM IR
- DWARF debugging format
File formats:
- ELF files
- COFF PE (EXE) files
- hex files
- S-record files
Uses well known human-readable and machine-processable formats like JSON and XML as its tools' formats.

Documentation

Documentation can be found here:

https://ppci.readthedocs.io/

Warning

This project is in alpha state and not ready for production use!

You can try out PPCI at godbolt.org, a site which offers Web access to various compilers: https://godbolt.org/g/eooaPP

_ _ _ _ _ _ _ _ _

ppci's People

Contributors

Stargazers

Watchers

ppci's Issues

docs: Clarify/decide on what goes into docs/*.rst and what - into module docstrings

When reading https://ppci.readthedocs.io/en/latest/reference/codegen/registerallocator.html , I was surprised that corresponding .rst file is almost empty: https://github.com/windelbouwman/ppci-mirror/blob/master/docs/reference/codegen/registerallocator.rst , and the contents comes fully from the corresponding .py module, https://github.com/windelbouwman/ppci-mirror/blob/master/ppci/codegen/registerallocator.py

It doesn't seem that this pattern is followed consistently, e.g. Peephole doc has quite a bunch of content in .rst: https://github.com/windelbouwman/ppci-mirror/blob/master/docs/reference/codegen/peephole.rst

So, would be nice to decide which way to do it is the best, and follow it.

RFC: Using rebase model for git commits/pull requests

When using "Merge" button for pull requests in github web ui, the default setup is to always perform git merge operation. That means that even small single commit in a PR leads to 2 commits in the repo: the original commit and ~~ugly~~ useless "merge" commit.

Many projects instead use rebase-based model, where merge commits are never used, and new changes instead integrated into the main codebase in a seamless (merge-less) manner. Github UI supports that way either, it just should be set as the default in project settings, as describe here: https://help.github.com/en/articles/configuring-commit-rebasing-for-pull-requests (The recommended setup is that among 3 checkboxes, only "Allow rebase merging" is ticked).

@windelbouwman, I hope you would agree this is useful setup and can make this change.

cc: Tighten up error messages

2019-09-02 15:32:35,652 |    ERROR |       root | Who is this "printf"?
2019-09-02 15:32:35,652 |    ERROR |       root | (hello.c, 3, 5)
File : "hello.c"
    1:int main()
    2:{
    3:    printf("hello world\n");
          ^ Who is this "printf"?

Love the irony, but suggest to make error messages more formal.

ppci-cc: predefined macros FILE and LINE give wrong source locations

A typical use for these macros is the assert mechanism:

If NDEBUG is not defined, assert() typically expands to something like:
#define _assert(exp, FILE, LINE)

In version 0.5.7, if _assert detects that exp is 0, FILE and LINE does not expand to the location of the expansion (where the macro assert was invoked) but the location where the macro assert is defined (which is of course useless)

docs: Clarify "python" vs "python3"

It looks like PPCI requires Python3, which is great. It looks like "python -m ppci" syntax is used somewhere in the docs (as suggested in #11), which is also great: https://ppci.readthedocs.io/en/latest/reference/lang/java.html#compile-java-ahead-of-time

However, that uses exactly python, and for most users nowadays that will lead to stack trace ending with:

AssertionError: Needs to be run in python version 3.x

Not exactly user friendly (especially for people not familiar with Python, and #11 suggests to keep those in mind).

So, suggested to use python3 consistently in the commands everywhere.

No way to specify program entry point, ELF hardcodes to 0x40000

D'oh: https://github.com/windelbouwman/ppci-mirror/blob/master/ppci/format/elf/file.py#L219

This in turn leads to crazy workarounds like:

// No idea how to specify entrypoint to ppci linker, so far, entry point is
// at the beginning of the segment, and functions appear in the segment in
// the order of appearance in the source file, even if "apperance" is merely
// a prototype.

If not for inline asm, added just yesterday, it wouldn't be possible to workaround that at all (in "single source file" model, I mean).

Update compiler explorer to latest version

This site: https://godbolt.org/ has support for ppci, but an older version.

As a side note, we might add support there to output the AST, which is possible with ppci.

Clarification on the license choice

I apologize in advance if this matter was well discussed already. It might make sense to link/summarize such a discussion from an accessible place then. In the meantime, license doesn't seem to be mentioned in the docs at all: https://ppci.readthedocs.io/en/latest/search.html?q=license

So, what are the ideas behind choosing the BSD license for the project? Who are the expected contributors to the project who would enjoy such a license? Who are the users?

Thanks.

Does this project have any commit guidelines/rules?

The project is absolutely great, I don't know how I missed it previously (I'm aware of https://github.com/ShivamSarodia/ShivyC for example).

But looking at the commit history to get an idea of what's happening, I'm somewhat concerned if there's good enough change control. E.g., the latest commit as of now, cd69292 is described as "Change C string literal type to array of char.", but contains changes to "with 469 additions and 267 deletions", with some changes clearly have nothing to do with "Change C string literal type to array of char", e.g., changes to .hgignore, docs, etc.

So, are there guidelines/rules being followed by the project in that regard? While there's pretty detailed https://ppci.readthedocs.io/en/latest/contributing.html section in the docs, I don't see that matter covered there.

(I apologize in advance if that's not the kind of question the maintainers interested to discuss. OTOH, I may assure it would be of the utter importance to (some) contributors ;-) ).

C frontend: Support (basic) inline asm

My understanding that C frontend currently doesn't have inline asm support. Which means that any asm must be in a separate source, which means that currently it's not possible to compile a single C source and get an executable with one command. Rounded up to "integer digits" that in turn means "PPCI doesn't really work".

I'd humbly suggest that a race to get following scenario work (ppci-cc hello.c -o hello; ./hello) should be the first task for the project.

And for that, apparently inline asm support is needed. Per #23, it should be implemented in GCC-compatible manner, which of course likely will take time and effort.

As a stop-gap measure, might just introduce adhoc __ppci_asm("mov rax, rdi") to get that syscall() func up and running. (With a social contract that __ppci_asm will be removed once normal asm is implemented, to keep the codebase clean).

RFC: Following principle of drop-in compatibility

The only way Clang got off the ground is by starting to implement various GCC features and options, so people could actually try to use it in real-world projects, find issues, report back, rinse and repeat a thousand times.

The only way PPCI may catch more attention than "oh cool thing, tried once, forgot" is it'll do the same - behave consistently with gcc/clang.

Doing that would require not too large in effort (most of the things are already there), but ripple effect thru the codebase.

For example, it's cool that PPCI has a human-readable .oj object file format. But $CC compiler is expected to produce a platform-standard object file format, interoperable with other tools.

Or ppci-ld is described as taking --layout <layout-file>, -L <layout-file> option. But for linker, -L option means library search path, so should be used (even if reserved) as such.

And speaking of memory layout files, there's a standard GNU ld "linker script" format, so it's cool that PPCI has its own simple format, but switching to (subset) of GNU ld script syntax now will remove unneeded obstacle to adoption later.

Etc, etc.

README: Mention projects/tools PPCI interoperates with

It's pretty good point that PPCI tries to interoperate with other projects, like JVM, LLVM, CastXML, etc.

That should be a pretty good selling point - that PPCI is not just a NIH thing in itself, but actually interoperability platform. Should draw attention to that by explicitly listing projects in README.

ppci-cc: cpp stringify operator # expansions too simplistic

The preprocessor stringify operator # is supported but the resulting expansions are not always correct.
(behaviour for 0.5.7)

#define S(a) #a
S(word) => "word" // correct
S("string") => ""string"" // instead of ""string"" (double quotes should be escaped)
S('a') => "'a'" // correct
S("aa\tbb") => ""aa\tbb" // instead of ""aa\tbb"" (backslashes should be escaped)
S( 1) =>> " 1" au lieu de "1" // leading and trailing blanks should be ignored
S( 1 2 3 ) => " 1 2 3" au lieu de "1 2 3" // only one separator is kept between tokens

ppci-cc: test expressions should not be implicitly casted to int

Test expressions also called "conditions" are casted to "int" in the C front-end.
This leads to wrong IR code when testing types larger than ints.
On an architecture where ints are 32bit, and pointers and longs are 64bit, pointers and longs are tested only on their 32bit lower part.
The problem includes expressions in if, while, do/while, and operand expressions of ! && ||.
Examples:
FILE *f;
if (f) ...
if (f && f->fd...)
On both tests, f is first casted to "int" and then tested against 0.

A work-around is to explicitly test against NULL/0:
Examples:
FILE *f;
if (f != NULL) ...
if (f != NULL && f->fd)

In semantics.py, removing the int-coercion in all these cases solves most of the problem.

Typos in code identifiers

Docs for registerallocator module https://ppci.readthedocs.io/en/latest/reference/codegen/registerallocator.html have a typo in a few places, "coalesc". Worse, such a typo is also in a method name: https://ppci.readthedocs.io/en/latest/reference/codegen/registerallocator.html#ppci.codegen.registerallocator.GraphColoringRegisterAllocator.coalesc

Should be fixed nonetheless ;-).

New release (0.5.8)?

It looks like enough functionality landed in the PPCI to warrant a new release. (Which is always a chance to announce it and spread a word about the project).

And IMHO, relocatable ELF is pretty big feature to warrant 0.6 version number.

ppci-cc : incorrect IR module generated with -O 2

When compiling the following C code with "ppci-cc -O 2 xxx.c", an unexpected error is reported
void g(int n);

void f() {
int i;
g(i);
}
Here is the progress report :
2020-01-15 22:59:12,590 | INFO | root | ppci 0.5.7 on CPython 3.6.9
2020-01-15 22:59:12,601 | INFO | cbuilder | Starting C compilation (c99)
2020-01-15 22:59:12,607 | INFO | cparser | Parsing finished
2020-01-15 22:59:12,610 | INFO | ccodegen | Finished IR-code generation
2020-01-15 22:59:12,612 | INFO | optimize | Optimizing module main level 2
2020-01-15 22:59:12,614 | ERROR | root | und_i_alloc = undefined is used
2020-01-15 22:59:12,614 | ERROR | root | None
und_i_alloc = undefined is used
I dig a bit and saw that the problem is reported by the IR verifier in irutils.py.
It reports that instruction "undefined" has been found in the module.
The instruction is apparently inserted by the "mem2reg promotor" during the optmizer pass.
This is why the problem does not occur when compiling with -O 0.
Note that if variable "i" is assigned a value before calling "g", the problem does not happen. If this comes from this, it should be reported in a clearer way.

Promoting the project

As I mentioned in some other tickets, I was totally astonished to have missed the PPCI project for so long, despite actively watching scenes of both hmm, "hobby" compilers, and compilers-in-Python. Poor souls like me should be helped, there should be effort to spread the word about PPCI. This ticket is created to hopefully help to coordinate the effort.

Redesign project logo

At this point the logo is pretty minimalistic. We may require a new project logo.

README: Should start with showing how to use project command line tools instead of API

PPCI is a vast project, containing probably a hundred of APIs. Showing them (even such "basic" as compiling/assembling source) is arguably less interesting than showing that PPCI indeed provides a full cycle of standard compiler workflow, using standard command line invocations.

So, giving instructions how to compile, then ideally, run, a simple C "hello world" would be more useful. I myself however haven't yet figured whether it's possible to link an executable referring to printf ;-). But even if giving instructions how to get IR/asm output, it would be still useful.

ppci-ld: library parameter does not work

The linker ppci-ld has a --library parameter to specify library files to fulfill potential unresolved references.
Although the parameter is checked, it is not processed to the linker.

Improve/make consistent "python3 -m ppci" behavior

I personally would prefer to not rely on entrypoints created by the installer script (and ideally/eventually, not rely on the need for ppci to be "installed" at all).

So, the first thing I'm greeted is:

$ python3 -m ppci
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/hdd/projects-3rdparty/Python/Python-compilers-for-non-python/ppci-mirror/ppci/__main__.py", line 18, in <module>
    main()
  File "/mnt/hdd/projects-3rdparty/Python/Python-compilers-for-non-python/ppci-mirror/ppci/__main__.py", line 10, in main
    subcommand = sys.argv[1]
IndexError: list index out of range

Now, there's ppci-ld, but:

$ python3 -m ppci ld
...
ModuleNotFoundError: No module named 'ppci.cli.ld'

Suddenly, the linker is under python3 -m ppci link. Let's fix it up and make it consistent, which is of course "ld", per #23 . If there's strong desire "link" can stay as an alias, but I'd recommend against it, let's just just clean up such small cases, it's just not worse to maintain "diversification" in such small things, it will only lead to confusion and maintenance overheads.

8bit addition should not be generated when expecting int result

Hello,
When compiling the following simple function with ppci-cc 0.5.8
int f(char *s) {
return *s - '0';
}
The generated IR code is
global function i64 f(ptr s) {
f_block0: {
i8 load_0 = load s;
i8 constant = 48;
i8 op = load_0 - constant;
i64 typecast = cast op;
return typecast;
}
}
The C standard specifies that operands narrower than ints are extended to int width before an operation can be done. However optimisations are possible as far as the result is consistent with this rule.
In our case, doing an 8bit subtraction can trigger an integer underflow and lead to an incorrect result.
Example if *str is -120 (0x88), doing an 8bit subtraction with 48 (0x30) will make 0x58 which is +88. Then with the extension to i64 the function will return +88.
If the 2 operands are first extended to ints before subtraction, the result becomes correct : -168.

Inconsistent test results when using different runners

Docs at https://ppci.readthedocs.io/en/latest/development.html#running-the-testsuite present three ways to run the testsuite, and the way it's worded there, one can only imagine that all they equivalent. However, trying them results in different number of tests run:

python -m unittest discover -s test . This would be a default way, as it uses builtin Python module. But:

Ran 165 tests in 0.483s

OK (skipped=48)

python -m pytest -v test/ . Quite different result:

====== 1338 passed, 672 skipped, 2 warnings in 16.79s ======

tox -e py3. This gives the biggest coverage:

1473 passed, 537 skipped, 2 warnings in 36.43s

Would be nice to know the reason for discrepancies and do something about them (ideally, make them all run the same amount of tests, vs leaving only 1 way to run them ;-) ).

README: Should mention IR and link to its description

IR is the central part of all this stuff. README should mention a few words about IR, and link to docs: https://ppci.readthedocs.io/en/latest/reference/ir.html . But that doc is very bare itself, to describe IR in more detail and give an example (I mean textual representation first of all).

List of C compilers written in Python

Some of these were mentioned throughout comments, and I guess it makes sense to have a dedicated, visible list. My motivation behind this list is to avoid situation when some party involved is not aware that other similar projects exist, so any work can be a potential duplication of effort. To that end, I @-mention users involved. My apologies if this information is of no interest to you - there's an "Unsubscribed" button on the right in Github UI to stop receiving further notifications.

Decide on who's the target audience of the project

I post this as a first issue to decide in regard to #10 . What is an "improvement" (and what's not) depends largely on who's target audience.

For example, it's possible to decide that end users are the target audience. But then the README should not contain "frightening" stuff like:

Warning

This project is in alpha state and not ready for production use!

And it stead should contain stuff along the lines of "Drop GCC, drop LLVM, start use PPCI now, we have cookies!"

Or it can be decided that WebAssembly can be a selling point piggybacking on which PPCI could launch into masses. And then README should provide instructions how to build some cute display hack and open it in a popular browser.

The examples can continue, let me come up with a specific proposal: target audience should be Python developers. "Developers" definitely, as README itself says that the project is in alpha stage. "Python", because I doubt that many non-Python developers would jump to see another compiler out there. The project should be of most interest to folks who know Python and curious what can be done using their favorite language.

However, any instruction in the README should also be friendly to folks who aren't (much) familiar with Python. This is not contradictory goal to the previous paragraph. For example, I'm good enough familiar with Python, but I still don't like to "install" something right away (indeed, that's not related to Python and applies to any software). Fortunately, it seems that PPCI is usable from just a git clone. But instruction should e.g. refer to python3 -m ppci cc instead of ppci-cc, because the latter appears only after "installing" it.

Criticism/other ideas are welcome. I anyway wanted to post this, to give a context to other sub-tickets I may post for #10. (Again, based on own experience - I get various suggestions for improvements in my projects READMEs, and half of the time I wonder why they think it would be an improvement.)

ppci-ci: preprocessor does not accept empty macro argument values

#define M(arg) arg
#define CALL(f, ...) (f)(VA_ARGS)
void f() {
M(;) // accepted
M() // rejected
CALL(g0); // rejected
CALL(g1, 1); // accepted
CALL(g2, 1, 2); // accepted
CALL(g3, 1, 2, 3); // accepted
}
However it is legal to invoke a macro with empty arguments (ie with no tokens) as long as the number of commas is correct for the macro.

ppci-build: No munging with terminal colors, please

It's hard to believe, but the attached is the result of running ppci-build.

Apparently, ppci-build does something with terminal colors, in the assumption the terminal has black background. Of course, there're no grounds for such assumptions. Generally, there's absolutely no reason why a build tool would munge with terminal colors. Though if it's really itches, using green and red foreground should be fine for both blacks and white backgrounds. (Of course, users of green background will loudly disagree.)

ppci-cc : using [] in a function argument makes the compiler crash

When compiling the C code below :

void print_board(register int board[][8]) {
}

ppci-cc raises a Python exception AttributeError: 'ArrayType' object has no attribute 'location'
Here is the progress report :

2020-01-15 23:14:21,486 | INFO | root | ppci 0.5.7 on CPython 3.6.9
2020-01-15 23:14:21,492 | INFO | cbuilder | Starting C compilation (c99)
2020-01-15 23:14:21,493 | INFO | cparser | Parsing finished
AttributeError

This happens during IR code generation when the compiler tries to evaluate the size of the "board" argument. It fails because 1st size is not defined (which is legal for a function argument). It tries to report the error but type.location is apparently not defined.

2 problems here :

[] is apparently not supported in function arguments (C standard states that "array of type" in arguments are considered as "pointer to type" and therefore int board[][8] is the same as int (*board)[8]
the error reporting raises an exception

Outdated package on PyPI: 0.5.6 (latest is 0.5.7)

https://pypi.org/project/ppci/ :

ppci 0.5.6

Would be nice to upload the latest version there.

Testing strategies

This post lists some interesting ideas about compiler testing: https://old.reddit.com/r/Python/comments/eieuld/c_compiler_written_in_python/

Csmith is an example: https://embed.cs.utah.edu/csmith/using.html

The other idea is hypothesis testing: https://hypothesis.works/

Work out some ideas about testing the compiler and document the different options at fuzzing / stress testing.

Issues with ppci-cc backend tree generation

Hello,
I am trying to figure out the effort for writing a backend for a simple CPU. For this I compile simple functions and display the corresponding back-end tree.
For a the C function
char *strcpy(char *d, char *s) {
char *save = d;
while (*d++ = *s++);
return save;
}
This produces the following trees (to be matched by the BURG patterns) :
Generation tree : MOVU64vreg5
Generation tree : MOVU64vreg6
Generation tree : MOVU64vreg0phi_s_alloc_0
Generation tree : MOVU64vreg1phi_d_alloc_0
Generation tree : JMP[strcpy_block1:]
Generation tree : MOVU64vreg0phi_s_alloc_0
Generation tree : MOVI8vreg9load_4
Generation tree : MOVU64vreg1phi_d_alloc_0
Generation tree : STRI8(REGU64[vreg1phi_d_alloc_0], REGI8[vreg9load_4])
Generation tree : MOVU64[vreg7](ADDU64(REGU64[vreg0phi_s_alloc_0], CONSTU64[1]))
Generation tree : MOVU64[vreg8](ADDU64(REGU64[vreg1phi_d_alloc_0], CONSTU64[1]))
Generation tree : MOVU64vreg0phi_s_alloc_0
Generation tree : MOVU64vreg1phi_d_alloc_0
Generation tree : CJMPI64[('==', strcpy_block3:, strcpy_block1:)](I8TOI64(REGI8[vreg9load_4]), CONSTI64[0])
Generation tree : MOVU64vreg4retval
Generation tree : JMP[strcpy_epilog:]

This raises some issues (assuming I have correctly understood how all this works) :

Two trees are a copy of a vreg to itself (tree 6 and 8), the backend can check for such cases that but it is better if thies MOVE are not generated
This makes a lot of copies... Even if after code selection, register allocation will try to reuse the real registers, each MOVE will generate instructions whatever the final real registers are (this is reflected in the corresponding x86-64 code)
There is no way in the backend to know when a vreg is no longer needed (last use), this means that the corresponding real register will not be deallocated til the end of the function.
Precise def-use information are not necessary but the number of times a vreg is used would be nice.
When playing with the -O option, I have seen that in -O 0, the compiler assumes that function arguments come in registers and first saves them on the stack. This is OK especially for debugging.
But if the target architecture transfers the arguments thru the stack there will be a 2nd copy of the arguments. The code generator should use the results of the "determine_arg_locations" and if all args are already on the stack, should not generate another memory copy.
The CJMP instruction has 2 labels (yes/no). This is perfect at IR level. However when generating code, this often produces useless jumps to following instructions :
jmp label3
label3: ...
The x86_64 backend does this all the time. The problem is that the backend does not know what label will come just after the CJMP (blocks are not always generated from 0 to n, for example if we reverse loops). A peephole optimizer could remove this, but it would be better not to produce this at code generation.

These are several points on a single issue report but as there is no much documentation I could be wrong on the interpretation.

Anyway I am impressed by the job done! that's Python productivity at work...

WebAssembly as backend

It's interesting WebAssembly is used as frontend, but could it be added support to use it as backend too?

ppci-cc: wrong handling of storage-class when a symbol redeclared

Problem 1 (minor):

static int i;
...
int i;

=> no error is reported, whereas this is a change of linkage (a storage-class transition from "static" to None is forbidden) (C standard section 6.2.2)

Problem 2 (major): classical way of declaring a variable in many include files, and to define it in only one .c file

extern int errno; /* typically in include files /
...
int errno; / in only one source file */

=> no error is reported, but the definition of "errno" is ignored and not put in the object file.
This means there is no way to define in a file 1 a variable which is declared in an file 2 included by file 1. (C standard section 6.2.2)

I need to reconfigure my environment to be able to make pull requests. The fix being local to only one function, I include it below:

lang/c/semantics.py, function check_redeclaration_storage_class

def check_redeclaration_storage_class(self, sym, declaration):
    """ Test if we specified an invalid combo of storage class. """
    old_storage_class = sym.declaration.storage_class
    new_storage_class = declaration.storage_class
    # None == automatic storage class.
    # TSF 49: add (static, None) because linkage cannot change from internal to external
    invalid_combos = [(None, "static"), ("extern", "static"), ("static", None)]
    combo = (old_storage_class, new_storage_class)
    if combo in invalid_combos:
        message = "Invalid redefine of storage class. Was {}, but now {}".format(
            old_storage_class, new_storage_class
        )
        self.invalid_redeclaration(sym, declaration, message)

    # TSF 49: if new storage-class is "extern", keep existing storage-class
    # if not declaration.storage_class:
    #    if sym.declaration.storage_class:
    #        declaration.storage_class = sym.declaration.storage_class
    if new_storage_class == "extern":
        declaration.storage_class = old_storage_class

ppci-cc: cpp #if expression mis-calculated in some cases

In some cases, the result of #if expressions is wrong. I have noticed 2 cases :

&& and || do not return 0 or 1
The following expressions are always false:

#if 7 && 3 == 1 (7 && 3 is evaluated as 3)
#if 7 || 3 == 1 (7 || 3 is evaluated as 7)
#if 0 || 3 == 1 (0 || 3 is evaluated as 3)

Binary operators with same priority are evaluated from right to left

#if 2 * 3 / 2 == 3 is false because 2 * 3 / 2 is evaluated as 2 * (3 / 2) yielding 2

I have fixed the 2 problems with the 2 following (small) changes:

in lang/c /preprocessor.py in _eval_tree(), at the end of evaluation of both || and &&, add the following lines:

if value:
value = 1

in lang/c/preprocessor.py in parse_expression(), when parsing a new operator, the following code tests if it has to be evaluated after of before the previous one :

if left_associative and (op_prio >= priority):
pass

It should be:

if left_associative and (op_prio > priority):
pass

RFC: Codebase structure

Just cross-linking to Bitbucket issue: https://bitbucket.org/windel/ppci/issues/9/library-structure

(Following my idea that having stuff on Github will allow for better exposure.)

IR questions/criticism

My attempt to start looking inside the project, not just at the surface of it.

    blob<8:8> alloc_a = alloc 8 bytes aligned at 8;
    ptr addr_a = &alloc_a;
    store i_phi, addr_a;

Why "alloc" operation returns "an object" instead of pointer to object? What is semantics of this operation, how to reason about it? Read literally, it's return by value - "allocate object somewhere, then return its value". That doesn't make any sense. At best, this requires 3 lines to do a common operation like above, instead of 2 (taking an address is superflous).

That "allocate somewhere" is also confusing. If it allocates on the stack, why not call it classically, "alloca". Then there could be separate "heapalloc" instruction, if it ever comes to that (you'd definitely need that to high-level'ly compile a language with automatic memory management, like Python).

[umbrella] Improve README

Thanks for merging #8 .

There's still a lot can be done to improve README. For example, even if I'm a programmer, I find README showing me how to use API a bit boring. If this claims to be a compiler, I want to run it as a compiler (on a command line, passing in my hello-world). So, show me how!

It's however not very easy to decide what would be an "improvement", it depends on many factors. So, let me open this ticket as an umbrella issue to possibly discuss generic things, but otherwise just link more specific issues from, with specific decisions made in the focused issues.

#11: Decide on who's the target audience of the project
#12: README: Should start with showing how to use project command line tools instead of API
#13: Elaborate on what's "c3"
#14: docs: Clarify "python" vs "python3"
#15: README: Should mention IR and link to its description
#21: README: Mention projects PPCI interoperates with

Patches using git/github are accepted (vs bitbucket/mercurial)?

While reading https://ppci.readthedocs.io/en/latest/support.html#how-to-submit-a-patch , I noticed that project requires submitting patches using Mercurial, and following special conventions, like:

Create a hg bookmark, not a branch, and make your change.

(In majority of VCSes, a work would be done on a branch.)

Are contributions via this Github project accepted? It would be nice to clarify this matter.

ppci-cc: impossible to redefine a struct/union/enum tag in inner scopes

C Struct/union/enum tags follow the same rules than other identifiers for scope. They may be redeclared in an inner scope and the inner declaration hides the outer ones.
The following legal code is reported as erroneous :

void f(int n) {
struct S { int a; };
if (n == 10) {
struct S { int b; } s; // error "multiple definitions"
s.b = 1;
}
s.a = 2;
}
4: struct S { int b; } s;
^ Multiple definitions
2020-01-22 00:37:59,248 | INFO | root | ppci 0.5.7 on CPython 3.6.9 on Linux
2020-01-22 00:37:59,254 | INFO | cbuilder | Starting C compilation (c99)
2020-01-22 00:37:59,256 | ERROR | root | Multiple definitions
2020-01-22 00:37:59,257 | ERROR | root | (bug20.c, 5, 5)

Same behaviour for union/enum (which share the tag namespace)

ppci-cc: in x86_64, "long" types are smaller than "int" types

I am trying to compile my own small libc with target x86_64.
I have just realized something weird: sizeof(int) > sizeof(long)
ints are 8-byte long in x64, and I expected longs to be the same, but they are only 4-byte long.
The C standard requires that sizeof(int) <= sizeof(long) but this is not the case.
According to the code I have read, it seems that "longs" are 32bit on all targets. "ints" size is defined in the arch.py file of the target.
Fixing this requires that the size of longs is taken as max(32, sizeof(int))
I tested this by changing context.py as follows and it works (longs are at least 32bit as required by the C standard but cannot be smaller that ints)

self.type_size_map = {
BasicType.CHAR: (1, 1),
BasicType.UCHAR: (1, 1),
BasicType.SHORT: (2, 2),
BasicType.USHORT: (2, 2),
BasicType.INT: (int_size, int_alignment),
BasicType.UINT: (int_size, int_alignment),
> BasicType.LONG: (max(int_size, 4), max(int_alignment, 4)),
BasicType.ULONG: (max(int_size, 4), max(int_alignment, 4)),
BasicType.LONGLONG: (8, 8),
BasicType.ULONGLONG: (8, 8),
BasicType.FLOAT: (4, 4),
BasicType.DOUBLE: (double_size, double_alignment),
BasicType.LONGDOUBLE: (10, 10),
}
As a protection, same change should be done for "long long" for the future 128bit ints CPU...

RFC: improvements for developer debugging

I want the debugging reporting to be as good as possible, to save time while debugging.

Current methods which help developers (e.g. myself):

HTML report with compilation process
debug information in binaries.

This issue requests ideas on how to improve those outputs.

My current list of improvements:

HTML report could include hyperlinks to jump to variable definitions
Table with symbols at the end of the report
More HTML tables with detailed information

Other ideas?

`ppci-cc hello.c -o hello; ./hello` should "just work"

The scenario:

ppci-cc hello.c -o hello; ./hello should "just work" (tm) to give a better out of box experience.

See #24 for an initial discussion about this.

Questions here:

Use a C library?
What C runtime to use?
How to deal with windows, linux, mac support?

ppci-cc: missing pointer arithmetic for operators -, +=, -=

Pointer arithmetic for + is implemented (pointer + integral or integral + pointer) but other cases are missing.

This should be as follows:
pointer to T - integral: as for +, ie the pointer address is incremented by "integral" times sizeof(T)
pointer1 to T - pointer2 to T: this yields an integral expression equal to (address1-address2)/sizeof(T)
(both pointers shall point to the same type)

+= and -+ should have similar behaviours, since exp1 .= exp2 "means" exp1 = exp1 . exp2.
pointer1 to T += integral => adress1 += integral * sizeof(T)
pointer1 to T -= integral => adress1 -= integral * sizeof(T)

Care should be taken if the pointer types are (void*), because sizeof(void) is not defined. This should be flagged as an error.

"objcopy" seems to mandate --segment switch ungroundly

For starters:

python -m ppci objcopy --output-format elf hello.all.o hello
usage: __main__.py [-h] [--log log-level] [--report report-file]
                   [--html-report html-report-file]
                   [--text-report text-report-file] [--verbose] [--version]
                   --segment SEGMENT [--output-format OUTPUT_FORMAT]
                   input output
__main__.py: error: the following arguments are required: --segment/-S

(Note the cosmetical issue of __main__.py appearing in the output, that needs to be solved too.)

While making my make-based example, I wondered what the heck "flash" is https://github.com/windelbouwman/ppci-mirror/blob/master/examples/linux64/hello/build.xml#L26 . That "imagename" is what translates to --segment (that's !! on its own, and explain why I would like to stay away from ppci-build as far as possible, and think that everyone else should have a clear option to do the same - life is just to short to repeatedly solve quizzes like that on almost every step).

So, now brushing up the make-based example, I see that you can actually pass anything to --segment, in other words, for the case --output-format=elf it's just ignored. So, let's make it like that, because again, quizzes to solve on every step ;-).

Rustpython or PPCI

Hi
Which is better for Webassembly?
Rustpython or PPCI?

Plans/future for PPCI?

It's that time of the year which calls for sum up what was done for a year and lay further plans. I would like to bring a question of what the future may hold for PPCI. I discovered this project only in August 2019, and was really astonished by its topic and scope - that's something I wished I would have started and done. And with that, it's just a project in pretty early stages re: any usability for "real world" usage. I quickly submitted a few patches, then started to watch the project to see where it goes. Well, over one month it didn't go anywhere, which is quite understood given summer/vacation time. I at least tried to dump whatever ideas of improvements I had in mind as tickets, so they aren't forgotten, and prepared for more patience.

Well, now by the end of the year, it's fair to say that beyond some activity in May-July, and my humble patches in Aug, totalling ~25 commits: https://github.com/windelbouwman/ppci-mirror/graphs/contributors?from=2019-01-01&to=2019-12-31&type=c , there was no much activity on the project. That's unlike pretty steady progress over previous 2.5 years.

So, @windelbouwman, as the author and owner of the project, I would like to ask you what plans/feelings you have regarding PPCI, and what people interested in it can expected. Thanks for your time and effort developing PPCI, and hopefully, to answer this question too.

ppci-cc : problems handling static string initializers

Compilation of the following C source code went well but the generated bytes are wrong (comments mention expected results).

char c0[] = "012"; // size is 4
char c1[9] = "012"; // 3 digits + 6 trailing zero bytes
char c2[3] = "012"; // 3 digits and no trailing zero bytes
char c3[4] = "012"; // 3 digits + 1 trailing zero byte
char c4[3] = "012345"; // should report an error (initializer too long)
char c5[3] = { '0', '1', '2', '4', '5' }; // should report an error (initializer too long)

General serious issue : the " string delimiters (ASCII code 34) are included in the generated bytes.
Eg: c0 generates bytes 34, 48, 49, 50, 34, 0 instead of 48, 49, 50, 0
Generated number of bytes for c2, c3, c4, c5 is wrong : when a size is specified for an array, this is the actual size of the variable, and the intializer cannot change that.
There are warnings on the size in the progress report but rather well hidden:

2020-01-16 00:22:18,296 | INFO | root | ppci 0.5.7 on CPython 3.6.9
2020-01-16 00:22:18,303 | INFO | cbuilder | Starting C compilation (c99)
2020-01-16 00:22:18,310 | WARNING | ccontext | Excess elements!
2020-01-16 00:22:18,310 | INFO | ccontext | At: (bug8.c, 8, 31), hints: None
2020-01-16 00:22:18,311 | WARNING | ccontext | Excess elements!
2020-01-16 00:22:18,311 | INFO | ccontext | At: (bug8.c, 8, 36), hints: None
2020-01-16 00:22:18,312 | WARNING | ccontext | Excess elements!
2020-01-16 00:22:18,313 | INFO | ccontext | At: (bug8.c, 8, 41), hints: None
2020-01-16 00:22:18,314 | INFO | cparser | Parsing finished
2020-01-16 00:22:18,315 | INFO | ccodegen | Finished IR-code generation

Elaborate on what's "c3"

I'm still not 100% sure, but it seems that "c3" referred in the README is an adhoc language developed specifically for/with PPCI. This fact should be made clear in README, and elaborated in the docs, where it's still not exactly clear: https://ppci.readthedocs.io/en/latest/reference/lang/c3.html

And for reference, see what google finds for "c3 language":

What is probability that those "c3" is the same "c3" as used in PPCI?