Git Product home page Git Product logo

chibicc's Introduction

chibicc: A Small C Compiler

(The old master has moved to historical/old branch. This is a new one uploaded in September 2020.)

chibicc is yet another small C compiler that implements most C11 features. Even though it still probably falls into the "toy compilers" category just like other small compilers do, chibicc can compile several real-world programs, including Git, SQLite, libpng and chibicc itself, without making modifications to the compiled programs. Generated executables of these programs pass their corresponding test suites. So, chibicc actually supports a wide variety of C11 features and is able to compile hundreds of thousands of lines of real-world C code correctly.

chibicc is developed as the reference implementation for a book I'm currently writing about the C compiler and the low-level programming. The book covers the vast topic with an incremental approach; in the first chapter, readers will implement a "compiler" that accepts just a single number as a "language", which will then gain one feature at a time in each section of the book until the language that the compiler accepts matches what the C11 spec specifies. I took this incremental approach from the paper by Abdulaziz Ghuloum.

Each commit of this project corresponds to a section of the book. For this purpose, not only the final state of the project but each commit was carefully written with readability in mind. Readers should be able to learn how a C language feature can be implemented just by reading one or a few commits of this project. For example, this is how while, [], ?:, and thread-local variable are implemented. If you have plenty of spare time, it might be fun to read it from the first commit.

If you like this project, please consider purchasing a copy of the book when it becomes available! 😀 I publish the source code here to give people early access to it, because I was planing to do that anyway with a permissive open-source license after publishing the book. If I don't charge for the source code, it doesn't make much sense to me to keep it private. I hope to publish the book in 2021. You can sign up here to receive a notification when a free chapter is available online or the book is published.

I pronounce chibicc as chee bee cee cee. "chibi" means "mini" or "small" in Japanese. "cc" stands for C compiler.

Status

chibicc supports almost all mandatory features and most optional features of C11 as well as a few GCC language extensions.

Features that are often missing in a small compiler but supported by chibicc include (but not limited to):

  • Preprocessor
  • float, double and long double (x87 80-bit floating point numbers)
  • Bit-fields
  • alloca()
  • Variable-length arrays
  • Compound literals
  • Thread-local variables
  • Atomic variables
  • Common symbols
  • Designated initializers
  • L, u, U and u8 string literals
  • Functions that take or return structs as values, as specified by the x86-64 SystemV ABI

chibicc does not support complex numbers, K&R-style function prototypes and GCC-style inline assembly. Digraphs and trigraphs are intentionally left out.

chibicc outputs a simple but nice error message when it finds an error in source code.

There's no optimization pass. chibicc emits terrible code which is probably twice or more slower than GCC's output. I have a plan to add an optimization pass once the frontend is done.

I'm using Ubuntu 20.04 for x86-64 as a development platform. I made a few small changes so that chibicc works on Ubuntu 18.04, Fedora 32 and Gentoo 2.6, but portability is not my goal at this moment. It may or may not work on systems other than Ubuntu 20.04.

Internals

chibicc consists of the following stages:

  • Tokenize: A tokenizer takes a string as an input, breaks it into a list of tokens and returns them.

  • Preprocess: A preprocessor takes as an input a list of tokens and output a new list of macro-expanded tokens. It interprets preprocessor directives while expanding macros.

  • Parse: A recursive descendent parser constructs abstract syntax trees from the output of the preprocessor. It also adds a type to each AST node.

  • Codegen: A code generator emits an assembly text for given AST nodes.

Contributing

When I find a bug in this compiler, I go back to the original commit that introduced the bug and rewrite the commit history as if there were no such bug from the beginning. This is an unusual way of fixing bugs, but as a part of a book, it is important to keep every commit bug-free.

Thus, I do not take pull requests in this repo. You can send me a pull request if you find a bug, but it is very likely that I will read your patch and then apply that to my previous commits by rewriting history. I'll credit your name somewhere, but your changes will be rewritten by me before submitted to this repository.

Also, please assume that I will occasionally force-push my local repository to this public one to rewrite history. If you clone this project and make local commits on top of it, your changes will have to be rebased by hand when I force-push new commits.

Design principles

chibicc's core value is its simplicity and the reability of its source code. To achieve this goal, I was careful not to be too clever when writing code. Let me explain what that means.

Oftentimes, as you get used to the code base, you are tempted to improve the code using more abstractions and clever tricks. But that kind of improvements don't always improve readability for first-time readers and can actually hurts it. I tried to avoid the pitfall as much as possible. I wrote this code not for me but for first-time readers.

If you take a look at the source code, you'll find a couple of dumb-looking pieces of code. These are written intentionally that way (but at some places I might be actually missing something, though). Here is a few notable examples:

  • The recursive descendent parser contains many similar-looking functions for similar-looking generative grammar rules. You might be tempted to improve it to reduce the duplication using higher-order functions or macros, but I thought that that's too complicated. It's better to allow small duplications instead.

  • chibicc doesn't try too hard to save memory. An entire input source file is read to memory first before the tokenizer kicks in, for example.

  • Slow algorithms are fine if we know that n isn't too big. For example, we use a linked list as a set in the preprocessor, so the membership check takes O(n) where n is the size of the set. But that's fine because we know n is usually very small. And even if n can be very big, I stick with a simple slow algorithm until it is proved by benchmarks that that's a bottleneck.

  • Each AST node type uses only a few members of the Node struct members. Other unused Node members are just a waste of memory at runtime. We could save memory using unions, but I decided to simply put everything in the same struct instead. I believe the inefficiency is negligible. Even if it matters, we can always change the code to use unions at any time. I wanted to avoid premature optimization.

  • chibicc always allocates heap memory using calloc, which is a variant of malloc that clears memory with zero. calloc is slightly slower than malloc, but that should be neligible.

  • Last but not least, chibicc allocates memory using calloc but never calls free. Allocated heap memory is not freed until the process exits. I'm sure that this memory management policy (or lack thereof) looks very odd, but it makes sense for short-lived programs such as compilers. DMD, a compiler for the D programming language, uses the same memory management scheme for the same reason, for example [1].

About the Author

I'm Rui Ueyama. I'm the creator of 8cc, which is a hobby C compiler, and also the original creator of the current version of LLVM lld linker, which is a production-quality linker used by various operating systems and large-scale build systems.

References

[1] https://www.drdobbs.com/cpp/increasing-compiler-speed-by-over-75/240158941

DMD does memory allocation in a bit of a sneaky way. Since compilers are short-lived programs, and speed is of the essence, DMD just mallocs away, and never frees.

chibicc's People

Contributors

rui314 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chibicc's Issues

Thanks and roadmap

I like very much your design principles, I tend to work the same way, read some source files and code is clean and easy to follow.
Is the compiler complete?, I mean, can I start reading right now or do you plan to introduce big changes in the future?.
Or maybe do you have a roadmap worth including in the readme?
I've been toying with a small compiler that produces code for a small 8bit VM, not C, but still have lots of things to finish.
Thanks for inspiration!

Bug with array initializer

The following code

#include <stdio.h>

#define SIZE 10

extern int myarr[SIZE];
int myarr[] = { 11, 34, };

int main(void)
{
    for (int i = 0; i < SIZE; ++i)
        printf("%d: %d\n", i, myarr[i]);

    return 0;
}

should produce the following output:

0: 11
1: 34
2: 0
3: 0
4: 0
5: 0
6: 0
7: 0
8: 0
9: 0

Instead of the above, I get the following with chibicc:

0: 11
1: 34
2: 1635147615
3: 1735549279
4: 7366239
5: 1635147615
6: 1735549279
7: 7366239
8: 1635147615
9: 1735549279

I guess this feature has not being implemented yet?

Failed to Compile Against Musl Libc Headers

main.c

#include <stdio.h>

int
main(void)
{
    puts("Hello world");
    return 0;
}
$ chibicc main.c
/usr/include/bits/alltypes.h:6: typedef __builtin_va_list va_list;
                                                          ^ expected ','

OS: Void Linux
libc: musl
I didn't see anywhere (or maybe I didn't look hard enough) about the portability of this C compiler and if different libc would work or if only glibc is intended to work.

`(struct S){expr}.field` cannot get parsed

Reproduction:

// a.c
struct S { int n; };

int main(void) {
    return (struct S){0}.n;
}
chibi a.c

This exits with code 1 and error:

a.c:4:     return (struct S){0}.n;
                               ^ expected ';'

According to N1570 (final draft of C11), (type-name){ initializer-list } is a postfix-expression, I think this should compile. Note GCC(10) and Clang(10) compiles this.

is_variadic is being set on 0-arity functions

chibicc/parse.c

Line 625 in 90d1f7f

if (cur == &head)

I'm going to assume you meant to set the flag to false on nil arg va func to supress register preservation? At any rate I want to show my appreciation for this project, it really helped clarify a few aspects of the process, regards! ❤️

Rationale for skipping extraneous tokens before newline in some preprocessor directives

preprocess.c#L80

// Some preprocessor directives such as #include allow extraneous
// tokens before newline. This function skips such tokens.

Taking #include as an example, [ISO/IEC 9899:2017, 6.10.2, §4]

The preprocessing tokens after include in the directive are processed just as in normal text. (Each identifier currently defined as a macro name is replaced by its replacement list of preprocessing tokens.) The directive resulting after all replacements shall match one of the two previous forms.

Neither of the two previous forms of #include allow extraneous tokens before newline. Is the rationale for skipping such tokens to be lenient and provide a warning instead of halting with an error?

Can't find system headers

On MacOS the system headers are in /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include

Clang offers the -isysroot option to allow specifying this location on the command line.

Can you add a similar option so your compiler has even a chance of working on MacOS?

Initializer list warning

struct __pthread_rwlock_arch_t {
	unsigned int __readers;
	unsigned int __writers;
	unsigned int __wrphase_futex;
	unsigned int __writers_futex;
	unsigned int __pad3;
	unsigned int __pad4;
	int __cur_writer;
	int __shared;
	signed char __rwelision;
	unsigned char __pad1[7];
	unsigned long int __pad2;
	unsigned int __flags;
};

typedef union
{
	struct __pthread_rwlock_arch_t __data;
	char __size[56];
	long int __align;
} pthread_rwlock_t;

typedef union
{
	char __size[8];
	long int __align;
} pthread_rwlockattr_t;

int main()
{
	pthread_rwlock_t lock = {0, };

	return 0;
}

Gives the warning

$ chibicc -c test.c 
test3.c:37: 	pthread_rwlock_t lock = {0, };
                                      ^ expected '}'

Memory leak

In a1ab0ff, a memory leak is introduced. I added one possible way to fix it in 427fb99. I just don't know how to create a PR targeting a given commit instead of a branch.

I am currently trying to re-write the whole compiler in Rust, one commit at a time (I will see if it was wise later!). I notice that (at least in the first commits), you C code isn't really robust against bad input, and would expose undefined behavior. Do you want me to create issues/PR against those?

-MD places .d file in wrong folder

Steps to Reproduce

mkdir src build
echo 'int main(void){return 0;}' > src/main.c
chibicc -MD src/main.c -o build/main.o

Expected Behaviour

The dependency file main.d should be under build/.

Actual Behaviour

The dependency file main.d is in the current folder.

Provide a way to get a notification when the book launches

Hi there! I recently came across this repository, and would like to say I'm really eager to get the book once it's ready.
Could you consider creating a mailing list, or something like that so we can get a straightforward way to know when the book is available for purchase? I'd appreciate that.

Thank you!

Anonymously named bitfield member segfaults compiler

Steps to reproduce

  1. Create test.c with the following code:
struct info {
  int key;
  int :5;
  int data;
};
int Get(struct info* m) { return m->data; }
  1. Execute chibicc test.c -cc1 -cc1-input test.c -cc1-output test.S
  2. Observe Segmentation fault (core dumped) message appears

Notes

If you change the int :5 to say int flags:5 instead, the compiler works correctly.

According to gdb, the stacktrace (compiled from latest master) is

get_struct_member (ty=0x555555586d10, tok=0x5555555851c0) at parse.c:2749
2749        if (mem->name->len == tok->len &&
(gdb) print mem->name
$1 = (Token *) 0x0
(gdb) bt
#0  get_struct_member (ty=0x555555586d10, tok=0x5555555851c0) at parse.c:2749
#1  0x0000555555567629 in struct_ref (node=0x5555555880c0, tok=0x5555555851c0) at parse.c:2777
#2  0x00005555555679d3 in postfix (rest=0x7fffffffdcc0, tok=0x555555585130) at parse.c:2852
#3  0x0000555555566b21 in unary (rest=0x7fffffffdcc0, tok=0x5555555850a0) at parse.c:2540
#4  0x00005555555667a6 in cast (rest=0x7fffffffdcc0, tok=0x5555555850a0) at parse.c:2482
#5  0x000055555556658e in mul (rest=0x7fffffffdcf0, tok=0x5555555850a0) at parse.c:2440
#6  0x00005555555664af in add (rest=0x7fffffffdd20, tok=0x5555555850a0) at parse.c:2418
#7  0x0000555555565fa1 in shift (rest=0x7fffffffdd50, tok=0x5555555850a0) at parse.c:2325
#8  0x0000555555565e0e in relational (rest=0x7fffffffdd80, tok=0x5555555850a0) at parse.c:2293
#9  0x0000555555565d1f in equality (rest=0x7fffffffddb0, tok=0x5555555850a0) at parse.c:2271
#10 0x0000555555565c89 in bitand (rest=0x7fffffffdde0, tok=0x5555555850a0) at parse.c:2260
#11 0x0000555555565bf3 in bitxor (rest=0x7fffffffde10, tok=0x5555555850a0) at parse.c:2249
#12 0x0000555555565b5d in bitor (rest=0x7fffffffde40, tok=0x5555555850a0) at parse.c:2238
#13 0x0000555555565ac7 in logand (rest=0x7fffffffde70, tok=0x5555555850a0) at parse.c:2227
#14 0x0000555555565a31 in logor (rest=0x7fffffffdea0, tok=0x5555555850a0) at parse.c:2216
#15 0x000055555556585a in conditional (rest=0x7fffffffdf00, tok=0x5555555850a0) at parse.c:2187
#16 0x0000555555565478 in assign (rest=0x7fffffffdf40, tok=0x5555555850a0) at parse.c:2146
#17 0x0000555555563e60 in expr (rest=0x7fffffffdf80, tok=0x5555555850a0) at parse.c:1819
#18 0x0000555555562d25 in stmt (rest=0x7fffffffe070, tok=0x555555585010) at parse.c:1553
#19 0x0000555555563d05 in compound_stmt (rest=0x7fffffffe208, tok=0x5555555850a0) at parse.c:1792
#20 0x00005555555691cf in function (tok=0x555555585010, basety=0x55555557b460 <__compound_literal.4>, attr=0x7fffffffe28c) at parse.c:3257
#21 0x00005555555696a3 in parse (tok=0x555555583e50) at parse.c:3353
#22 0x000055555555d7fb in cc1 () at main.c:553
#23 0x000055555555e0e9 in main (argc=7, argv=0x7fffffffe488) at main.c:707

VLA argument does not compile

This code does not compile:

int fun(int n, int a[n]);

Cmd: chibicc x.c -c
Error:

x.c:1: int fun(int n, int a[n]);
                            ^ undefined variable

This is compliant C99 code using VLA.

Internal error on long double initializer

Trying to compile this code:

long double x = 0;

results in this error:

internal error at parse.c:1413

This seems to caused by write_buf not supporting buffers of a size above 8

asan issues with `memcp`, use `strncmp` instead?

I was just testing out chibicc on a mac with

clang -g \
  -fsanitize=address \
  -fsanitize=undefined \
  -Wno-switch -Wno-format *.c && \
./a.out -I<dummy-headers-include-path> *.c -S

and the address sanitizer complains about this line:

return memcmp(tok->loc, op, tok->len) == 0 && op[tok->len] == '\0';

I'm guessing that this is probably because tok->len can sometimes be longer than the string stored in op, so memcmp can sometimes access memory out of bounds.

Using strncmp here instead of memcmp seems to fix the issue.

I would open a pull request, but I read your README, and figured you probably want to make the change yourself :p

On the other hand, I don't think this would ever actually be an issue at runtime, so if you don't think this issue is worth the hassle of rebasing, please feel free to just close and ignore this issue :)

best,

Having a large amount of global variables makes compilation very slow

Compiling this code:

#define LIM1(x) x##0; x##1; x##2; x##3; x##4; x##5; x##6; x##7; x##8; x##9;
#define LIM2(x) LIM1(x##0) LIM1(x##1) LIM1(x##2) LIM1(x##3) LIM1(x##4) \
                LIM1(x##5) LIM1(x##6) LIM1(x##7) LIM1(x##8) LIM1(x##9)
#define LIM3(x) LIM2(x##0) LIM2(x##1) LIM2(x##2) LIM2(x##3) LIM2(x##4) \
                LIM2(x##5) LIM2(x##6) LIM2(x##7) LIM2(x##8) LIM2(x##9)
#define LIM4(x) LIM3(x##0) LIM3(x##1) LIM3(x##2) LIM3(x##3) LIM3(x##4) \
                LIM3(x##5) LIM3(x##6) LIM3(x##7) LIM3(x##8) LIM3(x##9)
#define LIM5(x) LIM4(x##0) LIM4(x##1)

LIM5(char t)

takes ~7.5 seconds. Making this just a little bigger makes it much longer. It appears the algorithm used to remove tentative definitions is an O(n²) algorithm, which is why adding more variables makes it exponentially slower.

Assertion when using empty structs

I am trying to use chibicc with some tool that generates C code with empty structs and I get some assertions when compiling, here is a small test case:

typedef struct empty_struct {} empty_struct;

int f(empty_struct a) {
  return 0;
}

int main() {
  f((empty_struct){});
  return 0;
}

Output:

./chibicc a.c -o a
chibicc: codegen.c:1568: emit_text: Assertion `depth == 0' failed.

Support for other *nix-es

Is there any interest in supporting other platforms? To avoid running into #21, I've tried to build int main(void) { return 10; } on FreeBSD, but the outcome of that is not a complete solution - see #24. It looks like going deeper into FreeBSD support would require more robust platform detection. I think it is still doable, but I thought I'd ask if there interest in this, before continuing.

Using a large amount of parentheses in a declaration eats a humonguous amount of memory

Code such as this:

#define PTR1 (* (* (* (* (* (* (* (* (* (*
#define PTR2 PTR1 PTR1

#define RBR1 ) ) ) ) ) ) ) ) ) )
#define RBR2 RBR1 RBR1

int PTR2 (*(* q4_var )) RBR2 = 0;

will compile, but at the cost of eating ~6 GB of RAM during compilation (and adding a few more will eat all of memory). It appears as though a lot of new types are created, and recursively at that (also making the compilation of this take a rather long time, btw).

Stale comment describing parse rule for `typespec()`.

Consider the following comment:

// typespec = typename typename*
// typename = "void" | "_Bool" | "char" | "short" | "int" | "long"
//            | struct-decl | union-decl | typedef-name

This rule as-is describes (more or less) the type-specifier rule in the C11 standard:

type-specifier:
void
char
short
int
long
float
double
signed
unsigned
_Bool
_Complex
atomic-type-specifier
struct-or-union-specifier
enum-specifier
typedef-name

However, since the typespec function of chibicc also parses storage classes, alignments, etc, I think the comment should mention this. Parsing storage classes, alignments, and the link also makes the "typespec" function more like the declaration_specifiers rule in the C standard:

declaration_specifiers
	: storage_class_specifier declaration_specifiers
	| storage_class_specifier
	| type_specifier declaration_specifiers
	| type_specifier
	| type_qualifier declaration_specifiers
	| type_qualifier
	| function_specifier declaration_specifiers
	| function_specifier
	| alignment_specifier declaration_specifiers
	| alignment_specifier
	;

So maybe "declspec" might be a better name? I'm not sure.

Nested designators error

struct s {
  struct {
    int i;
    int j;
  } t;
  int k;
};

int main() {
  struct s x = {.t.i = 1, .k = 2};
}

gives an error:

test.c:10:   struct s x = {.t.i = 1, .k = 2};
                                   ^ expected an expression

memory sanitizer

Failed run tests by (make test-all) if chibicc compiled under memory sanitizer:

./chibicc -Iinclude -Itest -c -o test/variable.o test/variable.c
==3636==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x4d531b in array_dimensions /home/mpech/chibicc/parse.c:667:26
    #1 0x4d3b8a in type_suffix /home/mpech/chibicc/parse.c:680:12
    #2 0x4d5c3e in declarator /home/mpech/chibicc/parse.c:719:8
    #3 0x4d5824 in declarator /home/mpech/chibicc/parse.c:705:5
    #4 0x4c1d3e in is_function /home/mpech/chibicc/parse.c:3306:14
    #5 0x4f3aae in compound_stmt /home/mpech/chibicc/parse.c:1793:11
    #6 0x4ed476 in primary /home/mpech/chibicc/parse.c:2998:18
    #7 0x4ec4fb in postfix /home/mpech/chibicc/parse.c:2835:16
    #8 0x4d251c in unary /home/mpech/chibicc/parse.c:2568:10
    #9 0x4d0495 in cast /home/mpech/chibicc/parse.c:2511:10
    #10 0x4cd9f5 in mul /home/mpech/chibicc/parse.c:2453:16
    #11 0x4cd425 in add /home/mpech/chibicc/parse.c:2431:16
    #12 0x4cce35 in shift /home/mpech/chibicc/parse.c:2338:16
    #13 0x4cc495 in relational /home/mpech/chibicc/parse.c:2306:16
    #14 0x4cbea5 in equality /home/mpech/chibicc/parse.c:2284:16
    #15 0x4cbb27 in bitand /home/mpech/chibicc/parse.c:2273:16
    #16 0x4cb7c7 in bitxor /home/mpech/chibicc/parse.c:2262:16
    #17 0x4cb467 in bitor /home/mpech/chibicc/parse.c:2251:16
    #18 0x4cb107 in logand /home/mpech/chibicc/parse.c:2240:16
    #19 0x4ca3a7 in logor /home/mpech/chibicc/parse.c:2229:16
    #20 0x4bcf7a in conditional /home/mpech/chibicc/parse.c:2200:16
    #21 0x4dfe68 in assign /home/mpech/chibicc/parse.c:2159:16
    #22 0x4f1f88 in funcall /home/mpech/chibicc/parse.c:2901:17
    #23 0x4ec62b in postfix /home/mpech/chibicc/parse.c:2839:14
    #24 0x4d251c in unary /home/mpech/chibicc/parse.c:2568:10
    #25 0x4d0495 in cast /home/mpech/chibicc/parse.c:2511:10
    #26 0x4cd9f5 in mul /home/mpech/chibicc/parse.c:2453:16
    #27 0x4cd425 in add /home/mpech/chibicc/parse.c:2431:16
    #28 0x4cce35 in shift /home/mpech/chibicc/parse.c:2338:16
    #29 0x4cc495 in relational /home/mpech/chibicc/parse.c:2306:16
    #30 0x4cbea5 in equality /home/mpech/chibicc/parse.c:2284:16
    #31 0x4cbb27 in bitand /home/mpech/chibicc/parse.c:2273:16
    #32 0x4cb7c7 in bitxor /home/mpech/chibicc/parse.c:2262:16
    #33 0x4cb467 in bitor /home/mpech/chibicc/parse.c:2251:16
    #34 0x4cb107 in logand /home/mpech/chibicc/parse.c:2240:16
    #35 0x4ca3a7 in logor /home/mpech/chibicc/parse.c:2229:16
    #36 0x4bcf7a in conditional /home/mpech/chibicc/parse.c:2200:16
    #37 0x4dfe68 in assign /home/mpech/chibicc/parse.c:2159:16
    #38 0x4cada8 in expr /home/mpech/chibicc/parse.c:1832:16
    #39 0x4fd15b in expr_stmt /home/mpech/chibicc/parse.c:1825:15
    #40 0x4fc7f5 in stmt /home/mpech/chibicc/parse.c:1772:10
    #41 0x4f3ece in compound_stmt /home/mpech/chibicc/parse.c:1805:25
    #42 0x4c3829 in function /home/mpech/chibicc/parse.c:3265:14
    #43 0x4bdf9c in parse /home/mpech/chibicc/parse.c:3361:13
    #44 0x4b6da0 in cc1 /home/mpech/chibicc/main.c:570:15
    #45 0x4b0164 in main /home/mpech/chibicc/main.c:680:5
    #46 0x7f83b7e490b2 in __libc_start_main /build/glibc-ZN95T4/glibc-2.31/csu/../csu/libc-start.c:308:16
    #47 0x41c2fd in _start (/home/mpech/chibicc/chibicc+0x41c2fd)

SUMMARY: MemorySanitizer: use-of-uninitialized-value /home/mpech/chibicc/parse.c:667:26 in array_dimensions
Exiting

Unnatural associativity of arithmetic operators in parser

The following program

int printf(char *fmt, ...);

int println_id(int x) {
  printf("%d\n", x);
  return x;
}

int main() {
  return println_id(1) + println_id(2) + println_id(3);
}

compiles to something that contains

  mov $3, %rax
  push %rax
  .loc 1 9
  lea println_id(%rip), %rax
  pop %rdi
  mov %rax, %r10
  mov $0, %rax
  call *%r10
  add $0, %rsp
  push %rax
  .loc 1 9
  .loc 1 9
  .loc 1 9
  .loc 1 9
  sub $8, %rsp
  .loc 1 9
  .loc 1 9
  mov $2, %rax
  push %rax
  .loc 1 9
  lea println_id(%rip), %rax
  pop %rdi
  mov %rax, %r10
  mov $0, %rax
  call *%r10
  add $8, %rsp
  push %rax
  .loc 1 9
  .loc 1 9
  .loc 1 9
  .loc 1 9
  mov $1, %rax
  push %rax
  .loc 1 9
  lea println_id(%rip), %rax
  pop %rdi
  mov %rax, %r10
  mov $0, %rax
  call *%r10

I can't run the code since I have MacOS, but from the assembly above the call sequence seems to be println_id(3), println_id(2), and println_id(1).

I was wondering how your recursive descent parser handles operator associativity, and the following part of the parser led me to discover this "issue":

...
// add = mul ("+" mul | "-" mul)*
static Node *add(Token **rest, Token *tok) {
...

"+" and "-" seems to be right-associative, instead of the more natural "left-associative".

"typedef name omitted" warning

#include <linux/types.h>

#define barrier __sync_synchronize

typedef __u8  __attribute__((__may_alias__))  __u8_alias_t;
typedef __u16 __attribute__((__may_alias__)) __u16_alias_t;
typedef __u32 __attribute__((__may_alias__)) __u32_alias_t;
typedef __u64 __attribute__((__may_alias__)) __u64_alias_t;

static __always_inline void __read_once_size(const volatile void *p, void *res, int size)
{
        switch (size) {
        case 1: *(__u8_alias_t  *) res = *(volatile __u8_alias_t  *) p; break;
        case 2: *(__u16_alias_t *) res = *(volatile __u16_alias_t *) p; break;
        case 4: *(__u32_alias_t *) res = *(volatile __u32_alias_t *) p; break;
        case 8: *(__u64_alias_t *) res = *(volatile __u64_alias_t *) p; break;
        default:
                barrier();
                __builtin_memcpy((void *)res, (const void *)p, size);
                barrier();
        }
}

static __always_inline void __write_once_size(volatile void *p, void *res, int size)
{
        switch (size) {
        case 1: *(volatile  __u8_alias_t *) p = *(__u8_alias_t  *) res; break;
        case 2: *(volatile __u16_alias_t *) p = *(__u16_alias_t *) res; break;
        case 4: *(volatile __u32_alias_t *) p = *(__u32_alias_t *) res; break;
        case 8: *(volatile __u64_alias_t *) p = *(__u64_alias_t *) res; break;
        default:
                barrier();
                __builtin_memcpy((void *)p, (const void *)res, size);
                barrier();
        }
}

#define READ_ONCE(x)                                    \
({                                                      \
        union { typeof(x) __val; char __c[1]; } __u =   \
                { .__c = { 0 } };                       \
        __read_once_size(&(x), __u.__c, sizeof(x));     \
        __u.__val;                                      \
})

#define WRITE_ONCE(x, val)                              \
({                                                      \
        union { typeof(x) __val; char __c[1]; } __u =   \
                { .__val = (val) };                     \
        __write_once_size(&(x), __u.__c, sizeof(x));    \
        __u.__val;                                      \
})

int main()
{
        int test_wr = 1;

        int val = READ_ONCE(test_wr);
        val++;
        WRITE_ONCE(test_wr, 1);
        return 0;
}

Gives

test.c:5: typedef __u8  __attribute__((__may_alias__))  __u8_alias_t;
                         ^ typedef name omitted

Token concatenation not done in object-like macros

Hello,

chibicc does not evaluate the ## operator in the replacement list of an object-like macros.

The C standard says, to quote ISO/IEC 9899:1999:

For both object-like and function-like macro invocations, before the replacement list is
reexamined for more macro names to replace, each instance of a ## preprocessing token
in the replacement list (not from an argument) is deleted and the preceding preprocessing
token is concatenated with the following preprocessing token.

For example the following:

#define obj a ## b
obj

expands to:

a ## b

even though the expected expansion by the standard (which gcc and clang conforms to) is:

ab

EDIT: typo

Assembler error on larger than 32-bit bitfields

struct S { unsigned long long int b:32; } s;

void f()
{
    s.b = 0;
}

Trying to compile this results in this error:

/tmp/chibicc-0G8AuI: Assembler messages:
/tmp/chibicc-0G8AuI:60: Error: operand type mismatch for `and'

Non-directives incorrectly recognized as directives

The following code:

#define x
x#define y z
y

Should expand to:

#define y z
y

But instead chibicc expands it to:

z

Rationale: ISO/IEC 9899:1999 6.10:

A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the
following constraints: The first token in the sequence is a # preprocessing token that (at
the start of translation phase 4) is either the first character in the source file (optionally
after white space containing no new-line characters) or that follows white space
containing at least one new-line character.

semantic errors uncatched

It seems that many semantic errors are not catched.For example,the type of a bit field has to be int or unsigned,but the compiler seems not to check this. I may wonder if the front end is not completed.

struct a {
    int;
    double d:10;
}A;
                                                                                                                                                                                         thank you                                                                                                                                                             

FreeBSD stdarg support

FreeBSD gives out this rather excluding message when including stdio.h:

/usr/include/sys/_types.h:135: #error "No support for your compiler for stdargs"

It is triggered by is this condition in cdefs.h not being met (and the guarded macros not getting set):

#if (__GNUC_MINOR__ > 95 || __GNUC__ >= 3)
#define	__GNUCLIKE_BUILTIN_VARARGS 1
#define	__GNUCLIKE_BUILTIN_STDARG 1
#define	__GNUCLIKE_BUILTIN_VAALIST 1
#endif

Unicode security improvements over C11

n1570 Annex D allows a wild mix of GREEK, CYRILLIC and Mathematical symbols as identifiers, which are insecure according to the TR39 Unicode Security mechanisms.

Esp. GREEK must not be mixed with CYRILLIC, as they are not identifiable. no identifiers anymore. I am seeing more and more of such examples, which look cute, but are not.
e.g. П 041F (Cyrillic) vs Π 03A0 (Greek)
or the various lamdas λ 03BB vs 13 others.
just try uninames LAMDA

In my language and in Java they check such script ranges and fail if a document (program) contains identifiers in disallowed mixed scripts. certain combinations are allowed, but many not.

With C11 maybe start with checking the GREEK vs CYRILLIC vs MATH ranges, and set a global, like id_greek, id_cyrillic and id_math and fail if two if them become true. => "Cannot mix greek with cyrillic"

http://unicode.org/reports/tr39/#Confusable_Detection
https://www.sigbus.info/n1570#D

linker fails

ld: cannot find /usr/lib/x86_64-linux-gnu/crt1.o: No such file or directory
ld: cannot find /usr/lib/x86_64-linux-gnu/crti.o: No such file or directory
ld: cannot find /usr/lib/gcc/x86_64-linux-gnu/9/crtbegin.o: No such file or directory

Duplicate keys in hash table

When we are inserting a new key-value pair to a hash table, we always overwrite a tombstone if found. However, it can cause a duplicate key to be inserted to a hash table, since another key might be after the first tombstone. So we need to continue searching a given key until an empty slot is found, to make sure that there's no existing key.

#include errors are reported on the wrong line

The following file:

#include "fake.h"

int main() {}

where fake.h does not exist
produces the output

fail_include.c:3: int main() {}
                  ^ fake.h: cannot open file: No such file or directory

whereas the error should instead be reported on the #include line.

LLVM codegen

Hello! 👋 Thank you for making this, I feel like it’s an incredibly helpful learning resource, and overall a really neat project!

In case anyone finds it useful, I wanted to mention that I decided to start writing an LLVM backend for chibicc! It’s still nowhere nearly finished, but it’s enough to compile some simple examples.

codegen.c
#include <inttypes.h>
#include "chibicc.h"

int align_to(int n, int align) {
  return (n + align - 1) / align * align;
}

static FILE *output_file;

__attribute__((format(printf, 1, 2)))
static void println(char *fmt, ...) {
  va_list ap;
  va_start(ap, fmt);
  vfprintf(output_file, fmt, ap);
  va_end(ap);
  fprintf(output_file, "\n");
}

static char *gen_addr(Node *node);
static char *gen_expr(Node *node);
static char *gen_stmt(Node *node);

static int counter = 0;
static int counter2 = 0;

static char *to_val(char *str) {
  counter++;
  println("  %%%d = %s", counter, str);
  return format("%%%d", counter);
}

static int count() {
  counter2++;
  return counter2;
}

static char *gen_type(Type *ty) {
  switch (ty->kind) {
  case TY_VOID: return "void";
  case TY_BOOL: return "i32";
  case TY_CHAR: return "i8";
  case TY_SHORT: return "i16";
  case TY_INT: return "i32";
  case TY_ENUM: return "i32";
  case TY_LONG: return "i64";
  case TY_FLOAT: return "f32";
  case TY_DOUBLE: return "f64";
  case TY_LDOUBLE: return "x86_f80";
  case TY_ARRAY:
    return format("[%d x %s]", ty->array_len, gen_type(ty->base));
  case TY_PTR:
    if (ty->base->kind == TY_VOID) return "i8*";
    return format("%s*", gen_type(ty->base));
  // TODO
  // case TY_STRUCT:
  // case TY_UNION:
  // ...
  }
  unreachable();
}

char *to_bool(Node *node) {
  if (is_flonum(node->ty)) {
    return to_val(format("fcmp ne %s %s, 0.0", gen_type(node->ty), gen_expr(node)));
  } else {
    return to_val(format("icmp ne %s %s, 0", gen_type(node->ty), gen_expr(node)));
  }
}

char *fn_args(Node *arg) {
  if (!arg) return NULL;
  char *next = fn_args(arg->next);
  char *v = gen_expr(arg);
  if (next) return format("%s %s, %s", gen_type(arg->ty), v, next);
  else return format("%s %s", gen_type(arg->ty), v);
}

static char *gen_addr(Node *node) {
  switch (node->kind) {
  case ND_VAR: {
    if (node->var->is_local)
      return format("%%%s.%d", node->var->name, node->var->offset);
    char *v = format("@%s", node->var->name);
    if (node->ty->kind == TY_ARRAY) {
      int size = node->ty->size / node->ty->base->size;
      char *ty = gen_type(node->ty->base);
      v = to_val(format("getelementptr [%d x %s], [%d x %s]* %s, i8 0, i8 0", size, ty, size, ty, v));
    }
    return v;
  }
  case ND_DEREF:
    return gen_expr(node->lhs);
  case ND_COMMA:
    println("  %s", gen_expr(node->lhs));
    return gen_addr(node->rhs);
  case ND_ASSIGN:
  case ND_COND:
    if (node->ty->kind == TY_STRUCT || node->ty->kind == TY_UNION) {
      // TODO
      error_tok(node->tok, "not implemented");
    }
    break;
  case ND_FUNCALL:
  case ND_MEMBER:
  case ND_VLA_PTR:
    // TODO
    error_tok(node->tok, "not implemented");
  }

  error_tok(node->tok, "not an lvalue");
}

static char *gen_expr(Node *node) {
  switch (node->kind) {
  case ND_NULL_EXPR:
    return NULL;
  case ND_NUM:
    switch (node->ty->kind) {
    case TY_FLOAT:
    case TY_DOUBLE:
    case TY_LDOUBLE:
      return format("%Lf", node->fval);
    default:
      return format("%ld", node->val);
    }
  case ND_NEG:
    if (is_flonum(node->ty))
      return to_val(format("fneg %s %s", gen_type(node->ty), gen_expr(node->lhs)));
    else
      return to_val(format("sub %s %s, 0", gen_type(node->ty), gen_expr(node->lhs)));
  case ND_VAR:
    if (node->ty->kind == TY_ARRAY)
      return gen_addr(node);
    return to_val(format("load %s, %s* %s", gen_type(node->ty), gen_type(node->ty), gen_addr(node)));
  case ND_MEMBER:
    return to_val(format("load %s, %s* %s", gen_type(node->member->ty), gen_type(node->member->ty), gen_addr(node)));
  case ND_DEREF:
    return to_val(format("load %s, %s %s", gen_type(node->ty), gen_type(node->lhs->ty), gen_expr(node->lhs)));
  case ND_ADDR:
    return gen_addr(node->lhs);
  case ND_ASSIGN: {
    if (node->lhs->kind == ND_MEMBER && node->lhs->member->is_bitfield) {
      // TODO
      error_tok(node->tok, "not implemented");
    }
    char *v1 = gen_addr(node->lhs);
    char *v2 = gen_expr(node->rhs);
    println("  store %s %s, %s* %s", gen_type(node->rhs->ty), v2, gen_type(node->lhs->ty), v1);
    return v2;
  }
  case ND_STMT_EXPR: {
    char *res;
    for (Node *n = node->body; n; n = n->next)
      res = gen_stmt(n);
    if (res == NULL) error_tok(node->tok, "invalid last statement kind");
    return res;
  }
  case ND_COMMA:
    gen_expr(node->lhs);
    return gen_expr(node->rhs);
  case ND_CAST: {
    char *kw;
    if (is_integer(node->ty) && is_integer(node->lhs->ty)) {
      if (node->ty->size < node->lhs->ty->size)
        kw = "trunc";
      else if (node->ty->size > node->lhs->ty->size)
        kw = "zext";
      else
        return gen_expr(node->lhs);
    }
    else if (is_flonum(node->ty) && is_flonum(node->lhs->ty)) {
      if (node->ty->size < node->lhs->ty->size)
        kw = "fptrunc";
      else if (node->ty->size > node->lhs->ty->size)
        kw = "fpext";
      else
        return gen_expr(node->lhs);
    }
    else if (is_flonum(node->ty) && is_integer(node->lhs->ty)) {
      if (node->lhs->ty->is_unsigned)
        kw = "uitofp";
      else
        kw = "sitofp";
    }
    else if (is_integer(node->ty) && is_flonum(node->lhs->ty)) {
      if (node->lhs->ty->is_unsigned)
        kw = "fptoui";
      else
        kw = "fptosi";
    }
    else if (node->ty->kind == TY_PTR && is_integer(node->lhs->ty)) {
      kw = "inttoptr";
    }
    else if (is_integer(node->ty) && node->lhs->ty->kind == TY_PTR) {
      kw = "ptrtoint";
    }
    else if (node->ty->kind == TY_PTR && node->lhs->ty->kind == TY_ARRAY) {
      return gen_addr(node->lhs);
    }
    else {
      return gen_expr(node->lhs);
    }
    return to_val(format("%s %s %s to %s", kw, gen_type(node->lhs->ty), gen_expr(node->lhs), gen_type(node->ty)));
  }
  case ND_MEMZERO:
    return "zeroinitializer";
  case ND_COND: {
    int c = count();
    println("  br i1 %s, label %%L.then.%d, label %%L.else.%d", to_bool(node->cond), c, c);
    println("L.then.%d:", c);
    char *v1 = gen_expr(node->then);
    println("  br label %%L.end.%d", c);
    println("L.else.%d:", c);
    char *v2 = gen_expr(node->els);
    println("  br label %%L.end.%d", c);
    println("L.end.%d:", c);
    return to_val(format("phi %s [ %s %%L.then.%d, %s %%L.else.%d ]", gen_type(node->ty), v1, c, v2, c));
  }
  case ND_NOT: {
    char *v;
    if (is_flonum(node->lhs->ty)) {
      v = to_val(format("fcmp eq %s %s, 0.0", gen_type(node->lhs->ty), gen_expr(node->lhs)));
    } else {
      v = to_val(format("icmp eq %s %s, 0", gen_type(node->lhs->ty), gen_expr(node->lhs)));
    }
    return to_val(format("zext i1 %s to %s", v, gen_type(node->lhs->ty)));
  }
  case ND_BITNOT:
    return to_val(format("xor %s %s, -1", gen_type(node->lhs->ty), gen_expr(node->lhs)));
  case ND_LOGAND: {
    int c = count();
    char *v1 = to_bool(node->lhs);
    println("  br label %%L.and.%d", c);
    println("L.and.%d:", c);
    println("  br i1 %s, label %%L.true.%d, label %%L.end.%d", v1, c, c);
    println("L.true.%d:", c);
    char *v2 = to_bool(node->rhs);
    println("  br label %%L.end.%d", c);
    println("L.end.%d:", c);
    return to_val(format("phi %s [ 0 %%L.and.%d, %s %%L.true.%d ]", gen_type(node->ty), c, v2, c));
  }
  case ND_LOGOR: {
    int c = count();
    char *v1 = to_bool(node->lhs);
    println("  br label %%L.or.%d", c);
    println("L.or.%d:", c);
    println("  br i1 %s, label %%L.end.%d, label %%L.false.%d", v1, c, c);
    println("L.false.%d:", c);
    char *v2 = to_bool(node->rhs);
    println("  br label %%L.end.%d", c);
    println("L.end.%d:", c);
    return to_val(format("phi %s [ 1 %%L.or.%d, %s %%L.false.%d ]", gen_type(node->ty), c, v2, c));
  }
  case ND_FUNCALL: {
    if (node->lhs->kind == ND_VAR && !strcmp(node->lhs->var->name, "alloca")) {
      // TODO
      error_tok(node->tok, "not implemented");
    }
    char *args = fn_args(node->args);
    if (node->ty->kind == TY_VOID) {
      println("  call void %s(%s)", gen_addr(node->lhs), args);
      return NULL;
    }
    return to_val(format("call %s %s(%s)", gen_type(node->ty), gen_addr(node->lhs), args));
  }
  case ND_ADD:
  case ND_SUB:
  case ND_MUL:
  case ND_DIV:
  case ND_MOD:
  case ND_BITAND:
  case ND_BITOR:
  case ND_BITXOR:
  case ND_EQ:
  case ND_NE:
  case ND_LT:
  case ND_LE: {
    char *v1 = gen_expr(node->lhs);
    char *v2 = gen_expr(node->rhs);

    char *kw;
    switch (node->kind) {
    case ND_ADD: kw = is_flonum(node->lhs->ty) ? "fadd" : "add"; break;
    case ND_SUB: kw = is_flonum(node->lhs->ty) ? "fsub" : "sub"; break;
    case ND_MUL: kw = is_flonum(node->lhs->ty) ? "fmul" : "mul"; break;
    case ND_DIV: kw = is_flonum(node->lhs->ty) ? "fdiv" : node->lhs->ty->is_unsigned ? "udiv" : "sdiv"; break;
    case ND_MOD: kw = is_flonum(node->lhs->ty) ? "frem" : node->lhs->ty->is_unsigned ? "urem" : "srem"; break;
    case ND_BITAND: kw = "and"; break;
    case ND_BITOR: kw = "or"; break;
    case ND_BITXOR: kw = "xor"; break;
    case ND_EQ: kw = is_flonum(node->lhs->ty) ? "fcmp oeq" : "icmp eq"; break;
    case ND_NE: kw = is_flonum(node->lhs->ty) ? "fcmp une" : "icmp ne"; break;
    case ND_LT: kw = is_flonum(node->lhs->ty) ? "fcmp ult" : node->lhs->ty->is_unsigned ? "icmp ult" : "icmp slt"; break;
    case ND_LE: kw = is_flonum(node->lhs->ty) ? "fcmp ule" : node->lhs->ty->is_unsigned ? "icmp ule" : "icmp sle"; break;
    }

    char *ty = gen_type(node->lhs->ty);
    if (node->ty->kind == TY_PTR) {
      ty = "i64";
      if (node->lhs->ty-> kind == TY_PTR) {
        v1 = to_val(format("ptrtoint %s %s to i64", gen_type(node->lhs->ty), v1));
        if (node->rhs->ty-> kind != TY_PTR)
          v2 = to_val(format("mul i64 %s, 8", v2));
      }
      if (node->rhs->ty-> kind == TY_PTR) {
        v2 = to_val(format("ptrtoint %s %s to i64", gen_type(node->rhs->ty), v2));
        if (node->lhs->ty-> kind != TY_PTR)
          v1 = to_val(format("mul i64 %s, 8", v1));
      }
    }

    char *v = to_val(format("%s %s %s, %s", kw, ty, v1, v2));

    if (node->ty->kind == TY_PTR)
      v = to_val(format("inttoptr i64 %s to %s", v, gen_type(node->ty)));

    switch (node->kind) {
    case ND_EQ:
    case ND_NE:
    case ND_LT:
    case ND_LE:
      return to_val(format("zext i1 %s to %s", v, gen_type(node->ty)));
    }

    return v;
  }
  case ND_LABEL_VAL:
  case ND_CAS:
  case ND_EXCH:
    // TODO
    error_tok(node->tok, "not implemented");
  }

  error_tok(node->tok, "invalid expression");
}

static char *gen_stmt(Node *node) {
  switch (node->kind) {
  case ND_IF: {
    int c = count();
    println("  br i1 %s, label %%L.then.%d, label %%L.else.%d", to_bool(node->cond), c, c);
    println("L.then.%d:", c);
    gen_stmt(node->then);
    println("  br label %%L.end.%d", c);
    println("L.else.%d:", c);
    gen_stmt(node->els);
    println("  br label %%L.end.%d", c);
    println("L.end.%d:", c);
    return NULL;
  }
  case ND_FOR: {
    int c = count();
    if (node->init) gen_stmt(node->init);
    println("  br label %%L.continue.%d", c);
    println("L.continue.%d:", c);
    if (node->cond) {
      char *v1 = to_bool(node->cond);
      println("  br i1 %s, label %%L.then.%d, label %%L.break.%d", v1, c, c);
    }
    println("L.then.%d:", c);
    gen_stmt(node->then);
    if (node->inc) gen_expr(node->inc);
    println("  br label %%L.continue.%d", c);
    println("L.break.%d:", c);
    return NULL;
  }
  case ND_BLOCK:
    for (Node *n = node->body; n; n = n->next)
      gen_stmt(n);
    return NULL;
  case ND_LABEL:
    println("%s:", node->unique_label);
    gen_stmt(node->lhs);
    return NULL;
  case ND_RETURN:
    if (node->lhs)
      println("  ret %s %s", gen_type(node->lhs->ty), gen_expr(node->lhs));
    else
      println("  ret void");
    counter++;
    return NULL;
  case ND_EXPR_STMT:
    gen_expr(node->lhs);
    return NULL;
  case ND_ASM:
    error_tok(node->tok, "unavailable statement type");
    return NULL;
  case ND_DO:
  case ND_SWITCH:
  case ND_CASE:
  case ND_GOTO:
  case ND_GOTO_EXPR:
    // TODO
    error_tok(node->tok, "not implemented");
  }

  error_tok(node->tok, "invalid statement");
}

static char *fn_params(Obj *params) {
  char *res = "";
  for (Obj *var = params; var; var = var->next)
    res = format("%s, %s %%arg.%s.%d", res, gen_type(var->ty), var->name, var->offset);
  if (res[0]) res += 2;
  return res;
}

static char *fn_param_types(Type *params) {
  int i = 0;
  char *res = "";
  for (Type *ty = params; ty; ty = ty->next) {
    i++;
    res = format("%s, %s", res, gen_type(ty));
  }
  if (res[0]) res += 2;
  return res;
}

static char *global_value(char *data, Type *ty, int i) {
  uint64_t value;
  switch (ty->size) {
  case 1:
    value = data[i];
    break;
  case 2:
    value = ((uint16_t *)data)[i];
    break;
  case 4:
    value = ((uint32_t *)data)[i];
    break;
  case 8:
    value = ((uint64_t *)data)[i];
    break;
  }
  if (is_flonum(ty)) return format("0x%" PRIx64, value);
  else return format("%" PRIu64, value);
}

static char *global_array(char *data, Type *ty) {
  char *res = "";
  for (int i = 0 ; i < ty->array_len ; i++)
    res = format("%s, %s %s", res, gen_type(ty->base), global_value(data, ty->base, i));
  if (res[0]) res += 2;
  return format("[%s]", res);
}

void codegen(Obj *prog, FILE *out) {
  output_file = out;

  for (Obj *fn = prog; fn; fn = fn->next) {
    if (!fn->is_function) {
      if (fn->ty->kind == TY_ARRAY) {
        println("@%s = global [%d x %s] %s", fn->name, fn->ty->size / fn->ty->base->size, gen_type(fn->ty->base), global_array(fn->init_data, fn->ty));
      } else {
        println("@%s = global %s %s", fn->name, gen_type(fn->ty), global_value(fn->init_data, fn->ty, 0));
      }
      continue;
    }

    int i = 1;
    for (Obj *var = fn->locals; var; var = var->next)
      var->offset = i++;

    if (!fn->is_definition) {
      char *params = fn_param_types(fn->ty->params);
      println("declare %s @%s(%s)", gen_type(fn->ty->return_ty), fn->name, params);
      continue;
    }

    counter = 0;
    counter2 = 0;

    char *params = fn_params(fn->params);
    println("define %s @%s(%s) {", gen_type(fn->ty->return_ty), fn->name, params);

    for (Obj *var = fn->locals; var; var = var->next)
      println("  %%%s.%d = alloca %s", var->name, var->offset, gen_type(var->ty));
    for (Obj *var = fn->params; var; var = var->next)
      println("  store %s %%arg.%s.%d, %s* %%%s.%d", gen_type(var->ty), var->name, var->offset, gen_type(var->ty), var->name, var->offset);

    gen_stmt(fn->body);

    if (!strcmp(fn->name, "main"))
      println("  ret i32 0");
    else if (fn->ty->return_ty->kind == TY_VOID)
      println("  ret void");
    else
      println("  unreachable");

    println("}");
  }
}

This is the file I have been using to test it, and it seems to be compiling and working fine insofar!

test.c
int write(int fd, void *buffer, int len);

void print(char *str)
{
   int l = 0, x;
   while (str[l]) l++;
   while (l) x = write(1, str, l), l -= x, str += x;
}

void put(char ch)
{
   write(1, &ch, 1);
}

char hex(int i)
{
   if (i < 10) return '0' + i;
   else return 'A' + i - 10;
}

int main()
{
   print(" = hex table =\n");
   for (int i = 0 ; i < 0x40 ; i++)
      print("dec "), put(hex(i / 10)), put(hex(i % 10)),
      print(" = hex "), put(hex(i / 0x10)), put(hex(i % 0x10)),
      put('\n');
   return 0;
}
./chibicc -S -o ../test.ll ../test.c
clang -O3 -o ../test ../test.ll
../test

I’m hoping to continue working on this, and I’m going to keep posting updates on this issue, if that’s fine! (I’m going to use edits to avoid pinging watchers, though.)

changelog

2021

  • August 21: Started working on LLVM codegen.
  • October 25: Improved support for global values.

Compile warnings on GCC/Clang

When I compile with O2 I have this warning on GCC 10.2.0:

In function ‘rehash’,
    inlined from ‘get_or_insert_entry’ at hashmap.c:82:5:
hashmap.c:41:18: warning: argument 1 range [18446744071562067968, 18446744073709551615] exceeds maximum object size 9223372036854775807 [-Walloc-size-larger-than=]
   41 |   map2.buckets = calloc(cap, sizeof(HashEntry));
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from chibicc.h:10,
                 from hashmap.c:3:
hashmap.c: In function ‘get_or_insert_entry’:
/usr/include/stdlib.h:542:14: note: in a call to allocation function ‘calloc’ declared here
  542 | extern void *calloc (size_t __nmemb, size_t __size)

And with Clang many warnings about switches, like:

codegen.c:81:11: warning: 42 enumeration values not handled in switch: 'ND_NULL_EXPR', 'ND_ADD', 'ND_SUB'... [-Wswitch]
  switch (node->kind) {
codegen.c:179:11: warning: 8 enumeration values not handled in switch: 'TY_VOID', 'TY_BOOL', 'TY_CHAR'... [-Wswitch]
  switch (ty->kind) {
...

gcc library path is not found

main.c

int main(int argc, char *argv[]){
        return 0;
}

Compile:

chibicc ./main.c -o ./n

Error:

gcc library path is not found

Re-use notification

Thanks for this great small C compiler; it's lean and powerful.

In case you're interested: I reused your parser to implement a C header file to Oberon+ definition module transpiler, see https://github.com/rochus-keller/C2OBX. It already works sufficiently well with the SDL2 headers. The modifications to your code were minor (see commit log); I intend to also process comments and transpile a subset of defines, so some more modifications are likely needed.

security of the executables properties

maybe you can consider make it more secure by adding security flags like PIE, RELRO, PaX, Canaries, ASLR, Fortify.
i dont know if this will make any difference on the performance but in my opinion, it's better than nothing tho

image link

'typedef enum MyEnum MyEnum' cannot get parsed before define

#include "test.h"

typedef int MyInt, MyInt2[4];
typedef int;
typedef enum MyEnum MyEnum;

enum MyEnum
{
ENUM_TEST
};

int main() {
ASSERT(1, ({ typedef int t; t x=1; x; }));
ASSERT(1, ({ typedef struct {int a;} t; t x; x.a=1; x.a; }));
ASSERT(1, ({ typedef int t; t t=1; t; }));
ASSERT(2, ({ typedef struct {int a;} t; { typedef int t; } t x; x.a=2; x.a; }));
ASSERT(4, ({ typedef t; t x; sizeof(x); }));
ASSERT(3, ({ MyInt x=3; x; }));
ASSERT(16, ({ MyInt2 x; sizeof(x); }));

printf("OK\n");
return 0;
}

Missing header and expected ','

bash-4.3$ ./chibicc -o hello-world hello-world.c
/usr/include/stdio.h:33: # include <stddef.h>
^ stddef.h: cannot open file: No such file or directory

bash-4.3$ gcc --print-file-name=include
/usr/lib64/gcc/x86_64-slackware-linux/5.5.0/include

When I copied stddef.h from /usr/lib64/gcc/x86_64-slackware-linux/5.5.0/include to /usr/include/ I get this message:
bash-4.3$ ./chibicc -o hello-world hello-world.c
/usr/include/libio.h:49: #include <stdarg.h>
^ stdarg.h: cannot open file: No such file or directory

When I copied stdarg.h from /usr/lib64/gcc/x86_64-slackware-linux/5.5.0/include to /usr/include/ I get this message:
bash-4.3$ ./chibicc -o hello-world hello-world.c
/usr/include/stdarg.h:40: typedef __builtin_va_list __gnuc_va_list;
^ expected ','

Maybe it might help, here is my problem with TCC too:
bash-4.3$ ./tcc -o hello-world hello-world.c
In file included from hello-world.c:1:
In file included from /usr/include/stdio.h:74:
In file included from /usr/include/libio.h:49:
/usr/include/stdarg.h:40: error: ';' expected (got "__gnuc_va_list")

Using goto inside statement expressions gives an error

Mixing statement expressions and goto gives an error, usually this works in GCC/Clang. Small test case:

int main() {
  int r = ({
    int x;
    goto label;
label:
    x;
  });
  return 0;
}

Output:

/chibicc a.c -o a
a.c:2:   int r = ({
                 ^ statement expression returning void is not supported

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.