singhcoder / compiler-construction Goto Github PK

Compiler in C for a custom language

C 90.78% Makefile 0.06% Assembly 9.02% C++ 0.14%

compiler-construction's Introduction

Harpinder Jot Singh

I am Harpinder, working at DevRev almost since the start of the company. I have experience in building products 0->1, am a generalist by nature, and have worked on multiple projects here at DevRev including the Snap-in platform, and ML/LLMOps solution for our AI needs.

I did my B.tech in Computer Science from BITS Pilani.

Recently, I have started diving into LLMs era, and tinker around with them in my free time.

Let's talk!

compiler-construction's People

Contributors

Stargazers

Watchers

Forkers

rbz-99 guptasamarth61 aaks-ctrl vismit2000 lakshmiteja17 pbisen akash-sr hemantsisodiya11 siddharth-u todi-sid superbhanu22 ypmundhada rachit-0107 olinreedster

compiler-construction's Issues

Testing required

Please write test cases and verify correctness for following

Symbol table scoping works correctly
Selecting 7 (and 8 also) from menu while running program generates symbol table
Interpret output as :

When a new scope is opened
- It prints New Scope opened at
  - (non-terminal is the one involved in AST which opens up a scope)
Whenever it encounters a declaration of a variable,
- Declaration statement
- For loop opening {right now it is assumed that whenever a FORLOOP is encountered, the index variable involved is declared on the fly so it's added to symbol table as type int}
- Input/output parameters of a function definition
It prints "Inserting in hashtable '' "
When the id is accessed and it exists in symbol table, it prints "Received entry for '' , type_name = ARRAY|BOOLEAN|REAL|INTEGER "
- If array, it also prints range and primitive type
  - if the range contains an identifier, 0x3f3f3f3f is printed at that place
When a function definition is encountered
- Inserting in hashtable <fun_id>
- Received entry for , type_name = MODULE
- is_declared : 1(true)
- Input_types_list : [<num_of_inp_args>] - arg1 type | arg2 type | ... | argn type
- Output_types_list : same as above

Type checker module implementation

Whenever it encounters an assignment statement, it prints :

"id" = .. type checking
LHS_TYPE : REAL | BOOLEAN | ARRAY | INTEGER | TYPE_ERROR
RHS_TYPE : same as above
If types match
- =======Type matches======== is printed
Else
- SEMANTIC ERROR: lhs and rhs types don't match
If an identifier is not defined,
- SEMANTIC ERROR: ID not defined is printed
If array is not dynamic,
- and index is out of bounds, it prints,
  - SEMANTIC ERROR: INDEX OUT OF BOUNDS
- if index is not int type
  - SEMANTIC ERROR: INVALID ARRAY ACCESS
Description of type checking rules here

Type Extractor and Checker:
Type of an identifier is extracted from the declaration statement that declares the identifier. The data types supported in the language you are implementing are: integer, real, boolean and array. The type of an element of an array is the type of the data the array refers to. For instance, consider the statement declare a: array[1..15] of integer; the type of an element a[5], say is integer. The type checker verifies the type of an expression appearing at the right hand side of the assignment statement such as value := (a+b‐c ) and checks if it matches with that of the identifier on the left hand side. An arithmetic operator can have two operands of the similar type, where types can be integer and real data types. Example : let a,b, c and d be declared as integer then

An expression in a statement a := (b2‐4c)+5*d; its RHS has a type integer and since the identifier a is also of the same type, the type checker approves the expression assignment

An expression in a statement a:= (b2 +c)/d <= c10; has a boolean valued expression which is assigned to the real valued identifier, hence the expression value assignment is wrong.

The Static type checking rules are:

The type of an identifier is the type appearing while declaring the variable.
The type of NUM is integer.
The type of RNUM is real.
The type of true or false value is boolean.
The type of an array variable A of type array [12..20] of real (say) is defined as an expression <real, 12, 20>. The type of an array element A[13] (say) is real if 13 is within the bounds [12, 20]
The type of a simple expression (say E) of the form expression(say E1) Expression(say E2)

is integer, if both expressions are of type integer and the operator is arithmetic operator.
is real, if both the expressions are of type real and the operator is arithmetic operator.
is boolean, if both expressions are of type integer and the operator is a relational operator.
is boolean, if both expressions are of type real and the operator is relational.
is boolean, if both expressions are of type boolean and the operator is logical.

The type of the expression is ERROR, if the above rules do not derive the type of E appropriately.

Type checking rules for array construct are as follows:

The operations +, -, *, / and all relational and logic operators cannot be applied on array variables of type <type, 12, 20> (say)
The assignment operator applied to two array variables of the same type is valid. For example, if A and B are the array variables of type array[12..20] of real, then A:= B; is a valid statement. This applies to dynamic arrays of type array[a..b] of real as well.
Consider array elements with index represented by integer identifier say A[k]. Here type checking of variable k is done at compile time. If the type of k is integer, then it is valid else it is reported as an error. Also, the type checking of A[13], for type of index (NUM), is done at compile time.
The bound checking of A[13] where A is a static array is done at compile time. If A is a dynamic array, then bound checking of A[13] is done at run time [see below]
The type of an identifier or an expression is computed by traversing the abstract syntax tree.

Remove limits.h from semanticAnalyzerDef

@happy please remove it. It's no longer required.

Also #define MAX_CARSNUM_IN_EXPR

New changes done for semantic analyzer

Grammar is changed at rule numbers 61-63, 75-77, 79 , 90 to do left factoring properly

Things left in semantic analyzer module

Currently, Single AST pass has been implemented for semantic rules checking

Second pass is required to
- check function calls validity where declaration existed but definition could not be verified due to inexistence

Storing of lexeme in heap

We are storing lexeme in heap here, which could create a problem when function and other invocation are done. String local variable has value in the heap which is not "reset" until the program terminates. Also when we change it to storing lexeme within token then we have to pass the information of lexeme appropriately.

Minor Implementation problems/bugs

Array bounds, if of type a..b were kept as a constant OBTAIN_DYNAMICALLY while installing in symbol table to tell that it will be obtained dynamically.
But it isn't working as expected, and gives some other integral values check it out
Maybe keep a label is_dynamic to ensure it?
While making hash_table to static instead of dynamic added a #define NUM_TABLE_ENTRIES but it's not needed due to presence of HASH_SIZE already, so remove it
Currently each error printing again and again types code for printing colors, make a separate module print_error with following declaration

void print_error( char *error_type /*LEXICAL / SYNTAX / SEMANTIC */, char *error_msg);

Right now, to get second/third child, it's hardcoded to as leftmost_child->sibling->sibling->...
Define a function giving nth child
Store number of children of a tree_node also in a node. Will help in generating codes later
There is an unnecessary error_present variable present in parser.h/c which is not used anywhere later. Remove it

Implement Type checking rules

Type checker's description - here

Some Final changes for creating AST

Added a non-terminal NT_ARRAY and a rule DATATYPE->NT_ARRAY to be able to distinguish that it's an array type
Removed OTHERCASE stmt and made CASESTMT rule right recursive so that we get a linked list of form value -> stmts -> value -> stmts.. rather than again a tree like structure
Changes MODULEREUSESTMT name to FUNCTIONCALLSTMT for sake of clearity

Offset and width calculations

Try to incorporate changes to the code base to implement offset and width calculations and store them in the symbol table
Offset calculations is required everytime an id is installed in symbol table
For width you can directly go inside function inserting an id in symbol table and check the token and store width
Give some #defines defining widths of types in bytes so that we can later change them if needed accordingly

You can find the changes done to incorporate symbol table in these two commits 9aef702 and 59f8140

Hashtable recursive search

Checkout functionality for searching for a lexeme in hash table(key_present_in_table),

Right now it searches only in current table and returns false if not found
Implement a recursive version so as to search in parents (outer scopes) if not found in current scope and return false only if not found anywhere.

Semantic Rules implementation

Implement these semantic rules for ERPLAG

An identifier cannot be declared multiple times in the same scope.
While installing in symbol table, make a check if already present, flag an error (make sure that while checking you only search current symbol table and not parent too)
An identifier must be declared before its use.
Whenever there is a usage and symbol table says no entry, flag an error
The types and the number of parameters returned by a function must be the same as that of the parameters used in invoking the function.
Check semantics of FUNCTIONCALLSTMT
The parameters being returned by a function must be assigned a value. If a parameter does not get a value assigned within the function definition, it should be reported as an error.
There must be a assignmentstmt for each output parameter type within function
The function that does not return any value, must be invoked appropriately.
Same as 3
Function input parameters passed while invoking it should be of the same type as those used in the function definition.
Semantics of FUNCTIONCALLSTMT
A switch statement with an integer typed identifier associated with it, can have case statement with case keyword followed by an integer only and the case statements must be followed by a default statement.
Semantics of SWITCH stmt
A switch statement with an identifier of type real is not valid and an error should be reported.
same as 7
A switch statement with a boolean type identifier can have the case statements with labels true and false only. The switch statement then should not have a default statement.
same as 7
Function overloading is not allowed.
Same as identifier multiple installations
A function declaration for a function being used (say F1) by another (say F2) must precede the definition of the function using it(i.e. F2), only if the function definition of F1 does not precede the definition of F2.
might require 2 passes of AST
If the function definition of F1 precedes function definition of F2(the one which uses F1), then the function declaration of F1 is redundant and is not valid.
A for statement must not redefine the variable that participates in the iterating over the range.
The function cannot be invoked recursively.
An identifier used beyond its scope must be viewed as undefined