- Maven v.10.0
- Oracle JDK 1.8
-
Build
mvn package
The output .jar will be built in /Target
Bug: The jar generated by maven cannot be run due to the system scope dependency. Please adjust it to compile or usemvn install
to install the package in maven's dependencies management folder. -
IDE
If you want to use Eclipse, try the command below:
mvn eclipse:eclipse
-
Top-level class is
com.sql.interpreter.App
-
The runnable jar could be run in cmd
java -jar cs4321_p3.jar $<interpreter_config_file>
Hash join has the same I/O cost as refined sort merge join. So theoretically, hash join will perform better than the sort merge join operator we already have. Assuming one bucket is able to fit in memory, the hash join operator will cost around 3(M + N), where M is the number of outer tuple and N is the number of inner tuple.
We first extract the join condition for later use of hash function. For hashing, there are two phases: in the first phase, we use the mod hash function and divide the tuple into different buckets according to the hash code of the. Then we write the tuples into files. Each bucket is a file. For example, if the mod value is 5, there will be 10 files, 5 for left tuples and 5 for right tuples. In the second phase, take left and right buckets with the same hash code, and join the tuples in the two buckets according to the join condition. During this phase, we use another hash function, also a mod function. We use a HashMap<Integer, List> as a hash table in store the left bucket as inner tuple.
To make the most of CPU usage, we can use parallel join with multi-threaded program. Specifically, during the second phase of hash join, for each pair of bucket with the same hash code, we start a new thread to do the join operation of the two buckets. There are two shared variables: tuple queue and finish count. Tuple queue stores joint tuple. All the threads put the result tuple into the queue. The main thread getNextTuple takes tuple from the queue. Finish count is to record how many threads has finished. The getNextTuple will return null only if all the threads have finished and the queue is empty. Otherwise, the main thread will wait if the queue is empty.
Pushing selection is implemented using the info generated by UnionFind. Selections are pushed in the LogicalPlanBuilder class. After the creation of a logical ScanOperator, all the attributes in the UnionFind list would be traversed. For each attribute that both belongs to the schema of the scanOperator and also have its constraints, the attribute would be added to a Map. Finally, if the Map is not null, a SelectOperator would be created using all the attributes and constraints in the Map.
The UnionFind class contains two important methods: find
and union
. It unions elements together creating separate sets. If two elements in two different sets union, these two sets will merge together. All the elements in a set have the same attributes.
find
takes in a column and returns the corresponding Constraints object. It finds the root constraints according to a father map of union find class. Once the root is found, update the roots of all the visited nodes to the same Constraints object.union
takes two columns and merge them together. The corresponding two sets will be merged after this when thefind
method is used on these columns.
PlanBuilder.JoinOrder
- Considering our schema stradegy, we have to calculate the join order during the logcial plan builder, in logical operator exactly.
- The optimal order is obtained via dynamic programming. We recursively traverse the subset and record the optimal solution locally to avoid repeated computing.
In my Project 2 benchmarking, I noticed that
SMJ runs much faster than BNLJ, so I implement all joins as SMJ where possible. However, SMJ does not
apply to joins that have other-than-equality comparisons or to pure cross-products, so those are
implemented using BNLJ.
When implementing samples in the server, although BNLJ takes the lower I/Os, SMJ runs 120 times faster than BNLJ under the same block size.
The logical query plan is printed by traversing all the logical operators in the plan tree in preorder using visitor pattern implemented as PhysicalOperatorVisitor class. In each visit function in the PhysicalOperatorVisitor class, information of the input argument operator is first printed, then the children/child of the operator will accept this visitor in order. The union find information will be printed with the help of union find getOutput() method. Possible differences from expected output:
- No project operator if the query is "SELECT * From ....."
- For Join operator, conditions like S.A < R.E will not create new elements in union find.
- Order of leaves and union elements.
The physical query plan is printed by traversing all the physical operators in the plan tree in preorder using visitor pattern implemented as PhysicalOperatorVisitor class. In each visit function in the PhysicalOperatorVisitor class, information of the input argument operator is first printed, then the children/child of the operator will accept this visitor in order. There might be some difference between the printed expected physical query plan and ours and the following is our explanation to the difference:
- The difference of the join order. According to our calculation, we think that our join order should be more time efficient than or same time efficient as the expect join order.
- The difference between the chosen join methods. According to our test, we find our physical plan is much more time efficient than the expected physical plan.
- Difference between Join condition. We have some redundant join conditions printed that inherited from its child join operator. This does not affect the behavior of our query plan but we did not fix the bug in the print process.