Git Product home page Git Product logo

sql_interpreter's Introduction

cs4321_p1

Prerequisite

  • Maven v.10.0
  • Oracle JDK 1.8

Installation

  • Build

    mvn package
    

    The output .jar will be built in /Target
    Bug: The jar generated by maven cannot be run due to the system scope dependency. Please adjust it to compile or use mvn install to install the package in maven's dependencies management folder.

  • IDE

    If you want to use Eclipse, try the command below:

    mvn eclipse:eclipse
    

Implement

  • Top-level class is com.sql.interpreter.App

  • The runnable jar could be run in cmd

    java -jar cs4321_p3.jar $<interpreter_config_file>

Details

Hash Join

Hash join has the same I/O cost as refined sort merge join. So theoretically, hash join will perform better than the sort merge join operator we already have. Assuming one bucket is able to fit in memory, the hash join operator will cost around 3(M + N), where M is the number of outer tuple and N is the number of inner tuple.
We first extract the join condition for later use of hash function. For hashing, there are two phases: in the first phase, we use the mod hash function and divide the tuple into different buckets according to the hash code of the. Then we write the tuples into files. Each bucket is a file. For example, if the mod value is 5, there will be 10 files, 5 for left tuples and 5 for right tuples. In the second phase, take left and right buckets with the same hash code, and join the tuples in the two buckets according to the join condition. During this phase, we use another hash function, also a mod function. We use a HashMap<Integer, List> as a hash table in store the left bucket as inner tuple.

Parallel Join

To make the most of CPU usage, we can use parallel join with multi-threaded program. Specifically, during the second phase of hash join, for each pair of bucket with the same hash code, we start a new thread to do the join operation of the two buckets. There are two shared variables: tuple queue and finish count. Tuple queue stores joint tuple. All the threads put the result tuple into the queue. The main thread getNextTuple takes tuple from the queue. Finish count is to record how many threads has finished. The getNextTuple will return null only if all the threads have finished and the queue is empty. Otherwise, the main thread will wait if the queue is empty.

Selection Pushing

Pushing selection is implemented using the info generated by UnionFind. Selections are pushed in the LogicalPlanBuilder class. After the creation of a logical ScanOperator, all the attributes in the UnionFind list would be traversed. For each attribute that both belongs to the schema of the scanOperator and also have its constraints, the attribute would be added to a Map. Finally, if the Map is not null, a SelectOperator would be created using all the attributes and constraints in the Map.

Union Find

The UnionFind class contains two important methods: find and union. It unions elements together creating separate sets. If two elements in two different sets union, these two sets will merge together. All the elements in a set have the same attributes.

  1. find takes in a column and returns the corresponding Constraints object. It finds the root constraints according to a father map of union find class. Once the root is found, update the roots of all the visited nodes to the same Constraints object.
  2. union takes two columns and merge them together. The corresponding two sets will be merged after this when the find method is used on these columns.

Join Order

PlanBuilder.JoinOrder

  • Considering our schema stradegy, we have to calculate the join order during the logcial plan builder, in logical operator exactly.
  • The optimal order is obtained via dynamic programming. We recursively traverse the subset and record the optimal solution locally to avoid repeated computing.

Join Implement

In my Project 2 benchmarking, I noticed that SMJ runs much faster than BNLJ, so I implement all joins as SMJ where possible. However, SMJ does not apply to joins that have other-than-equality comparisons or to pure cross-products, so those are implemented using BNLJ.
When implementing samples in the server, although BNLJ takes the lower I/Os, SMJ runs 120 times faster than BNLJ under the same block size.

Query Plan Print

Logical Query Plan Print and Potential Issues

The logical query plan is printed by traversing all the logical operators in the plan tree in preorder using visitor pattern implemented as PhysicalOperatorVisitor class. In each visit function in the PhysicalOperatorVisitor class, information of the input argument operator is first printed, then the children/child of the operator will accept this visitor in order. The union find information will be printed with the help of union find getOutput() method. Possible differences from expected output:

  1. No project operator if the query is "SELECT * From ....."
  2. For Join operator, conditions like S.A < R.E will not create new elements in union find.
  3. Order of leaves and union elements.

Physical Query Plan Print and Potential Issues

The physical query plan is printed by traversing all the physical operators in the plan tree in preorder using visitor pattern implemented as PhysicalOperatorVisitor class. In each visit function in the PhysicalOperatorVisitor class, information of the input argument operator is first printed, then the children/child of the operator will accept this visitor in order. There might be some difference between the printed expected physical query plan and ours and the following is our explanation to the difference:

  1. The difference of the join order. According to our calculation, we think that our join order should be more time efficient than or same time efficient as the expect join order.
  2. The difference between the chosen join methods. According to our test, we find our physical plan is much more time efficient than the expected physical plan.
  3. Difference between Join condition. We have some redundant join conditions printed that inherited from its child join operator. This does not affect the behavior of our query plan but we did not fix the bug in the print process.

sql_interpreter's People

Contributors

fseldow avatar gjingdx avatar waltmo147 avatar

Watchers

 avatar

Forkers

siyaohuang

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.