eholk / harlan Goto Github PK

A language for GPU computing.

License: Other

Shell 1.40% Ruby 0.07% Makefile 1.30% Scheme 83.73% Emacs Lisp 0.24% Python 1.41% C++ 9.95% C 0.99% Haskell 0.91%

harlan's Introduction

Harlan

Harlan is a domain specific language for programming GPUs. This project serves primarily as a testbed for implementation and optimization techniques. The language is intentionally small, in order to simplify the process of exploring new analyses and optimizations.

Getting Started

Harlan is known to build and run on the following operating systems.

Mac OS X 10.6 (Snow Leopard)
Mac OS X 10.7 (Lion)
Mac OS X 10.8 (Mountain Lion)
Mac OS X 10.9 (Mavericks)
Various flavors of Linux

Others will probably work as well. The OpenCL included with Mac OS X has several bugs that lead to failures in some of the Harlan test cases. For the most part, these failures can be ignored.

Harlan requires an OpenCL implementation as well as a compatible Scheme. Below are several OpenCL implementations that should work.

Harlan is known to work with (Petite) Chez Scheme and Vicare. Petite Chez Scheme can be downloaded from http://www.scheme.com/download.

Once all the prerequisites are installed, you can compile and run the test suite as follows.

make check

If the tests are successful, you will see the following at the end of all the output:

All tests succeeded.

The test programs are available in the test directory. End-to-end test programs have the .kfc extension. Other extensions represent code that is valid at various intermediate passes in the compiler.

Make puts test binaries in the test.bin directory, and also saves output from test programs here. Programs may be run directly from this directory for easier debugging.

Harlan programs can be compiled manually as follows.

./harlanc hello.kfc

For debugging purposes, the -v flag can be used.

./harlanc -v hello.kfc

This causes the compiler to write out the intermediate results from each compiler pass.

Assuming the Harlan compiler is successful, the compiler will produce an executable based on the program's filename. For example, the previous example will produce the file hello, which can be executed directly.

Next Steps

The official Harlan documentation is currently held in the Harlan Wiki.

The Harlan mailing list is harlan-dev. Feel free to join this group and ask questions of the Harlan developers and other Harlan users.

Directory Structure

Here is a quick overview of the various directories included in this repository.

HSBencher - a utility to automatically run and report the results of Harlan benchmarks.
etc - miscellaneous tools, including an Emacs mode for Harlan files.
external - the location of several git submodules, including the Nanopass Framework and Elegant Weapons.
harlan - the source code for the Harlan compiler.
lib - the standard library for Harlan.
rt - the runtime library for Harlan.
test - Harlan test cases and benchmarks.
test.bin - Harlan's test runner (run-tests.scm) stores the compiled binaries and program output for the test cases here.
travis - support files for running Harlan's test suite under Travis CI.
util - miscellaneous Scheme files, including compatability shims for different Scheme implementations.

harlan's People

Contributors

Stargazers

Watchers

Forkers

desa lkuper yanlinaung ghosthamlet bizworld surflightroy ziaius edmore duducosmos rdprescott idealarifi gtxx dreamfrog hmaimi lfneves fashgek iraikov seasonkwok vfe szho42 lovefawn alastairgould catillidan bitamp chiehwen rjv38 nivertech apella-zz damapl akiohtani martes-wigglesworth valerysemenova its2mc superniaoren akeep hackrva mathias catufunwa marcinz ernyldrm gjmurakami-10gen delwin9999 yurneroma hariomrana sundy-li neuroradiology mzmansour dhlorenz projectinnovation-pi billthebest mykesmith mihawk sarvex marc-truitt brain-qi astachurski yibaivf power1974 blueridanus pkdevbox theomission yuhangwang jameslinus transformersprimeabcxyz rlugojr seanjensengrey maniacs-ops senghoo aa10000 datascilab spencerx jebcat1982 rbarraud chinaworld guenchi neo4reo stjordanis studiovc hmrg-grmh davidalphafox graydon arthursxl8 navegador5 reposhelf icodein

harlan's Issues

Crazy things inside of kernels should be dealt with appropriately

Whether this means throwing a compile time error (easier), or running on the CPU at runtime (harder), we have to do something.

Run tests on other operating systems

Specifically, the benchmarks. We need to see if we get the same performance characteristics different operating systems, OpenCL versions and hardware configurations.

Windows port

This would be good if we start getting more users. We could probably do it under MinG32.

Add a parser for C-like syntax

We have enough posters and publications out there using a C-like syntax for Harlan that it's starting to feel a little dishonest when we actually write all our programs in S-expressions. We should come up with a syntax and write a lexer and parser that translates it into current Harlan S-expressions.

global-id primitive

During optimization/fusing, we have to walk the kernel body to replace references to get-global-id when the dimension of the argument changes. It would be helpful if we could just refer to (global-id var), a primitive that could be expanded during make-vector-refs-explicit or something.

Generate multiple kernels in some situations

For example, it is possible that x in this example is not a perfectly rectangular,

(kernel ((row x)
       ((i row))
  ...)

but if it is rectangular (or, close enough to being rectangular), a 2D kernel should be executed. Otherwise, the "safe" route should be taken, where the kernels are still nested.

Resizable regions

When allocating in regions outside of a kernel, we should allow them to be resized. Mostly this is just a call to realloc, but it means we also have to keep an extra level of indirection in the CPU code because realloc can change the pointer value.

Region size should be controlled by an environment variable

This will make it easier to run tests that make lots of data.

This won't be needed once regions can resize, but it still may help to set the minimum region size for some cases.

Error reporting in kernels

It'd be good to have kernels be able to report errors for when something goes wrong (for example, a failed bounds check). This probably means adding a status vector that kernels kernels set to a value depending on which error occurred.

Better typechecker errors

Will has a cool idea based on failing when backtracking that would let us log something to help point out where the typechecking went wrong.

This is partially implemented (prints out a backtrace in verbose-mode), but it cannot identify exactly what does not typecheck.

Functions must be defined before they can be used

This fails in g++:

(module 
  (define (bar x)
    (foo x))
  (define (foo x)
    x)
  (define (main) (bar 5) (return 0)))

We need to emit prototypes for all the functions in a module so they can mutually refer to each other.

Add another typechecking pass

We do a lot of code transformations, and it may be that we end up with something that isn't well-typed. It might be worth adding another typechecking pass towards the end of the compiler to make sure all the code we generated makes sense.

Add symbols

We should add Scheme-style symbols.

As long as we don't include things like (string->symbol ...), it shouldn't cause too many problems. Basically, we'd create symbols by doing 'x. Then we'd have a pass that finds all the symbols and map then onto integers which we'd use instead. When we go to print or otherwise observe symbols, we could look them up in the symbol table.

Kernels should be able to make some function calls

Or should they be able to make all function calls, and be forced to run on the CPU if they do anything too crazy.

Remove unnecessary CPU/GPU transfers

We're currently moving way too much data back and forth. We add a pass to identify and remove unnecessary CPU/GPU transfers.

Ensure Harlan can run on the mine machines

The machines in the mine have CUDA capable GPUs now. We should make sure Harlan works out of the box on them.

The gensyms we create in our compiler should be more meaningful

... so that our output code is more readable.

Identity kernel

We need a simple test that basically just tests memory bandwidth.

Add bounds checking

Harlan should be memory safe, but it's not if we don't check to make sure array accesses are in bounds.

We currently check (some) vector-refs, but we do not check the lengths of kernel arguments.

Kernel arguments are ignored in region inference

We take no advantage the knowledge of which variables appear in which kernels. We should be using this information to place variables that are used within the same kernels into the same regions.

Our method doesn't have to be optimal, but we need to do something.

Add time-passes mode to harlanc

It'd be nice to profile which tasks are taking the longest.

Our compiler is starting to get a bit slow. This could just be because we use g++, but we should maybe be optimizing some of our own passes now too.

We have a timing function now, -t, but it would be nice to print out a succinct table at the end as well.

rt/gpu_common.h path problem

Harlanc is not quite path-independent because it generates code that tries to include rt/gpu_common.h. At the very least, we need to somehow include path information to find everything, but we should maybe just include the text of gpu_common.h in the resulting binary.

Per-function profiling within kernels

CUDA and OpenCL don't seem to provide this already. Since we are generating our own code, there's a chance we could do this. This would be really handy for people tuning their applications, and could serve as a good reason to use Harlan over straight CUDA.

Code Explosion

The s-exprs towards the end of the compiler are insanely large and hard to read. With primitives like (m)alloc and get-region-ptr, we could eliminate some unnecessary calls and casts.

The primitives above are implemented, but there is more room for improvement.

Stencil syntax for kernels

It'd be nice to be able to specify stencils with special syntax in a kernel. For example, we could do something like this:

(kernel ([(stencil (a b c)
                   (d e f)
                   (g h i))
          M])
   (/ (+ a b c d e f g h i) 9))

This kernel would compute the average of a 3x3 window in a matrix and use each of these as a new value in a result matrix. Obviously we'd need to specify boundary conditions, and the stencil keyword might be too verbose, but this is the basic idea.

Is this someone we could express with a macro system for Harlan?

We could also do a bunch of stencil-specific optimizations really easily with this syntax.

Add support for tuples

Ryan needs this for his Accelerate to Harlan compiler.

Translate benchmarks from MATLAB

There are a few xfailed tests that are progress towards this issue, but none of them are complete..

Different types for mapped/unmapped regions

There should be two different types for regions: one for regions on the CPU, and another for regions on the GPU. That way, there can be functions that only deal with regions on the GPU. This will eliminate unnecessary maps/unmaps, and allow the caller to control when data is mapped back and forth.

The idea here is to do some kind of flow-sensitive typestate analysis.

This is useful for separate compilation (probably a long ways off) or for cross-language linking, which will probably happen pretty soon with Accelerate.

Joseph pointed out that without separate compilation or cross-language boundaries, this feature probably doesn’t give us much that we can’t do without whole-program analysis.

Deal with zero-length vectors in kernels.

You can put a zero length vector as a kernel argument, which causes errors in OpenCL. We should either make this an error, or make our kernels evaluate to zero-length vectors when they have zero-length inputs.

The second one seems like the better solution because it is more general, and gives fewer opportunities for run-time errors.

Uneven workload

Running a kernel over a vector that looks like this:

(vector (vector 1) (iota 1000000) (vector 1 2 3))

basically runs sequentially, since each dimension of the kernel has one thread on the GPU. What can we do about this?

"I think it depends on what the body of the kernel is. If it never inspects the inner vectors, then there’s not much we can do, but not much we probably need to do either. If it’s doing another kernel inside, we might do okay by just sequentializing the outer kernel and doing the inner kernel in parallel.

A lot of data parallel languages do flattening, which I think means you start out by appending all the vectors (maybe not literally…), and then making a single kernel that does something like “if my id is between 0 and 1, use the first vector, between 1 and 1000001, use the second vector, between 1000001 and 1000004, use the third vector.” There’s some question of whether this approach wins in practice on modern hardware."

Optimize-level compiler option

We need an option to turn on/off compiler optimizations, so we can easily compare performance with and without them.

Harlan doesn't build well on Linux / with AMD OpenCL 1.2

I experienced this problem trying to get Harlan installed on my Arch desktop, and it looks like Will is experiencing the same problems with an Ubuntu install. Our compilation of the runtime and of a harlan program does not use proper flags, meaning we need to update our makefile and harlan/driver.scm. Even after properly locating the OpenCL headers, it looks like all tests still fail with an CL_INVALID_VALUE (-30) error.

A side note -- clEnqueueBarrier is deprecated as of 1.2, and although it looks like this does not fix the problem, we really should update it to use clEnqueuBarrierWithWaitList

Remove Chez Scheme dependencies

If possible, all our code should run in any R6RS-compliant implementation.

Automatically install an OpenCL if not available

This is related to Harlan being too hard to build. Not all systems will have OpenCL installed to start with. In those cases, we should fetch one that will work for the current system and install it in a local path so we can keep a super simple install system.

Harlan doesn't work on Hivequeen

Harlan is broken on Hivequeen with both the CPU and GPU version. With the CPU, we get 62 failures, while they all fail on the GPU.

This is very likely the result of a misconfiguration on Hivequeen, but it might also be a bug in Harlan.

Update Harlan Manual

The Harlan language has changed a lot over the last semester, and the manual is now very out of date. We need to make sure things are up to date.

Use precise read/write memory flags

We're planning on enforcing that kernels may not write to free variables, only locals. This means we can get away with using read-only memory for arguments and free variables, and write only memory for the output buffer. This could potentially let OpenCL make smarter code, so we should do it. It mostly just means we need some more analysis and we need to be careful which allocation paths we use.

Harlan JIT Interface

Ryan is interested in using Harlan as a backend for GPU computing in MonadPar (I think). To make this easier and faster, it'd be nice if Harlan had a JIT interface. We're part of the way there with the .so output support, but we should make it more streamlined.

The approach Ryan and I talked about today was using full Chez to compile a libharlan.so, which exports a jit-harlan function that takes a string representing a Harlan S-expression. JIT Harlan would compile the string, call g++, use Chez's FFI to dlopen the resulting library, and then return some kind of handle representing a set of entry points. Finally, JIT Harlan would provide some mechanism for invoking the generated entry points and providing input and output buffers.

Later we might modify Harlan to generate Scheme code that we can call eval on to avoid the overhead of running g++.

Recursive types

So we can do trees and linked lists.

Make optimizations more robust

It's very easy to defeat kernel fusion and other optimizations now. We're basically doing very local pattern matching. To do this right, we'll probably need some more global control or data flow analysis to decide when we can do the optimizations.

Test runner should print a list of failures at the end

This will make it easier for us to see what went wrong, instead of having to scroll back through all the test output.

Region variable redefinition

Compiling https://github.com/eholk/harlan/blob/master/test/transpose.kfc fails with these C++ compilation errors:

g++   -x c++ - -x none ./rt/libharlanrt.a -I./rt -o transpose -framework OpenCL
Exception in g++-compile-stdin: <stdin>: In function ‘int main()’:
<stdin>:36: error: redeclaration of ‘region_ptr vec_63’
<stdin>:30: error: ‘region_ptr vec_63’ previously declared here
<stdin>:37: error: redeclaration of ‘int refindex_64’
<stdin>:17: error: ‘int refindex_64’ previously declared here
<stdin>:47: error: redeclaration of ‘region_ptr vec_63’
<stdin>:30: error: ‘region_ptr vec_63’ previously declared here
<stdin>:48: error: redeclaration of ‘int refindex_64’
<stdin>:17: error: ‘int refindex_64’ previously declared here

The generated C++ is at https://gist.github.com/3744663.

It looks like we end up declaring region variables multiple times.

Harlan libraries

Harlan should have some sort of external-module-import-feature!

Ryan's asking for zipwith, and we already have the primitive write-pgm that shouldn't actually be a primitive. We should have a module/file named cool-function-things with the definitions of zipwith and write-pgm, so a user could (import cool-function-things) and everything would work.

This could be done simply by collecting all the import statements in the user's file, reading in the imported files, grabbing all of their definitions, and compiling with everything thrown together.

This could be done more intelligently by implementing separate compilation units and real imports/exports, but we're not to that point yet.

Support asynchronous kernels

In particular, this means adding an async-kernel form. This may not be necessary. However, the language has enough information that we can probably generate asynchronous code for kernels by default, which means we can automatically overlap CPU and GPU computation.

Multidimensional Array Type

We can't do a 2D kernel in the most obvious case, for the following test case, where x is a 2D vector.

(kernel ((row x))
  (kernel ((i row))
     ...))

We have no guarantee that x is a rectangle. If we had another data type, for 2D rectangular arrays, we could optimize this case into a 2D kernel as well.

A few our of tests fail without optimizations turned on

Support run-fail testspec

We want to mark tests that are expected to fail at runtime, by design.

While we're at it, we should have a compile-fail testspec to make sure we reject all the programs we should.

Harlan doesn't work with Intel OpenCL

Here's an example error:

eric@hivequeen ~/class/osl/harlan $ HARLAN_DEVICE=cpu ./test.bin/simple-kernel.kfc.bin
found 1 devices
Creating queue for        Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
clEnqueueNDRangeKernel(queue, k.k, dimensions, NULL, global_size, local_size, 0, 0, &e) failed with error CL_INVALID_KERNEL_ARGS (-52)
Aborted

We're probably doing something illegal somewhere, and we should fix that. Intel is stricter than the other OpenCLs, but following the strictest implementation will probably save us heartache in the future.

Make Harlan work with Vicare

Vicare is a fork of Ikarus that seems to be actively maintained. If we used Vicare, it could be a submodule of Harlan, and we could remove the "Install Petite Chez Scheme" step from the install instructions.

There is currently a WIP branch for this.

Use 2D OpenCL kernels for adjacent kernels

OpenCL has 2D kernel support. We should use this when we see (kernel (kernel ...).