hikettei / cl-waffe2 Goto Github PK
View Code? Open in Web Editor NEW[Experimental] Graph and Tensor Abstraction for Deep Learning all in Common Lisp
Home Page: https://hikettei.github.io/cl-waffe2/
License: MIT License
[Experimental] Graph and Tensor Abstraction for Deep Learning all in Common Lisp
Home Page: https://hikettei.github.io/cl-waffe2/
License: MIT License
今取り組んでることとか課題とかのTODO List
build
receiving multiple inputs (LazyCons)1. AbstractNodeネットワークの構築
2. ネットワークのコンパイル
この二つに分けて最適化。 :cache-when-compiled=nil
のdefine-implは、build時に(compile nil body)を実行する原因となるのでWarningを出す
apply-ranked-op
的なforwardメソッドの形状検査
Memory Pool
retain-gradオプションを追加する by setting tensor-grad-n=1
逆伝播の時 toplevelにあるMoveをInlining + In-place mutationの最適化は、optimizerの副作用の行方がわからなくなるのでやらない。
call-with-view: 大規模な行列に対するlparallel並列化の性能の評価とそれらを有効にする
データ型(Dtype)関連
!sum
はBroadcastingとAddNodeベースだが、総和専用のKernelを書いた方が速度精度高い。(log (1+ x))
の微分のFusionOPは数値的安定性からMUSTである。これからFusionOPで実装する(EXP X) -> A, B
これを検出して最適化できる箇所がたくさんある。ソートをTensorじゃなくてAbstractNodeのIDベースでやればできそうRAdam
https://arxiv.org/pdf/1810.07951.pdf
https://arxiv.org/abs/1809.00738
https://github.com/JuliaDiff/Diffractor.jl
https://arxiv.org/pdf/2002.03794.pdf
https://tvm.apache.org/docs/arch/index.html
https://web.ist.utl.pt/nuno.lopes/pubs/torchy-cc23-extended.pdf
Also: !matmul (!t a) b
isn't working
(This article is WIP)
https://github.com/marcoheisig/Petalisp/tree/master
(As far as I know), Petalisp is a DSL implemented in Common Lisp for generating parallelized array processing codes, providing:
Deep Learning models are everywhere, but what about the technology behind them? Many deep learning frameworks are in development today, and there are DL compilers with a focus on efficient inference (or training). TVM could be one of the good options, but when you want to make a model specific to an arbitrary environment, there are always compatibility issues (e.g.: pytorch/pytorch#49890, this is a case of PyTorch though).
Concretely speaking, It is possible to implement gemm
for many devices (e.g.: CPU, GPU, NEON, AVX, Metal etc...) and many data types (e.g.: uint8, int8, int16, ..., float16, FBloat16, float32, ...). But can it be easier?
With Petalisp, once written at a higher layer, it can be run on various backends instead of implementing gemm (like a template).
;; Petalisp
(defun matrix-multiplication (A B)
(lazy-reduce #'+
(lazy #'*
(lazy-reshape A (transform m n to n m 1))
(lazy-reshape B (transform n k to n 1 k)))))
as well as tinygrad:
# Tinygrad
c = (a.reshape(N, 1, N) * b.permute(1,0).reshape(1, N, N)).sum(axis=2);
Users don't need anymore to worry about parallelization; just rely on the compiler.
If TVM were CISC, tinygrad would be a RISC.
In terms of training time and memory usage, cl-waffe2 has a lot of challenges. In fact, even in the case of training simple MLP, cl-waffe2 is even 1.5 times slower than the same operations in PyTorch
. However, this is because cl-waffe2 is a JIT compilation-based framework and I've only started this project a few months ago. It still has a large number of potential optimization. The next term goals are to optimize training time, So here's a list of things to be optimized:
Graph-level optimization is still not enough. Especially, the number of MoveTensorNode
should be reduced.
FuseOps
Supporting is still poor. In the future, I want to create search-based instruction fusion. For example, users define the sequence of IR to be replaced with a (defpath ...)
macro, and the compiler reads it.
・Use SLEEF
lparallel
Maximum speed-up can be achieved by putting all data on SIMD registers and then parallelising by lparallel.
(TODO)
[AbstractNode] The fundamental unit which binds forward/backward propagations.
defnode - Declares a general definition of AbstractNode
L define-impl implements AbstractNode. Its forward definition is given as a macro (to inline/call-with-view), later (compile nil body) is called, and cached.
L define-impl-op Implements as a lambda function.
define-op = defnode + define-impl-op
[Composite] Bundles several AbstractNodes, defined by defmodel macro.
defmodel - Defines a new Composite
L defmodel-as Redefining the existing Composite as a function or AbstractNode to reduce compiling time, to use cl-waffe2 as a define-by-run library.
Accordingly, these macros will be deleted in the future release: define-static-node
, define-composite-function
.
Since the graph processing library is all about cl-waffe2, I feel that building the entire backend from scratch is reinventing the wheel. I don't know which is the best, but the following list is the choice;
Loading cl-waffe2 test top-level, a ton of undefined-type occurs.
The class is located in cl-waffe2/vm.facets-tmp
which means nodes defined by define-node
cause the problem. More precisely, nodes put warning is defined as :device t
. Referring to NODENAME-CPUTENSOR
, but there's no implementation because all backends use NODENAME-T
impls.
I've not reached the place of using NODENAME-CPUTENSOR
.
tutorial_jp.lisp
into Englishdeftrainer
usage should be much more documented, because it is unique and complicated.コードがぐちゃぐちゃ/バグ多すぎ/compile遅すぎ/パフォーマンスがカス/の四つの理由からsource/vm
直下のコードを書き直そうか検討している
./source/vm/generic-tensor
./source/vm/nodes
に破壊的変更を加えないままでVMを再実装
判明している現在の課題とか:
(!sin x (!copy x))
みたいな計算ノード トポロジカルソートをしてない(あほ)If
Map
などの命令(build out)
で一回一次元のデータ構造に直したほうがいい./source/vm/generic-tensor
のVMと互換性があるかをテストしながらvm-refactoring
ブランチで作業します。
when running examples, there are always erros following:
Couldn't find any implementation of MATMULNODE for (LISPTENSOR).
[Condition of type CL-WAFFE2/VM.NODES::NODE-NOT-FOUND]
how to address?
btw, your waffe2 system looks great, pls keep it on!
we aim to generalize APIs and optimization techniques around different architecture computers, that is, we also have to make support for GPU removing CPU dependencies because cl-waffe2 was originally designed as so (easy to extend, easy to fuse multiple kernels, cl-waffe2 is nothing but tensor abstraction APIs and more including the fastest auto diff in Common Lisp)
As of now, I'm working on implementing a deep learning compiler for multiple targets including AVX, Neon, NVIDIA, AMD, and more! (it also extends eazy to extend concepts)
https://github.com/hikettei/AbstractTensor.lisp
The approaches are similar to tibygrad, even a beautiful tinygrad port to Common Lisp may be good.
This might be some kind of destructive changes and included in my future works(thats why i have created a new issue); but I believe this modification will enable get Int8 Quantized Llama3 model running on Common Lisp, with the smallest dependencies. This could be one of the reason using Common Lisp because it is impossible to reproduce it for Python, or other languages communities.
Workload to implement LLAMA3
(proceed (proceed tensor))
↑add detach!
Since there are many packages, cl-waffe2 exports aren't friendly to users. I'm considering putting them into the cl-waffe2
package together...
deftrainer
is a macro to describe:
A template macro for users to implement a new backend without being familiar with complicated APIs (e.g.: call-with-view
)
;; Code
(call (Conv2D 3 6 `(5 5)) (make-input `(N C H W) :X))
won't work since the implementation include:
...
(asnode #'!reshape (* N C-out H-out) t)
Generally speaking, when a mathematical(arithmetic) function needs to be called in the shape calculation, it also needs to be lazily evaluated because the value of N
isn't determined at that time.
(TODO: More details)
Some of the displayed logs/errors are ugly I guess. (e.g.: one
is displayed as 1)
Here's a list of ugly points:
(TODO)
cl-waffe2 instantly generates/compiles forward kernel depending on given tensors' dimensions, and views. This approach allows me to reduce the computing time of multidimensional offsets, and schedule multithreading in advance. However, this compiling is never done at the top level, but the (compile nil ...)
function. 80% of compiling time consists of this kernel compiling time (e.g.: expands of SinNode
).
For example, (!sin (!sin (!sin x)))
uses the completely same code at each time, albeit we need three times compiling. Therefore, one primary strategy to reduce compiling time is to reuse the compiled kernels.
Using the (current version) of mlp.lisp, for the first call of, e.g.,
(train-and-valid-mlp :epoch-num 11 :benchmark-p nil)
the training loss in the first epoch is around 0.26 usually.
For further runs (when evaluating (train-and-valid-mlp :epoch-num 11 :benchmark-p nil)), the loss is larger (around 0.76 in the first epoch). I suspect that this is caused by some caching in the compiler and different initializations of the compiled structures, since if I evaluate
(cl-waffe2/vm.generic-tensor::reset-compiled-function-cache!)
before evaluating
(train-and-valid-mlp :epoch-num 11 :benchmark-p nil),
then the loss is in the same range as for the very first run.
Is this the intended behavior, or should the reset be applied somewhere when the model is built/compiled?
Current problems (as far as I know) in Subscript DSL
is the following:
:where (Input[N C_in H_in W_in] -> Output[N C_out H_out W_out]
where
C_in = in-channels
C_out = out-channels
;; H_out = floor(((H_in + 2 * padding[0] - dilation[0] * (kernel_size[0] - 1) - 1) / stride[0]) + 1)
H_out = (if (numberp H_in) ;; If H_in is a symbol, return -1 (=undetermined, later determined.)
(floor (+ 1 (/ (+ H_in (* 2 (car padding)) (* (- (car dilation)) (- (car kernel-size) 1)) -1)
(car stride))))
-1)
;; W_out = floor(((W_in + 2 * padding[1] - dilation[1] * (kernel_size[1] - 1) - 1) / stride[1]) + 1)
W_out = (if (numberp W_in)
(floor (+ 1 (/ (+ W_in (* 2 (second padding)) (* (- (second dilation)) (- (second kernel-size) 1)) -1)
(second stride))))
-1))
(at https://github.com/hikettei/cl-waffe2/blob/master/source/nn/conv.lisp#L76)
Since both Convolution
and Pooling
have too complicated transmission states to express lazily, the :where
form returns -1 (can't predict)
when the result won't become an integer. This ugly behaviour should be fixed in the future release, but I don't have any idea.
Should be refactored: https://github.com/hikettei/cl-waffe2/blob/master/source/vm/nodes/shape.lisp
(At https://github.com/hikettei/cl-waffe2/blob/master/source/vm/nodes/shape-error.lisp)
Shaping-Error should be more pinpoint, In fact, it is possible to make it easy to know where should be fixed, and at which node the error occurred.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.