My TODO List

今取り組んでることとか課題とかのTODO List

Docgen周りのリファクタリング (気が向いたら)
metal backend
cl-waffe2/nn -> generic
Stable Diffusion Inference

Environments / Backends

Testing on CCL/SBCL/LispWorks/Allegro (Modern Mode Support?)
SIMD Extension [Add] SSE/AVX512/Neon Support.
[Add] CUDATensor Backend
[Add] MetalBackend
最適化レベルの設定とVMでのRuntime Error Handling
build receiving multiple inputs (LazyCons)
defmodelの多重ディスパッチ (バックエンドの名前で)

cl-waffe2/base-impl

UnfoldをC++で自作してvectorize or oneDNN interop
Row-majorのGemm
Sparse Gemm, Sparse Matrix Support
LazyCons/Permute/Reshape/View/Setqとかのbackend=tのノード: Loadで置き換える (A* = B*;)

cl-waffe2/vm.nodes

ネットワーク構築のAPI(defnode/define-impl/define-impl-op/defmodel/defmodel-as)の構築/テスト (Implemented Fairly well)
ただのNumpy-likeな行列演算ライブラリとして、cl-waffe2から分離したプロジェクトとして、コンパイルされたcl-waffe2のプログラムの集合をライブラリとして提供することができる (e.g.: topi in TVM)
RNNの実装に関して
- Control FlowをVMに実装するか defmodel-asで部分的にコンパイルしたネットワークをdefine-by-runっぽく繋げて動作する二つの方法のどっちかがある多分後者でRNN実装
defmodel-asの最適化:
-AOT Compiler, AbstractNodeにコンパイルするときはメモリの割り当てだけ後から変更して再利用できるように！
define-by-runモードの実装で一番の課題はコンパイル時間である。
- (メソッド割り当てが重い) 1. AbstractNodeネットワークの構築 2. ネットワークのコンパイル この二つに分けて最適化。
Conv2DのSubscript DSLのバグ
forward -> compiler-macro使ってインライン化できない？

cl-waffe2/vm.generic-tensor

cl-waffe2/vm

defpathマクロによる検索ベースのFusionOpでデバイス特化の最適化を追加する
- 例えば!sumはBroadcastingとAddNodeベースだが、総和専用のKernelを書いた方が速度精度高い。
(log (1+ x))の微分のFusionOPは数値的安定性からMUSTである。これからFusionOPで実装する
最適化 -> (EXP X) -> A, B これを検出して最適化できる箇所がたくさんある。ソートをTensorじゃなくてAbstractNodeのIDベースでやればできそう
最適化: sin(x, out) <- outでcopy(x)するの無駄 allocする計算ノードにする

cl-waffe2

deftrainerを廃止する。defoptimizerの使い方と合わせてもっとelegantなAPIにする。
ちゃんと頭が回っているときに書きたい

その他

Reading List📕

Lens

https://arxiv.org/pdf/1810.07951.pdf

https://arxiv.org/abs/1809.00738

https://github.com/JuliaDiff/Diffractor.jl

Existing approaches of Deep Learning Compilers

https://arxiv.org/pdf/2002.03794.pdf

https://tvm.apache.org/docs/arch/index.html

https://web.ist.utl.pt/nuno.lopes/pubs/torchy-cc23-extended.pdf

https://towardsdatascience.com/how-pytorch-2-0-accelerates-deep-learning-with-operator-fusion-and-cpu-gpu-code-generation-35132a85bd26

Etc.

https://arxiv.org/pdf/2307.12187.pdf

https://arxiv.org/pdf/1611.06945.pdf

[BugFix] Batched Matmul won't work

Also: !matmul (!t a) b isn't working

[WIP] Petalisp as a high-level IR?

(This article is WIP)

An overview of Petalisp

https://github.com/marcoheisig/Petalisp/tree/master

(As far as I know), Petalisp is a DSL implemented in Common Lisp for generating parallelized array processing codes, providing:

Petalisp works on a fundamental level:
- Abstract Array Processings, Polyhedral Compiler et al.
- The IRs are sophisticated; lazy-reshape, transform, ranges and many optimization techniques specialized on them can provide by far the fastest and most systematic JIT Compiler.
(Could be, or with more effort) applied to multiple backends likewise tinygrad. (i.e.: relatively easy to implement new backends like Metal, x86, gcc, neon et al)

RISC or CISC?

Deep Learning models are everywhere, but what about the technology behind them? Many deep learning frameworks are in development today, and there are DL compilers with a focus on efficient inference (or training). TVM could be one of the good options, but when you want to make a model specific to an arbitrary environment, there are always compatibility issues (e.g.: pytorch/pytorch#49890, this is a case of PyTorch though).

Concretely speaking, It is possible to implement gemm for many devices (e.g.: CPU, GPU, NEON, AVX, Metal etc...) and many data types (e.g.: uint8, int8, int16, ..., float16, FBloat16, float32, ...). But can it be easier?

With Petalisp, once written at a higher layer, it can be run on various backends instead of implementing gemm (like a template).

;; Petalisp
(defun matrix-multiplication (A B)
  (lazy-reduce #'+
   (lazy #'*
    (lazy-reshape A (transform m n to n m 1))
    (lazy-reshape B (transform n k to n 1 k)))))

as well as tinygrad:

# Tinygrad
c = (a.reshape(N, 1, N) * b.permute(1,0).reshape(1, N, N)).sum(axis=2);

Users don't need anymore to worry about parallelization; just rely on the compiler.

If TVM were CISC, tinygrad would be a RISC.

Why Petalisp is a good choice for replacing the cl-waffe2 compiler

A survey of improving performance

In terms of training time and memory usage, cl-waffe2 has a lot of challenges. In fact, even in the case of training simple MLP, cl-waffe2 is even 1.5 times slower than the same operations in PyTorch. However, this is because cl-waffe2 is a JIT compilation-based framework and I've only started this project a few months ago. It still has a large number of potential optimization. The next term goals are to optimize training time, So here's a list of things to be optimized:

cl-waffe2 IR

Graph-level optimization is still not enough. Especially, the number of MoveTensorNode should be reduced.

FuseOps

FuseOps Supporting is still poor. In the future, I want to create search-based instruction fusion. For example, users define the sequence of IR to be replaced with a (defpath ...) macro, and the compiler reads it.

The full use of SIMD Ops

・Use SLEEF

The full use of `lparallel`

Maximum speed-up can be achieved by putting all data on SIMD registers and then parallelising by lparallel.

[BugFix] In-place backward with required-grad=t

[BugFix] Lambda wrapped backward definition can't return more than one gradient. (i.e.: define-static-node)

(TODO)

[Refactor] APIs for Network Construction

Overview of Computation Node in cl-waffe2

[AbstractNode] The fundamental unit which binds forward/backward propagations.
 defnode - Declares a general definition of AbstractNode
      L define-impl        implements AbstractNode. Its forward definition is given as a macro (to inline/call-with-view), later (compile nil body) is called, and cached.
      L define-impl-op  Implements as a lambda function.

define-op = defnode + define-impl-op

[Composite] Bundles several AbstractNodes, defined by defmodel macro.
  defmodel - Defines a new Composite
     L defmodel-as Redefining the existing Composite as a function or AbstractNode to reduce compiling time, to use cl-waffe2 as a define-by-run library.

Accordingly, these macros will be deleted in the future release: define-static-node, define-composite-function.

Wrapping/Adapting another frameworks?

Since the graph processing library is all about cl-waffe2, I feel that building the entire backend from scratch is reinventing the wheel. I don't know which is the best, but the following list is the choice;

GGML Wrapper and Interop
oneDNN (Most Promising for CPU)
- with the power of oneDNN, we can provide bfloat16 training and uint8 inference

[Fix] a ton of undefined-type

Loading cl-waffe2 test top-level, a ton of undefined-type occurs.

The class is located in cl-waffe2/vm.facets-tmp which means nodes defined by define-node cause the problem. More precisely, nodes put warning is defined as :device t. Referring to NODENAME-CPUTENSOR, but there's no implementation because all backends use NODENAME-T impls.

I've not reached the place of using NODENAME-CPUTENSOR.

[API Enhancement] Project Simplification for those who is new to Common Lisp

TODO
- Step-by-step tutorial for those who aren't familiar with Deep Learning.
- Tutorial written in Jupyter Notebook
- Translate tutorial_jp.lisp into English
- Visualized Explanation of cl-waffe2 (should i make a slide?)
- I think deftrainer usage should be much more documented, because it is unique and complicated.

[FixME] features on cl-waffe2 is perfectly working on REPL, but the same operation in the test isn't working...

[Refactoring] I'm considering rewriting the entire code of VM

コードがぐちゃぐちゃ/バグ多すぎ/compile遅すぎ/パフォーマンスがカス/の四つの理由からsource/vm直下のコードを書き直そうか検討している

追加: cl-waffe2/vmパッケージ

./source/vm/generic-tensor ./source/vm/nodesに破壊的変更を加えないままでVMを再実装

判明している現在の課題とか：

cl-waffe2/vm直下に新たに仕様を真面目に策定したVMを構築する
- Lispのコード全体を再度compileするのは完全に時間の無駄だと判明
- インライン化したコードを読み込んだり合成するのはVMの仕事にする。（ここの部分はコンパイル走らせない）
- (!sin x (!copy x))みたいな計算ノードトポロジカルソートをしてない（あほ）
- RNN実装するときにIRの仕様がもうちょっとちゃんとしてると良い If Mapなどの命令
- call-with-viewがlparallelで並列化できるように書き直す
- reverse mode In-placeになってるか？？
- (build out)で一回一次元のデータ構造に直したほうがいい

./source/vm/generic-tensorのVMと互換性があるかをテストしながらvm-refactoringブランチで作業します。

defnode関連の機能
- 不安定すぎる
- shape.lispがクソコードなので書き直したい

Various discussions arising from examples and tutorials

when running examples, there are always erros following:

Couldn't find any implementation of MATMULNODE for (LISPTENSOR).
[Condition of type CL-WAFFE2/VM.NODES::NODE-NOT-FOUND]

how to address?

btw, your waffe2 system looks great, pls keep it on!

[WIP] Remove obsolete SIMD and BLAS dependencies (for cpu)

we aim to generalize APIs and optimization techniques around different architecture computers, that is, we also have to make support for GPU removing CPU dependencies because cl-waffe2 was originally designed as so (easy to extend, easy to fuse multiple kernels, cl-waffe2 is nothing but tensor abstraction APIs and more including the fastest auto diff in Common Lisp)

As of now, I'm working on implementing a deep learning compiler for multiple targets including AVX, Neon, NVIDIA, AMD, and more! (it also extends eazy to extend concepts)

https://github.com/hikettei/AbstractTensor.lisp

The approaches are similar to tibygrad, even a beautiful tinygrad port to Common Lisp may be good.

This might be some kind of destructive changes and included in my future works(thats why i have created a new issue); but I believe this modification will enable get Int8 Quantized Llama3 model running on Common Lisp, with the smallest dependencies. This could be one of the reason using Common Lisp because it is impossible to reproduce it for Python, or other languages communities.

Workload to implement LLAMA3

(nearly) complete tinygrad port to Common Lisp
Fast Conv2D kernel implementation (and winograd)
Support more fuse patterns
GPU(all of NVIDIA, METAL, and AMD is not as difficult with our approaches) Supports
Improve data type interface, esp, cast ops, and quantization op support with JIT.

[BugFix] Proceed -> Proceed

(proceed (proceed tensor))
↑add detach!

[BugFix] !matmul with broadcasting, with different number of axes may cause StackOverFlow

[Enhancement] package naming could be confusing

Since there are many packages, cl-waffe2 exports aren't friendly to users. I'm considering putting them into the cl-waffe2 package together...

[Enhancement] macros to add

deftrainer

deftrainer is a macro to describe:

criterion/optimizer
Both of nodes of training phase, predicting phase

define-impl-1d-kernel

A template macro for users to implement a new backend without being familiar with complicated APIs (e.g.: call-with-view)

[FixME] Dynamically Shaped Conv2D isn't working due to !reshape

;; Code
(call (Conv2D 3 6 `(5 5)) (make-input `(N C H W) :X))

won't work since the implementation include:

...
(asnode #'!reshape (* N C-out H-out) t)

Generally speaking, when a mathematical(arithmetic) function needs to be called in the shape calculation, it also needs to be lazily evaluated because the value of N isn't determined at that time.

(TODO: More details)

[Enhancement] Better Printing on terminal

Some of the displayed logs/errors are ugly I guess. (e.g.: one is displayed as 1)

Here's a list of ugly points:
(TODO)

[TODO] AVX512/SSE Extension, CUDA Backend, and more, other backends...

[Enhancement] Compiling time is remained to be optimized.

cl-waffe2 instantly generates/compiles forward kernel depending on given tensors' dimensions, and views. This approach allows me to reduce the computing time of multidimensional offsets, and schedule multithreading in advance. However, this compiling is never done at the top level, but the (compile nil ...) function. 80% of compiling time consists of this kernel compiling time (e.g.: expands of SinNode).

For example, (!sin (!sin (!sin x))) uses the completely same code at each time, albeit we need three times compiling. Therefore, one primary strategy to reduce compiling time is to reuse the compiled kernels.

examples/mnist/mlp.lisp - reset-compiled-function-cache! question

Using the (current version) of mlp.lisp, for the first call of, e.g.,
(train-and-valid-mlp :epoch-num 11 :benchmark-p nil)
the training loss in the first epoch is around 0.26 usually.

For further runs (when evaluating (train-and-valid-mlp :epoch-num 11 :benchmark-p nil)), the loss is larger (around 0.76 in the first epoch). I suspect that this is caused by some caching in the compiler and different initializations of the compiled structures, since if I evaluate
(cl-waffe2/vm.generic-tensor::reset-compiled-function-cache!)
before evaluating
(train-and-valid-mlp :epoch-num 11 :benchmark-p nil),
then the loss is in the same range as for the very first run.

Is this the intended behavior, or should the reset be applied somewhere when the model is built/compiled?

[Fix] Some specifications on Subscript DSL should be changed in the future release

Current problems (as far as I know) in Subscript DSL is the following:

Difficulty in expressing complicated transmission states

	   :where (Input[N C_in H_in W_in] -> Output[N C_out H_out W_out]
			   where
			   C_in  = in-channels
			   C_out = out-channels
			   ;; H_out = floor(((H_in + 2 * padding[0] - dilation[0] * (kernel_size[0] - 1) - 1) / stride[0]) + 1)
			   H_out = (if (numberp H_in) ;; If H_in is a symbol, return -1 (=undetermined, later determined.)
				       (floor (+ 1 (/ (+ H_in (* 2 (car padding)) (* (- (car dilation)) (- (car kernel-size) 1)) -1)
						      (car stride))))
				       -1)
			   ;; W_out = floor(((W_in + 2 * padding[1] - dilation[1] * (kernel_size[1] - 1) - 1) / stride[1]) + 1)
			   W_out = (if (numberp W_in)
				       (floor (+ 1 (/ (+ W_in (* 2 (second padding)) (* (- (second dilation)) (- (second kernel-size) 1)) -1)
						      (second stride))))
				       -1))

(at https://github.com/hikettei/cl-waffe2/blob/master/source/nn/conv.lisp#L76)

Since both Convolution and Pooling have too complicated transmission states to express lazily, the :where form returns -1 (can't predict) when the result won't become an integer. This ugly behaviour should be fixed in the future release, but I don't have any idea.

The implementation is ugly

Should be refactored: https://github.com/hikettei/cl-waffe2/blob/master/source/vm/nodes/shape.lisp

Polish generated Shape-Error

(At https://github.com/hikettei/cl-waffe2/blob/master/source/vm/nodes/shape-error.lisp)

Shaping-Error should be more pinpoint, In fact, it is possible to make it easy to know where should be fixed, and at which node the error occurred.

hikettei / cl-waffe2 Goto Github PK

cl-waffe2's Issues