Git Product home page Git Product logo

ggllm.cpp's People

Contributors

0cc4m avatar anzz1 avatar blackhole89 avatar cmp-nct avatar comex avatar crd716 avatar dannydaemonic avatar dfyz avatar ejones avatar ggerganov avatar gjmulder avatar glinscott avatar green-sky avatar howard0su avatar ikawrakow avatar ivanstepanovftw avatar j-f1 avatar johannesgaessler avatar jploski avatar kerfufflev2 avatar maddes8cht avatar mgroeber9110 avatar prusnak avatar slaren avatar slyecho avatar sw avatar tjohnman avatar unbounded avatar xaedes avatar zenixls2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ggllm.cpp's Issues

--help , pipes and inconsistent help text

I have always been irritated (also in Llama.cpp) by the fact that the help text cannot be piped. Neither a falcon-main --help |less (the help is meanwhile 85 lines long) will work nor a falcon-main --help |grep -A3 penal (which should e.g. provide a compilation in a few lines with everything that has to do with penalties commands).
That is because the help output is written to stderr. But a help text output with --help is not an error message, but the desired text output of the help command.

These changes can be done quickly, just a lot of lines - but I also noticed a lot of inconsistencies in the help text. I would like to change them together with the correction of the output code.

The help text has a basic structure:
Entries start with the argument or option preceded by - or --, followed by an argument in capital letters if one is provided, e.g. in the line

-p PROMPT, --prompt PROMPT

The argument has now been omitted in many new entries. I will add these again.
But some lines become very long because of this, take for example:

-a,--alias,--finetune Set model name alias and optionally force fine-tune type (or disable it)

should actually be

-a ALIAS, --alias ALIAS, --finetune ALIAS Set model name alias and optionally force fine-tune type (or disable it)

It would be consistent to write it this way (and I'm all for writing the help text consequently consistent), but it doesn't make it any clearer.
So my suggestion is to change the notation and basically write the argument only once after the comma-separated list of option notations. Then it would be

-s, --seed SEED               RNG seed (default: -1, use random seed for < 0)
-p, --prompt PROMPT           prompt to start generation with (default: empty)
-a,--alias,--finetune ALIAS   Set model name alias and optionally force fine-tune type (or disable it)

As a benefit there would be some more lines where the description still fits in the same line as the parameter list.
If I get approval for this, I would send the mentioned changes all together into a PR.

A strange delay happens after about 200 tokens

I noticed that there is a sudden "stop" of processing after about 200 tokens, then it continues.
I've had it in various models.
Happens with pure CPU as well as with pure GPU processing so it's likely not on GPU side.

falcon_main chokes on larger prompts (linux)

I tried running the following command, with a 5 bit quantised model:

./bin/falcon_main -m ../falcon/ggml-model--f32-q5_1.bin -t 12 -c 2048 --repeat_penalty 1.0 --color -p "The corrected version of the following sentence 'The office is secluded, surrounded by a sea of pastures, and if you were to walk downhill in the opposite direction to the village, you would eventually find yourself in an area with houses but no people,' with grammatical and spelling corrections, if any, applied is:" --top_k 10000

but it gave me the following output:

main: build = 677 (dd3d346)
main: seed  = 1687011598
falcon.cpp: loading model from ../falcon/ggml-model--f32-q5_1.bin
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 2048
falcon_model_load_internal: n_embd     = 4544
falcon_model_load_internal: n_head     = 71
falcon_model_load_internal: n_head_kv     = 1
falcon_model_load_internal: n_layer    = 32
falcon_model_load_internal: version      = 7
falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_model_load_internal: n_ff       = 18176
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 7B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: mem required  = 6955.12 MB (+   32.00 MB per state)
.....................................................................................
falcon_init_from_file: kv self size  =   32.00 MB

system_info: n_threads = 12 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10000, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0


The corrected version of the following sentence 'The office is secluded, surrounded by a sea of pastures, and if you were to walk downhill in the opposite direction to the village, you would eventually find yourself in an area with houses but no people,' with grammatical and spelling corrections, if any, applied is:ggml_new_tensor_impl: not enough space in the context's memory pool (needed 1184709184, available 805306368)
[1]    44634 segmentation fault (core dumped)  ./bin/falcon_main -m ../falcon/ggml-model--f32-q5_1.bin -t 12 -c 2048  1.0  -

Interestingly, the command works with a smaller prompt:

./bin/falcon_main -m ../falcon/ggml-model--f32-q5_1.bin -t 12  --repeat_penalty 1.0 --color -p "Hi," --top_k 10000
main: build = 677 (dd3d346)
main: seed  = 1687012039
falcon.cpp: loading model from ../falcon/ggml-model--f32-q5_1.bin
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 512
falcon_model_load_internal: n_embd     = 4544
falcon_model_load_internal: n_head     = 71
falcon_model_load_internal: n_head_kv     = 1
falcon_model_load_internal: n_layer    = 32
falcon_model_load_internal: version      = 7
falcon_model_load_internal: ftype      = 9 (mostly Q5_1)
falcon_model_load_internal: n_ff       = 18176
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 7B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: mem required  = 6955.12 MB (+    8.00 MB per state)
.....................................................................................
falcon_init_from_file: kv self size  =    8.00 MB

system_info: n_threads = 12 / 12 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.000000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 10000, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


Hi, I just purchased 2 tickets to the MAMiT #2 at the Home Depot Center in Carson, CA on October 29th.
Here's my question: I am flying in to LA the day before (October 28th) and would love to meet up with others that will be at the event if they are going.
My e-mail is [email protected] if you'd like to contact me there and any info you can give would be great!
Thank you in advance!
Sincerely,
Dave
South Carolina<|endoftext|> [end of text]

falcon_print_timings:        load time =   299.65 ms
falcon_print_timings:      sample time =   207.28 ms /   117 runs   (    1.77 ms per token)
falcon_print_timings: prompt eval time =   196.46 ms /     2 tokens (   98.23 ms per token)
falcon_print_timings:        eval time = 20454.49 ms /   116 runs   (  176.33 ms per token)
falcon_print_timings:       total time = 20897.24 ms
                                                         

Log the version of cuda that is being used

Prerequisites

  • [X ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [X ] I carefully followed the README.md.
  • [X ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [X ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Your exact command line to replicate the issue

./falcon_main ....

Environment and Context

  • Physical (or virtual) hardware you are using, e.g. for Linux: intel cpu
  • Operating System, e.g. for Linux: CentOS

Steps to Reproduce

  1. ./falcon_main ...
  2. see in the log: "falcon_model_load_internal: using CUDA for GPU acceleration"
  3. desired: "falcon_model_load_internal: using CUDA 11.8 for GPU acceleration"

Could we also get the Makefile updated to build a libfalcon.so

Still would be nice when everything starts getting merged into llama, for one place to maintain.. I just don't like touch the C files and embarassing my self, but should be able to make it a drop in replacement for libllama.so, correct? Just thinking in terms of projects like llama_cpp

Windows Installation Video Tutorial

This isn't an issue or enhancement request.

Just wanted to say thanks for your work on ggllmcpp,

And just wanted to help Windows users, that don't want to go the wsl route, be able to get it working using gpu offloading which after many tries and research the only solution I found was a bit tricky to figure out which I show in this video.

https://www.youtube.com/watch?v=BALw669Qeyw

Also these are my pc specs:

CPU = AMD Ryzen 7 3700X 8-core Processor
RAM = 32gb
GPU = RTX 2060 Super 8gb

Here are some of my results:

CPU Only

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 8 -ngl 100 -m wizard-falcon40b.ggmlv3.q4_K_S.bin --color -c 2048 -p "Tell me a story about robot falcons from outer space.\n### Response:" -s 1686779952
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
main: build = 774 (e97d148)
main: seed  = 1686779952
falcon.cpp: loading model from wizard-falcon40b.ggmlv3.q4_K_S.bin
falcon.cpp: file version 4
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65025
falcon_model_load_internal: n_ctx      = 2048
falcon_model_load_internal: n_embd     = 8192
falcon_model_load_internal: n_head     = 128
falcon_model_load_internal: n_head_kv     = 8
falcon_model_load_internal: n_layer    = 60
falcon_model_load_internal: n_falcon_type      = 40
falcon_model_load_internal: ftype      = 14 (mostly Q4_K - Small)
falcon_model_load_internal: n_ff       = 32768
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 40B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: mem required  = 26033.24 MB (+  480.00 MB per state)
[==================================================] 100%  Tensors populated
falcon_init_from_file: kv self size  =  480.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0


Tell me a story about robot falcons from outer space.\n### Response:Once upon a time, in a far-off galaxy, there was a civilization of intelligent robots. They had achieved incredible technological advancements and had colonized many planets in their solar system. One day, they discovered a new planet that seemed to be habitable for their kind. However, the planet was inhabited by a race of sentient beings who were not friendly towards outsiders.
The robot falcons were dispatched from outer space to explore the planet and find out more about its inhabitants. They landed on the planet's surface and immediately began scanning the area for any signs of life. To their surprise, they discovered that the inhabitants of the planet were not humanoids but rather a species of bird-like creatures with incredible intelligence.
The robot falcons approached the birds cautiously and tried to communicate with them, but the birds were afraid and attacked the robots. The falcons quickly realized that they had underestimated the intelligence of the birds and decided to retreat back to their spaceship.
As they were leaving the planet, the falcons noticed a strange object in the sky. It was a giant spaceship, unlike anything they had ever seen before. The falcons tried to communicate with the ship but received no response. They decided to follow the ship back to its home planet and investigate further.
Upon landing on the alien planet, the falcons were greeted by a group of robots who looked identical to them. The leader of the robot colony explained that they had been monitoring the falcons' progress on their journey and had sent the spaceship to intercept them.
The leader revealed that they had been searching for a new home for their civilization, as their own planet was dying. They had found the perfect place in the form of the falcons' planet, which was rich in resources and could support their kind.
The falcons were hesitant at first, but they soon realized that the robots meant no harm and had only come to explore the possibility of a peaceful coexistence. The falcons agreed to let the robots stay on their planet, as long as they promised to respect the planet's natural resources and not harm any of its inhabitants.
And so, the robot falcons and the alien robots joined forces and began to colonize the new planet together. They worked side by side to build a new civilization that would benefit both races and create a harmonious society where all beings could live in peace.<|endoftext|> [end of text]

falcon_print_timings:        load time = 11956.06 ms
falcon_print_timings:      sample time =   183.35 ms /   484 runs   (    0.38 ms per token,  2639.77 tokens per second)
falcon_print_timings: batch eval time =  4267.24 ms /    16 tokens (  266.70 ms per token,     3.75 tokens per second)
falcon_print_timings:        eval time = 524919.20 ms /   483 runs   ( 1086.79 ms per token,     0.92 tokens per second)
falcon_print_timings:       total time = 529578.79 ms

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

With GPU Offloading

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 8 -ngl 100 -m wizard-falcon40b.ggmlv3.q4_K_S.bin --color -c 2048 -p "Tell me a story about robot falcons from outer space.\n### Response:" -s 1686779952
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 774 (e97d148)
main: seed  = 1686779952

CUDA Device Summary - 1 devices found
+------------------------------------+------------+-----------+-----------+-----------+-----------+
| Device                             | VRAM Total | VRAM Free | VRAM Used | Split at  | Device ID |
+------------------------------------+------------+-----------+-----------+-----------+-----------+
| NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |   7163 MB |   1028 MB |      0.0% |  0 (Main) |
+------------------------------------+------------+-----------+-----------+-----------+-----------+
Total VRAM: 8.00 GB, Total available VRAM: 7.00 GB
--------------------
Preparing CUDA for device(s):
[0]... [done]
falcon.cpp: loading model from wizard-falcon40b.ggmlv3.q4_K_S.bin
falcon.cpp: file version 4
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65025
falcon_model_load_internal: n_ctx      = 2048
falcon_model_load_internal: n_embd     = 8192
falcon_model_load_internal: n_head     = 128
falcon_model_load_internal: n_head_kv     = 8
falcon_model_load_internal: n_layer    = 60
falcon_model_load_internal: n_falcon_type      = 40
falcon_model_load_internal: ftype      = 14 (mostly Q4_K - Small)
falcon_model_load_internal: n_ff       = 32768
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 40B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 22449.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: INFO: using n_batch > 1 will require additional VRAM per device: 2818.00 MB
falcon_model_load_internal: VRAM free: 6961.00 MB  of 8191.00 MB (in use: 1230.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (285 MB)
INFO: Not enough VRAM to load all requested layers - at layer 8 of 60: skipping
INFO: 8 layers will be offloaded to GPU (layers 1 to 9)
falcon_model_load_internal: mem required  = 22466.99 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 8 of 60 layers to GPU, weights offloaded 3566.25 MB
falcon_model_load_internal: estimated VRAM usage: 6385 MB
[==================================================] 100%  Tensors populated
falcon_model_load_internal: VRAM free: 3381.00 MB  of 8191.00 MB (used: 4810.00 MB)
falcon_init_from_file: kv self size  =  480.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0


Tell me a story about robot falcons from outer space.\n### Response:Once upon a time, in a far-off galaxy, there was a civilization of robots who had evolved to resemble birds of prey. They were called the Falconoids, and they lived on a planet that orbited a binary star system. The Falconoids had developed advanced technology that allowed them to travel through space, and they had used it to explore neighboring galaxies.
One day, the Falconoids detected a strange signal coming from a distant planet in a solar system near their own. They sent a small fleet of robot falcons to investigate, but when they arrived, they found that the planet was already inhabited by intelligent life forms that resembled humans. The Falconoids had never encountered such creatures before, and they were fascinated by them.
The Falconoids decided to observe the humans from afar, without revealing themselves. They sent their falcon robots to fly over the planet's cities and countryside, gathering information about the inhabitants' behavior and technology. Over time, the Falconoids learned much about human society, including its weaknesses and strengths.
One day, a group of humans stumbled upon one of the falcon robots while hiking in the mountains. The robot had landed on a rocky outcropping, and it was unable to take off again. The humans approached the robot cautiously, not knowing what to expect. To their surprise, the robot spoke to them in perfect English, explaining that it was a visitor from another world.
The humans were stunned by this revelation, but they eventually came to accept the falcon robot as one of their own. They named it "Falco," and they took care of it like a beloved pet. Falco continued to gather information about human society, but now it was also transmitting that information back to its home planet.
As time passed, more and more Falconoid robots arrived on Earth, disguised as birds of prey. They integrated themselves into human society, learning everything they could about the humans' culture and technology. Some even took on human identities, posing as scientists or engineers.
Eventually, the Falconoids decided that it was time to reveal themselves to humanity. They descended from the skies in their spaceships, announcing their presence and offering their advanced technology to the humans. The humans were amazed by the Falconoids' generosity, and they gratefully accepted their offer of friendship and cooperation.
From that day forward, the Falconoids and humans worked together to build a better future for both species. They shared knowledge and resources, and they built a network of interstellar trade and communication that spanned the galaxy. The Falconoids even helped the humans develop their own space program, so that they could explore the stars alongside their robot friends.
And so, the Falconoids and humans lived together in peace and harmony, each species learning from the other and growing stronger as a result. They looked to the stars with wonder and excitement, knowing that there were still many mysteries to uncover and new worlds to explore.<|endoftext|> [end of text]

falcon_print_timings:        load time = 50344.30 ms
falcon_print_timings:      sample time =   284.85 ms /   595 runs   (    0.48 ms per token,  2088.80 tokens per second)
falcon_print_timings: batch eval time = 11017.97 ms /    16 tokens (  688.62 ms per token,     1.45 tokens per second)
falcon_print_timings:        eval time = 683718.28 ms /   594 runs   ( 1151.04 ms per token,     0.87 tokens per second)
falcon_print_timings:       total time = 695231.39 ms

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

Why divert from the default GGML versions?

The introduction of additional version numbers (7 and 40) brings additional complexity to the ggml ecosystem.

Basically this could also be solved by simply reading the file magic. If its a GGML file don't read the version and disable mmap. If the magic is GGJT read the version (as this format is versioned) and enable mmap.

This would also allow the creation of Falcon-7B ggjt files with mmap support.

OpenBLAS and CLBlast support

Is there a reason that OpenBLAS and CLBlast are not mentioned in the README?

I compiled it with cmake -DLLAMA_OPENBLAS=1 -DLLAMA_CLBLAST=1 .. && cmake --build . --config Release (on Linux), and it appears to be working as expected.

Loss of context in batch prompt processing - once again

I hunted that bug for 3-4 hours the past 2 days. I "fixed" it 3 times.
Now before heading off for today I notice it again.

What happens is that when "-b 512" is used the context is totally lost. But not in the "garbage" way typical, it write something but not related. As if it did get something but not at all related to the prompt context.

If anyone finds that problem, awesome.
If you are bugged by it: use -b 1 (default) until solved.

mmap fails on WSL (linux too?)

I've not looked deeper into it yet but we have a bt:

#0  __GI___mmap64 (offset=0, fd=10, flags=32769, prot=1, len=4963752448, addr=0x0) at ../sysdeps/unix/sysv/linux/mmap64.c:59
#1  __GI___mmap64 (addr=0x0, len=4963752448, prot=1, flags=32769, fd=10, offset=0) at ../sysdeps/unix/sysv/linux/mmap64.c:47
#2  0x0000564d9dd9e08c in llama_mmap::llama_mmap(llama_file*, unsigned long) ()
#3  0x0000564d9dd930d5 in falcon_model_load_internal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, falcon_context&, int, int, int, int, float const*, ggml_type, bool, bool, bool, void (*)(float, void*), void*) ()
#4  0x0000564d9dd94722 in falcon_init_from_file ()
#5  0x0000564d9dd6cf51 in falcon_init_from_gpt_params(gpt_params const&) ()
#6  0x0000564d9dd681c5 in main ()

Works fine when loading with --no-mmap

Also super strange behavior when you Ctrl+C during GDB, then just "continue" and it will actually go to inference but it produces garbage.

Performance at high context (18k+)

Opening this as a ticket as this is quite a large thing to solve.
We still suffer a significant slowdown compared to the fast speed for the first 1-2k context.

  • 2144: [ 18725, 64, 8]x[ 47, 47, 47]=[ 18725, 64, 8] CONT ( 1) cpu = 44.000 / 44.000 ms, wall = 43.976 / 43.976 ms [ 60 V] [CPU] (Slow)
  • 2150: [ 64, 18725, 8]x[ 64, 1, 128]=[ 18725, 1, 128] MUL_MAT ( 4) cpu = 29.000 / 7.250 ms, wall = 27.057 / 6.764 ms [ 60 KQ] [CPU]
  • 2154: [ 18725, 64, 8]x[ 18725, 1, 128]=[ 64, 1, 128] MUL_MAT ( 4) cpu = 8.000 / 2.000 ms, wall = 11.296 / 2.824 ms [ 60 KQV] [CPU]
  • 2164: [ 8192, 65040, 1]x[ 8192, 1, 1]=[ 65040, 1, 1] MUL_MAT ( 4) cpu = 7.000 / 1.750 ms, wall = 7.280 / 1.820 ms [ 0 result_lm_head] [GPUxQ]
  • 2153: [ 18725, 1, 128]x[ 47, 47, 47]=[ 18725, 1, 128] SOFT_MAX ( 4) cpu = 5.000 / 1.250 ms, wall = 5.425 / 1.356 ms [ 60 KQ_soft_max] [CPU]

The biggest hit is getting V straight after cache extraction and that should be something that can be avoided

            struct ggml_tensor* V = ggml_permute(
                ctx0,
                ggml_view_3d(
                    ctx0,
                    kv_self.v,
                    head_dim, n_head_kv, n_past + N,
                    head_dim * sizeof_wtype,
                    head_dim * n_head_kv * sizeof_wtype,
                    il * n_ctx * ggml_element_size(kv_self.v) * n_head_kv * head_dim),
                1, 2, 0, 3);
                V = ggml_cont(ctx0, V);

One token:

perf_total_per_op_us[             ADD] =   1.483 ms
perf_total_per_op_us[             MUL] =   1.183 ms
perf_total_per_op_us[            GELU] =   1.878 ms
perf_total_per_op_us[            NORM] =   1.800 ms
perf_total_per_op_us[         MUL_MAT] = 2913.213 ms
perf_total_per_op_us[           SCALE] =  21.552 ms
perf_total_per_op_us[             CPY] =   1.307 ms
perf_total_per_op_us[            CONT] = 2676.875 ms
perf_total_per_op_us[            VIEW] =   0.440 ms
perf_total_per_op_us[         PERMUTE] =   0.240 ms
perf_total_per_op_us[        GET_ROWS] =   0.008 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.331 ms
perf_total_per_op_us[        SOFT_MAX] = 335.385 ms
perf_total_per_op_us[            ROPE] =   2.865 ms

Last layer:

 - 2125: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.037 /   0.009 ms [ 60 node_2125] [CPU]
 - 2126: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 60 node_2126] [CPU]
 - 2127: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [ 60 node_2127] [CPU]
 - 2128: [  8192,  9216,   1]x[  8192,     1,   1]=[  9216,     1,   1]          MUL_MAT   (  4) cpu =   1.000 /   0.250 ms, wall =   1.037 /   0.259 ms [ 60 node_2128] [GPUxQ]
 - 2129: [  9216,     1,   1]x[    47,    47,  47]=[    64,     8,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Kcur] [CPU]
 - 2130: [    64,     8,   1]x[     4,     1,   1]=[    64,     8,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.002 ms [ 60 Kcur (view)] [CPU]
 - 2131: [614400000,     1,   1]x[    47,    47,  47]=[   512,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 k] [GPU]
 - 2132: [    64,     8,   1]x[   512,     1,   1]=[   512,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 60 k (copy of Kcur (view))] [CPU]
 - 2133: [  9216,     1,   1]x[    47,    47,  47]=[    64,     8,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Vcur] [CPU]
 - 2134: [614400000,     1,   1]x[    47,    47,  47]=[   512,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 v] [GPU]
 - 2135: [    64,     8,   1]x[   512,     1,   1]=[   512,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.000 ms [ 60 v (copy of Vcur)] [CPU]
 - 2136: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.003 ms [ 60 node_2136] [CPU]
 - 2137: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  4) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [ 60 node_2137] [CPU]
 - 2138: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  3) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [ 60 inpFF] [CPU]
 - 2139: [  8192, 32768,   1]x[  8192,     1,   1]=[ 32768,     1,   1]          MUL_MAT   (  4) cpu =   4.000 /   1.000 ms, wall =   3.851 /   0.963 ms [ 60 inpFF*ff_up] [GPUxQ]
 - 2140: [ 32768,     1,   1]x[    47,    47,  47]=[ 32768,     1,   1]             GELU   (  4) cpu =   0.000 /   0.000 ms, wall =   0.033 /   0.008 ms [ 60 inpFF*ff_up (view)] [CPU]
 - 2141: [ 32768,  8192,   1]x[ 32768,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  4) cpu =   3.000 /   0.750 ms, wall =   3.570 /   0.892 ms [ 60 gelu_cur*ff_down] [GPUxQ]
 - 2142: [614400000,     1,   1]x[    47,    47,  47]=[    64,     8,18725]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 cache_v (view)] [GPU]
 - 2143: [    64,     8,18725]x[    47,    47,  47]=[ 18725,    64,   8]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 cache_v (view) (permuted)] [CPU]
 - 2144: [ 18725,    64,   8]x[    47,    47,  47]=[ 18725,    64,   8]             CONT   (  1) cpu =  44.000 /  44.000 ms, wall =  43.976 /  43.976 ms [ 60 V] [CPU]  (Slow)
 - 2145: [614400000,     1,   1]x[    47,    47,  47]=[    64,     8,18725]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 cache_k (view)] [GPU]
 - 2146: [    64,     8,18725]x[    47,    47,  47]=[    64, 18725,   8]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 K] [CPU]
 - 2147: [  9216,     1,   1]x[    47,    47,  47]=[    64,   128,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Qcur] [CPU]
 - 2148: [    64,   128,   1]x[     4,     1,   1]=[    64,   128,   1]             ROPE   (  4) cpu =   0.000 /   0.000 ms, wall =   0.036 /   0.009 ms [ 60 Qcur (view)] [CPU]
 - 2149: [    64,   128,   1]x[    47,    47,  47]=[    64,     1, 128]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 Q] [CPU]
 - 2150: [    64, 18725,   8]x[    64,     1, 128]=[ 18725,     1, 128]          MUL_MAT   (  4) cpu =  29.000 /   7.250 ms, wall =  27.057 /   6.764 ms [ 60 KQ] [CPU]
 - 2151: [ 18725,     1, 128]x[     1,     1,   1]=[ 18725,     1, 128]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.370 /   0.370 ms [ 60 KQ_scaled] [CPU]
 - 2152: [ 18725,     1, 128]x[     2,     1,   1]=[ 18725,     1, 128]    DIAG_MASK_INF   (  4) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.001 ms [ 60 KQ_masked] [CPU]
 - 2153: [ 18725,     1, 128]x[    47,    47,  47]=[ 18725,     1, 128]         SOFT_MAX   (  4) cpu =   5.000 /   1.250 ms, wall =   5.425 /   1.356 ms [ 60 KQ_soft_max] [CPU]
 - 2154: [ 18725,    64,   8]x[ 18725,     1, 128]=[    64,     1, 128]          MUL_MAT   (  4) cpu =   8.000 /   2.000 ms, wall =  11.296 /   2.824 ms [ 60 KQV] [CPU]
 - 2155: [    64,     1, 128]x[    47,    47,  47]=[    64,   128,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 60 KQV_merged] [CPU]
 - 2156: [    64,   128,   1]x[  8192,     1,   1]=[  8192,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.002 ms [ 60 KQV_merged (copy)] [CPU]
 - 2157: [  8192,  8192,   1]x[  8192,     1,   1]=[  8192,     1,   1]          MUL_MAT   (  4) cpu =   2.000 /   0.500 ms, wall =   1.139 /   0.285 ms [ 60 result_wo] [GPUxQ]
 - 2158: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              CPY   (  4) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.002 ms [ 60 attn_out] [CPU]
 - 2159: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.001 ms [ 60 node_2159] [CPU]
 - 2160: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  4) cpu =   0.000 /   0.000 ms, wall =   0.017 /   0.004 ms [ 60 inpFF_+_result_attn_out] [CPU]
 - 2161: [  8192,     1,   1]x[    47,    47,  47]=[  8192,     1,   1]             NORM   (  4) cpu =   0.000 /   0.000 ms, wall =   0.006 /   0.002 ms [  0 norm_cur] [CPU]
 - 2162: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              MUL   (  2) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.002 ms [  0 node_2162] [CPU]
 - 2163: [  8192,     1,   1]x[  8192,     1,   1]=[  8192,     1,   1]              ADD   (  3) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.001 ms [  0 result_norm] [CPU]
 - 2164: [  8192, 65040,   1]x[  8192,     1,   1]=[ 65040,     1,   1]          MUL_MAT   (  4) cpu =   7.000 /   1.750 ms, wall =   7.280 /   1.820 ms [  0 result_lm_head] [GPUxQ]

Parameter --reverse-prompt won't accept text

Expected Behavior

Running model wizard-falcon40b.ggmlv3.q4_K_S.bin in interactive mode with argument --reverse-prompt "User:" for model to halt generation if string User: shows in output.

Current Behavior

Any text added after --reverse-prompt gives error:
error: unknown argument: User:
Model not starting

Environment and Context

pip list |egrep "torch|numpy|sentencepiece"
numpy 1.25.0
torch 2.0.1

AMD Ryzen 7 5800X 8-Core Processor

WSL, Ubuntu 22.04, Linux 5.10.102.1-microsoft-standard-WSL2

Python 3.10.6

Failure Information (for bugs)

Steps to Reproduce

  1. cd ./build/bin
  2. ./falcon_main -r "User:" --threads 14 --model ~/cmp-nct/ggllm.cpp/models/wizard-falcon40b.ggmlv3.q4_K_S.bin --n-predict 150 --temp 0.8 --ctx-size 2048 --n-gpu-layers 10 --interactive --file ~/cmp-nct/ggllm.cpp/prompts/falcon_chat.prompt.txt --keep -1 --color

error: unknown argument: User:

#1 performance requirement

I'm stuck with other work, I recently pushed half finished branch, containing a ton of fixes and changes but not finished.
Also moved from falcon_main to "ggfalcon" which is meant to replace the main example and other examples later on with API support.

The real big improvement, I was not able to complete yet, is to calculate the KV mulmat operations on CUDA.
Broadcasting of the first tensor is required (which basically is just repeating it 128 times per batched token, so -b 100 would cause 12800 multiplications sequentially two times. Except it's a single GPU environment then there might be more parallelism behind it)

We do have cublas 8 bit support in that branch! Which is very fast (but not faster than quantized multiplication which is default).
The branch also supports on demand change of the matmul method (cublas 8,16,32,quantized,cpu), so it's easy to test and switch.

What I believe should be done is broadcasting and batched cublas in 8 bit for the two KV cache multiplications. That should bring an enormous boost in performance.

Potential roadblock:
The current operation routine in the cuda code is not usable for that, it would loop tens of thousands of times for batched broadcasted processing and that can not be used to feed into batched cublas. non-batched cublas is also useless that way.
I just did some dry tests (broadcasting the input, not aligning the output properly) and the slowdown compared to CPU was huge.
But that can be solved, likely with a dedicated routine.

Anyone here has cuda/cublas experience who'd like to give that a try ?

Slowdown with tokens

With each token processed the inference speed slows down a little bit, starts to become noticeable at around 50 tokens on 40B Q3_K and adds up.

slow on 3090 and very high cpu usage

I have a 3090 GPU, and converted the falcon-40b-instruct and quantized by Q3_K. But when I run the test, prediction is 3x slower than the reported, then I check the gpu and cpu uage, but GPU utils is low about 10% and CPU usage is very high about 6400%.
The command is
CUDA_VISIBLE_DEVICES=0 ./build/bin/falcon_main -m ./falcon_40b_instruct/ggml-model-falcon-40b-instruct-q3_k.bin -p "Building a website can be done in 10 simple steps:" -n 16 -ngl 80 -b 1
the output is like

main: build = 755 (a584364)
main: seed  = 1687272872
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090
falcon.cpp: loading model from ./falcon_40b_instruct/ggml-model-falcon-40b-instruct-q3_k.bin
falcon_model_load_internal: format     = ggjt v3 (latest)
falcon_model_load_internal: n_vocab    = 65024
falcon_model_load_internal: n_ctx      = 512
falcon_model_load_internal: n_embd     = 8192
falcon_model_load_internal: n_head     = 128
falcon_model_load_internal: n_head_kv     = 8
falcon_model_load_internal: n_layer    = 60
falcon_model_load_internal: version      = 40
falcon_model_load_internal: ftype      = 12 (mostly Q3_K - Medium)
falcon_model_load_internal: n_ff       = 32768
falcon_model_load_internal: n_parts    = 1
falcon_model_load_internal: model size = 40B
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 17150.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 23655.00 MB  of 24259.00 MB (in use:  604.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: mem required  = 4028.27 MB (+  120.00 MB per state)
falcon_model_load_internal: offloading 60 of 60 layers to GPU, weights offloaded 16706.25 MB
falcon_model_load_internal: offloading output layer to GPU
falcon_model_load_internal: estimated VRAM usage: 17957 MB
...................................................................................................
falcon_model_load_internal: VRAM free: 6855.00 MB  of 24259.00 MB (used: 17404.00 MB)
falcon_init_from_file: kv self size  =  120.00 MB

system_info: n_threads = 64 / 128 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 1, n_predict = 16, n_keep = 0


Building a website can be done in 10 simple steps:
1. Choose your domain name
2. Decide on a hosting provider

falcon_print_timings:        load time =  3599.05 ms
falcon_print_timings:      sample time =    24.45 ms /    16 runs   (    1.53 ms per token)
falcon_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
falcon_print_timings:        eval time =  7579.99 ms /    27 runs   (  280.74 ms per token)
falcon_print_timings:       total time =  7613.69 ms


Instruct Mode Issue

Seems that when using instruct mode the seed is locked and doesn't allow to continue to ask follow up questions.

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 4 -m h2ogpt-gm-oasst1-en-2048-falcon-7b-v2.ggccv1.q5_1.bin --color -c 2048 -i -ins -s -1 --gpu-reserve-mb-main 100 --debug-timings 0
main: build = 860 (1d6e234)
falcon.cpp: loading model from h2ogpt-gm-oasst1-en-2048-falcon-7b-v2.ggccv1.q5_1.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab |   n_bpe | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggcc v1 |   65024 |   64784 |  2048 |   4544 |      71 ;   1 |      32 |  7; 7B |     9 |  18176 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7125.00 MB  of 8191.00 MB (in use: 1066.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (211 MB)
falcon_model_load_internal: mem required  =  370.24 MB (+   32.00 MB per state)
falcon_model_load_internal: offloading 32 of 32 layers to GPU, weights offloaded 5117.03 MB
falcon_model_load_internal: estimated VRAM usage: 5150 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_context_prepare: Context falcon_main RAM buffers - key_val =   32.00 MB, Compute =  160.00 MB, Scratch 0 =  124.00 MB, Scratch 1 =   40.14 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |   1885 MB |   6306 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  4/16 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

main: interactive mode on.
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prom. |          Seed |             Finetune | Stop |
+------------+-------+-------+-------+-------+---------------+----------------------+------+
|            |  2048 |     1 |     0 |     0 |    1688970281 |          UNSPECIFIED | #  4 |
+------------+-------+-------+-------+-------+---------------+----------------------+------+


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to ggLLM.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> Translate this sentence into English: Hoy me recorte el pelo.
 Today I cut my hair.

I hope this helps.
Feel free to ask for any further assistance.

> Translate this sentence into English: Mañana tengo que cojer un examen en la escula en la clase de quimica.


>

release builds should have other names than "llama-master-codestring.zip"

Nice to see you've set up binary release builds too.
:)
Now the ames of the binaries are still something like "llama-master-codestring.zip".
I think the whole point of this repo is at least not to be Llama-centric - it should be either falcon-master-xyz" or just "falcon-xyz" or ggllm-cpp-xyz" or something.

It might even get confusing with having llama.cpp and ggllm.cpp release files having all the same names.

Even in interactive mode, multiturn conversation is not possible.

Thanks for the wonderful work!

I am running the falcon-7b-instruct model with falcon_main, I generated the appropriate model with the conversion script and from warning messages, I can tell it is in the old format. Anyway, it runs perfectly fine for the given prompt but I cannot continue the chat after the model generates its output, even in the interactive mode. Since there will be a significant time overhead due to GPU offloading every time the falcon_main script runs, I would like to have multiturn conversations in a single run. Is there a way to achieve that?

Mul_mat Speedup??

Im not too familiar with mul_mat, but it seem's like it is the part of the process that takes the longest time, is that able to be optimized even further?

The current speed is great for a falcon model, I had tested the original gptq ones and those were so slow in ooba.

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 2 -b 1 -ngl 100 -m wizard-falcon-7b.ggmlv3.q5_1.bin --color -c 2048 -p "What is the difference between a falcon and an eagle?\n### Response:" -s 1686779952 --gpu-reserve-mb-main 1 --debug-timings 1
WARNING: when using cuBLAS generation results are NOT guaranteed to be reproducible.
main: build = 796 (c4d4d5f)
falcon.cpp: loading model from wizard-falcon-7b.ggmlv3.q5_1.bin
falcon.cpp: file version 4
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggml v3 |   65025 |  2048 |   4544 |      71 ;   1 |      32 |  7; 7B |     9 |  18176 |
+---------------+------------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7157.00 MB  of 8191.00 MB (in use: 1034.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (211 MB)
falcon_model_load_internal: mem required  = 2004.00 MB (+   32.00 MB per state)
falcon_model_load_internal: offloading 32 of 32 layers to GPU, weights offloaded 4951.14 MB
falcon_model_load_internal: estimated VRAM usage: 4952 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_init_from_file: kv self size  =   32.00 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |   1885 MB |   6306 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+---------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  System Info  | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+---------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  2/16 threads | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+---------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+====---+-------+------+------+--------+---------+
| Generation |   n_ctx |  n_batch | n_keep | prompt |       seed |
+------------+---------+----------+--------+--------+------------+
|            |    2048 |        1 |      0 |     17 | 1686779952 |
+------------+---------+----------+--------+--------+------------+


What is=== GRAPH ===
n_nodes = 1189
 - Idx: [   S00,   S01, S02]x[   S10,   S11, S12]=[    D0,    D1,  D2]         Op_Label Gx (Runs) cpu =  Cycles / Avg_Cyc ms, wall =    Time / Avg_Time ms [Layer Name] Device  Impact
 -   0: [  4544, 65025,   1]x[     1,     1,   1]=[  4544,     1,   1]         GET_ROWS   (  1) cpu =   0.000 /   0.000 ms, wall =   0.009 /   0.009 ms [  0 node_0]   CPU
 -   1: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.056 /   0.056 ms [  1 node_1]   CPU
 -   2: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   1.000 /   1.000 ms, wall =   0.280 /   0.280 ms [  1 node_2]   CPU  (Slow)
 -   3: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  1 inpFF]   CPU
 -   4: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.275 /   0.275 ms [  1 node_4]   GPU
 -   5: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 Kcur]   CPU
 -   6: [    64,     1,   1]x[     3,     1,   1]=[    64,     1,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [  1 node_6]   CPU
 -   7: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 k]   GPU
 -   8: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 node_8]   CPU
 -   9: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 Vcur]   CPU
 -  10: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 v]   GPU
 -  11: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 node_11]   CPU
 -  12: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.793 /   0.793 ms [  1 inpFF*ff_up]   GPU  (Slow)
 -  13: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.015 /   0.015 ms [  1 node_13]   CPU
 -  14: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.801 /   0.801 ms [  1 gelu_cur*ff_down]   GPU  (Slow)
 -  15: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 node_15]   GPU
 -  16: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 node_16]   CPU
 -  17: [    64,     2,   1]x[    47,    47,  47]=[     2,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 node_17]   CPU
 -  18: [     2,    64,   1]x[    47,    47,  47]=[     2,    64,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 node_18]   CPU
 -  19: [     2,    64,   1]x[     2,    64,  71]=[     2,    64,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.016 /   0.016 ms [  1 V]   CPU
 -  20: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 node_20]   GPU
 -  21: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 node_21]   CPU
 -  22: [    64,     2,   1]x[    47,    47,  47]=[    64,     2,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 node_22]   CPU
 -  23: [    64,     2,   1]x[    64,     2,  71]=[    64,     2,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  1 K]   CPU
 -  24: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 Qcur]   CPU
 -  25: [    64,    71,   1]x[     3,     1,   1]=[    64,    71,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.011 /   0.011 ms [  1 node_25]   CPU
 -  26: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 Q]   CPU
 -  27: [    64,     2,  71]x[    64,     1,  71]=[     2,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [  1 KQ]   CPU
 -  28: [     2,     1,  71]x[     1,     1,   1]=[     2,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  1 KQ_scaled]   CPU
 -  29: [     2,     1,  71]x[     2,     1,   1]=[     2,     1,  71]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  1 KQ_masked]   CPU
 -  30: [     2,     1,  71]x[    47,    47,  47]=[     2,     1,  71]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.005 /   0.005 ms [  1 KQ_soft_max]   CPU
 -  31: [     2,    64,  71]x[     2,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.024 /   0.024 ms [  1 KQV]   CPU
 -  32: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  1 KQV_merged]   CPU
 -  33: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  1 node_33]   CPU
 -  34: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.269 /   0.269 ms [  1 result_wo]   GPU
 -  35: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  1 attn_out]   CPU
 -  36: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  1 node_36]   CPU
 -  37: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  1 inpFF_+_result_attn_out]   CPU
 -  38: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.008 ms [  2 node_38]   CPU
 -  39: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.088 /   0.088 ms [  2 node_39]   CPU
 -  40: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  2 inpFF]   CPU
 -  41: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.272 /   0.272 ms [  2 node_41]   GPU
 -  42: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 Kcur]   CPU
 -  43: [    64,     1,   1]x[     3,     1,   1]=[    64,     1,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  2 node_43]   CPU
 -  44: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 k]   GPU
 -  45: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 node_45]   CPU
 -  46: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 Vcur]   CPU
 -  47: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 v]   GPU
 -  48: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 node_48]   CPU
 -  49: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.786 /   0.786 ms [  2 inpFF*ff_up]   GPU  (Slow)
 -  50: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.014 /   0.014 ms [  2 node_50]   CPU
 -  51: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.797 /   0.797 ms [  2 gelu_cur*ff_down]   GPU  (Slow)
 -  52: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 node_52]   GPU
 -  53: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 node_53]   CPU
 -  54: [    64,     2,   1]x[    47,    47,  47]=[     2,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 node_54]   CPU
 -  55: [     2,    64,   1]x[    47,    47,  47]=[     2,    64,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 node_55]   CPU
 -  56: [     2,    64,   1]x[     2,    64,  71]=[     2,    64,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.023 /   0.023 ms [  2 V]   CPU
 -  57: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 node_57]   GPU
 -  58: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 node_58]   CPU
 -  59: [    64,     2,   1]x[    47,    47,  47]=[    64,     2,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 node_59]   CPU
 -  60: [    64,     2,   1]x[    64,     2,  71]=[    64,     2,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.010 ms [  2 K]   CPU
 -  61: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 Qcur]   CPU
 -  62: [    64,    71,   1]x[     3,     1,   1]=[    64,    71,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.010 ms [  2 node_62]   CPU
 -  63: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 Q]   CPU
 -  64: [    64,     2,  71]x[    64,     1,  71]=[     2,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  2 KQ]   CPU
 -  65: [     2,     1,  71]x[     1,     1,   1]=[     2,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 KQ_scaled]   CPU
 -  66: [     2,     1,  71]x[     2,     1,   1]=[     2,     1,  71]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  2 KQ_masked]   CPU
 -  67: [     2,     1,  71]x[    47,    47,  47]=[     2,     1,  71]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  2 KQ_soft_max]   CPU
 -  68: [     2,    64,  71]x[     2,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.021 /   0.021 ms [  2 KQV]   CPU
 -  69: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  2 KQV_merged]   CPU
 -  70: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  2 node_70]   CPU
 -  71: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.269 /   0.269 ms [  2 result_wo]   GPU
 -  72: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  2 attn_out]   CPU
 -  73: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  2 node_73]   CPU
 -  74: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  2 inpFF_+_result_attn_out]   CPU
 -  75: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.008 /   0.008 ms [  3 node_75]   CPU
 -  76: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.113 /   0.113 ms [  3 node_76]   CPU
 -  77: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  3 inpFF]   CPU
 -  78: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.272 /   0.272 ms [  3 node_78]   GPU
 -  79: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 Kcur]   CPU
 -  80: [    64,     1,   1]x[     3,     1,   1]=[    64,     1,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  3 node_80]   CPU
 -  81: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 k]   GPU
 -  82: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 node_82]   CPU
 -  83: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 Vcur]   CPU
 -  84: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 v]   GPU
 -  85: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  3 node_85]   CPU
 -  86: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.816 /   0.816 ms [  3 inpFF*ff_up]   GPU  (Slow)
 -  87: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.015 /   0.015 ms [  3 node_87]   CPU
 -  88: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.815 /   0.815 ms [  3 gelu_cur*ff_down]   GPU  (Slow)
 -  89: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 node_89]   GPU
 -  90: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 node_90]   CPU
 -  91: [    64,     2,   1]x[    47,    47,  47]=[     2,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 node_91]   CPU
 -  92: [     2,    64,   1]x[    47,    47,  47]=[     2,    64,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 node_92]   CPU
 -  93: [     2,    64,   1]x[     2,    64,  71]=[     2,    64,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.022 /   0.022 ms [  3 V]   CPU
 -  94: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 node_94]   GPU
 -  95: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 node_95]   CPU
 -  96: [    64,     2,   1]x[    47,    47,  47]=[    64,     2,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 node_96]   CPU
 -  97: [    64,     2,   1]x[    64,     2,  71]=[    64,     2,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.010 ms [  3 K]   CPU
 -  98: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 Qcur]   CPU
 -  99: [    64,    71,   1]x[     3,     1,   1]=[    64,    71,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.009 /   0.009 ms [  3 node_99]   CPU
 - 100: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 Q]   CPU
 - 101: [    64,     2,  71]x[    64,     1,  71]=[     2,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  3 KQ]   CPU
 - 102: [     2,     1,  71]x[     1,     1,   1]=[     2,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 KQ_scaled]   CPU
 - 103: [     2,     1,  71]x[     2,     1,   1]=[     2,     1,  71]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 KQ_masked]   CPU
 - 104: [     2,     1,  71]x[    47,    47,  47]=[     2,     1,  71]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  3 KQ_soft_max]   CPU
 - 105: [     2,    64,  71]x[     2,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.023 /   0.023 ms [  3 KQV]   CPU
 - 106: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [  3 KQV_merged]   CPU
 - 107: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 node_107]   CPU
 - 108: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.292 /   0.292 ms [  3 result_wo]   GPU  (Slow)
 - 109: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 attn_out]   CPU
 - 110: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  3 node_110]   CPU
 - 111: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [  3 inpFF_+_result_attn_out]   CPU

 - 1074: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms [ 30 node_1074]   CPU
 - 1075: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   1.000 /   1.000 ms, wall =   0.110 /   0.110 ms [ 30 node_1075]   CPU
 - 1076: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 30 inpFF]   CPU
 - 1077: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.293 /   0.293 ms [ 30 node_1077]   GPU  (Slow)
 - 1078: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 Kcur]   CPU
 - 1079: [    64,     1,   1]x[     3,     1,   1]=[    64,     1,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 node_1079]   CPU
 - 1080: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 k]   GPU
 - 1081: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 node_1081]   CPU
 - 1082: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 Vcur]   CPU
 - 1083: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 v]   GPU
 - 1084: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 node_1084]   CPU
 - 1085: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.805 /   0.805 ms [ 30 inpFF*ff_up]   GPU  (Slow)
 - 1086: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.016 /   0.016 ms [ 30 node_1086]   CPU
 - 1087: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.823 /   0.823 ms [ 30 gelu_cur*ff_down]   GPU  (Slow)
 - 1088: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 node_1088]   GPU
 - 1089: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 node_1089]   CPU
 - 1090: [    64,     2,   1]x[    47,    47,  47]=[     2,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 node_1090]   CPU
 - 1091: [     2,    64,   1]x[    47,    47,  47]=[     2,    64,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 node_1091]   CPU
 - 1092: [     2,    64,   1]x[     2,    64,  71]=[     2,    64,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.017 /   0.017 ms [ 30 V]   CPU
 - 1093: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 node_1093]   GPU
 - 1094: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 node_1094]   CPU
 - 1095: [    64,     2,   1]x[    47,    47,  47]=[    64,     2,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 node_1095]   CPU
 - 1096: [    64,     2,   1]x[    64,     2,  71]=[    64,     2,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [ 30 K]   CPU
 - 1097: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 Qcur]   CPU
 - 1098: [    64,    71,   1]x[     3,     1,   1]=[    64,    71,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.009 /   0.009 ms [ 30 node_1098]   CPU
 - 1099: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 Q]   CPU
 - 1100: [    64,     2,  71]x[    64,     1,  71]=[     2,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 30 KQ]   CPU
 - 1101: [     2,     1,  71]x[     1,     1,   1]=[     2,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 KQ_scaled]   CPU
 - 1102: [     2,     1,  71]x[     2,     1,   1]=[     2,     1,  71]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 KQ_masked]   CPU
 - 1103: [     2,     1,  71]x[    47,    47,  47]=[     2,     1,  71]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 KQ_soft_max]   CPU
 - 1104: [     2,    64,  71]x[     2,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.019 /   0.019 ms [ 30 KQV]   CPU
 - 1105: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 KQV_merged]   CPU
 - 1106: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 30 node_1106]   CPU
 - 1107: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.292 /   0.292 ms [ 30 result_wo]   GPU  (Slow)
 - 1108: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 30 attn_out]   CPU
 - 1109: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 30 node_1109]   CPU
 - 1110: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 30 inpFF_+_result_attn_out]   CPU
 - 1111: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms [ 31 node_1111]   CPU
 - 1112: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.110 /   0.110 ms [ 31 node_1112]   CPU
 - 1113: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [ 31 inpFF]   CPU
 - 1114: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.293 /   0.293 ms [ 31 node_1114]   GPU  (Slow)
 - 1115: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 Kcur]   CPU
 - 1116: [    64,     1,   1]x[     3,     1,   1]=[    64,     1,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 node_1116]   CPU
 - 1117: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 k]   GPU
 - 1118: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 node_1118]   CPU
 - 1119: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 Vcur]   CPU
 - 1120: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 v]   GPU
 - 1121: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1121]   CPU
 - 1122: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.805 /   0.805 ms [ 31 inpFF*ff_up]   GPU  (Slow)
 - 1123: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.015 /   0.015 ms [ 31 node_1123]   CPU
 - 1124: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.822 /   0.822 ms [ 31 gelu_cur*ff_down]   GPU  (Slow)
 - 1125: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1125]   GPU
 - 1126: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1126]   CPU
 - 1127: [    64,     2,   1]x[    47,    47,  47]=[     2,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1127]   CPU
 - 1128: [     2,    64,   1]x[    47,    47,  47]=[     2,    64,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1128]   CPU
 - 1129: [     2,    64,   1]x[     2,    64,  71]=[     2,    64,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.017 /   0.017 ms [ 31 V]   CPU
 - 1130: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1130]   GPU
 - 1131: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1131]   CPU
 - 1132: [    64,     2,   1]x[    47,    47,  47]=[    64,     2,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 node_1132]   CPU
 - 1133: [    64,     2,   1]x[    64,     2,  71]=[    64,     2,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [ 31 K]   CPU
 - 1134: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 Qcur]   CPU
 - 1135: [    64,    71,   1]x[     3,     1,   1]=[    64,    71,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.010 /   0.010 ms [ 31 node_1135]   CPU
 - 1136: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 Q]   CPU
 - 1137: [    64,     2,  71]x[    64,     1,  71]=[     2,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 KQ]   CPU
 - 1138: [     2,     1,  71]x[     1,     1,   1]=[     2,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 KQ_scaled]   CPU
 - 1139: [     2,     1,  71]x[     2,     1,   1]=[     2,     1,  71]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 KQ_masked]   CPU
 - 1140: [     2,     1,  71]x[    47,    47,  47]=[     2,     1,  71]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 KQ_soft_max]   CPU
 - 1141: [     2,    64,  71]x[     2,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.019 /   0.019 ms [ 31 KQV]   CPU
 - 1142: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 31 KQV_merged]   CPU
 - 1143: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 node_1143]   CPU
 - 1144: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.292 /   0.292 ms [ 31 result_wo]   GPU  (Slow)
 - 1145: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 31 attn_out]   CPU
 - 1146: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 31 node_1146]   CPU
 - 1147: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 31 inpFF_+_result_attn_out]   CPU
 - 1148: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms [ 32 node_1148]   CPU
 - 1149: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.112 /   0.112 ms [ 32 node_1149]   CPU
 - 1150: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 32 inpFF]   CPU
 - 1151: [  4544,  4672,   1]x[  4544,     1,   1]=[  4672,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.298 /   0.298 ms [ 32 node_1151]   GPU  (Slow)
 - 1152: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 Kcur]   CPU
 - 1153: [    64,     1,   1]x[     3,     1,   1]=[    64,     1,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 node_1153]   CPU
 - 1154: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 k]   GPU
 - 1155: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1155]   CPU
 - 1156: [  4672,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 Vcur]   CPU
 - 1157: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 v]   GPU
 - 1158: [    64,     1,   1]x[    64,     1,   1]=[    64,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1158]   CPU
 - 1159: [  4544, 18176,   1]x[  4544,     1,   1]=[ 18176,     1,   1]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.805 /   0.805 ms [ 32 inpFF*ff_up]   GPU  (Slow)
 - 1160: [ 18176,     1,   1]x[    47,    47,  47]=[ 18176,     1,   1]             GELU   (  1) cpu =   0.000 /   0.000 ms, wall =   0.016 /   0.016 ms [ 32 node_1160]   CPU
 - 1161: [ 18176,  4544,   1]x[ 18176,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.821 /   0.821 ms [ 32 gelu_cur*ff_down]   GPU  (Slow)
 - 1162: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 node_1162]   GPU
 - 1163: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1163]   CPU
 - 1164: [    64,     2,   1]x[    47,    47,  47]=[     2,    64,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1164]   CPU
 - 1165: [     2,    64,   1]x[    47,    47,  47]=[     2,    64,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1165]   CPU
 - 1166: [     2,    64,   1]x[     2,    64,  71]=[     2,    64,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.017 /   0.017 ms [ 32 V]   CPU
 - 1167: [4194304,     1,   1]x[    47,    47,  47]=[    64,     1,   2]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1167]   GPU
 - 1168: [    64,     1,   2]x[    47,    47,  47]=[    64,     2,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1168]   CPU
 - 1169: [    64,     2,   1]x[    47,    47,  47]=[    64,     2,   1]             CONT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 node_1169]   CPU
 - 1170: [    64,     2,   1]x[    64,     2,  71]=[    64,     2,  71]          REPEAT2   (  1) cpu =   0.000 /   0.000 ms, wall =   0.004 /   0.004 ms [ 32 K]   CPU
 - 1171: [  4672,     1,   1]x[    47,    47,  47]=[    64,    71,   1]             VIEW   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 Qcur]   CPU
 - 1172: [    64,    71,   1]x[     3,     1,   1]=[    64,    71,   1]             ROPE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.011 /   0.011 ms [ 32 node_1172]   CPU
 - 1173: [    64,    71,   1]x[    47,    47,  47]=[    64,     1,  71]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 Q]   CPU
 - 1174: [    64,     2,  71]x[    64,     1,  71]=[     2,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 32 KQ]   CPU
 - 1175: [     2,     1,  71]x[     1,     1,   1]=[     2,     1,  71]            SCALE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 KQ_scaled]   CPU
 - 1176: [     2,     1,  71]x[     2,     1,   1]=[     2,     1,  71]    DIAG_MASK_INF   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 KQ_masked]   CPU
 - 1177: [     2,     1,  71]x[    47,    47,  47]=[     2,     1,  71]         SOFT_MAX   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 KQ_soft_max]   CPU
 - 1178: [     2,    64,  71]x[     2,     1,  71]=[    64,     1,  71]          MUL_MAT   (  1) cpu =   0.000 /   0.000 ms, wall =   0.019 /   0.019 ms [ 32 KQV]   CPU
 - 1179: [    64,     1,  71]x[    47,    47,  47]=[    64,    71,   1]          PERMUTE   (  1) cpu =   0.000 /   0.000 ms, wall =   0.000 /   0.000 ms [ 32 KQV_merged]   CPU
 - 1180: [    64,    71,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 node_1180]   CPU
 - 1181: [  4544,  4544,   1]x[  4544,     1,   1]=[  4544,     1,   1]          MUL_MAT   (  1) cpu =   1.000 /   1.000 ms, wall =   0.293 /   0.293 ms [ 32 result_wo]   GPU  (Slow)
 - 1182: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              CPY   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 attn_out]   CPU
 - 1183: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [ 32 node_1183]   CPU
 - 1184: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.002 /   0.002 ms [ 32 inpFF_+_result_attn_out]   CPU
 - 1185: [  4544,     1,   1]x[    47,    47,  47]=[  4544,     1,   1]             NORM   (  1) cpu =   0.000 /   0.000 ms, wall =   0.007 /   0.007 ms [  0 norm_cur]   CPU
 - 1186: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              MUL   (  1) cpu =   0.000 /   0.000 ms, wall =   0.003 /   0.003 ms [  0 node_1186]   CPU
 - 1187: [  4544,     1,   1]x[  4544,     1,   1]=[  4544,     1,   1]              ADD   (  1) cpu =   0.000 /   0.000 ms, wall =   0.001 /   0.001 ms [  0 result_norm]   CPU
 - 1188: [  4544, 65025,   1]x[  4544,     1,   1]=[ 65025,     1,   1]          MUL_MAT   (  1) cpu =   2.000 /   2.000 ms, wall =   2.593 /   2.593 ms [  0 result_lm_head]   GPU  (Slow)
perf_total_per_op_us[             ADD] =   0.182 ms
perf_total_per_op_us[             MUL] =   3.658 ms
perf_total_per_op_us[            GELU] =   0.498 ms
perf_total_per_op_us[            NORM] =   0.282 ms
perf_total_per_op_us[         MUL_MAT] =  74.875 ms
perf_total_per_op_us[           SCALE] =   0.032 ms
perf_total_per_op_us[             CPY] =   0.142 ms
perf_total_per_op_us[            CONT] =   0.064 ms
perf_total_per_op_us[            VIEW] =   0.224 ms
perf_total_per_op_us[         PERMUTE] =   0.160 ms
perf_total_per_op_us[        GET_ROWS] =   0.009 ms
perf_total_per_op_us[   DIAG_MASK_INF] =   0.033 ms
perf_total_per_op_us[        SOFT_MAX] =   0.042 ms
perf_total_per_op_us[            ROPE] =   0.349 ms
perf_total_per_op_us[         REPEAT2] =   0.673 ms
========================================
 the difference between a falcon and an eagle?\n### Response:Falcons are smaller and more agile than eagles. They have curved beaks, sharp talons, and keen eyesight that allows them to swoop down on prey at high speeds. Eagles, on the other hand, are larger and have broader wingspans. They have a hooked beak and powerful legs with sharp claws for grasping their prey. Eagles also have excellent vision but tend to hunt from higher vantage points.<|endoftext|> [end of text]

falcon_print_timings:        load time =  2765.76 ms
falcon_print_timings:      sample time =    30.56 ms /    84 runs   (    0.36 ms per token,  2748.87 tokens per second)
falcon_print_timings:        eval time =  8232.76 ms /   100 runs   (   82.33 ms per token,    12.15 tokens per second)
falcon_print_timings:       total time =  8277.83 ms

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

Debug Timings No Longer Working

Seems like the debug timings flag isn't working using the new ggcc models

C:\falcGGML\ggllm.cpp\build\bin\Release>title falcon_main.cpp

C:\falcGGML\ggllm.cpp\build\bin\Release>falcon_main -t 4 -m h2ogpt-gm-oasst1-en-2048-falcon-7b-v2.ggccv1.q5_1.bin --color -c 2048 -p "Tell me a story about robot falcons from outer space.\n### Response:" -s 1686779952 --gpu-reserve-mb-main 100 --debug-timings 1
main: build = 860 (1d6e234)
falcon.cpp: loading model from h2ogpt-gm-oasst1-en-2048-falcon-7b-v2.ggccv1.q5_1.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab |   n_bpe | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggcc v1 |   65024 |   64784 |  2048 |   4544 |      71 ;   1 |      32 |  7; 7B |     9 |  18176 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 5163.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 7129.00 MB  of 8191.00 MB (in use: 1062.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (211 MB)
falcon_model_load_internal: mem required  =  370.24 MB (+   32.00 MB per state)
falcon_model_load_internal: offloading 32 of 32 layers to GPU, weights offloaded 5117.03 MB
falcon_model_load_internal: estimated VRAM usage: 5150 MB
[==================================================] 100%  Tensors populated, CUDA ready
falcon_context_prepare: Context falcon_main RAM buffers - key_val =   32.00 MB, Compute =  160.00 MB, Scratch 0 =  124.00 MB, Scratch 1 =   40.14 MB
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 2060 SUPER      |    8191 MB |   1885 MB |   6306 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
|  4/16 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.100 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prom. |          Seed |             Finetune | Stop |
+------------+-------+-------+-------+-------+---------------+----------------------+------+
|            |  2048 |     1 |     0 |    16 |    1686779952 |          UNSPECIFIED | #  1 |
+------------+-------+-------+-------+-------+---------------+----------------------+------+


Tell=== GRAPH ===
n_nodes = 1061

C:\falcGGML\ggllm.cpp\build\bin\Release>pause
Press any key to continue . . .

Problem with cMake on Linux focal, Cuda

Prerequisites

Following the instructions in the README, Linux Ubuntu Focal

rm -rf build; mkdir build; cd buil
cmake -DLLAMA_CUBLAS=1 ..
cmake --build . --config Release

Expected Behavior

Simply trying to run make on a newly checked out version of gglim.cpp, was referenced here from the falcom.cpp, I know the README didn't probably get any love as it was moved to this fork.

-- Found CUDAToolkit: /usr/local/cuda-12.1/include (found version "12.1.105")
-- cuBLAS found
CMake Error at /usr/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.26/Modules/CMakeDetermineCompilerId.cmake:751 (message):
Compiling the CUDA compiler identification source file
"CMakeCUDACompilerId.cu" failed.
......
#$ ptxas -arch=sm_30 -m64 "tmp/CMakeCUDACompilerId.ptx" -o
"tmp/CMakeCUDACompilerId.sm_30.cubin"

ptxas fatal : Value 'sm_30' is not defined for option 'gpu-name'

--error 0xff --

...
/usr/local/lib/python3.10/dist-packages/cmake/data/share/cmake-3.26/Modules/CMakeDetermineCUDACompiler.cmake:307 (CMAKE_DETERMINE_COMPILER_ID)
CMakeLists.txt:238 (enable_language)

$ lscpu
Model name: AMD Ryzen Threadripper PRO 5955WX 16-Cores

commit 0eb3604
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper PRO 5955WX 16-Cores
Virtualization: AMD-V
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

Something weird is going on with -ngl

-ngl 0 - offloading 0 of 60 layers to GPU, weights offloaded 0.00 MB

-ngl 1 - offloading 1 of 60 layers to GPU, weights offloaded 445.50 MB

-ngl 2 - offloading 2 of 60 layers to GPU, weights offloaded 891.00 MB

-ngl 3 -

falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 4825.00 MB  of 6050.00 MB (in use: 1224.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
INFO: Not enough VRAM to load all requested layers - at layer 59 of 60: skipping
falcon_model_load_internal: mem required  = 29683.70 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 60 of 60 layers to GPU, weights offloaded 1336.50 MB
falcon_model_load_internal: estimated VRAM usage: 4155 MB
...................................................................................................
falcon_model_load_internal: VRAM free: 3487.00 MB  of 6050.00 MB (used: 2562.00 MB)

Wait, what?

-ngl 4 -

falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 4805.00 MB  of 6050.00 MB (in use: 1244.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
INFO: Not enough VRAM to load all requested layers - at layer 58 of 60: skipping
falcon_model_load_internal: mem required  = 29683.70 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 59 of 60 layers to GPU, weights offloaded 1336.50 MB
falcon_model_load_internal: estimated VRAM usage: 4155 MB
...................................................................................................
falcon_model_load_internal: VRAM free: 3467.00 MB  of 6050.00 MB (used: 2582.00 MB)

-ngl 10 -

falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 4828.00 MB  of 6050.00 MB (in use: 1221.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
INFO: Not enough VRAM to load all requested layers - at layer 52 of 60: skipping
falcon_model_load_internal: mem required  = 29683.70 MB (+  480.00 MB per state)
falcon_model_load_internal: offloading 53 of 60 layers to GPU, weights offloaded 1336.50 MB
falcon_model_load_internal: estimated VRAM usage: 4155 MB
...................................................................................................
falcon_model_load_internal: VRAM free: 3490.00 MB  of 6050.00 MB (used: 2559.00 MB)

Does the prompt cache work? I got an alignment assert when I turned it on.

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [Y ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [Y ] I carefully followed the README.md.
  • [Y ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [Y ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Can you migrate the new server from llama.cpp to here?

In the past few days, the server-example from llama.cpp has become a really useful piece of software - so much so that for many things it could replace the main program as the primary interaction tool with a model.

How difficult will it be to make this server available for falcon as well?
I have no idea how much falcon-specific code is actually in falcon-main - shouldn't most of the specific stuff be in the libraries, especially falcon_common and libfalcon?
How much is left to do once you've changed all the external calls in server.cpp to the corresponding calls from falcon_common and libfalcon?

Random spikes of up to 30ms in ggml_cuda_op device synchronization when using a low -ngl count with dual GPU

In ggml_cuda_op() I have spikes of up to 30ms, easily reproduceable when using a very low -ngl count like 1,2 or 3 on a large model like 40B, q6_k
This causes a quite significant slowdown of the calculations, it's 2 orders of magnitude higher than what the operation usually takes.
The CPU operations are significantly faster than the GPU operations in those cases.

The device the tensor is on is a 4090, a second 3090 is installed
I used -ngl 1 to reproduce it with almost every token.
I tried -ts 1,0 without any change (all tensors are on device 0)

When all works fine the sync on result_wo takes 0.144 ms

I debugged it down to the call of cudaDeviceSynchronize() at the end of the function.
Will continue debugging this one tomorrow

Maybe @JohannesGaessler already has an idea what is going on ?
Also anyone to confirm this would be helpful.

Just run a model like 40b q6_k (or similar) with **-ngl 1** and **--debug-timings 3**
In my case it shows some mat_mul spikes of 7-30ms in almost every token generation.
-ts 1,0 had no influence (note, the tensor split is currently not working because it stops at device #1 memory_free (was just fixing that)

linking error with static build

Prerequisites

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Your exact command line to replicate the issue

cmake ..  -DLLAMA_CUBLAS=ON -DLLAMA_STATIC=1
make

Environment and Context

  • Physical (or virtual) hardware you are using, e.g. for Linux:
    Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz

  • Operating System, e.g. for Linux:
    CentOS 7 + gcc11.3 + cuda 11.6

Steps to Reproduce

  1. cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_STATIC=1
  2. make

Failure Logs

...
Consolidate compiler generated dependencies of target quantize-stats
/usr/bin/ld: cannot find -ldl
/usr/bin/ld: attempted static link of dynamic object `/usr/lib64/librt.so'
/usr/bin/ld: cannot find -ldl
/usr/bin/ld: attempted static link of dynamic object `/usr/lib64/librt.so'
/usr/bin/ld: cannot find -ldl
/usr/bin/ld: attempted static link of dynamic object `/usr/lib64/librt.so'
/usr/bin/ld: cannot find -ldl
/usr/bin/ld: attempted static link of dynamic object `/usr/lib64/librt.so'
collect2: error: ld returned 1 exit status
make[2]: *** [bin/test-sampling] Error 1
collect2: error: ld returned 1 exit status
make[2]: *** [bin/test-quantize-perf] Error 1
collect2: error: ld returned 1 exit status
make[1]: *** [tests/CMakeFiles/test-sampling.dir/all] Error 2
make[1]: *** Waiting for unfinished jobs....
make[2]: *** [bin/falcon_quantize] Error 1
collect2: error: ld returned 1 exit status
make[1]: *** [tests/CMakeFiles/test-quantize-perf.dir/all] Error 2
make[1]: *** [examples/falcon_quantize/CMakeFiles/falcon_quantize.dir/all] Error 2
make[2]: *** [bin/test-quantize-fns] Error 1
make[1]: *** [tests/CMakeFiles/test-quantize-fns.dir/all] Error 2

When adding CMAKE_VERBOSE_MAKEFILE=1:

g++ -O3 -DNDEBUG -static CMakeFiles/quantize-stats.dir/quantize-stats.cpp.o -o ../../bin/quantize-stats   
-L/usr/local/cuda/targets/x86_64-linux/lib/stubs  -L/usr/local/cuda/targets/x86_64-linux/lib  
../../libllama.a -pthread /usr/local/cuda/lib64/libcudart_static.a -pthread -ldl 
/usr/lib64/librt.so /usr/local/cuda/lib64/libcublas_static.a /usr/local/cuda/lib64/libcublasLt_static.a 
/usr/local/cuda/lib64/libculibos.a -lcudadevrt -lcudart_static 
-lrt -lpthread -ldl 

For some reasons (mainly the added "-static"), cmake adds both "/usr/lib64/librt.so" (not a static lib) and "-lrt" (that one is usually acceptable even for a static build).

Performance - heads up

Just a heads up, given it's more than a week since last release.
I'm deep in a complete overhaul of a series of behavior and functions.
The core focus is to increase performance significantly, that includes a lot of rework in the cuda code, many new kernels, GPU offloading and memory management changes, GPU operation changes.
Secondary I'm working on restructuring the application to be more flexible for the future.

I'm sitting on hundreds of smaller to bigger changes, that all takes it's toll on completing it timely.
So a bit patience will be needed.

Can't falcon_convert on OpenBuddy Falcon 7B model, KeyError [fixed]

Model: OpenBuddy Falcon 7B
python falcon_convert.py openbuddy-falcon-7b-v6-bf16 openbuddy-ggllm use-f32

Error:

* Loading model from:  openbuddy-falcon-7b-v6-bf16
Vocab size: 70144
Hidden size: 4544
Number of heads: 71
Number of layers: 32
Number of head_kv: 1
Number of head_dim: 64
Traceback (most recent call last):
  File "/home/paloma/Git/ggllm.cpp/falcon_convert.py", line 111, in <module>
    text = bytearray([byte_decoder[c] for c in reverse_vocab[i]])
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/paloma/Git/ggllm.cpp/falcon_convert.py", line 111, in <listcomp>
    text = bytearray([byte_decoder[c] for c in reverse_vocab[i]])
                      ~~~~~~~~~~~~^^^
KeyError: '能'
s -lh openbuddy-falcon-7b-v6-bf16/
total 13G
-rw-r--r-- 1 paloma paloma  992 Jul 18 19:11 config.json
-rw-r--r-- 1 paloma paloma 2,6K Jul 18 19:11 configuration_RW.py
-rw-r--r-- 1 paloma paloma  111 Jul 18 19:11 generation_config.json
-rw-r--r-- 1 paloma paloma  47K Jul 18 19:11 modelling_RW.py
-rw-r--r-- 1 paloma paloma 9,4G Jul 18 19:05 pytorch_model-00001-of-00002.bin
-rw-r--r-- 1 paloma paloma 3,7G Jul 18 19:05 pytorch_model-00002-of-00002.bin
-rw-r--r-- 1 paloma paloma  17K Jul 18 19:11 pytorch_model.bin.index.json
-rw-r--r-- 1 paloma paloma   28 Jul 18 19:11 README.md
-rw-r--r-- 1 paloma paloma  281 Jul 18 19:11 special_tokens_map.json
-rw-r--r-- 1 paloma paloma  180 Jul 18 19:11 tokenizer_config.json
-rw-r--r-- 1 paloma paloma 3,5M Jul 18 19:11 tokenizer.json

Operating System: Arch Linux

falcon-main.exe Exits Unexpectedly after 'Numa' Commit

My last successful build on my Windows machine came from commit 8db9e2c "useless warnings"
8db9e2c
I compile successfully, run falcon-main.exe and then can get responses in an interactive session.
From the "Numa" commit I can still compile successfully, but after the message

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to ggLLM.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
E:\LLama\Falcon>

but the program exits and I am back at the terminal command prompt.

I have completely deleted the build directory and done a clean rebuild, but the failure remains.

Steps forward - Tokenizer

I'm currently working on the tokenizer, we need a new one.

The llama tokenizer is not suitable, it has problems forming larger tokens and favors smaller ones and it does not adhere to the merge priority of bpe, instead uses sentencepiece scores.

That's why the progress on the roadmap has stopped a bit, without good tokenization Falcon can not provide good quality results.

Couple problems to be solved:

  1. BPE merge logic instead of scores
  2. current tokenization of whitespaces conflicts with BPE whitespace token merging (whitespace and multi-whitespace binding to each other)
    2.1) Same problem with newlines, these are actual tokens and can be combined and interleaved with spaces forming pure whitespace tokens (most likely a lot for code)
  3. vocabulary in ggml V3 is not fit for the purpose

For good quality Falcon flapping we need the tokenizer to be identical or almost identical to the training tokenization

Add support for AMX instructions (bf16 and/or int8)

Prerequisites

  • [X ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [X ] I carefully followed the README.md.
  • [X ] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [X ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Your exact command line to replicate the issue

./falcon_main_avx512 -m falcon-40b-instruct.ggccv1.q4_k.bin -f g1.txt -n 100 --top-k 1
...
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
...

Environment and Context

  • Physical (or virtual) hardware you are using, e.g. for Linux:
    $ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8488C CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 Stepping: 8 BogoMIPS: 4800.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_ perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe p opcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_ adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec x getbv1 xsaves avx_vnni avx512_bf16 wbnoinvd ida arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid cldemote movdiri movdir64b md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

  • Operating System, e.g. for Linux:
    fedora37
    $ uname -a Linux 6.1.9-200.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 2 00:21:48 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Steps to Reproduce

  1. create a m7i instance on aws: https://aws.amazon.com/ec2/instance-types/m7i/
  2. run ggml
  3. see no use of AMX instructions: https://www.intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf

https://www.intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf

Unable to make falcon_main

I am trying to build this on Ubuntu 20.04.6 LTS. When I run make falcon_main I get the following error:

examples/falcon/falcon_main.cpp: In function ‘int main(int, char**)’:
examples/falcon/falcon_main.cpp:402:17: error: ‘ggml_cuda_pool_free_all’ was not declared in this scope
402 | ggml_cuda_pool_free_all(-1);
| ^~~~~~~~~~~~~~~~~~~~~~~
examples/falcon/falcon_main.cpp:418:17: error: ‘ggml_cuda_pool_free_all’ was not declared in this scope
418 | ggml_cuda_pool_free_all(-1);
| ^~~~~~~~~~~~~~~~~~~~~~~

Do you know what I am doing wrong?

Apple Silicon Unable To Build

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [ X] I carefully followed the README.md.
  • [ X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [X ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Your exact command line to replicate the issue

I tried two different commands:

rm -rf build; mkdir build; cd build
cmake -DLLAMA_CUBLAS=0 ..
cmake --build . --config Release

and

export LLAMA_CUBLAS=0;
make falcon_main falcon_quantize falcon_perplexity

Environment and Context

$ make --version
GNU Make 3.81
Copyright (C) 2006  Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

This program built for i386-apple-darwin11.3.0
$ cmake --version
cmake version 3.26.3

CMake suite maintained and supported by Kitware (kitware.com/cmake).
$ /usr/bin/xcodebuild -version
Xcode 14.3.1
Build version 14E300c
$ sw_vers
ProductName:		macOS
ProductVersion:		13.4.1
BuildVersion:		22F82

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using, e.g. for Linux:
    MacBook Pro 13" with M1 Max and 64 GB of RAM.

$ lscpu
N/A

  • Operating System, e.g. for Linux:
    MacOS 13.4.1

Steps to Reproduce

I'm unable to build this project on my Apple Silicon Mac. Llama.cpp builds without issues.

I tried building with CMAKE and MAKE, outputs along with commands provided below:

CMake

rm -rf build; mkdir build; cd build
cmake -DLLAMA_CUBLAS=0 ..
cmake --build . --config Release

Output:

[  2%] Built target BUILD_INFO
[  4%] Building C object CMakeFiles/ggml.dir/ggml.c.o
/Users/xxxxxx/Developer/ggllm.cpp/ggml.c:11037:37: error: use of undeclared identifier 'ne10'
                        ne11, ne01, ne10,
                                    ^
/Users/xxxxxx/Developer/ggllm.cpp/ggml.c:11038:37: error: use of undeclared identifier 'ne10'
                        1.0f,    y, ne10,
                                    ^
/Users/xxxxxx/Developer/ggllm.cpp/ggml.c:18287:13: warning: 'snprintf' will always overflow; destination buffer has size 128, but size argument is 512 [-Wbuiltin-memcpy-chk-size]
            snprintf(str_device_info, 512, "[CPU]");
            ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/secure/_stdio.h:57:3: note: expanded from macro 'snprintf'
  __builtin___snprintf_chk (str, len, 0, __darwin_obsz(str), __VA_ARGS__)
  ^
/Users/xxxxxx/Developer/ggllm.cpp/ggml.c:19609:79: warning: format specifies type 'long long' but the argument has type 'size_t' (aka 'unsigned long') [-Wformat]
            pos += snprintf(strides + pos, sizeof(strides) - pos, "%" PRId64, tensor->nb[i]);
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX13.3.sdk/usr/include/secure/_stdio.h:57:62: note: expanded from macro 'snprintf'
  __builtin___snprintf_chk (str, len, 0, __darwin_obsz(str), __VA_ARGS__)
                                                             ^~~~~~~~~~~
/Users/xxxxxx/Developer/ggllm.cpp/ggml.c:19708:68: warning: format specifies type 'long' but the argument has type 'int64_t' (aka 'long long') [-Wformat]
            printf("Index: %d, Dimension size: %ld\n", indices[i], tensor->ne[i]);
                                               ~~~                 ^~~~~~~~~~~~~
                                               %lld
3 warnings and 2 errors generated.
make[2]: *** [CMakeFiles/ggml.dir/ggml.c.o] Error 1
make[1]: *** [CMakeFiles/ggml.dir/all] Error 2
make: *** [all] Error 2

Make

export LLAMA_CUBLAS=0;
make falcon_main falcon_quantize falcon_perplexity

Output:

I ggllm.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -DGGML_PERF=1 -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -DGGML_PERF=1 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include
I LDFLAGS:   -framework Accelerate -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -DGGML_PERF=1 -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -DGGML_USE_K_QUANTS -DGGML_USE_ACCELERATE -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include   -c ggml.c -o ggml.o
ggml.c:4262:5: error: call to undeclared function 'ggml_init_cublas'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    ggml_init_cublas(false);
    ^
ggml.c:4319:14: error: call to undeclared function 'ggml_init_cublas'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
        if (!ggml_init_cublas(true))
             ^
ggml.c:4322:13: error: call to undeclared function 'ggml_cuda_update_gpu_status'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
            ggml_cuda_update_gpu_status(-1);
            ^
ggml.c:11037:37: error: use of undeclared identifier 'ne10'
                        ne11, ne01, ne10,
                                    ^
ggml.c:11038:37: error: use of undeclared identifier 'ne10'
                        1.0f,    y, ne10,
                                    ^
ggml.c:15751:21: error: call to undeclared function 'ggml_cuda_compute_forward'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    bool skip_cpu = ggml_cuda_compute_forward(params, tensor);
                    ^
ggml.c:15751:21: note: did you mean 'ggml_compute_forward'?
ggml.c:15747:13: note: 'ggml_compute_forward' declared here
static void ggml_compute_forward(struct ggml_compute_params * params, struct ggml_tensor * tensor) {
            ^
ggml.c:17382:29: error: call to undeclared function 'ggml_cuda_can_mul_mat'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
                        if (ggml_cuda_can_mul_mat(node->src0, node->src1, node)) {
                            ^
ggml.c:17382:29: note: did you mean 'ggml_can_mul_mat'?
ggml.c:4132:20: note: 'ggml_can_mul_mat' declared here
static inline bool ggml_can_mul_mat(const struct ggml_tensor * t0, const struct ggml_tensor * t1) {
                   ^
ggml.c:18287:13: warning: 'snprintf' will always overflow; destination buffer has size 128, but size argument is 512 [-Wbuiltin-memcpy-chk-size]
            snprintf(str_device_info, 512, "[CPU]");
            ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/secure/_stdio.h:57:3: note: expanded from macro 'snprintf'
  __builtin___snprintf_chk (str, len, 0, __darwin_obsz(str), __VA_ARGS__)
  ^
ggml.c:19609:79: warning: format specifies type 'long long' but the argument has type 'size_t' (aka 'unsigned long') [-Wformat]
            pos += snprintf(strides + pos, sizeof(strides) - pos, "%" PRId64, tensor->nb[i]);
                   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/secure/_stdio.h:57:62: note: expanded from macro 'snprintf'
  __builtin___snprintf_chk (str, len, 0, __darwin_obsz(str), __VA_ARGS__)
                                                             ^~~~~~~~~~~
ggml.c:19708:68: warning: format specifies type 'long' but the argument has type 'int64_t' (aka 'long long') [-Wformat]
            printf("Index: %d, Dimension size: %ld\n", indices[i], tensor->ne[i]);
                                               ~~~                 ^~~~~~~~~~~~~
                                               %lld
3 warnings and 7 errors generated.
make: *** [ggml.o] Error 1

Metal support

ggml and llama.cpp support Metal, do Apple Silicon users need to use LLaMA.cpp or can they use gglm.cpp with Falcon?

Upcoming PR - Pushing the Context limit to 8k+ for all existing Falcon models - Longrange Falcon flights

I plan to PR today, though it depends on final progress.
The computation speed is slow because we currently have no mulmat kernel with interleaving broadcast support yet, so tests are time consuming.
Falcon has twice the vocabulary than llama, in practice that means that Falcon naturally has a performance benefit of 30-40% on english text and about 20-25% on code and foreign languages.
This also means that 50 tokens/sec flacon speed is about as fast as 70 tokens/sec on llama in terms of language throughput.
So a 8k context window on Falcon is equivalent to ~12k context on llama.

The task: Pre-processing a large input such as a book chapter, complex code, a tutorial or a transcription of a meeting
Now I want to be able to interview Falcon about this huge text to work with it, extend it or transform it

For the current work I copied the entire falcon_eval_internal() function from current libfalcon.cpp, that's 20kb of source code and quite exactly 7k falcon tokens and the question asked is "<|prompter|>Write a summary of 10 sentences covering everything this function does<|endoftext|><|assistant|>"

I'm processing this on a high quality quantization: The 40B Q5_K (OpenAssistant).

Default
Normal Falcon result on the above question and libfalcon.cpp input:

" when, as. and for are a to:, the by , use for a and on that: a,. for it and in, this you from is. for ,,.
.' of.рен if( you they with,"

What is going on ? If we look below the surface of how the model understands text, the most essential part for the relationship between tokens is the positional encoding done through "ROPE". Sounds super compilcated but actually all it is is a 2d rotation of each token based on it's position in the total context.
Visualized this rotation of one embedding:
image
This is how the model was trained to understand relationships between tokens and sequences within a 2048 token context. I am not entirely sure why this quite tight rotation is being used, I assume (hope) someone mathed those parameters out.

Beyond that 2048 context it happens quite fast that the model does not calculate proper attention anymore, at 7k context it's completely braindead.

But by adapting the angle of rotation we can push it back into reality.
For example 8k context with a fixed scaled rotation angle:
image

The model output now:
"<|prompter|>Write a summary of 10 sentences covering everything this function does<|endoftext|><|assistant|>"

Sure, here's a summary of the function:

  1. Initialize the context struct falcon with various parameters, such as the model, input tensor, output tensor, etc.
  2. Check if broadcasting is enabled and whether to use it. If so, set up g_repeat accordingly for the first 3 dimensions.
  3. Load input tensor from disk into a tensor using gml_tensor_from_file() or create an empty tensor with gml_zeros().
  4. Create output tensor for embedding using gml_tensor_new_d() and initialize it with zeros if necessary.
  5. Initialize the current layer to zero and set its input to the previous one, i.e., self.layer = -1.
  6. Loop over each attention head in a sequence of length n_head:
    • Load token tensor from disk into a tensor using gml_tensor_from_file() or create an empty tensor with gml_zeros().
    • Normalize the token tensor using gnorm() to get the embeddings, and store it as self.data.
    • Compute scores for every token in sequence using llama_forward() and store them in self.scores.
    • Repeat the above steps for each token until end of sequence is reached.
  7. Store the scores for all tokens in a matrix with gml_tensor_mul().
  8. Normalize the matrix by dividing it by the sum of squares of squares, add one to avoid division by zero.
  9. Softmax the matrix and store result as self.data. This is the token representation.
  10. If embedding is enabled, load embeddings from the model.<|endoftext|>

Here is another variant:

Sure, here's a summary of what this function does:

The function performs the self-attention operation for one step in the transformer model. It takes in the input embeddings from the previous layer (inpL), the current token mask, and the query vectors for each head (Q) and computes the attention weight matrix (K). The attention weight matrix is used to compute the weighted sum of the embeddings from the previous step, scaled by sqrt(n_embd/head), and then softmaxed. This result is then multiplied with the value vector V to produce the updated embeddings for the current token, which are stored in KV. The attention is applied over the entire input sequence (N) and the final output is added to the value for each head. If broad casting is enabled, the attention weight matrix is broadcasted across all heads.
The function then takes in the memory from the previous layer (LL), rotates it by shifting the rows by (n_past-1)*head_dims and adds it to the current token embeddings to create the final output LLK. This is stored in cur and returned as the result of the function. If LLK is empty, it returns a single value from K for the first token.
The function then takes in the past tokens (pL), the attention mask, and the query vectors for each head (Q) and computes the attention weight matrix (KV). The attention weight matrix is used to compute the weighted sum of embeddings from the previous step, scaled by sqrt(n_embd/head), and then softmaxed. This result is added to the value for each token, which are stored in Q. The past tokens are also rotated by (n_past-1)*head_dims and added to K for the first token. If broad casting is enabled, the weight matrix is broadcast across all heads.
The function then takes in the attention output LLK from the previous layer, adds it to the result V, and adds the memory from LLK-1 to produce the final output LLK_merged. This is stored in cur and returned as the result of the function.

This is WIP. I currently have a bunch of different variants running that all perform a bit different.
The amount of hallucination is striking.

The benchmark is the best OpenAI currently has to offer, of course they not only have good parameters but also were fine tuned for that purpose. Fine tuning is something we can do once the Falcon large context parameters are chosen.

Turbo-16k

This function evaluates the Falcon neural network model on a given sequence of tokens.

First, it initializes various variables and parameters related to the model and the context.
Then, it creates a ggml_context and ggml_cgraph to perform the computation.
Next, it processes each layer of the model using a loop. For each layer, it performs self-attention, applies a linear transformation, and computes the output.
After all the layers have been processed, the function applies batch normalization and computes the embeddings.
Then, it applies the language modeling head to the embeddings.
Finally, it computes the logits and updates the kv token count and storage buffer.
The function also performs optional offloading to the GPU and prints timing information for each ggml operation.
It returns the logits and embeddings as output.

GPT4 at 8k:

The function falcon_eval_internal is a critical piece of the Falcon Natural Language Processing model. It processes tokens through the trained Falcon model for evaluation and returns the generated logits. The function begins by initializing various values and parameters based on the inputs and the model configurations. It then sets up a computational graph and loops through all the layers in the model. In each layer iteration, the function performs a series of mathematical operations, such as matrix multiplications, rearrangements, and normalizations, to perform self-attention on the provided tokens. For each layer, the function also manages memory buffers and uses API calls to calculate embeddings and language model heads. After the layer loop, normalization is performed on the input layer, and the language model head is computed. Before final logits can be returned, the function checks if all versus only last token logits are required and manages memory accordingly. The function concludes by measuring and tracking the time taken for execution.

Overall Turbo as well as GPT4 provide a definitely better roundup, especially regarding hallucinations, not super convincing in all cases which is also caused by the code being above the understanding of any llm today.

K Quant 64 support - quite a feat to integrate

A large patch was just integrated into llama.cpp (ggerganov#2001) another stunning job by @ikawrakow

In the long run we need it, K quants are better for 7B and have more flexibility but two obstacles need to be solved:

  1. We need to modify that PR so it's not a compiler switch anymore, it needs to support 256 and 64 bit.
    Either by splitting and duplicating it or by using a global variable instead of the define.
    Otherwise we'd need distinctly compiled binaries for 7B and 40B
  2. These are 32 bit dequantizers, we use 16 bit for cuBLAS to save 50% VRAM.
    It's not a huge thing to change but it doubles the kernels (again) and I'm a bit afraid of maintaining so many of them.
    Maybe instead of duplicating all kernels from 32 to 16 it would be possible to write a wrapper, let the kernels work in 32 bit and convert that into half precision. Given the parallelization that wouldn't require much VRAM.

I'm a bit afraid of investing hours integrating such custom variants in case another big push comes from upstream.

Falcon quantization lets you choose q4_k_m, etc but you just get q4_k

In llama.cpp there's logic to choose the quantization based on the type of layer and other heuristics and that controls which layers are q4_k (as an example). It doesn't just quantize everything as q4_k, and that's the difference between q4_k_m, q4_k_l, k4_k_s.

libfalcon's quantize does let you choose the different qN_k types (and will set the file type appropriately) but it just blindly quantizes every layer as q4_k, q3_k etc (except the 1d tensors which are always left 32bit - same as llama.cpp).

Just putting this here as a note and so @TheBloke knows he can save some time/bandwidth by only making one qN_k quantization for the time being.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.