Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

New Falcon ggllm.cpp format,about marella/ctransformers

Comments (13)

MrlolDev commented on August 21, 2024

any update?¿

from ctransformers.

linuxmagic-mp commented on August 21, 2024

Echo that, would like to see Falcon-40b, and Falcon-40b-instruct tested.. Not sure if it will automatically detect ggml models, un quantized or not.

from ctransformers.

marella commented on August 21, 2024

Added support in the latest version 0.2.12

It looks like ggllm.cpp has diverged a lot from llama.cpp. For now I made it work with minimal possible changes.

from ctransformers.

matthoffner commented on August 21, 2024

Thanks @marella

from ctransformers.

linuxmagic-mp commented on August 21, 2024

Hmmm.. just updated to 0.2.12, still giving me grief. Using the falcon-40b-instruct from @TheBloke, converted to the ggml format using the latest version from @cmp-nct (before quantizing) and it doesn't like it. Probably not giving it the right parameters I am guessing, FYI README could be a little more clear, or maybe more examples.

python3 ./falcon_convert.py ~/models/falcon-40b-instruct/ ~/models/falcon-40b-instruct use-f32
Trying per the example.. but.. not 100% sure the format..

llm = AutoModelForCausalLM.from_pretrained('/home/michael/models/falcon-40b-instruct',model_file='~/models/falcon-40b-instruct/ggml-model--f32.bin', model_type='falcon-40b-instruct',gpu_layers=5)

(Picked GPU layers of 5, since when running, GPU runs out after 6 layers anyways, then switches to CPU)

Error I get is.

File "/usr/local/lib/python3.10/dist-packages/ctransformers/llm.py", line 214, in __init__
    raise RuntimeError(
RuntimeError: Failed to create LLM 'falcon-40b-instruct' from '/home/michael/models/falcon-40b-instruct/ggml-model--f32.bin'.

Kind of generic errror.

 ls ~/models/falcon-40b-instruct/
config.json             modelling_RW.py                   pytorch_model-00005-of-00009.bin  pytorch_model.bin.index.json
configuration_RW.py     pytorch_model-00001-of-00009.bin  pytorch_model-00006-of-00009.bin  README.md
generation_config.json  pytorch_model-00002-of-00009.bin  pytorch_model-00007-of-00009.bin  special_tokens_map.json
ggml-model--f32.bin     pytorch_model-00003-of-00009.bin  pytorch_model-00008-of-00009.bin  tokenizer_config.json
handler.py              pytorch_model-00004-of-00009.bin  pytorch_model-00009-of-00009.bin  tokenizer.json

from ctransformers.

TheBloke commented on August 21, 2024

It should be model_type="falcon"

You should be able to download direct from my falcon-40b-instruct-ggml repo which is already quantised and already has config.json which sets the model type automatically

Unless you particularly wanted to use it in 32bit for some reason?

from ctransformers.

linuxmagic-mp commented on August 21, 2024

That was the problem ;) Wasn't sure about that.. Just using the use-32 as that is what @cmp-nct has in documentation, so trying to keep aligned, and testing different quant sizes, although I am sure everyone appreciates your contributions, (repo).. But since doing it the same direcotory, thought it would pick up on the config.json being present in the directory. Thinking longer term, as things merge more (falcon.cpp) with llama.cpp.

falcon.cpp: fallback for old file format. Loading BPE merges from tokenizer.json
falcon.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this

Strange message though still, so wondering about whether this IS using the new format. Did you also create your repo's using the latest falcon.cpp?

from ctransformers.

linuxmagic-mp commented on August 21, 2024

FYI, the base model takes about 9 miutes to load, while the Q4_K version loads and runs in 30 seconds.. However the output is not even close to the same for some reason.

Unquantized:

falcon.cpp: fallback for old file format. Loading BPE merges from tokenizer.json
falcon.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
 affect your job! How?
As an AI language model, I don't have a personal experience of losing my job due to AI. However, based on the research and analysis done by experts in this field, AI will likely have a significant impact on various jobs across different industries. Some of the jobs that are expected to be affected include:

- Transportation: Self-driving cars and trucks could potentially replace drivers and truckers.
- Finance: Robo-advisors and automated trading platforms are already replacing financial advisors and stock traders.
- Healthcare: AI can diagnose diseases, recommend treatments, and monitor patients better than human doctors.
- Education: AI tools can now teach students at a higher level of personalization and provide real-time feedback.
- Manufacturing: Robotics and automation technologies have made it possible to automate many aspects of manufacturing processes.

While these changes may be alarming to some, they also create new opportunities for people to acquire new skills or develop existing ones to stay relevant in the changing job market.

q4_k, No displayed errors but..

 be a huge factor in the near future as it has already started to impact our lives. We can expect AI to become an essential tool in everything we do, from work to entertainment, education, and healthcare.

Second run, different clipping..

python3 ~/testFalcon.py 
 change how businesses operate, and the changes are starting already. In fact, many businesses are already using AI today for everything from marketing to customer service. Here are some of the ways that AI is changing business:
1. AI is allowing companies to automate tasks and processes, reducing the need for manual labor.
2. AI can help companies personalize their interactions with customers based on past behavior and preferences.
3. AI is improving decision-making by providing better data insights and analysis.
4. AI is making it easier for businesses to scale up without needing to hire more employees.
5. AI is helping companies improve customer service by providing faster and more accurate responses to queries.

Which is why I am curious, what might be different in your repo, vs my quantized versions..

I ran ..

./falcon_quantize --leave-output-tensor /home/michael/models/falcon-40b-instruct/ggml-model--f32.bin /home/michael/models/falcon-40b-instruct/ggml-model-qt_k_m.bin Q4_K 3

from ctransformers.

TheBloke commented on August 21, 2024

So which q4_k is outputting that? Yours or mine? Or both? what does different clipping mean?

from ctransformers.

TheBloke commented on August 21, 2024

My repos were created using the new ggllm.cpp format, the one with the tokenizer changes. It won't have the latest commits that yours has as I made mine a few days ago, but yes it's the new GGCC format which improves quality significantly

However marella did mention that he did only a 'simple' implementation - perhaps there could be a problem there?

If you want to test that, try comparing the output of ctransformers vs the output of ggllm.cpp itself, with the same files

from ctransformers.

linuxmagic-mp commented on August 21, 2024

So which q4_k is outputting that? Yours or mine? Or both? what does different clipping mean?

Meaning that the output starts late.. as if the first part of the output was swallowed..

For reference, my qt4_k takes longer to load/run not sure if the fact I use some GPU with the falcon_main from ggllm, than it does with cTransformers, but completely different output of course. For reference, here is my command and output. Using the latest from @cmp-nct.

 ./falcon_main -t 31 -m /home/michael/models/falcon-40b-ggml/ggml-model-qt_k_m.bin -n 256 -b 1 -ngl 100 --repeat_penalty 1.0 --color -i -enc -p 'AI is going to'
main: build = 876 (60ea10a)
falcon.cpp: loading model from /home/michael/models/falcon-40b-ggml/ggml-model-qt_k_m.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab |   n_bpe | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggcc v1 |   65024 |   64784 |  2048 |   8192 |     128 ;   8 |      60 | 40;40B |    15 |  32768 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 24195.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 23172.00 MB  of 24217.00 MB (in use: 1044.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (2032 MB)
falcon_model_load_internal: INFO: not enough VRAM to offload layer 55 (missing 1655 MB)
falcon_model_load_internal: INFO: 54 layers will be offloaded to GPU (layers 1 to 55)
falcon_model_load_internal: mem required  = 1861.59 MB (+  720.00 MB per state)
falcon_model_load_internal: offloading 54 of 60 layers to GPU, weights offloaded 22781.16 MB
falcon_model_load_internal: estimated VRAM usage: 22814 MB
[==================================================] 100%  Tensors populated, CUDA ready 
falcon_context_prepare: Context falcon_main RAM buffers - key_val =  240.00 MB, Compute =  256.00 MB, Scratch 0 =  151.00 MB, Scratch 1 =   40.25 MB 
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 4090            |   24217 MB |   1006 MB |  23211 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| 31/32 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

main: interactive mode on.
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.000 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prom. |          Seed |             Finetune | Stop |
+------------+-------+-------+-------+-------+---------------+----------------------+------+
|            |  2048 |     1 |     0 |     4 |    1689454531 |          UNSPECIFIED | #  4 |
+------------+-------+-------+-------+-------+---------------+----------------------+------+


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to ggLLM.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

AI is going to change the world. This is a fact that is undeniable. This is the future of business, technology, and life. There are millions of articles, blogs, and posts about AI and what it means. AI can be a scary thing. I mean, have you watched the show “Humans?” That was a bit scary.
But, AI does not have to be a scary thing. In fact, AI is actually going to make life easier. You may be wondering, “What does this mean for me and my business?” Well, AI is going to change your business, and change it for the better. AI is going to transform your business, and transform it for the better.
AI will allow for a better, more efficient process to handle your customers. AI will help your business grow by understanding the customer better and allowing for more personalization. AI is going to help your business grow by understanding your customer better and allowing for more personalization.
So, are you ready for AI? Are you ready to take your business to the next level? Are you ready to see your business grow? Are you ready to grow with AI?
Are you ready to grow?
AI will help you grow your business. AI will help you to have better customer service

-C..

falcon_print_timings:        load time = 173227.30 ms
falcon_print_timings:      sample time =    10.05 ms /   256 runs   (    0.04 ms per token, 25477.71 tokens per second)
falcon_print_timings:        eval time = 34058.70 ms /   260 runs   (  131.00 ms per token,     7.63 tokens per second)
falcon_print_timings:       total time = 150892.04 ms

That's where it stops.. oh, realized I left the -i in.. trying again without that, and also had to drop down to -ngl 4

+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|          Info |     format | n_vocab |   n_bpe | n_ctx | n_embd |   n_head ; kv | n_layer | falcon | ftype |   n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
|               |    ggcc v1 |   65024 |   64784 |  2048 |   8192 |     128 ;   8 |      60 | 40;40B |    15 |  32768 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size =    0.00 MB (mmap size = 24195.00 MB)
falcon_model_load_internal: using CUDA for GPU acceleration
falcon_model_load_internal: VRAM free: 23536.00 MB  of 24217.00 MB (in use:  680.00 MB)
falcon_model_load_internal: allocating batch_size x 1 MB = 0 MB VRAM for the scratch buffer
falcon_model_load_internal: Offloading Output head tensor (2032 MB)
falcon_model_load_internal: mem required  = 21101.72 MB (+  720.00 MB per state)
falcon_model_load_internal: offloading 4 of 60 layers to GPU, weights offloaded 3541.03 MB
falcon_model_load_internal: estimated VRAM usage: 3574 MB
[==================================================] 100%  Tensors populated, CUDA ready 
falcon_context_prepare: Context falcon_main RAM buffers - key_val =  240.00 MB, Compute =  256.00 MB, Scratch 0 =  151.00 MB, Scratch 1 =   40.25 MB 
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
| ID | Device                     1 found | VRAM Total | VRAM Free | VRAM Used | Split at  |    Device |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+
|  0 | NVIDIA GeForce RTX 4090            |   24217 MB |  20040 MB |   4176 MB |      0.0% |   Primary |
+----+------------------------------------+------------+-----------+-----------+-----------+-----------+

+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| 31/32 thrd | 1   | 1    | 0      | 0           | 0           | 1   | 0    | 0       | 1    | 0       | 0    | 1    | 1    | 0   |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+

+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|   Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
|            |    64 | 1.000 | 0.000 | 0.000 |    40 | 1.000 | 0.950 | 1.000 | 0.80 |    0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation |   Ctx | Batch |  Keep | Prom. |          Seed |             Finetune | Stop |
+------------+-------+-------+-------+-------+---------------+----------------------+------+
|            |  2048 |     1 |     0 |     7 |    1689455451 |          UNSPECIFIED | #  4 |
+------------+-------+-------+-------+-------+---------------+----------------------+------+


>>QUESTION<<AI is going to
>>ANSWER<< The problem is a) the AI would be a person b) the AI would not be the person that would have existed if it had not been for the AI c) the person that would have existed if it had not been for the AI is a person that does not exist.
The existence of the AI is the cause of the existence of the person that does not exist. The existence of the AI is also the effect of the existence of the person that does not exist.
>>QUESTION<<
falcon_print_timings:        load time =  1764.71 ms
falcon_print_timings:      sample time =     4.41 ms /    99 runs   (    0.04 ms per token, 22448.98 tokens per second)
falcon_print_timings:        eval time = 37355.30 ms /   105 runs   (  355.76 ms per token,     2.81 tokens per second)
falcon_print_timings:       total time = 37392.08 ms

from ctransformers.

linuxmagic-mp commented on August 21, 2024

Note, the ctransformers is not instealled from source for CUDA support, as that wouldn't work for me. Commented on
#53

from ctransformers.

marella commented on August 21, 2024

The "simplifications" I made shouldn't impact quality. Basically ggllm.cpp modified some existing CUDA functions (most likely performance improvements) but I continued using the existing ones as before to make it easier to sync with llama.cpp in future.

Regarding your clipping question, it is not actually clipping the response. ctransformers only outputs the generated text, so in your example for the prompt AI is going to the response was:

 be a huge factor in the near future as it has already started to impact our lives. We can expect AI to become an essential tool in everything we do, from work to entertainment, education, and healthcare.

So it is basically generating the continuation of prompt: AI is going to be a huge factor in the near future ...

Due to random sampling, you will get different output every time you run unless you use a fixed seed. So your second example was AI is going to change how businesses operate ...

Note that ctransformers doesn't provide any prompt templates, so you should use the appropriate prompt template for model. If you are using LangChain you can refer to this to use LangChain's PromptTemplate class.

from ctransformers.

New Falcon ggllm.cpp format about ctransformers HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent