Git Product home page Git Product logo

llama3.java's Introduction

Llama3.java

Practical Llama 3 inference implemented in a single Java file.

This project is the successor of llama2.java based on llama2.c by Andrej Karpathy and his excellent educational videos.

Besides the educational value, this project will be used to test and tune compiler optimizations and features on the JVM, particularly for the Graal compiler.

Features

  • Single file, no dependencies
  • GGUF format parser
  • Llama 3 tokenizer based on minbpe
  • Llama 3 inference with Grouped-Query Attention
  • Support for Q8_0 and Q4_0 quantizations
  • Fast matrix-vector multiplication routines for quantized tensors using Java's Vector API
  • Simple CLI with --chat and --instruct modes.

Here's the interactive --chat mode in action:

Setup

Download pure Q4_0 and (optionally) Q8_0 quantized .gguf files from:
https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF

The ~4.3GB pure Q4_0 quantized model is recommended, please be gentle with huggingface.co servers:

curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q4_0.gguf

# Optionally download the Q8_0 quantized model ~8GB
# curl -L -O https://huggingface.co/mukel/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf

Optional: quantize to pure Q4_0 manually

In the wild, Q8_0 quantizations are fine, but Q4_0 quantizations are rarely pure e.g. the output.weights tensor is quantized with Q6_K, instead of Q4_0.
A pure Q4_0 quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source with the quantize utility from llama.cpp as follows:

./quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0

Build and run

Java 21+ is required, in particular the MemorySegment mmap-ing feature.

jbang is a perfect fit for this use case, just:

jbang Llama3.java --help

Or execute directly, also via jbang:

chmod +x Llama3.java
./Llama3.java --help

Optional: Makefile + manually build and run

A simple Makefile is provided, run make to produce llama3.jar or manually:

javac -g --enable-preview -source 21 --add-modules jdk.incubator.vector -d target/classes Llama3.java
jar -cvfe llama3.jar com.llama4j.Llama3 LICENSE -C target/classes .

Run the resulting llama3.jar as follows:

java --enable-preview --add-modules jdk.incubator.vector -jar llama3.jar --help

Performance

Important Note
On GraalVM, please note that the Graal compiler doesn't support the Vector API yet, run with -Dllama.VectorAPI=false, but expect sub-optimal performance.
Vanilla OpenJDK 21+ is recommended for now, which supports the Vector API.

llama.cpp

Vanilla llama.cpp built with make -j 20.

./main --version
version: 2879 (4f026363)
built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu

Executed as follows:

./main -m ../Meta-Llama-3-8B-Instruct-Q4_0.gguf \
  -n 512 \
  -s 42 \
  -p "<|start_of_header_id|>user<|end_of_header_id|>Why is the sky blue?<|eot_id|><|start_of_header_id|>assistant<|end_of_header_id|>\n\n" \
  --interactive-specials

Collected the "eval time" metric in tokens\s.

Llama3.java

Running on OpenJDK 21.0.2.

jbang Llama3.java \
  --model ./Meta-Llama-3-8B-Instruct-Q4_0.gguf \
  --max-tokens 512 \
  --seed 42 \
  --stream false \
  --prompt "Why is the sky blue?"

Results

Notebook Intel 13900H 6pC+8eC/20T 64GB (5200) Linux 6.6.26

Model tokens/s Implementation
Llama-3-8B-Instruct-Q4_0.gguf 7.53 llama.cpp
Llama-3-8B-Instruct-Q4_0.gguf 6.95 llama3.java
Llama-3-8B-Instruct-Q8_0.gguf 5.16 llama.cpp
Llama-3-8B-Instruct-Q8_0.gguf 4.02 llama3.java

Workstation AMD 3950X 16C/32T 64GB (3200) Linux 6.6.25

**Notes
Running on a single CCD e.g. taskset -c 0-15 jbang Llama3.java ... since inference is constrained by memory bandwidth.

Model tokens/s Implementation
Llama-3-8B-Instruct-Q4_0.gguf 9.26 llama.cpp
Llama-3-8B-Instruct-Q4_0.gguf 8.03 llama3.java
Llama-3-8B-Instruct-Q8_0.gguf 5.79 llama.cpp
Llama-3-8B-Instruct-Q8_0.gguf 4.92 llama3.java

License

MIT

llama3.java's People

Contributors

mukel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.