karpathy / ng-video-lecture Goto Github PK

Python 100.00%

ng-video-lecture's Introduction

nanogpt-lecture

Code created in the Neural Networks: Zero To Hero video lecture series, specifically on the first lecture on nanoGPT. Publishing here as a Github repo so people can easily hack it, walk through the git log history of it, etc.

NOTE: sadly I did not go too much into model initialization in the video lecture, but it is quite important for good performance. The current code will train and work fine, but its convergence is slower because it starts off in a not great spot in the weight space. Please see nanoGPT model.py for # init all weights comment, and especially how it calls the _init_weights function. Even more sadly, the code in this repo is a bit different in how it names and stores the various modules, so it's not possible to directly copy paste this code here. My current plan is to publish a supplementary video lecture and cover these parts, then I will also push the exact code changes to this repo. For now I'm keeping it as is so it is almost exactly what we actually covered in the video.

License

MIT

ng-video-lecture's People

Contributors

Stargazers

Watchers

Forkers

kabongosalomon brennoalencar synthaether alexistercero55 yun-long awalpremi russ76 anthonymadia w32zhong jeffreygray dumpmemory melonsmasher pterameta vn-os atillayasar davidyuan666 sandipwalke davgit kuminov javiervicho prediction-systems tcrapse rowzzy bishwajitdey stjordanis codedcclxxvii deepxmatter sanjaykrishnamurthy aoe-khkhan ravitandon90 mvisionai tienpqt jonathanseng hamdisha mrgoldilocks ronofays jhordanfigueroa cawithhz mendi80 waynemunro biomathcode aiching8x8 amitdutta121 jocobtt varun-varghese liamnorman groundhogdisplay diegoascanio akankushjnvku remcqueen tsuoki 666erik saimarpaka cloudata-ai nithiroj zaidorx aliaiops madhavsapkota anastasiyatkachuk00 sanya115836 karina285276 rahulranjan7201 amunnezza dot8pixels nicholas-dinicola themcsebi vkramdev ryan4daniel4 twistedshampoo fifmikey salvajigit tsilagava gugoku ingvarstep rr-hung-nguyen kermorvant polich12 bdonkey iuriimattos2 hungry-heart natashaaasmi dukecamel kurtniegilso kokojo56 assassin65170 pure-water thefirebanks rohit4100 shadab4150 agony121 smvorwerk lp-addict sebvikingo ukaserge xiaoconstantine doytsujin chrisiiiixxx andrehirano10 eniompw guyko81

ng-video-lecture's Issues

bug?: m vs model

I'm curious if gpt.py is buggy (which is my guess and gpt-4's guess as well https://chat.openai.com/share/b50316a4-0f63-4813-8888-9cb3ca68b7f1) or why it's not.

On line 199 of gpt.py we define m. We then use m and model when I think we should be using only one of them (I think 199 should be model = ... and then we only use model. There might be some spots where we have to move tensors so everything is on the same device).

can it be run on ubuntu PC with nvidia 3060 GPU 8 GB

can be windows OS with only CPU used ?

Call `model.eval()` before generating?

I understand why we have to call model.eval() before calculating the average loss in estimate_loss(). But should we not similarly call model.eval() before we start generating from the model?

time series data like BTC price

if we have a time series data like BTC price, then we don't need to do token_embedding i suppose,

how do we do position_embedding in this case?

KeyError

Does anyone come across the same issue as follow?

Traceback (most recent call last):
  File "./gpt.py", line 224, in <module>
    print(decode(m.generate(context, max_new_tokens=3000)[0].tolist()))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./gpt.py", line 32, in <lambda>
    decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
                               ^^^^^^^^^^^^^^^^^^^^
  File "./gpt.py", line 32, in <listcomp>
    decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
                                ~~~~^^^
KeyError: -9223372036854775808

Using the variable "model" after declaring variable "m"

When you call m = model.to(device), it returns a model that shares the same parameters with the original model but is located on the specified device.

So, in your training loop and anywhere else you use the model after this point, you should use m, not model.

https://github.com/karpathy/ng-video-lecture/blob/52201428ed7b46804849dea0b3ccf0de9df1a5c3/gpt.py#L217C2-L217C2

wei value not 100% per row after dropout

It doesn't make sense to me, but

        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)

although after this step the row level percentages sum up to 100%, taking the dropout

        wei = self.dropout(wei)

the values increase above 100%. Any reason for that? Does it cause any issues? I mean the overall calculation shouldn't be effected too much, other parts of the network can overcome this issue, but still.

How is torch broadcasting (T, T) @ (B, T, C) ?!

At around 53:10 of the lecture, Andrej does a matrix multiplication with tensors of size (T, T) and (B, T, C). More precisely: (8, 8) @ (4, 8, 2).

Now, even after looking over PyTorch docs on broadcasting semantics, I'm surprised to see that this works - but sure enough, running the code produces an output of (4, 8, 2).

Can anyone explain how this broadcast works?

// align trailing dimensions
     8, 8
4, 8, 2

// pad missing dimensions with 1
1, 8, 8
4, 8, 2

// duplicate 1 dimensions until match
4, 8, 8
4, 8 ,2

// now what???

Discrepancy with dimensions

In the colab notebook linked under your YT video the dimensions for the single headed attention appear to be incorrect.

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B,T,C = x.shape
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,C)
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

I believe v.shape is not BTC but rather B,T, hs. In this repository it is correct:

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size):
        super().__init__()
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # input of size (batch, time-step, channels)
        # output of size (batch, time-step, head size)
        B,T,C = x.shape
        k = self.key(x)   # (B,T,hs)
        q = self.query(x) # (B,T,hs)
        # compute attention scores ("affinities")
        wei = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (B, T, hs) @ (B, hs, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
        wei = F.softmax(wei, dim=-1) # (B, T, T)
        wei = self.dropout(wei)
        # perform the weighted aggregation of the values
        v = self.value(x) # (B,T,hs)
        out = wei @ v # (B, T, T) @ (B, T, hs) -> (B, T, hs)
        return out

This caused my some confusion, maybe you could change it?
Thank you for such a wonderful, educational project!

supplementary video lecture: may you share link to this video pls

My current plan is to publish a supplementary video lecture and cover these parts, then I will also push the exact code changes to this rep

may you share link to this video pls

About gpt.py line 134-135

Acording to the paper of transformer , it seems that we can change
x = x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
to
x = self.ln1(x + self.sa(x))
x = self.ln2(x + self.ffwd(x))
Although the result is similar.

Might want to modify README to remove the "NOTE"

Now that you have added the "init" bit in d38c865

"index out of range" error when using a different embedding dimension than vocab_size

ng-video-lecture/bigram.py

Line 66 in 5220142

self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

If I change the 2nd parameter (dimension of the embedding) to something different than vocab_size (e.g. 128), I got "index out of range error" in generate().

To replicate the error, just change this line in the notebook:

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, 128)    # <-- change dimension to 128

And then rerun the cell:

torch.Size([32, 128])
tensor(5.2106, grad_fn=<NllLossBackward0>)
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
[<ipython-input-14-58747f1080e0>](https://localhost:8080/#) in <cell line: 48>()
     46 print(loss)
     47 
---> 48 print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))

5 frames
[<ipython-input-14-58747f1080e0>](https://localhost:8080/#) in generate(self, idx, max_new_tokens)
     30         for _ in range(max_new_tokens):
     31             # get the predictions
---> 32             logits, loss = self(idx)
     33             # focus only on the last time step
     34             logits = logits[:, -1, :] # becomes (B, C)

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

[<ipython-input-14-58747f1080e0>](https://localhost:8080/#) in forward(self, idx, targets)
     14 
     15         # idx and targets are both (B,T) tensor of integers
---> 16         logits = self.token_embedding_table(idx) # (B,T,C)
     17 
     18         if targets is None:

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *args, **kwargs)
   1499                 or _global_backward_pre_hooks or _global_backward_hooks
   1500                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501             return forward_call(*args, **kwargs)
   1502         # Do not call functions when jit is used
   1503         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py](https://localhost:8080/#) in forward(self, input)
    160 
    161     def forward(self, input: Tensor) -> Tensor:
--> 162         return F.embedding(
    163             input, self.weight, self.padding_idx, self.max_norm,
    164             self.norm_type, self.scale_grad_by_freq, self.sparse)

[/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py](https://localhost:8080/#) in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2208         # remove once script supports set_grad_enabled
   2209         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2211 
   2212 

IndexError: index out of range in self

mac studio can't generate token

the gpt.py code well run at cuda. but when I can devices with mps. this model can be trained. but can not generate token.
when run the code at generating can't finish. keeping running forever. thx for reply.

no longer bigram model?

In https://github.com/karpathy/ng-video-lecture/blob/master/gpt.py
on lines 136 / 137:

# super simple bigram model
class BigramLanguageModel(nn.Module):

just to clarify, is this now a GPT model and not a bigram model?

No MIT license file in the repository

This is essentially a repeat of #15 but I'm making a new issue for visibility. It's noted that the license is MIT in the README, but there is no MIT license in the repository. This might make the usage of the code here a bit ambiguous. @karpathy, would you mind adding the actual LICENSE file to remove any ambiguity?

I'll also take a moment to note that this is bar none the best tutorial on transformers and LLM usage I have ever seen, hence why I want to use some of it for my students/interns. But I'm hesitant to do so until the license is updated. Thanks!

how do I crack a pin showing this (****)?

The mathematical trick in self-attention, why it returns false for torch.allclose(xbow, xbow2)?

Hi
I noticed that the result of torch.allclose(xbow, xbow2), torch.allclose(xbow, xbow3) are all false when running the Collab example gpt-dev.ipynb in The mathematical trick in self-attention section. Here is what I got, has anyone encountered the same issue?

how to save, Load and Finetune the model

Model :

model = LanguageModel()

To Save

with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

To Load :

with open('model.pkl', 'rb') as f:
    model = pickle.load(f)
m = model.to(device)

UML diagram helping beginners understand gpt.py

Hey @karpathy , I created a high-level UML diagram showcasing what's going on at a high-level in gpt.py. This will make it easier for folks to hack the rest of your repo ;D
I used Graphical Code Tracer to generate this.

gpt.py how to save the model after training and how to use it so that it returns the text to me relevant to ChatGPT?

I have familiarized myself with the course gpt.py in principle, everything is clear with the training data, I have prepared a dataset. However, I want to save the resulting gpt model and then connect to it, insert some text into it and see how it will respond to it

Strange model behavior when taking the softmax in the wrong dimension

ng-video-lecture/gpt.py

Line 85 in 5220142

wei = F.softmax(wei, dim=-1) # (B, T, T)

I accidentally changed the softmax dimension to -2 instead of -1 and got incredibly low losses on both the training and validation set. However, when generating from the model, I get very low-quality result. What is the explanation ?

My guess is that I'm somehow leaking information when taking the softmax in the wrong dimension, which may explain why the training loss is very low. However, I don't quite get why validation loss would also be low.

@karpathy Any idea why this is the case?