gmftbygmftby / copyisallyouneed Goto Github PK

[ICLR 2023] Codebase for Copy-Generator model, including an implementation of kNN-LM

Home Page: https://openreview.net/forum?id=CROlOA9Nd8C&referrer=%5Bthe%20profile%20of%20Tian%20Lan%5D(%2Fprofile%3Fid%3D~Tian_Lan7)

License: MIT License

Python 87.22% Shell 8.34% Jupyter Notebook 4.44%

copyisallyouneed's Introduction

About Me

Welcome! You are the visitor!

I am currently a Ph.D. student at Beijing Institute of Technology. My current research interests lie in the dialogue system, large-scale language models, natural language evaluation, multimodal large-scale language models, and LLM as agent.

📫 Contact me via lantiangmftby[AT]gmail[DOT]com.

My GitHub Stats:

My Most Used Language

My Profile Trophy

copyisallyouneed's People

Contributors

Stargazers

Watchers

copyisallyouneed's Issues

Question about process data of "encode doc"?

code in data/dpr_wikitext103_1024/encode_doc.py

def inference(**args):
    data = DPRDataset(args['data_path'])
    sampler = torch.utils.data.distributed.DistributedSampler(data)
    data_iter = DataLoader(data, batch_size=args['batch_size'], collate_fn=data.collate, sampler=sampler)
    sampler.set_epoch(0)

    text_lists, embeddings, size, counter = [], [], 0, 0
    for documents, labels in tqdm(data_iter):
        embed = inference_one_batch(documents)
        text_lists.extend(labels)
        embeddings.append(embed)
        size += len(embed)
        if len(embeddings) > args['cut_size']:
            embed = torch.cat(embeddings)
            torch.save((text_lists, embed), f'dpr_chunk_{args["local_rank"]}_{counter}.pt')
            counter += 1
            embeddings = []
    if len(embed) > 0:
        embed = torch.cat(embeddings)
        torch.save((text_lists, embed), f'dpr_chunk_{args["local_rank"]}_{counter}.pt')

this part of code is right? I think should 'clean' the text_lists when embeddings = [].

checkpoints plz

Is there a checkpoint trained by COG?

Question: about test set "retrieve"?

Code in data/dpr_wikitext103_1024/test_retrieve.py

def search_one_job(worker_id):

    # encode the test prefix
    # with open(f'../{args["dataset"]}/new_test.txt') as f:
    with open(f'../{args["dataset"]}/test.txt') as f:
        datasets = [line.strip() for line in tqdm(f.readlines())]
        test_set = []
        for line in datasets:
            words = nltk.word_tokenize(line)
            if len(words) >= 32:
                # prefix = clean_data(words[:32])
                prefix = clean_data(words)     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                # reference = clean_data(words[32:32+128])
                reference = clean_data(words)
                test_set.append((prefix, reference))
    print(f'[!] collect {len(test_set)} samples from the test set')

I think the prefix should not the whole,
becase “actual generation, behind is unknow”

Upload models to the Hugging Face Hub

Hi!

Very cool work! It would be nice to have the model checkpoints on the Hugging Face Hub rather than a Dropbox link

Some of the benefits of sharing your models through the Hub would be:

versioning, commit history and diffs
repos provide useful metadata about their tasks, languages, metrics, etc that make them discoverable
multiple features from TensorBoard visualizations, PapersWithCode integration, and more
wider reach of your work to the ecosystem

Creating the repos and adding new models should be a relatively straightforward process if you've used Git before. This is a step-by-step guide explaining the process in case you're interested.

Please let us know if you would be interested and if you have any questions. 😊

question about preprocess data

Hello, I am trying to preprocess a private Chinese dataset, and the following error occurs when I process the data according to data/readme.md.

[Errno 2] No such file or directory: 'dpr_chunk_0_0.pt'

Questions about phrase encoder

Hi, everyone, nice job!
After reading paper, I have some little questions about phrase encoder:
"a document of length m" Is "m" the number of tokens in a document?
Does "s, e" also be the token at correspond positions?
A phrase embedding is represented by these two tokens' embeddings, right?

Thank you for your reply!

请问有公开的模型吗？

没有资源训练不起来:(

Question about the dataset download link.

Hello! The download link you provided doesn't seem to work for me.
Could you please provide another link including base_data_128.txt and test.txt?
Thank you.

cython依赖项有问题

Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [55 lines of output]
Running from numpy source directory.
setup.py:470: UserWarning: Unrecognized setuptools command, proceeding with generating Cython sources and expanding templates
run_build = parse_setuppy_commands()

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
      def __init__(self, seed=None):
          BitGenerator.__init__(self, seed)
          self.rng_state.pcg_state = &self.pcg64_random_state
  
          self._bitgen.state = <void *>&self.rng_state
          self._bitgen.next_uint64 = &pcg64_uint64
                                     ^
  ------------------------------------------------------------
  
  _pcg64.pyx:113:35: Cannot assign type 'uint64_t (*)(void *) except? -1 nogil' to 'uint64_t (*)(void *) noexcept nogil'. Exception values are incompatible. Suggest adding 'noexcept' to type 'uint64_t (void *) except? -1 nogil'.
  Processing numpy/random/_bounded_integers.pxd.in
  Processing numpy/random/mtrand.pyx
  Processing numpy/random/_pcg64.pyx
  Traceback (most recent call last):
    File "/tmp/pip-install-x3hc01nq/numpy_42705059889545b8b2f5cc566c30ba9a/tools/cythonize.py", line 235, in <module>
      main()
    File "/tmp/pip-install-x3hc01nq/numpy_42705059889545b8b2f5cc566c30ba9a/tools/cythonize.py", line 231, in main
      find_process_files(root_dir)
    File "/tmp/pip-install-x3hc01nq/numpy_42705059889545b8b2f5cc566c30ba9a/tools/cythonize.py", line 222, in find_process_files
      process(root_dir, fromfile, tofile, function, hash_db)
    File "/tmp/pip-install-x3hc01nq/numpy_42705059889545b8b2f5cc566c30ba9a/tools/cythonize.py", line 188, in process
      processor_function(fromfile, tofile)
    File "/tmp/pip-install-x3hc01nq/numpy_42705059889545b8b2f5cc566c30ba9a/tools/cythonize.py", line 77, in process_pyx
      subprocess.check_call(
    File "/usr/lib64/python3.9/subprocess.py", line 373, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'cython', '-3', '--fast-fail', '-o', '_pcg64.c', '_pcg64.pyx']' returned non-zero exit status 1.
  Cythonizing sources
  Traceback (most recent call last):
    File "/root/.local/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
      main()
    File "/root/.local/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/root/.local/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 149, in prepare_metadata_for_build_wheel
      return hook(metadata_directory, config_settings)
    File "/tmp/pip-build-env-s911lhr6/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 157, in prepare_metadata_for_build_wheel
      self.run_setup()
    File "/tmp/pip-build-env-s911lhr6/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 248, in run_setup
      super(_BuildMetaLegacyBackend,
    File "/tmp/pip-build-env-s911lhr6/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 142, in run_setup
      exec(compile(code, __file__, 'exec'), locals())
    File "setup.py", line 499, in <module>
      setup_package()
    File "setup.py", line 479, in setup_package
      generate_cython()
    File "setup.py", line 274, in generate_cython
      raise RuntimeError("Running cythonize failed!")
  RuntimeError: Running cythonize failed!
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

安装存在问题，

Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [55 lines of output]

metadata安装不了

Question about retrieval pool size

Hi, thank you for your great work!

I'm trying to decipher some of your experiment notes, specifically this one (data/wikitext103_1024/README_find_mbest_chunk.md).

The second section of this file refers to:

确定最佳chunk 128,找最佳token数
1. chunk_size: 128, 64: 24.9
2. chunk_size: 128, 128: 29.18
3. chunk_size: 128, 256: 33.53%
4. chunk_size: 128, 512: 37.84%
5. chunk_size: 128, 1024: 41.53%

mismatch的原因是因为chunk导致的

Could you explain what those numbers mean? I'm particularly interested in how the copy ratio scales with the retrieval pool size. I wonder if the second number in each line (64, 128, 512, 1024) refers to the pool size in each individual experiment.

Question about the equation?

Nice work!
After read the paper. I am confused about the equation in 3.1 MODEL ARCHITECTURE section.
$H_i \in R^{i*dL}$
Does multiplying by Transformer layers $L$ have any special meaning? For CLM, input a prefix of length $i$ and output a matrix (hidden state) of $i * d$.
Do I misunderstood it? Thank you for your reply! : )

少个e

pip install -r requirments.txt 少个 e，lol

Question about the phrase collection

Hi, thanks for sharing the source code.

What is the average length of collected phrases? Could you please provide statistics on it? Such as how many length-1 phrases, how many length-2 phrases, and so on.
Besides, in this repo I didn't find the code of your Algorithm 1 (page 14), could you please help me out?

Thank you so much~