Git Product home page Git Product logo

copyisallyouneed's Introduction

About Me

Welcome! You are the visitor!

I am currently a Ph.D. student at Beijing Institute of Technology. My current research interests lie in the dialogue system, large-scale language models, natural language evaluation, multimodal large-scale language models, and LLM as agent.

📫 Contact me via lantiangmftby[AT]gmail[DOT]com.

My GitHub Stats:

Tian's GitHub stats

My Most Used Language

Top Langs

My Profile Trophy

trophy

copyisallyouneed's People

Contributors

gmftbygmftby avatar jcyk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

copyisallyouneed's Issues

Question about process data of "encode doc"?

code in data/dpr_wikitext103_1024/encode_doc.py

def inference(**args):
    data = DPRDataset(args['data_path'])
    sampler = torch.utils.data.distributed.DistributedSampler(data)
    data_iter = DataLoader(data, batch_size=args['batch_size'], collate_fn=data.collate, sampler=sampler)
    sampler.set_epoch(0)

    text_lists, embeddings, size, counter = [], [], 0, 0
    for documents, labels in tqdm(data_iter):
        embed = inference_one_batch(documents)
        text_lists.extend(labels)
        embeddings.append(embed)
        size += len(embed)
        if len(embeddings) > args['cut_size']:
            embed = torch.cat(embeddings)
            torch.save((text_lists, embed), f'dpr_chunk_{args["local_rank"]}_{counter}.pt')
            counter += 1
            embeddings = []
    if len(embed) > 0:
        embed = torch.cat(embeddings)
        torch.save((text_lists, embed), f'dpr_chunk_{args["local_rank"]}_{counter}.pt')

this part of code is right? I think should 'clean' the text_lists when embeddings = [].

Question: about test set "retrieve"?

Code in data/dpr_wikitext103_1024/test_retrieve.py

def search_one_job(worker_id):

    # encode the test prefix
    # with open(f'../{args["dataset"]}/new_test.txt') as f:
    with open(f'../{args["dataset"]}/test.txt') as f:
        datasets = [line.strip() for line in tqdm(f.readlines())]
        test_set = []
        for line in datasets:
            words = nltk.word_tokenize(line)
            if len(words) >= 32:
                # prefix = clean_data(words[:32])
                prefix = clean_data(words)     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                # reference = clean_data(words[32:32+128])
                reference = clean_data(words)
                test_set.append((prefix, reference))
    print(f'[!] collect {len(test_set)} samples from the test set')

I think the prefix should not the whole,
becase “actual generation, behind is unknow”

Upload models to the Hugging Face Hub

Hi!

Very cool work! It would be nice to have the model checkpoints on the Hugging Face Hub rather than a Dropbox link

Some of the benefits of sharing your models through the Hub would be:

  • versioning, commit history and diffs
  • repos provide useful metadata about their tasks, languages, metrics, etc that make them discoverable
  • multiple features from TensorBoard visualizations, PapersWithCode integration, and more
  • wider reach of your work to the ecosystem

Creating the repos and adding new models should be a relatively straightforward process if you've used Git before. This is a step-by-step guide explaining the process in case you're interested.

Please let us know if you would be interested and if you have any questions. 😊

question about preprocess data

Hello, I am trying to preprocess a private Chinese dataset, and the following error occurs when I process the data according to data/readme.md.

[Errno 2] No such file or directory: 'dpr_chunk_0_0.pt'

Questions about phrase encoder

Hi, everyone, nice job!
After reading paper, I have some little questions about phrase encoder:
"a document of length m" Is "m" the number of tokens in a document?
Does "s, e" also be the token at correspond positions?
A phrase embedding is represented by these two tokens' embeddings, right?

Thank you for your reply!

cython依赖项有问题

Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [55 lines of output]
Running from numpy source directory.
setup.py:470: UserWarning: Unrecognized setuptools command, proceeding with generating Cython sources and expanding templates
run_build = parse_setuppy_commands()

  Error compiling Cython file:
  ------------------------------------------------------------
  ...
      def __init__(self, seed=None):
          BitGenerator.__init__(self, seed)
          self.rng_state.pcg_state = &self.pcg64_random_state
  
          self._bitgen.state = <void *>&self.rng_state
          self._bitgen.next_uint64 = &pcg64_uint64
                                     ^
  ------------------------------------------------------------
  
  _pcg64.pyx:113:35: Cannot assign type 'uint64_t (*)(void *) except? -1 nogil' to 'uint64_t (*)(void *) noexcept nogil'. Exception values are incompatible. Suggest adding 'noexcept' to type 'uint64_t (void *) except? -1 nogil'.
  Processing numpy/random/_bounded_integers.pxd.in
  Processing numpy/random/mtrand.pyx
  Processing numpy/random/_pcg64.pyx
  Traceback (most recent call last):
    File "/tmp/pip-install-x3hc01nq/numpy_42705059889545b8b2f5cc566c30ba9a/tools/cythonize.py", line 235, in <module>
      main()
    File "/tmp/pip-install-x3hc01nq/numpy_42705059889545b8b2f5cc566c30ba9a/tools/cythonize.py", line 231, in main
      find_process_files(root_dir)
    File "/tmp/pip-install-x3hc01nq/numpy_42705059889545b8b2f5cc566c30ba9a/tools/cythonize.py", line 222, in find_process_files
      process(root_dir, fromfile, tofile, function, hash_db)
    File "/tmp/pip-install-x3hc01nq/numpy_42705059889545b8b2f5cc566c30ba9a/tools/cythonize.py", line 188, in process
      processor_function(fromfile, tofile)
    File "/tmp/pip-install-x3hc01nq/numpy_42705059889545b8b2f5cc566c30ba9a/tools/cythonize.py", line 77, in process_pyx
      subprocess.check_call(
    File "/usr/lib64/python3.9/subprocess.py", line 373, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'cython', '-3', '--fast-fail', '-o', '_pcg64.c', '_pcg64.pyx']' returned non-zero exit status 1.
  Cythonizing sources
  Traceback (most recent call last):
    File "/root/.local/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
      main()
    File "/root/.local/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "/root/.local/lib/python3.9/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 149, in prepare_metadata_for_build_wheel
      return hook(metadata_directory, config_settings)
    File "/tmp/pip-build-env-s911lhr6/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 157, in prepare_metadata_for_build_wheel
      self.run_setup()
    File "/tmp/pip-build-env-s911lhr6/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 248, in run_setup
      super(_BuildMetaLegacyBackend,
    File "/tmp/pip-build-env-s911lhr6/overlay/lib/python3.9/site-packages/setuptools/build_meta.py", line 142, in run_setup
      exec(compile(code, __file__, 'exec'), locals())
    File "setup.py", line 499, in <module>
      setup_package()
    File "setup.py", line 479, in setup_package
      generate_cython()
    File "setup.py", line 274, in generate_cython
      raise RuntimeError("Running cythonize failed!")
  RuntimeError: Running cythonize failed!
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

安装存在问题,

Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [55 lines of output]

metadata安装不了

Question about retrieval pool size

Hi, thank you for your great work!

I'm trying to decipher some of your experiment notes, specifically this one (data/wikitext103_1024/README_find_mbest_chunk.md).

The second section of this file refers to:

确定最佳chunk 128,找最佳token数
1. chunk_size: 128, 64: 24.9
2. chunk_size: 128, 128: 29.18
3. chunk_size: 128, 256: 33.53%
4. chunk_size: 128, 512: 37.84%
5. chunk_size: 128, 1024: 41.53%

mismatch的原因是因为chunk导致的

Could you explain what those numbers mean? I'm particularly interested in how the copy ratio scales with the retrieval pool size. I wonder if the second number in each line (64, 128, 512, 1024) refers to the pool size in each individual experiment.

Question about the equation?

Nice work!
After read the paper. I am confused about the equation in 3.1 MODEL ARCHITECTURE section.
$H_i \in R^{i*dL}$
Does multiplying by Transformer layers $L$ have any special meaning? For CLM, input a prefix of length $i$ and output a matrix (hidden state) of $i * d$.
Do I misunderstood it? Thank you for your reply! : )

少个e

pip install -r requirments.txt 少个 e,lol

Question about the phrase collection

Hi, thanks for sharing the source code.

What is the average length of collected phrases? Could you please provide statistics on it? Such as how many length-1 phrases, how many length-2 phrases, and so on.
Besides, in this repo I didn't find the code of your Algorithm 1 (page 14), could you please help me out?

Thank you so much~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.