Git Product home page Git Product logo

Comments (7)

rainjacket avatar rainjacket commented on August 30, 2024

Side clarification question, I run out of memory during the data loading step when using my full dataset and multiproc. Is --lazy supposed to fix this? Could I have it start fewer threads for the data processing step?

My dataset is around 80GB and my machine has 480GB of RAM.

from sentiment-discovery.

rainjacket avatar rainjacket commented on August 30, 2024

Actually, it does look like this might be a problem with my data set. When I created it, I checked all(c in string.printable for c in text), but I think I might also have texts where text like "\xe3" is literally there in plaintext, and somehow via the conversion that this repo is doing, it ends up viewing it as a single byte again.

Will investigate. Apparently this breaks --lazy but not the normal training code, but this could explain why I was also sometimes getting non ascii predictions.

from sentiment-discovery.

raulpuric avatar raulpuric commented on August 30, 2024

Yes for clarity --lazy is to force lazy evaluation of text from disk, instead of loading it into memory.
PyTorch instantiates the dataset in each distributed worker, so in order to avoid duplicating the dataset in each process, --lazy will create a file structured like an array and load the data directly from disk.

UnicodeEncodeError: 'ascii' codec can't encode character '\uf04a' in position 13051820: ordinal not in range(128)

This error definitely sounds like a python3 encoding problem. Keep me posted as to any encoding logic you have to change in your/our code, as I've encountered this problem occasionally in the past.

from sentiment-discovery.

rainjacket avatar rainjacket commented on August 30, 2024

@raulpuric thanks, I'll look into the encoding thing when I have some free time.

On the memory issue, I currently run out of memory while generating the lazy files. Can just that part be done with one thread, and then have multiprocess model training on the lazy files?

from sentiment-discovery.

raulpuric avatar raulpuric commented on August 30, 2024

Actually the processes are more likely running out of memory while preprocessing the dataset.

You're right that one thread should preprocess the data. To achieve this we normally run a single process training job with --lazy and kill it whenever it's done preprocessing data, and creating and then caching lazy files to disk.

You only have to pay the preprocessing penalty once. These cached lazy files will then be loaded instantly into the multiprocess script if you give it the same data flags you gave to the single process script.

I realize now that this should be made clearer on the README.

from sentiment-discovery.

rainjacket avatar rainjacket commented on August 30, 2024

I think the encoding thing is mainly related to the fact that json.loads(json.dumps("\xe2")) = 'รข'
The only way I could fix it is by adding a bunch more filters on my input to prevent this case from happening (representations of unicode characters showing up in plaintext in the raw data).

from sentiment-discovery.

raulpuric avatar raulpuric commented on August 30, 2024

Could I get you to respond to #27 with characters that did not work for you/the filters that you used to get it to work.

from sentiment-discovery.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.