UnicodeEncodeError on data load when using --lazy about sentiment-discovery HOT 7 OPEN

nvidia commented on August 30, 2024

UnicodeEncodeError on data load when using --lazy

from sentiment-discovery.

Comments (7)

rainjacket commented on August 30, 2024

Side clarification question, I run out of memory during the data loading step when using my full dataset and multiproc. Is --lazy supposed to fix this? Could I have it start fewer threads for the data processing step?

My dataset is around 80GB and my machine has 480GB of RAM.

from sentiment-discovery.

rainjacket commented on August 30, 2024

Actually, it does look like this might be a problem with my data set. When I created it, I checked all(c in string.printable for c in text), but I think I might also have texts where text like "\xe3" is literally there in plaintext, and somehow via the conversion that this repo is doing, it ends up viewing it as a single byte again.

Will investigate. Apparently this breaks --lazy but not the normal training code, but this could explain why I was also sometimes getting non ascii predictions.

from sentiment-discovery.

raulpuric commented on August 30, 2024

Yes for clarity --lazy is to force lazy evaluation of text from disk, instead of loading it into memory.
PyTorch instantiates the dataset in each distributed worker, so in order to avoid duplicating the dataset in each process, --lazy will create a file structured like an array and load the data directly from disk.

UnicodeEncodeError: 'ascii' codec can't encode character '\uf04a' in position 13051820: ordinal not in range(128)

This error definitely sounds like a python3 encoding problem. Keep me posted as to any encoding logic you have to change in your/our code, as I've encountered this problem occasionally in the past.

from sentiment-discovery.

rainjacket commented on August 30, 2024

@raulpuric thanks, I'll look into the encoding thing when I have some free time.

On the memory issue, I currently run out of memory while generating the lazy files. Can just that part be done with one thread, and then have multiprocess model training on the lazy files?

from sentiment-discovery.

raulpuric commented on August 30, 2024

Actually the processes are more likely running out of memory while preprocessing the dataset.

You're right that one thread should preprocess the data. To achieve this we normally run a single process training job with --lazy and kill it whenever it's done preprocessing data, and creating and then caching lazy files to disk.

You only have to pay the preprocessing penalty once. These cached lazy files will then be loaded instantly into the multiprocess script if you give it the same data flags you gave to the single process script.

I realize now that this should be made clearer on the README.

from sentiment-discovery.

rainjacket commented on August 30, 2024

I think the encoding thing is mainly related to the fact that json.loads(json.dumps("\xe2")) = 'â'
The only way I could fix it is by adding a bunch more filters on my input to prevent this case from happening (representations of unicode characters showing up in plaintext in the raw data).

from sentiment-discovery.

raulpuric commented on August 30, 2024

Could I get you to respond to #27 with characters that did not work for you/the filters that you used to get it to work.

from sentiment-discovery.

UnicodeEncodeError on data load when using --lazy about sentiment-discovery HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent