Comments (3)
Thank you for reporting this! I’ll just summarize our off-GitHub discussion here for future reference.
The root of the problem is how worker processes are created on different platforms. On Linux, the default method is 'fork' which is very fast and is able to reuse the memory of the parent process. On MacOS (starting with Python 3.8) and Windows, the default method is 'spawn', which requires all objects loaded in the parent process to be pickled and to be unpickled in the worker processes (duplicating the amount of memory used). For the larger model files, e.g. the German web and social media model, this can take a substantial amount of time. The solution in the commit above is to prefer the 'fork' method over the default method, if it is supported by the operating system. This includes Linux and MacOS, but not Windows. Note that according to this issue, the 'fork' method might lead to crashes of the worker processes on MacOS (which is why the 'spawn' method is the default now). If this turns out to be the case for SoMeWeTa, please comment again or open a new issue.
from someweta.
Pickling/unpickling the model file creates a large startup delay, but that can't be the only issue. Processing is also extremely slow in the long run, i.e. pickling/unpickling a sentence would have to take much longer than actually tagging it.
The workaround should be fine, though. AFAIK problems with fork are related to use of UI code and other MacOS frameworks and shouldn't affect standalone Python scripts.
from someweta.
I did a few timing experiments and this is what I found:
-
Start-up overhead for the 'spawn' method scales linearly with the number of processes; i.e. creating 8 worker processes takes twice as much time as creating 4 worker processes. It seems that Python creates the worker processes sequentially and not in parallel.
-
Parallel efficiency is as follows (on a Windows machine using 'spawn'):
- 2 workers: 0.91
- 4 workers: 0.77
- 6 workers: 0.70
- 8 workers: 0.63
This seems to be similar to the efficiencies you would see for MPI-parallelized code (see, for example, Figure 2 in this random article) and is probably due to synchronization overhead.
I suspect that the synchronization overhead is unavoidable if SoMeWeTa should also be able to do parallel tagging when reading from STDIN. However, if you have a large number of files that you want to tag in parallel, you could use the somewe-tagger-multifile script in the utils directory of this repository in combination with GNU parallel to achieve higher efficiencies:
parallel -j 8 -X "somewe-tagger-multifile --tag <model> --output-prefix tagged/ --output-suffix ''" ::: tokenized/*.txt
from someweta.
Related Issues (10)
- Normalisierung der Tokens vor dem Taggen HOT 1
- Sentence Option HOT 2
- Jupyter Notebook: Future Warning possible nested set HOT 1
- Lemmatizer HOT 1
- "fromstring deprecated" (warnings in Jupyter notebook) HOT 2
- inaccurate action word recognition HOT 1
- Mistagging of homographic, sentence-initial verbs
- Question regarding multiprocessing HOT 2
- Model loading is very memory hungry HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from someweta.