Comments (2)
You are right. Currently, as a developer, you would need to implement parallel processing yourself. This is largely because parallel tagging was quickly added as an afterthought. However, cli.parallel_tagging
does almost what you want. You only need to change some parts to make it return its output, instead of printing it straight away.
Here I've turned it into a generator that yields tagged sentences, i.e. lists of (word, tag)-tuples. If you use it on XML input, note that tuples corresponding to XML tags have length 1 (only the word and no pos tag).
#!/usr/bin/env python3
import multiprocessing
import threading
from someweta import ASPTagger
import someweta.cli
def parallel_tagging(corpus, asptagger, parallel, xml=False):
"""Tag file object `corpus` using `parallel` worker processes."""
def output_result(data, xml):
if xml:
i, result, lines, word_indexes = data
tags = {idx: (t[1],) for idx, t in zip(word_indexes, result)}
return [(t,) + tags.get(idx, ()) for idx, t in enumerate(lines)]
else:
i, result = data
return result
sentinel = someweta.cli.Sentinel()
processes = min(parallel, multiprocessing.cpu_count())
input_queue = multiprocessing.Queue(maxsize=processes * 100)
output_queue = multiprocessing.Queue(maxsize=processes * 100)
producer = threading.Thread(target=someweta.cli.fill_input_queue, args=(input_queue, corpus, processes, sentinel, xml))
with multiprocessing.Pool(processes=processes, initializer=someweta.cli.process_input_queue, initargs=(asptagger.tag_sentence, input_queue, output_queue, sentinel, xml)):
producer.start()
observed_sentinels = 0
current = 0
cached_results = {}
while True:
data = output_queue.get()
if isinstance(data, someweta.cli.Sentinel):
observed_sentinels += 1
if observed_sentinels == processes:
break
else:
continue
i = data[0]
cached_results[i] = data
while current in cached_results:
yield output_result(cached_results[current], xml)
del cached_results[current]
current += 1
corpus_size = input_queue.get()
producer.join()
tagger = ASPTagger()
tagger.load("german_web_social_media_2020-05-28.model")
with open("input.txt", encoding="utf-8") as f:
for sentence in parallel_tagging(f, tagger, parallel=4, xml=False):
print("\n".join(["\t".join(t) for t in sentence]))
print()
from someweta.
Thank you so much for your prompt response. Really, appreciate it!
from someweta.
Related Issues (10)
- Normalisierung der Tokens vor dem Taggen HOT 1
- Sentence Option HOT 2
- Multiprocessing is broken (at least on MacOS arm64) HOT 3
- Jupyter Notebook: Future Warning possible nested set HOT 1
- Lemmatizer HOT 1
- "fromstring deprecated" (warnings in Jupyter notebook) HOT 2
- inaccurate action word recognition HOT 1
- Mistagging of homographic, sentence-initial verbs
- Model loading is very memory hungry HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from someweta.