What is the scenario for adding a new category ?

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

How to add new category? about nyan HOT 3 OPEN

b0tm1nd commented on May 25, 2024

How to add new category?

from nyan.

Comments (3)

b0tm1nd commented on May 25, 2024

From what I understood, we need a new dataset in .jsonl with text and labels.
Could you share datasets that this was trained on? Especially for not_news.
By reading the telegram contest I see that for russian content they mostly used lenta.ru archive.
But what about ukrainian?

from nyan.

NyanNyanovich commented on May 25, 2024

Here you go: https://github.com/NyanNyanovich/nyan/releases/download/can_annot/cat_markup.tar.gz
I used Lenta and gpt-4 annotations, here is the script to query gpt-4: https://github.com/NyanNyanovich/nyan/blob/master/scripts/annotate_categories.py
And the training script: https://github.com/NyanNyanovich/nyan/blob/master/scripts/train_clf.py

from nyan.

b0tm1nd commented on May 25, 2024

@NyanNyanovich Thanks, I have found train_clf.py already and tried to train it with a single category but then on send.sh classificator failed probably because of "not_news" missing..

I have taken a dataset for Ukrainian news website which tagged their news, grouped only related to corruption and gotten about 700 entries which I united with categories_train.jsonl.

And after training I've became getting much worse results: many from war/politics became triggering corruption now and resulting as "unknown".
I have found out that in the added dataset the median text size is 1000+ characters when in yours about 450.

So I have a few questions about the hints for a dataset for the new category:

Does smaller article size improves accuracy?
Do multiple labels for the new category (like ["corruption", "war"] or ["corruption", "politics"]) will increase accuracy?
What was your strategy (or was it random?) in news selection for your training dataset:

Labels sorted by Count:
politics: 1200 occurrences
war: 1062 occurrences
economy: 760 occurrences
incident: 699 occurrences
not_news: 451 occurrences
entertainment: 426 occurrences
tech: 418 occurrences
sports: 324 occurrences
science: 138 occurrences
other: 37 occurrences

What are the other hints you might suggest?

from nyan.

How to add new category? about nyan HOT 3 OPEN

Comments (3)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent