Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Multilabel Classification Dataset Loading about pyss3 HOT 4 CLOSED

sergioburdisso commented on May 18, 2024 4

Multilabel Classification Dataset Loading

from pyss3.

Comments (4)

sergioburdisso commented on May 18, 2024 1

😊 Following your suggestion, I've added a method called "load_from_files_multilabel" to carry out this task, supporting both dataset structures/format. I've decided to put "multilabel" at the end so that, as with classify and classify_multilabel, any method XXX related to multilabel will have "_multilabel" as a suffix, this way it will be easier to remember for users (and more consistent).

By catA I meant the label for category A, I'll edit my message to clarify this point (and to match my example with yours).

Now, following your example, you should be able to load your dataset simple by:

x_data, y_data = Dataset.load_from_files_multilabel(
    "path/to/text.txt",
    "path/to/labels.txt"
)

In case you need a different separator for labels, for instance, using commas, you could use the sep_label argument as follows:

x_data, y_data = Dataset.load_from_files_multilabel(
    "path/to/text.txt", "path/to/labels.txt",
    sep_label=","
)

And, finally, in case you need to use a document separator other than '\n', for instance, "\n---\n" you can use the sep_doc argument as follows:

x_data, y_data = Dataset.load_from_files_multilabel(
    "path/to/text.txt", "path/to/labels.txt",
    sep_doc="\n---\n"
)

More details are given in the API documentation. 👍

Dataset
SemEval 2016 Task 5 sounds cool, feel free to send me the dataset, probably it'll be much better for a tutorial and a Live Demo than the one that I'm using now (toxic comments 💩).

from pyss3.

sergioburdisso commented on May 18, 2024

Hi @angrymeir!
First of all, thanks for being interested in this project. Yeah, I agree that the file structures in the topic categorization tutorial is not well suited to work with multilabel classification, it follows the classic single-label dataset structure.
I haven't a lot of previous experience working with multilabel classification. That's one of the main reasons I haven't implemented full support for multilabel classification in the first place. Fortunately, now comes the time to implement full support for multilabel classification.

What do you think having two separate classes for loading datasets from disk? One for "standard" single label dataset (Dataset) and another for multilabel (MultiLabelDataset). For instance, for loading a dataset, we could use MultiLabelDataset.load_from_files.

Do you think we should provide support for another format/structure too?

For instance, having a file holding document name and category label pairs, like so:

doc1 label1
doc1 label2
doc2 label2
doc2 label3
doc3 label1
...

And a folder containing the actual documents. Being this the case, we should let the user specify somehow the file where these pairs are (also provide the separator/delimiter used, tab? comma? etc.) and the path to the folder where actual documents are.

The same should apply to your approach. The user should be able to provide the separator for the labels in labels.txt file, which in your case is a semicolon (;).

What do you think the load_from_files arguments should be? what do you think about this approach:

x_train, y_train = MultiLabelDataset.load_from_files(docs="a file or folder", labels="a file", sep=";")

If docs is a folder then the label file should have a format like the one I described above, if it is a file, it should have your structure.
The sep argument is by default "\s" if doc is a folder and ";" if it is a file (or should it be a comma like in a CSV?)

Do you recommend me any particular dataset to work with, while implementing full multilabel support? This dataset will be the one used for the tutorial introducing multilabel support, too, similar to the ones that are already available. I'm currently using a Kaggle's dataset for toxic comment classification.

from pyss3.

sergioburdisso commented on May 18, 2024

I just realized we would need two sep arguments to let the user specify the separator used for labels and also for documents. Since documents containing new lines will be considered as separate documents, so it is better to let the user specify what separator/delimiter was used to indicate where each document begins/ends (although it could be '\n' by default). Something like:

x_train, y_train = MultiLabelDataset.load_from_files(
    docs="the file or folder where the documents are",
    labels="the file containing the labels",
    sep_label="the separator used for labels e.g. ;",
    sep_doc="the separator used for documents e.g. \n"
)

What do you think about that?

from pyss3.

angrymeir commented on May 18, 2024

Hey @sergioburdisso,

MultiLabelDataset.load_from_files vs Dataset.load_mulitlabel_from_file
I think for consistency reasons the decision whether to use a different class (MultiLabelDataset) or an additional method (e.g. load_multilabel_from_file) in the class Dataset depends on how multilabel data should be treated in general in the this project.
Would you also create a different class for multilabel evaluation or rather add the functionality to the existing class?

Format/Structure
Assuming, that catA corresponds to a combination of labels like:

toxic = -1, sever_toxic=0, obscene=-1, threat=1, insult=-1, identiy_hate=1

This would imply that there were 3^6 possible categories (in the toxic comment dataset) which seems just not feasible to annotate...
Would a combination of both approaches make sense?
Meaning having one file either containing the text or the link to the documents and another file that contains the labels as described in my initial suggestion?

Giving the user the option to specify both delimiters makes absolutely sense! I also agree about the default parameters.

Dataset
We're currently working with a parsed version of SemEval 2016 Task 5, I can provide you the dataset if you would like.
The challenges with this dataset, are that the number of labels for a given text is in a range of [0..8].

from pyss3.

Multilabel Classification Dataset Loading about pyss3 HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent