The zippy from ioepas

zippy's Issues

Remove redundant data

The data set is full of redundant data and needs to be cleaned before data set exploration.

Fix 'Use of unsafe yaml load. Allows instantiation of arbitrary objects. Consider yaml.safe_load().' issue in src\utils\params.py

CodeFactor found an issue: Use of unsafe yaml load. Allows instantiation of arbitrary objects. Consider yaml.safe_load().

It's currently on:
src\utils\params.py:33

Implement remote data storage

For storing data, we need remote storage solutions(AWS/Azure/GCP).

Set up Travis CI

Configure Travis CI for:

Linting
Unit Tests (#27)

Extracting thread information

Find reply of an email.
Extract thread information from emails.
Check if an email is a draft.

Split from #2.

Enable stricter warnings on docs and setup travis for docs

After #8 is merged, we can enable stricter docs warning and even enable them in the travis.

Intent Analysis: Parakweet data set

The Parakweet data set consists of more than 3 thousands of labeled email data. Each email has a label if action is required for the email. Emails would be more important if they required action by the user.

Fix 'Use of unsafe yaml load. Allows instantiation of arbitrary objects. Consider yaml.safe_load().' issue in scripts\utils.py

CodeFactor found an issue: Use of unsafe yaml load. Allows instantiation of arbitrary objects. Consider yaml.safe_load().

It's currently on:
scripts\utils.py:13

Intent Analysis: B3C Corpus

For a better ranking algorithm, we could analyze the intent of the email using speech acts. The intent would determine if the email requires action. The B3C Corpus consists of 40 email threads with sentences labeled with speech acts and subjectivity. You can check the dataset information here. The dataset is in XML.

For now, is anyone available to parse the XML to a proper dataframe?

Waste 1-3 weeks for data processing and ensure everything is correct.
Processing is never complete. Keep making better with time and donot ever guarantee it's correct.

Things to do:

Split headers and message. Name them raw_headers and raw_message. Ensure following correctness:
- headers + message = email
- headers contains only headers(no messages)
- message only contains a message(no messages)
- message contains forwarded and replied contents as well
Retrieve every possible headers present in the message and create columns for it. How to test correctness? Name the columns as raw_<header_name>
Split all headers into it's respective columns. Ensure everything is correct and accounted for.
Get information from from and to headers. Test.
Get datetime, day, datetime local, year and other things from raw_datetime. Test.
Extract information from subject such as is_reply/ is_forward.

Created new issue( #49 ) for remaining tasks:

Extract threaded information from subject
Visit all steps again and ensure everything was done properly.
Document data dictionary, explanation for how it was retrieved(the process), and the method of reproducing it.
Perfom basic analysis of the resultant data

ioepas / zippy Goto Github PK

zippy's People

Contributors

Stargazers

Watchers

Forkers

zippy's Issues

Recommend Projects

Recommend Topics

Recommend Org