Git Product home page Git Product logo

zippy's People

Contributors

amritghimire avatar dependabot-preview[bot] avatar dependabot-support avatar dependabot[bot] avatar divyamani1 avatar dpakach avatar skshetry avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

zippy's Issues

Remove redundant data

The data set is full of redundant data and needs to be cleaned before data set exploration.

Intent Analysis: Parakweet data set

The Parakweet data set consists of more than 3 thousands of labeled email data. Each email has a label if action is required for the email. Emails would be more important if they required action by the user.

Intent Analysis: B3C Corpus

For a better ranking algorithm, we could analyze the intent of the email using speech acts. The intent would determine if the email requires action. The B3C Corpus consists of 40 email threads with sentences labeled with speech acts and subjectivity. You can check the dataset information here. The dataset is in XML.

For now, is anyone available to parse the XML to a proper dataframe?

Preprocessing of data

The data we get from make data/raw/emails.csv is very raw and crude. We need to process that into better versions.

We can do either of the following:

  1. Waste 1-3 weeks for data processing and ensure everything is correct.
  2. Processing is never complete. Keep making better with time and donot ever guarantee it's correct.

Things to do:

  • Split headers and message. Name them raw_headers and raw_message. Ensure following correctness:
    • headers + message = email
    • headers contains only headers(no messages)
    • message only contains a message(no messages)
    • message contains forwarded and replied contents as well
  • Retrieve every possible headers present in the message and create columns for it. How to test correctness? Name the columns as raw_<header_name>
  • Split all headers into it's respective columns. Ensure everything is correct and accounted for.
  • Get information from from and to headers. Test.
  • Get datetime, day, datetime local, year and other things from raw_datetime. Test.
  • Extract information from subject such as is_reply/ is_forward.

Created new issue( #49 ) for remaining tasks:

  • Extract threaded information from subject
  • Visit all steps again and ensure everything was done properly.
  • Document data dictionary, explanation for how it was retrieved(the process), and the method of reproducing it.
  • Perfom basic analysis of the resultant data

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.