Git Product home page Git Product logo

convertenrontocsv's Introduction

convertEnronToCsv

This repository contains the source code and the executable jar of the java application which builds a csv dataset file from the enron data folders.

This application converts unstructured enron dataset into structured dataset which can serve aas an input for data cleaning operations during the preprocessing stage.

The unstructured dataset is avaiable to download from here

Code organization/ Directory structure

downloadEnronDataset.sh : Shell script to download the enron dataset file and extract the same.

All executable code is present in the /executable directory.

  • execute.sh : Driver program - shell script used to run the java application.
  • createJar.sh : Shell script to compile the maven project and build the .jar file, and create a copy of it in this directory.
  • enron_to_csv-1.0.jar : .jar file encapsulating the java application.

The enron_to_csv/ directory is the maven project consisting of all the java source code.

The structuredData/ directory consists of the output csv file generated by this application.

Running the application

Understanding flow of operations

This application takes path to the maildir directory as input and produces one output csv file.

  • The output csv file consists of raw email text.

CSV format: "id","message"

This application requires 2 input parameters:

  • overAllLimitier: the value of this argument specifies the upper limit of the total no. of emails to be read and hence written to the output csv dataset file. -1 indicates no limit.
  • emailLimiterPerUser: the value of this argument specifies the upper limit of the no. of emails per user to be read and hence written to the output csv dataset file. -1 indicates no limit.

Steps to run the application

  • Navigate to the /executable directory
  • Download and extarct the enron dataset by executing the script downloadEnronDataset.sh. To execute the script run the following command:
    • ./downloadEnronDataset.sh
  • Execute the jar application by running the following command:
    • ./execute.sh -1 -1

Environment specifications

Following are the specifications of the environment on which this application was last executed:

  • Maven version: 3.8.6
  • openjdk version: "11.0.16.1" 2022-08-12
  • OpenJDK Runtime Environment Homebrew (build 11.0.16.1+0)
  • OpenJDK 64-Bit Server VM Homebrew (build 11.0.16.1+0, mixed mode)

convertenrontocsv's People

Contributors

mitrjain avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.