Git Product home page Git Product logo

entree's People

Contributors

francescoperera avatar

Watchers

 avatar  avatar  avatar  avatar

entree's Issues

Class Balancing

Given new FILE IO pipeline, apply class balancing.
Plan:

  • create a map of counters for each class
  • as a data point is read and labeled, updated appropriate counter
  • if class counter reaches the limit, then ignore any other data point that will be labeled with that class.

Fix S3 file listing bug

Currently, S3 file listing works fine when the file is under a folder path.
This doesn't seem to work when the files are right under the bucket ( without folders).

Create branch, use S3 dev mode, and fix this by using the data in cb-json-data bucket

Update Docs

update existing files in doc
update README
add doc for user-input.json

Default user input values

in the instance that important parameters in user-input.json are not present, set up default values. In particular, set up a default data format object.

Balance size of output file

Given new File IO pipeline, output file sizes are random.

  • Use a mutable variable (integer) to the pipeline.
  • As you writing data points to output, check that the lines in the output file are not above limit. IF above limit, close file, send to S3 , delete file, update var ,create new file and write next line to it.

Better define the data object creation methods

Optimize and improve the current process of creating a data object:

  • createDataObject
  • getKeyValuePair
  • createUnknownObjects

getKeyValuePair does not need all of these arguments. How do we differentiate between
createDataObject and createUnknownObjects.

Train - Test split

add functionality to divide dump data into training and test files.
By default, do 80/20 split, have these values defined by the user.

cfnMap as input

potentially define cfnMap as input in user-input.json and use current cfnMap in CFNMappingCook as default map

automate column field name mapping

column field name mapping ( i.e email_address -> (email_address,emailaddress,email) is currently hardcoded. This is inefficient because any change in the mapping, requires a change in the code.

Add tests

add tests for the basic Entree functionality

Update Config file

put every sensitive token/cred or var in the config file and use sys env. vars

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.