entree's People
entree's Issues
Class Balancing
Given new FILE IO pipeline, apply class balancing.
Plan:
- create a map of counters for each class
- as a data point is read and labeled, updated appropriate counter
- if class counter reaches the limit, then ignore any other data point that will be labeled with that class.
Fix S3 file listing bug
Currently, S3 file listing works fine when the file is under a folder path.
This doesn't seem to work when the files are right under the bucket ( without folders).
Create branch, use S3 dev mode, and fix this by using the data in cb-json-data bucket
Enforce encoding
enforce encoding utf-8
Update Docs
update existing files in doc
update README
add doc for user-input.json
Default user input values
in the instance that important parameters in user-input.json are not present, set up default values. In particular, set up a default data format object.
Balance size of output file
Given new File IO pipeline, output file sizes are random.
- Use a mutable variable (integer) to the pipeline.
- As you writing data points to output, check that the lines in the output file are not above limit. IF above limit, close file, send to S3 , delete file, update var ,create new file and write next line to it.
Better define the data object creation methods
Optimize and improve the current process of creating a data object:
- createDataObject
- getKeyValuePair
- createUnknownObjects
getKeyValuePair does not need all of these arguments. How do we differentiate between
createDataObject and createUnknownObjects.
Train - Test split
add functionality to divide dump data into training and test files.
By default, do 80/20 split, have these values defined by the user.
cfnMap as input
potentially define cfnMap as input in user-input.json and use current cfnMap in CFNMappingCook as default map
Stream files back into S3
potentially look at Alpakka
automate column field name mapping
column field name mapping ( i.e email_address -> (email_address,emailaddress,email) is currently hardcoded. This is inefficient because any change in the mapping, requires a change in the code.
Add tests
add tests for the basic Entree functionality
Change Arrays to Vectors in Maps
Create a cfn / label Map for Kaggle
Connect to Kaggle or download relevant Kaggle datasets to dump. Create a column field name mapping & breakdown map for Kaggle datasets.
Update Config file
put every sensitive token/cred or var in the config file and use sys env. vars
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.