Git Product home page Git Product logo

humanitas's People

Contributors

albu89 avatar duynguyen avatar f4bd3v avatar grill avatar halccw avatar humanitas avatar jcboyd avatar mstefanro avatar tonyo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

humanitas's Issues

Reservoir Computing (ESN) - Training

Echo State Networks

  • Batch learning with teacher forcing and ridge regression - implemented
    • Increase modelling power by increasing network size (constrained by number of training points!)
    • or easier by applying an additional nonlinear transformation to the network states x(n) to output: xsquares E.g. use squared version of the network states and expand Wout to 2N+2 entries and xsquares=(u(n),x(n),u2(n),x2(n)).
      • Insert noise into state update (look this up)
  • Online learning methods: recompute W^{out} after every update of the reservoir states x(t) which might give better performance

Optimization of the network size

shown in this paper

Fully-connected Reservoir Network with BPDC algorithm

  • better implement the backpropagation-decorrelation algorithm with fully connected reservoir, which according to the paper is better than RLS as it does not conditions on the matrix W to apply as in an Echo State Network.

No matter what training algorithm we used, we should implement a sort of 'bagging' by averaging over the results obtained by using multiple different reservoir initializations.

Bootstrapping vs Cross-validation

Cross-validation and bootstrapping are both methods for estimating generalization error based on "resampling". The resulting estimates of generalization error are often used for choosing among various models, such as different network architectures.
Bootstrapping seems to work better than cross-validation in many cases (Efron, 1983). In the simplest form of bootstrapping, instead of repeatedly analyzing subsets of the data, you repeatedly analyze subsamples of the data. Each subsample is a random sample with replacement from the full
sample. Depending on what you want to do, anywhere from 50 to 2000 subsamples might be used. There are many more sophisticated bootstrap methods that can be used not only for estimating generalization error but also for estimating confidence bounds for network outputs. There is more on this topic on this website.

Supermarket websites?

Find if there are any supermarkets who publish their prices online. If they are not
in English, browse them through Google Translate.
If you find any, see if web.archive.org's timemachine is archiving it.

Proc. tweets - matching to predefined categories

Additional words for general food category:
'snack', 'rice'?, 'groceries', 'cook'

work in progress
guys, we have to make absolutely sure we get all the tweets through filtering. I think we could still refine our approachThis pattern library is really powerful and we could run it on the tweets we write to the databaseFor filtering, we should use both suggestion as well as edit distance and PoS-tagging
keep associated PoS tags for all keywords
if a word doesn't fit any keyword, compute a suggestion and check if PoS tags match and edit distance threshold (to keyword) is satisfied, if no suggestion available just check edit distance and compare PoS tags

Filtering & NLP of tweets

To make sense of the tweets we're collecting we have to cluster them according to indicators we want to feed into our Neural Networks.

The first step is to filter the tweets hierarchically according to certain categories:

general:

  • Price --> Food --> Indicator
  • Price --> Oil --> Indicator

specific:
Price --> Food --> Commodity --> Indicator

Indicator words are "increase", "decrease", "high", "low" and their synonyms.
What are good indicators for making a prediction of a price?

The tweets we group into these categories are then ordered by their timestamps, counted and fed into the network as a sequence for each category. The scaling coefficient will have to be found empirically.

The Question is, given time constraints, do we want to implement a simple filtering or a feature-based clustering algorithm?

If we implement the latter do we use k-means clustering or Spectral clustering?

Situation of retail daily dataset

This is a fairly consistent dataset.

  1. There is no subproduct for each product
  2. Products have very similar distribution of NaN with each other
  3. All products have about 60% of valid data
  4. The following 9 products have "fairly good" data for all 15 regions: Atta(Wheat), Gram Dal, Onion, Rice, Salt Pack, Sugar, Tea Loose, Tur, Vanaspati

https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-retail/num_cities_0.4.csv
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-retail/best_non_na_0.4.csv

Convert all tabular data into CSV files

Ideally, all the price data should be tabular, with the following columns:

  • time
  • averaging-period (week/month)
  • country (India/Indonesia)
  • region (optional)
  • product (e.g. milk)
  • subproduct (e.g. condensed; optional)
  • price

Example of a row:
02/01/2000,week,Indonesia,Banda Aceh,milk,condensed,4000.0

Read price csv into multi-index Pandas dataframe

This part will be universal for other analysis and prediction, so I suggest we come up with a consensus.

Now I construct a multi-indexing dataframe from a csv file like this:

Dataframe
date price
product sub country city freq
Rice Common/Coarse India Chittoor week 2013-01-02 21.00
Rice Common/Coarse India Guntur week 2013-01-02 24.00
Rice Fine India Asansol week 2013-04-26 23.00
Rice Fine India Salem week 2013-04-26 24.00

Where the multi-index is built on 'product', 'sub','country' ,'city' and 'freq'.

Query

And with numexpr installed, we can extract any sub-dataframe we want like this:

sub1 = df.query('product == "Rice"')
sub2 = df.query('product == "Rice" & city == "Asansol" & sub == "Fine"')

where sub1 and sub2 are 2-column dataframes containing all "data, price" sorted by date for each predicate. Finally we can extract a certain time period like this:

after_july = sub1[sub1['date'] > '2013-07-01' ]

I'm quite new to Pandas. Any advice is welcome.

Find more data sources

Brainstorm ideas for other places where people may discuss food prices:
News articles, reddit comments, Forums in India etc.

India product selection

For the India weekly dataset, we may select our target products based on previous statistics and this table:

https://github.com/fabbrix/humanitas/blob/master/analysis/ts/na_table_org.csv

  1. I sorted products according to the column "city counts of cut off rate 0.2."
  2. Average rates show little difference among usable products.
  3. If we set the cutoff rate to 30% (although a bit much), we will have 35-40 cities for the top 10 products.
  4. Besides, it seems that we do not have to worry about the subproduct dimension except for rice.
  5. One interesting observation. Most cities report prices at an nearly constant rate.

What are the reasons behind price fluctuations? What other data is correlated to price fluctuations?

Research the following questions:

  • What are the main factors causing food price fluctuations?

food production (<-- weather), food stocks/distribution, currency exchange rate (w.r.t $), crude oil price

  • The prices of what other things are correlated to food prices?

Consumer Price Index (CPI) --> inflation

  • What other time series could potentially be correlated to price series?
  • Are the price fluctuations of neighbouring countries correlated?

Layout CSV Data Indonesia

Has one of the data crunchers checked the csv files for indonesia? is there a special format you'd like me to apply to the data?

Potential of daily wholesale data (2005-2014) for prediction vs. daily retail (2009-2013)

Before I dig into prediction, share and discuss some thoughts.

We have wholesale daily (2005-2014) and retail daily (2009-2013) datasets.

1. Include a few very good wholesale daily series into prediction goals

The wholesale daily dataset is sparse, but we have some very good series with more than 80%~90% of valid data in over 10 years which also appear very volatile and periodic. Although they are only tiny portions of the whole picture, I suggest we could still make good use of them to produce individual predictions.

Pre-interpolation graphs per region (zoom in or click it to see clearer graphs):

Uttar Pradesh
Apple and onion appear volatile and periodic, but we should discard the rice here, since its price is very stable.
1

West Bengal
Observe the periodic clustering of high volatility.
2

Gujarat
Super volatile potato.
3

NCT of Delhi
Wheat price
4

Some more to come tomorrow.

Visualiation of Results - Output format of analysis

Map of India with two layers that can be queried by date and commodity

  • Price prediction layer (percentage of increase per (marketplace and surrounding region | state), visualization of major influences on price)
  • Tweet analysis layer (tweets/inhabitants by region, relevant tweets by city/state, top-k relevant tweets for commodity price change)

Issue#1: Datamaps (http://datamaps.github.io/) doesn't provide a map of India so far. However it is possible to create maps with TOPOJSON; see this article on how to create a map with D3.js and TOPOJSON

Data Mining: Regional Scope of Analysis

Good morning team,
quick question concerning the regional scope of the analysis. Basically I'd be able to create interferences between regions, however depending on how we present our system it might not make sense. Are we going to present information to specific regions, i.e. will the user click on a part of India and receive information about commodities to that spcecific area? In that case I would exclude interferences between regions to make it coherent with the rest of the system.

Happy easter!

Alex

Situation of the wholesale daily dataset

It is very sparse.

If we set the valid-threshold to 70% (meaning that only keeps series which has at least 70% of non-NaN data). We only get 15 (product, subproduct), and most regions have only a few (product, subproduct) data.

See the following 2 tables:

num_cities: Each cell represents the number of cities that has at least 70% valid data in that region
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/num_cities_0.3.csv

best_non_na: Each cell represents the max among valid percentages of cities in that region
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/best_non_na_0.3.csv

Even if I reduce the valid-threshold to 60%, the data is still sparse.
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/num_cities_0.4.csv
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/best_non_na_0.4.csv

The attempt to reduce time period to 3 years (2011-2014) in order to have less sparsity did not work well. The result looks very similar to the one with the whole time span.
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/num_cities_3y_0.4.csv

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.