f4bd3v / humanitas Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 7.0 262.13 MB

A price prediction toolset for developing countries

License: BSD 3-Clause "New" or "Revised" License

Python 76.02% R 7.06% Shell 5.08% CSS 3.54% HTML 8.31%

humanitas's People

Contributors

Stargazers

Watchers

Forkers

duynguyen just4jc pratikdhumal29 albu89 thammuio reteinfo calculus-capital

humanitas's Issues

Reservoir Computing (ESN) - Training

Echo State Networks

Batch learning with teacher forcing and ridge regression - implemented
- Increase modelling power by increasing network size (constrained by number of training points!)
- or easier by applying an additional nonlinear transformation to the network states x(n) to output: E.g. use squared version of the network states and expand W^out to 2N+2 entries and x_squares=(u(n),x(n),u²(n),x²(n)).
  - Insert noise into state update (look this up)
Online learning methods: recompute W^{out} after every update of the reservoir states x(t) which might give better performance
- RLS(Recursive Least Squares)-based ESN online learning algorithm
  the RLS algorithm minimizes the exponentially discounted square "pre-error":
  
  y_n(k) is the model output that would be obtained at time k if the current output matrix would be employed at all times k=0,...,n

Optimization of the network size

shown in this paper

Fully-connected Reservoir Network with BPDC algorithm

better implement the backpropagation-decorrelation algorithm with fully connected reservoir, which according to the paper is better than RLS as it does not conditions on the matrix W to apply as in an Echo State Network.

No matter what training algorithm we used, we should implement a sort of 'bagging' by averaging over the results obtained by using multiple different reservoir initializations.

Bootstrapping vs Cross-validation

Cross-validation and bootstrapping are both methods for estimating generalization error based on "resampling". The resulting estimates of generalization error are often used for choosing among various models, such as different network architectures.
Bootstrapping seems to work better than cross-validation in many cases (Efron, 1983). In the simplest form of bootstrapping, instead of repeatedly analyzing subsets of the data, you repeatedly analyze subsamples of the data. Each subsample is a random sample with replacement from the full
sample. Depending on what you want to do, anywhere from 50 to 2000 subsamples might be used. There are many more sophisticated bootstrap methods that can be used not only for estimating generalization error but also for estimating confidence bounds for network outputs. There is more on this topic on this website.

Bootstraps for Time Series ETHZ
Maximum Entropy Bootstrapping for Time Series

Supermarket websites?

Find if there are any supermarkets who publish their prices online. If they are not
in English, browse them through Google Translate.
If you find any, see if web.archive.org's timemachine is archiving it.

Proc. tweets - matching to predefined categories

Additional words for general food category:
'snack', 'rice'?, 'groceries', 'cook'

work in progress
guys, we have to make absolutely sure we get all the tweets through filtering. I think we could still refine our approachThis pattern library is really powerful and we could run it on the tweets we write to the databaseFor filtering, we should use both suggestion as well as edit distance and PoS-tagging
keep associated PoS tags for all keywords
if a word doesn't fit any keyword, compute a suggestion and check if PoS tags match and edit distance threshold (to keyword) is satisfied, if no suggestion available just check edit distance and compare PoS tags

Filtering & NLP of tweets

To make sense of the tweets we're collecting we have to cluster them according to indicators we want to feed into our Neural Networks.

The first step is to filter the tweets hierarchically according to certain categories:

general:

Price --> Food --> Indicator
Price --> Oil --> Indicator

specific:
Price --> Food --> Commodity --> Indicator

Indicator words are "increase", "decrease", "high", "low" and their synonyms.
What are good indicators for making a prediction of a price?

The tweets we group into these categories are then ordered by their timestamps, counted and fed into the network as a sequence for each category. The scaling coefficient will have to be found empirically.

The Question is, given time constraints, do we want to implement a simple filtering or a feature-based clustering algorithm?

If we implement the latter do we use k-means clustering or Spectral clustering?

Getting historical tweets by user

Situation of retail daily dataset

This is a fairly consistent dataset.

There is no subproduct for each product
Products have very similar distribution of NaN with each other
All products have about 60% of valid data
The following 9 products have "fairly good" data for all 15 regions: Atta(Wheat), Gram Dal, Onion, Rice, Salt Pack, Sugar, Tea Loose, Tur, Vanaspati

https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-retail/num_cities_0.4.csv
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-retail/best_non_na_0.4.csv

Convert all tabular data into CSV files

Ideally, all the price data should be tabular, with the following columns:

time
averaging-period (week/month)
country (India/Indonesia)
region (optional)
product (e.g. milk)
subproduct (e.g. condensed; optional)
price

Example of a row:
02/01/2000,week,Indonesia,Banda Aceh,milk,condensed,4000.0

FNNs + RNNs (Echo State Networks) - implementation

Review of model and training method. Design models.

Read price csv into multi-index Pandas dataframe

This part will be universal for other analysis and prediction, so I suggest we come up with a consensus.

Now I construct a multi-indexing dataframe from a csv file like this:

Dataframe

					date	price
product	sub	country	city	freq
Rice	Common/Coarse	India	Chittoor	week	2013-01-02	21.00
Rice	Common/Coarse	India	Guntur	week	2013-01-02	24.00
Rice	Fine	India	Asansol	week	2013-04-26	23.00
Rice	Fine	India	Salem	week	2013-04-26	24.00

Where the multi-index is built on 'product', 'sub','country' ,'city' and 'freq'.

Query

And with numexpr installed, we can extract any sub-dataframe we want like this:

sub1 = df.query('product == "Rice"')
sub2 = df.query('product == "Rice" & city == "Asansol" & sub == "Fine"')

where sub1 and sub2 are 2-column dataframes containing all "data, price" sorted by date for each predicate. Finally we can extract a certain time period like this:

after_july = sub1[sub1['date'] > '2013-07-01' ]

I'm quite new to Pandas. Any advice is welcome.

Find more data sources

Brainstorm ideas for other places where people may discuss food prices:
News articles, reddit comments, Forums in India etc.

India product selection

For the India weekly dataset, we may select our target products based on previous statistics and this table:

https://github.com/fabbrix/humanitas/blob/master/analysis/ts/na_table_org.csv

I sorted products according to the column "city counts of cut off rate 0.2."
Average rates show little difference among usable products.
If we set the cutoff rate to 30% (although a bit much), we will have 35-40 cities for the top 10 products.
Besides, it seems that we do not have to worry about the subproduct dimension except for rice.
One interesting observation. Most cities report prices at an nearly constant rate.

Twitterstream filter

Tweet Processing: Cassandra, Spark, Shark - Storage issues

NLP, Sentiment/emotion analysis, Clustering etc.

What are the reasons behind price fluctuations? What other data is correlated to price fluctuations?

Research the following questions:

What are the main factors causing food price fluctuations?

food production (<-- weather), food stocks/distribution, currency exchange rate (w.r.t $), crude oil price

The prices of what other things are correlated to food prices?

Consumer Price Index (CPI) --> inflation

What other time series could potentially be correlated to price series?
Are the price fluctuations of neighbouring countries correlated?

Some issues of the daily datasets

Record some issues to be solved.

Missing cities
From the current 7 daily datasets (Rice, Wheat, Onion...), There are1308 cities(or towns or markets) in daily datasets that are not covered by regions.csv. I have not found an efficient way to solve it. The whole list please check: https://github.com/fabbrix/humanitas/blob/master/data/india/csv_daily/agmarknet.nic.in/missing_cities_daily.csv
Duplicate dates and abnormal spikes problem

Rice

Download tweets

Layout CSV Data Indonesia

Has one of the data crunchers checked the csv files for indonesia? is there a special format you'd like me to apply to the data?

Potential of daily wholesale data (2005-2014) for prediction vs. daily retail (2009-2013)

Before I dig into prediction, share and discuss some thoughts.

We have wholesale daily (2005-2014) and retail daily (2009-2013) datasets.

1. Include a few very good wholesale daily series into prediction goals

The wholesale daily dataset is sparse, but we have some very good series with more than 80%~90% of valid data in over 10 years which also appear very volatile and periodic. Although they are only tiny portions of the whole picture, I suggest we could still make good use of them to produce individual predictions.

Pre-interpolation graphs per region (zoom in or click it to see clearer graphs):

Uttar Pradesh
Apple and onion appear volatile and periodic, but we should discard the rice here, since its price is very stable.

West Bengal
Observe the periodic clustering of high volatility.

Gujarat
Super volatile potato.

NCT of Delhi
Wheat price

Some more to come tomorrow.

Select appropriate Time Series Analysis method

Compile list sources on topics of commodity pricing and food security

Compile a list of news websites, important blogs etc. that publish/post articles on food prices, international commodity markets and food security

Tabs and spaces in Python scripts

Guys,

Please try to follow PEP-8 (http://legacy.python.org/dev/peps/pep-0008/) as a style guide, at least the "Indentation" part (use 4 spaces per indentation level, NOT tabs), otherwise it will be the mess.
Thx.

Visualiation of Results - Output format of analysis

Map of India with two layers that can be queried by date and commodity

Price prediction layer (percentage of increase per (marketplace and surrounding region | state), visualization of major influences on price)
Tweet analysis layer (tweets/inhabitants by region, relevant tweets by city/state, top-k relevant tweets for commodity price change)

Issue#1: Datamaps (http://datamaps.github.io/) doesn't provide a map of India so far. However it is possible to create maps with TOPOJSON; see this article on how to create a map with D3.js and TOPOJSON

Attempt to group food types based on the correlation coefficient of their prices

Relate time series we have of different commodities to each other

RNN

Data Mining: Regional Scope of Analysis

Good morning team,
quick question concerning the regional scope of the analysis. Basically I'd be able to create interferences between regions, however depending on how we present our system it might not make sense. Are we going to present information to specific regions, i.e. will the user click on a part of India and receive information about commodities to that spcecific area? In that case I would exclude interferences between regions to make it coherent with the rest of the system.

Happy easter!

Alex

Situation of the wholesale daily dataset

It is very sparse.

If we set the valid-threshold to 70% (meaning that only keeps series which has at least 70% of non-NaN data). We only get 15 (product, subproduct), and most regions have only a few (product, subproduct) data.

See the following 2 tables:

num_cities: Each cell represents the number of cities that has at least 70% valid data in that region
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/num_cities_0.3.csv

best_non_na: Each cell represents the max among valid percentages of cities in that region
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/best_non_na_0.3.csv

Even if I reduce the valid-threshold to 60%, the data is still sparse.
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/num_cities_0.4.csv
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/best_non_na_0.4.csv

The attempt to reduce time period to 3 years (2011-2014) in order to have less sparsity did not work well. The result looks very similar to the one with the whole time span.
https://github.com/fabbrix/humanitas/blob/master/analysis/statistics/india-daily-wholesale/num_cities_3y_0.4.csv