Git Product home page Git Product logo

Comments (4)

mstefanro avatar mstefanro commented on August 17, 2024

@ChingChia

Cut-off rate 0.2 means picking only those for which at least 80% of the series is known prior to interpolation?

Important:
This may be confusing, but the "region" column in the daily and weekly datasets really means city, NOT region. To get the region, you need to join the datasets with the /data/india/csv_daily/agmarknet.nic.in/regions.csv file. I suggest
we replace the column name now to avoid future confusion.
Since our prediction model is per-region rather than per-city, maybe you should base your stats on per-region instead. When you are saying "35-40 cities" it is not very informative, because they may all be from the same region. And we are going to merge them using PCA (or averages?) in the end so we would really have one city, if it is indeed the case that they are all in the same region.

Besides, it seems that we do not have to worry about the subproduct dimension except for rice.

Our most important data-set is the daily one, not the weekly one. On the daily one, rice has 100 subproducts, onion has 26 subproducts, wheat has 68 subproducts etc. So we do have to worry about both city and subproducts.
What we would like to do is the following:

let D be a mapping from all (R, P) to a time series
for each region R:
|    for each product P:
|    |   let M be a matrix.
|    |   for each subproduct SP (of product P):
|    |   |   for each city C (of region R):
|    |   |   |   let T be the time-series corresponding to (R,P,SP,C)
|    |   |   |   interpolate T to obtain a full time-series
|    |   |   |   add the vector T as a column to matrix M
|    |   let T = PCA(M, 1)
|    |   store a mapping from (R, P) to T into D

Try to make your code in such a way that it works both on the daily and weekly datasets. The only differences between the datasets are the date-range you have to pick and the gaps between dates for interpolation (1 week vs. 1 day).
I can provide help with implementing this after we meet. We first need to go over your code.

One extra difficulty for the weekly dataset is that you might have to account for prices reported on the same week, but different day (I don't know if this occurs in the data, you should check). If that is the case, then you should really interpolate on week-of-the-year index rather than date index.

from humanitas.

halccw avatar halccw commented on August 17, 2024

@mstefanro

Yes, 0.2 cutoff rate means choosing those series with at least 80% non-NaN data points before interpolation.

We can easily group series in the same region by looping region[0] = [city1, city2...]. I will add stats on region tmr.

The final point you mentioned is fine. Prices are always reported on Fridays.

in: all_dates_raw = sorted(list(set(df['date'])))
in: all_dates = pd.date_range(all_dates_raw[0], all_dates_raw[-1], freq='W-FRI')
in: list(set(all_dates) - set(all_dates_raw))

out: 
[Timestamp('2007-03-02 00:00:00', tz=None),
 Timestamp('2007-03-09 00:00:00', tz=None),
 Timestamp('2007-03-16 00:00:00', tz=None),
 Timestamp('2007-03-23 00:00:00', tz=None),
 Timestamp('2007-03-30 00:00:00', tz=None),
 Timestamp('2007-04-06 00:00:00', tz=None),
 Timestamp('2007-04-13 00:00:00', tz=None),
 Timestamp('2007-04-20 00:00:00', tz=None),
 Timestamp('2007-04-27 00:00:00', tz=None),
 Timestamp('2007-05-04 00:00:00', tz=None)]

from humanitas.

mstefanro avatar mstefanro commented on August 17, 2024

Thanks for the feedback.
I don't think you don't have to redo the statistics, I merely wanted to
let you know that in the end we're going
to need to have at least one city in each region of interest.

On 04/23/2014 12:17 AM, chingchia wrote:

@mstefanro https://github.com/mstefanro

Yes, 0.2 cutoff rate means choosing those series with at least 80%
non-NaN data points before interpolation.

We can easily group series in the same region by looping region[0] =
[city1, city2...]. I will add stats on region tmr.

The final point you mentioned is fine. Prices are always reported on
Fridays.

in: all_dates_raw = sorted(list(set(df['date'])))
in: all_dates = pd.date_range(all_dates_raw[0], all_dates_raw[-1], freq='W-FRI')
in: list(set(all_dates) - set(all_dates_raw))

out:
[Timestamp('2007-03-02 00:00:00', tz=None),
Timestamp('2007-03-09 00:00:00', tz=None),
Timestamp('2007-03-16 00:00:00', tz=None),
Timestamp('2007-03-23 00:00:00', tz=None),
Timestamp('2007-03-30 00:00:00', tz=None),
Timestamp('2007-04-06 00:00:00', tz=None),
Timestamp('2007-04-13 00:00:00', tz=None),
Timestamp('2007-04-20 00:00:00', tz=None),
Timestamp('2007-04-27 00:00:00', tz=None),
Timestamp('2007-05-04 00:00:00', tz=None)]


Reply to this email directly or view it on GitHub
#18 (comment).

from humanitas.

f4bD3v avatar f4bD3v commented on August 17, 2024

Among the series with acceptable cutoff rate, we should select those for important commodities

"Rice is the staple of the south, while bread => wheat is the staple of the north, of course with some cross over. Environmental conditions support this trend; with the largest rice growing in the south and wheat grown mainly in the north. Dal, which is Hindi for lentil, is eaten all over."

"Common vegetables used in cooking; potato, onion, okra, green beans, peas, cauliflower, capsicum, carrot (which are red), mushrooms, eggplant, chilli."

"Available fruits include apples, oranges, mandarins (which they call oranges), bananas, mango and pineapple."

source: http://www.thetravelalmanac.com/india/indian-food.htm

In this pdf Groundnut Oil and Peanut Oil are said to be the most used oils in India:
http://www.umbrellaindia.com/Different-types-oils.pdf

from humanitas.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.