Ideally, all the price data should be tabular, with the following columns: <ul dir

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Convert all tabular data into CSV files about humanitas HOT 13 CLOSED

f4bd3v commented on August 17, 2024

Convert all tabular data into CSV files

from humanitas.

Comments (13)

tonyo commented on August 17, 2024

I'll add a script that traverses directories and converts all .XLS files to CSV.

from humanitas.

mstefanro commented on August 17, 2024

Great, but this may need some manual work. Ideally it should be easy to
programatically
get answers to queries of the type "give me the time series of region A
for product B",
where A or B can also be "ALL". As such, we should have a table with the
following columns:

Period (month-year, in some consistent formatting)
Region (region where the average is computed)
Product (product whose price we are interested in)
Value (actual price)

Those tabular files have multiple table headers, spanning across
multiple rows, with multiple sheets.
So just running some xls2csv is not going to be enough to be able to
e.g. load the data in R and work with it.

On 04/03/2014 07:50 PM, Anton Ovchinnikov wrote:

I'll add a script that traverses directories and converts all .XLS
files to CSV.

—
Reply to this email directly or view it on GitHub
#9 (comment).

from humanitas.

halccw commented on August 17, 2024

I tried to align everything manually but in vain. The problem is that for each product, the number of regions and states that provide data in each year could be different.

from humanitas.

tonyo commented on August 17, 2024

I think we should first focus on a couple of specific products (like rice, chicken, bread) and a few specific regions. What's the point of our system if it doesn't work for at least a subset?

Anyway, I added xls_to_csv.py to data_crunching/common, which converts all .xls files it can find under the specified directory. I decided not to push all .csv files to repo to avoid a mess, but if someone doesn't see a problem with it, feel free to commit them as well.

from humanitas.

mstefanro commented on August 17, 2024

Anton, an interesting thing to try is to learn them all at once (I have
a neural network model in mind).
The system may work better at predicting all of them than predicting a
subset, because they might be correlated.

On 04/03/2014 10:23 PM, Anton Ovchinnikov wrote:

I think we should first focus on a couple of specific products (like
rice, chicken, bread) and a few specific regions. What's the point of
our system if it doesn't work for at least a subset?

Anyway, I added xls_to_csv.py to data_crunching/common, which converts
all .xls files it can find under the specified directory. I decided
not to push all .csv files to repo to avoid a mess, but if someone
doesn't see a problem with it, feel free to commit them as well.

—
Reply to this email directly or view it on GitHub
#9 (comment).

from humanitas.

mstefanro commented on August 17, 2024

@ChingChia

There is no way we are going to attempt to implement neural networks in
VBA :D

I don't see why it makes a difference if the regions providing the data
are not the same. The
cells where the data is not provided is already filled with a dash or
something.
Can't you just run some xls2csv tool on each sheet, then open the
csv-s in python, reorganize the data in memory and then dump into a
separate csv?
I'd expect this to be ~40 lines of python code.

On 04/03/2014 09:44 PM, chingchia wrote:

I tried to align everything manually but in vain. The problem is that
for each product, the number of regions that provide data in each year
is different.

Other than R, considering the complexity of the data format, VBA for
Excel (visual basic for application) might be useful. Anyone used it
before?

Differentiating data for different regions can still be done by hand,
very painstaking but not infeasible...

—
Reply to this email directly or view it on GitHub
#9 (comment).

from humanitas.

halccw commented on August 17, 2024

@mstefanro

Yes I agree. Actually I meant using VBA to do the job you said using python. But using VBA seems a bad idea.

And yes, in some cases, missing data are filled with dashes. But if you look more closely to the original dataset (not the one I uploaded today), for each product you will find out that under each state, the number and the order of cities of that state will be different in different years.

Other than that, the subcategories of products differ in different years too. For example, rice has prices for "common" and "fine" in one year, and has data for "common", "fine", and "superfine" in another. So combining these two makes it harder to produce a correctly aligned version.

I also agree to Anton's suggestion. To hand pick some significant products in important states or cities for preliminary analysis.

from humanitas.

tonyo commented on August 17, 2024

A small update: I have added one column to daily CSV files that describes tonnage for the given product and region. So now CSV files have 8 columns:

date, averaging-period (week/month), country (India/Indonesia), region (optional), product (e.g. milk), subproduct (e.g. condensed; optional), price, tonnage.

Example of a row:
01/01/2005,day,India,Biharsharif,Rice,Fine,1000,5.0

Also, CSV files will have column names as their first row to avoid confusion. I'll upload/update daily data for several products tonight.

from humanitas.

mstefanro commented on August 17, 2024

Quick question: in the case of daily data, weren't we supposed to have
ranges of prices, as opposed to exact values?

On 04/14/2014 01:35 PM, Anton Ovchinnikov wrote:

A small update: I have added one column to daily CSV files that
describes tonnage for the given product and region, so now CSV files
have 8 columns:

date, averaging-period (week/month), country (India/Indonesia), region
(optional), product (e.g. milk), subproduct (e.g. condensed;
optional), price, tonnage.

Example of a row:
01/01/2005,day,India,Biharsharif,Rice,Fine,1000,5.0

Also, CSV files will have column names as their first row to avoid
confusion. I'll download/update daily data for several products tonight.

—
Reply to this email directly or view it on GitHub
#9 (comment).

from humanitas.

tonyo commented on August 17, 2024

Yes, there were ranges, but I used modal values (the last column). I believe ranges will only complicate implementation, but in any case min and max are present in raw data, which I also store.

from humanitas.

mstefanro commented on August 17, 2024

It's alright this way, for now (but keep the raw data).

Do we only have daily data for rice?

On 04/14/2014 05:52 PM, Anton Ovchinnikov wrote:

Yes, there were ranges, but I used modal values (the last column). I
believe ranges will only complicate implementation, but in any case
min and max are present in raw data, which I also store.

—
Reply to this email directly or view it on GitHub
#9 (comment).

from humanitas.

tonyo commented on August 17, 2024

Also for onion and wheat.
We should choose several important products (ten will suffice, I think). Downloading all products will be a waste of time, there's too many of them, and the majority is described by data which is pretty scarce.
Here's the product list: http://agmarknet.nic.in/cmmlist_today.asp?dt=14/04/2013

What about taking, say, apple, banana, carrot, potato, paddy(Dhan)? Unfortunately, there's no meat and bread there..

from humanitas.

mstefanro commented on August 17, 2024

You can see the products for which we have the best weekly data in
analysis/statistics/india-weekly-retail/products.txt
Maybe we should use those for which there is the most weekly data as well.

On 04/15/2014 12:13 AM, Anton Ovchinnikov wrote:

Also for onion and wheat.
We should choose several important products (ten will suffice, I
think). Downloading all products will be a waste of time, there's too
many of them, and the majority is described by data which is pretty
scarce.
Here's the product list:
http://agmarknet.nic.in/cmmlist_today.asp?dt=14/04/2013

What about taking, say, apple, banana, carrot, potato, paddy(Dhan)?
Unfortunately, there's no meat and bread there..

—
Reply to this email directly or view it on GitHub
#9 (comment).

from humanitas.

Convert all tabular data into CSV files about humanitas HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent