I have used the code and data provided in the this repository to create my own pipelin

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Potentially Useful Data for Machine Learning about data HOT 14 CLOSED

wilschmidtt commented on May 19, 2024 1

Potentially Useful Data for Machine Learning

from data.

Comments (14)

dataf3l commented on May 19, 2024 1

@wilschmidtt , I suggest in addition to the a safety_measures one includes a safety_measures_start_date so that when new countries adopt the measures, the model is still useful, given so many countries have different measures.

Also, we can make a poll where we ask individuals from all countries to participate and provide information so that we can fill out these data easily. if you write a google forms poll I can send it to friends in Nepal, India, US, Colombia, Chile, Mexico, Australia, Peru, Belgium, and France and I can also translate the poll to Spanish in order to share it with people from the latin american region.

All you need is one nerd per country and you are set, this person can become an information source, also the poll should ask people what is the source of their data.

If the problem is data collection, I think we can find the people to help.

Just send the questions in English, and I'll send back the data in CSV or whatever format you want.

Remember, the less questions, the more datapoints.

from data.

owahltinez commented on May 19, 2024

Hey WIlliam, thanks for sharing -- this is pretty cool! I think that adding all the columns you propose might make the main dataset a bit bloated, but some of them I'd love to add if we can find a reliable source for them. Specifically, I'd like to get a better understanding of where you got the SafetyMeasures data from. If we can get a reliable source for that, we could add a column to the dataset for:

Unknown (null)
No measures ("none")
International travel restricted ("international_travel")
Local travel restricted ("local_travel")
Shelter in place enacted ("shelter_in_place")

If you want to, you can open a PR and edit the relevant metadata_*.csv files and fill the Population and SafetyMeasures columns. Unless I missed something, you can infer the other columns that you mentioned from the data itself.

from data.

wilschmidtt commented on May 19, 2024

The SafetyMeasures column wasn't fetched from any online source. I looked online for a site that reported this information but I couldn't find anything useful. I simply populated this column by dividing the number of confirmed cased by the population, and when the number of confirmed cases exceeded 0.002% of the population, I changed the SafetyMeasures column from 0 to 1. This method is a bit arbitrary, so I could see why it might not be the best feature to include. I simply chose 0.002% based on observing at what point different locations started to take action. From what I observed, this came right around 0.002% of location's population being infected by the virus.

I agree that international_travel, local_travel, and shelter_in_place would all be much more reliable features. The only problem is that I am not sure where such data would be available.

I will open a PR to edit the metadata populations in the meantime.

from data.

dataf3l commented on May 19, 2024

actually, I just noticed there is a date on the dataset, so nevermind, my suggestion doesn't make sense.

from data.

owahltinez commented on May 19, 2024

@dataf3l I think your idea is still valid, we can put the safety measures in its own CSV table and them merge during the data processing stage. In my opinion the biggest difficulty would be to keep it up to date, since measures are changing very fast across different countries.

from data.

wilschmidtt commented on May 19, 2024

@dataf3l this could still be a good idea. Like I said, the 'SafetyMeasures' column is pretty arbitrarily chosen at this point. I couldn't find a good source of data indicating when each location started issuing quarantines. I had to search all over the web, and each bit of information that I found was exclusive to one location, so trying to fill it in for every location would take far too long.

From what I observed, it seemed that right around 0.002 % confirmed is when the governments started to feel the pressure and issue warnings to the public. I tried to use this information to infer the date in which preventative measures were put into place, but if there were actual sources that could verify this date then I think that would be even better.

from data.

wilschmidtt commented on May 19, 2024

@dataf3l there is also the problem of keeping it up to date. The nice thing about the 0.002 % threshold is that it automates the process and doesn't require any manipulation of the data by the user.

from data.

dataf3l commented on May 19, 2024

I think that's interesting, what about renaming the column HasPassed2PercentSoWeGuesstimateMeasureHaveBeenTakenButHaveNoRealDataSoIt'sJustAGuess :p

from data.

dataf3l commented on May 19, 2024

I'm merely joking, I see having no data is clearly an issue. having up to date data will also be an issue.

from data.

wilschmidtt commented on May 19, 2024

@dataf3l this is a decent suggestion. But I was thinking something more along the lines of ArbitrarilyChosen2PercentBecauseImTooLazyToFindRealSourcesAndUpdateTheDataEachDaySoThisIsAllWeGot

from data.

dataf3l commented on May 19, 2024

here is what the dataset could look like:

CO: 2020-03-19:https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Colombia
PE:2020-03-22:https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Bolivia
BR:????:https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Brazil
CL:2020-03-22:https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Chile

here is where I got the data from:

Other countries:
https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_South_America#Argentina
Other continents:
https://en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic_by_country_and_territory

I think as people spend more time on it, it is likely that we'll be able to improve the dataset.
Let's make this happen.

If you make a Google Forms doc, I'll send it around :)

from data.

owahltinez commented on May 19, 2024

@dataf3l thank you for those links, that makes me wonder if a better approach would be to propose the creation of a new table in the Wikipedia page rather than trying to collect that data in this repo. That way, the data will be made available to a lot more people and we can still scrape it from Wikipedia ourselves.

Personally, I would prefer to keep the efforts in this repo focused towards (automated) data aggregation rather than the creation of crowd-sourced data -- even though crowd-sourced data was the original intent of this repo!

from data.

dataf3l commented on May 19, 2024

Should mankind make an app to track movements and self-report if one has symptoms so that people can avoid paths with people with symptoms?

from data.

owahltinez commented on May 19, 2024

FYI I have added mobility and government measures datasets which are relevant to this discussion.

from data.

Potentially Useful Data for Machine Learning about data HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent