Git Product home page Git Product logo

data's Introduction

data's People

Contributors

adamobeng avatar ajnewman avatar andrewflowers avatar ascheink avatar atmccann avatar bencasselman avatar bycoffe avatar charliesmart avatar dmil avatar fivethirtyeight-bot avatar forecasterenten avatar frankbi avatar gwezerek avatar hfuong avatar jayb avatar jooncodes avatar juruwolfe avatar monachalabi avatar neil-paine-1 avatar peterthehan avatar ppaulojr avatar radcliffem avatar reubenfb avatar rhiever avatar ritchieking avatar rudeboybert avatar ryanabest avatar sfrostenson avatar stephenturner avatar tylerlittlefield avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data's Issues

Method for normalization

Hi,

Could you please share the method that you used for normalizing the ratings and user count?

Thanks.

Separate TSVs for all March Madness updates?

@dmil Is there any reason that you're adding a new TSV file for every change? Git makes this unnecessary/undesirable. Viewing (or restoring) previous versions of files is easily done with Git (and GitHub), and keeping everything in the same file would make it easier to view what changed in each commit. As Git is designed to track changes to individual files, separating everything defeats some of its purpose and functionality.

If you still want to have multiple files, I would suggest no more than one per round. But the Round of 64 hasn't even started and we already have 7 separate files; so at this rate there will be a lot of clutter soon.

Oscars data available?

Hey 538 team,

Wondering if you guys could upload your Oscar's data here? Or if there's a good public db/API that houses the data? Thanks!

Sworn vs all staff

This is actually for all police force staff and not for "Officers" right- the totals are completely wrong otherwise, Oakland's force is around 700 so the 1,530 has to be all OPD staff, not just cop cops. Can you clarify?

Burrito Bracket data missing

Thank you for making your data available.

Any chance I can get my hands on some delicious, delicious burrito data?

Add dataset for population and race percentages by county

I have compiled the following spreadsheet, which I think would be a good fit for this data repository. All data comes from 2010 census figures, and it's advantage over said tables is that it doesn't require going through a 5000 line cross reference to figure out what each column means, and all figures are in the same sheet instead split across 20 different ones over 3 files.

I understand that this repository is meant for datasets featured in 538 articles, but it seems likely that this sort of data will be used in the future, and I can't think of a better repository for this to live in.

Race % and Pop by County.xlsx

Suggestion on repo structure

I see that you're grouping all data sets in one repo. While there's some convenience to organizing things that way, I think it's going to make it more difficult for curious readers to sort through once you've published hundreds of data sets. It would probably be better in the long term to do one repo for each story or data set, and then link to that individual repo from the story.

Given that there are only three data sets posted so far, this will be easier to re-organize now than later.

Oakland, California numbers

Oakland is listed as having 1530 police. I doubt it has ever had that many. http://www.nytimes.com/2012/03/25/us/oakland-police-try-to-fill-the-ranks-but-keep-falling-behind.html says 837 in November, 2008, and 636 at the time of the article in March, 2012.

The graphic in http://fivethirtyeight.com/datalab/most-police-dont-live-in-the-cities-they-serve/ says the numbers are 2010 with source U.S. Census. Is this census data source online? I did some naive searches on census.gov and don't see anything obvious.

Add license

The repo contains no license file or link to license information that I could find.

Include reference to Creative Commons in LICENSE.md

The readme includes the text:

We hope you'll use it to check our work and to create stories and visualizations of your own. The data is available under the Creative Commons Attribution 4.0 International License and the code is available under the MIT License.

LICENSE.md only include reference to MIT.

When looking for the licensing information people (and scripts) tend to look at the LICENSE.md and assume all the info is there.

It looks like someone has already made the mistake of thinking everything was under MIT.
https://www.kaggle.com/fivethirtyeight/fivethirtyeight

Data Dictionary for the datasets

Hi @BenCasselman thanks for uploading these useful datasets. Can we also get a data dictionary explaining the columns. For example, in the grad_students dataset, the columns of Grad_employed and Grad_unemployed don't add upto Grad_total. So a dictionary would go a long way to help. Also how was the unemployment rate was computed?

Show your work/Data Hosting

Also, Data hosting becomes interesting once your dataset gets past 100mb (github: max file upload size) or 1gb in repo (github, max repo size) and git itself becomes slow at 10mb files and 100mb repos. I've used Amazon S3 and post-pull-hooks to create a /data directory in the .gitignore to avoid this issue in the past as part of work at dssg. Anyways, you might need a bigger solution, but if not:
image

Stream

Can you post code required to get the CSV data from the Twitter API? I would like to create and host a streaming / realtime version of this. It would be a cool addition to your site IMO.

Duplicate in CSV File

The Grand Illusion by Styx, is duplicated in data/classic-rock/classic-rock-song-list.csv

Irreproducible Research

No code or data has been shared for the following articles:

http://fivethirtyeight.com/datalab/the-return-of-mlbs-youth/
http://fivethirtyeight.com/datalab/what-to-expect-from-baseball-americas-top-100-prospects/
http://fivethirtyeight.com/features/the-hidden-value-of-the-nba-steal/

Some issues that could be addressed by publicy sharing data and code:

  • A steal is worth 9.1 times a point, but the article makes no mention of confidence intervals.
  • This conclusion is drawn from a sample of players who have missed at least 20 games and played at least 20 games in a season. Is this a representative sample? We have no way of assessing this because the underlying dataset has not been shared publicly.

Line endings are CR instead of LF or CR/LF

Hello!

First of all, thanks for posting the data to the stories making it easier to follow the described methodology in each article. But, please, could you upload the csv files with correct line endings? As of now, most of the CSV files have CR line endings, instead of the more "canonical" LF or CR/LF. The following articles from StackOverflow provide some guidelines:

http://stackoverflow.com/questions/2332349/best-practices-for-cross-platform-git-config/
http://stackoverflow.com/questions/10491564/git-and-cr-vs-lf-but-not-crlf

Thanks!

Race (Unknown-White) in Biopics CSV

I noticed that in the CSV of the biopics, every single line in which the race_known column was Unknown had the subject_race column as White; was it that White was the default race, or that you would guess (based on the subject's name and appearance, but without confirmation based on ancestry or self-reporting) that all 197 of the subjects were White? I realize that this wouldn't change the displays, because "Unknown" makes the race column meaningless, but it is a bit curious.

Is the shoot out data available?

I teach high school computer science with a match teacher who also teaches AP Stats. The raw 269,000 match preferences would be an awesome data set for us to play with. Is that available? I only see your dataset which contains the reduction of that.

College majors repo does not link to story

Can be changed in the readme by adding

[FiveThirtyEight's story on earnings of college majors](http://fivethirtyeight.com/features/the-economic-guide-to-picking-a-college-major/) 

as a link in README.md

Calculations for diverse/segregated cities?

It'd be great to see the calculations for http://fivethirtyeight.com/features/the-most-diverse-cities-are-often-the-most-segregated/

Particularly for calculating the integration-segregation index (and, I suppose by necessity, the the trend line the index is based off). The diversity index calcs we can get from footnotes 5 and 6, and the data from, as the article notes, US Census tract data. This gives readers enough to calculate diversity indices for any community (eg smaller towns and suburbs), but not enough to create an integration index.

Soccer SPI data

Dear 538 team,

I am curious about the preparation of your soccer data, here: https://github.com/fivethirtyeight/data/tree/master/soccer-spi. There appears to be some missing data: there are 467 unique teams in the matches csv, but only 453 teams in the ranks csv. Is it possible to obtain a complete version of this dataset? Thanks in advance for your help!

Best,
Stephanie

2016 Election Forecast data feed

(Feature request rather than data issue.)
Will you have a data feed for the 2016 Election Forecasts that can be used in 3rd party apps?
I've been writing voice applications for the Amazon Echo, including "Tweet Poll" which used "IBM Insights for Twitter" for state-by-state sentiment analysis of candidates during the primary season.
I'm interested in writing an interface that surfaces the daily 538 forecast. I would love to know if there is going to be a data feed for that, and what Terms and Conditions would apply to it. (Or who to talk to about setting up an app specific feed.)

Issues with data in the college-majors repo

Hi all,

I've been working with the data in the college majors repo over the weekend: https://github.com/fivethirtyeight/data/tree/master/college-majors

I think something went wrong during the data processing, since if you compare the gender-related data in recent-grads.csv and women-stem.csv, they don't match up. recent-grads.csv also indicates that 56% of all computer science majors are female, which is way off.

I'm looking into the data right now to try to find out what happened, but I'd appreciate a second look at this data set to make sure everything is in order.

Cheers,
Randy

Making the Data Available in Earlier Stages

First, thanks FiveThirtyEight for making your data available--this is very cool to see, and interesting to be able to replicate results.

One request I have, which may or may not be tenable, is to make data available from the earlier stages of variable construction. For instance, in Nate's recent piece on airline safety, the data we have access to is the number of incidents, fatalities, etc from 1985 to 1999, and again from 2000-2014. While the data is interesting to see, the same data by year would be even more interesting, as would a list of all incidents and how they are coded.

For example, it's easy to imagine for example analysis that could be done by combining the by-year incidents with airline sales in subsequent (something Nate alludes) to. However, this isn't something we're able to do, given that we can only see the data in 15 year chunks. The earlier the data is available to us, the more flexibility we'll have in using your data to develop our own theories and test them, which makes it of greater use to us.

I can appreciate the reasons why this might not be a good idea for FiveThirtyEight (in particular, it means any assumptions and decisions made in cleaning are potentially open to criticism) but to the extent possible, making the earlier stages of your data publicly available would be very appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.