Git Product home page Git Product logo

candy's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

candy's Issues

Candy Thoughts

I've now mostly cleaned this dataset and want to share a few of the thoughts and questions I had along the way. Hopefully others can weigh in here and suggest the best way to proceed:

  • Candy Names: after converting to a long format, candy name becomes a character variable and therefore the candy names could be anything. However, some are long or contain odd characters (e.g. smart quotes) and I think some cleaning is in order. To what extent should they be cleaned? Can it be done in an automated fashion or should we just do it by hand for the problem cases? For the time being I've removed all non-alphanumeric characters and converted to lowercase.
  • Degrees of Separation: I've stored this as a separate table in long format (id, person, degrees). Degrees can be 1, 2, or >3. Should degrees be a factor with three levels or an integer with values 1-3? The latter requires grouping >3 into just 3. For the time being I've kept 2 variables, one factor and one integer.
  • Age: what ages should be considered unreasonable? There are several outliers over 80, are they from real older people or are they just junk data? I'm inclined to think the latter, and I'd rather no data than bad data, so I've set ages over 80 to NA. Thoughts?
  • # of Mints: lots of non-numeric data that I've set to NA.
  • Logical vs. 2-level Factor: there are several questions that have two possible answers: Friday/Sunday, Betty/Veronica, Blue&Black/White&Gold. They could be left as character/factor or converted to logical T/F values. Which is better?
  • Intelligent Design: this field allows two options, plus a third other field entered as free text. For the time being, I'm ignoring the text in the other field and putting storing this as a 3 level factor: interior design, bullshit, and other. I don't see any value in keeping the free text. Thoughts?
  • Tears of Sadness: this field comes from a set of 5 check boxes and is stored as the values of the chosen boxes separated by commas. I see two ways to store these data: as a separate 3 column table (id, thing, caused_tears (T/F)) or within the table of respondent level data as 5 T/F columns.
  • There are 4 totally free text fields that I'm pretty much ignoring the time being: other joy, other despair, comments, and fonts

Profanity

The profanity in the "what's your favourite font question?" has two implications:

  • You could actually do some analysis of a binary indicator of profanity vs. no profanity. How does that relate to other things? Age, candy preference, favourite font, timestamp, ...
  • Before we can make this a data package for the world, that's got to get cleaned up. No way I would submit that to CRAN.

From helpful people on Twitter (see the replies to this), I've learned of some official naughty word lists:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.