Git Product home page Git Product logo

dat_sf_12's Introduction

DAT SF 12

<<<<<<< HEAD

Instructor:Alessandro Gagliardi
EiRs:Ramesh Sampath
Otto Stegmaier
Alex Chao
Classes:6:30pm-9:30pm, Tuesday and Thursdays
January 15 – March 31
Office Hours:Alex Chao, 5:30 - 6:30 before class at GA
Otto Stegmaier, 9:30 - 10:00 after class at GA
Ramesh Sampath, 4:00 - 6:00 Saturdays remote
Can also set by appointment

Homework is to be submitted by posting it to your own github repo. Then post the URL and folder where the homework lives at here.


Tentative Course Outline

  1. Intro to Data Science, Relational Databases & SQL
  2. Getting started with IPython & Git
  3. APIs and semi-structured data
  4. IPython.parallel & StarCluster
  5. Hadoop Distributed File System and Spark
  6. Intro to ML: k-Nearest Neighbor Classification
  7. Clustering: Hierarchical and K-Means
  8. Probability, A/B Tests & Statistical Significance
  9. Multiple Linear Regression and ANOVA
  10. Project Elevator Pitches
  11. Logistic Regression and Generlized Linear Models
  12. Time Series Analysis & Midterm Review
  13. Principal Components Analysis
  14. Text Mining & Naïve Bayes
  15. Nonlinear Models
  16. Grid Search and Parameter Selection
  17. Bringing it Together
  18. Final Project Working Session
  19. Final Project Working Session
  20. Final Project Presentations (12 min. each)
  21. Final Project Presentations (12 min. each)
  22. Future Directions

Project Schedule

Date Due Returned
1/22 Preliminary Project Proposals Due (3-4 sentences)
1/27 Homework 1
1/29 EiR Feedback on Project Proposals
2/3 EiR Feedback on Homework 1
2/5 Formal Proposals (including data and methods chosen)
2/10 Homework 2 Assigned
2/12 EiR Feedback on Formal Proposals
2/17 Homework 2 Due
2/19 Homework 3 Assigned and
Project Elevator Pitch in class (4 minutes each)
Project Live on Github
2/24 Homework 3 Due EiR Feedback on Homework 2
2/26 Peer Feedback of Projects Peer Feedback on Project
3/3 Peer Feedback of Homework 3 Peer Feedback on Homework 3
3/10 Midterm Assessment Due
3/17 At least one working model
3/24-26 Final Presentations (12 minutes each) Midterm Graded
=======
Note: I am working through the feedback on my proposal and hope to have the formal proposal posted on 2/12/15

acf05634219dc9c9a09f522b1f5ffe1987a17e1f

dat_sf_12's People

Contributors

maddatascience avatar alexchaomander avatar jawilliams3000 avatar sampathweb avatar vanessaohta avatar

Watchers

 avatar

dat_sf_12's Issues

Project Proposal Feedback

@jawilliams3000

Thanks for the proposal. I'll do some inline comments below:

You wrote:
I would like to examine how major events involving violence in the US impact the popularity of music genres.
Comment:
Great - what sort of questions are you trying to answer? Where will your initial exploratory analysis go? I like the idea, but tell me more about what you're looking for and how the user might find this helpful.

You wrote:
I plan to get data from Spotify and/or Soundcloud to identify changes in popularity in musical genres around events such as the Boston Marathon bombings, ISIS executions, school shootings, and Ferguson riots,

Comment:
This could be more difficult that you think. Have you contacted them about this or is there a dataset available? Another option might be to try to use the twitter API to pull tweets related to music and do something with this data?

Other notes:
If you could have your ideal dataset, what would that look like?What indicators would you be looking for to identify those trends around major traumatic events? The reason I ask this, if to help brainstorm other routes to answering your questions.

Let's keep this conversation going here. Tag me in your reply (@ostegm)

HW 2 review by OttoS

@jawilliams3000

Hey Jamie,

Hope all is well. I've been running through your HW2 submission and wanted to give you some feedback.

Firstly - the "official" solutions are posted now. You should be able to do a git pull and find those for review. They're also located here: https://github.com/ga-students/DAT_SF_12/tree/gh-pages/Solutions

So... some comments, in chronological order running through your homework.

  1. Your application of K-means and cross-validation is great. You arrived at the correct answer, however I wanted to point out a few subtle points. It looks like you used the code from class (which is awesome) - but be sure to think about what the inputs are. For example, you looped through the list of odd integers from 1 to 51.
n_neighbors = range(1, 51, 2)

Why 51? That was the number of observations we had in the demo example. In this case 51 isn't a bad choice, but I wanted to make sure it was conscious. In this data set we have 178 observations, so technically you could go to 177, although I think 51 is a better choice.... Anyway, small point, but wanted to make sure you understood why we used 51.

  1. When you chose the number of neighbors (27) I see that you based this off of the graph. Which optimized the score at 27. Remember that this graph was built off of only one slice of the data:

If you were to run the code below (with a random seed of 1, you'd find a different value for K.

X_train, X_test, y_train, y_test = train_test_split(wine_variables, wine['Wine'], test_size=0.3, random_state=1)

The point is, be careful picking K values based off of a random slice of data. Sometimes you can end up overfitting the model based on a single slice of the data. Another way to do this would be to fit the model, and the score based on cross-validation before choosing your K value.

  1. This is kind of bonus material, but did you notice that proline is of a bigger magnitude than all the other variables? It ranges from 0-168 when some of the other variables a much much smaller. This scale problem over amplifies the effect of proline. If you scale the data. you can get an accuracy of 96%.
from sklearn.preprocessing import StandardScaler
features_scalar = StandardScaler()
X_train_scaled = features_scalar.fit_transform(X_train)

from sklearn import neighbors
clf_scaled = neighbors.KNeighborsClassifier(3, weights='uniform')
clf_scaled.fit(X_train_scaled, y_train)
  1. For part two, clustering - awesome job - I really like that you picked out the two most important features and ran with the predictions in 2 dimensions so you could visualize it. It is possible to run those algorithms with more features, and to do that, just review the solution set. Also keep in mind the scaling issue.
  2. At the end, I like that you tried to compare the prediction accuracy across models. It looked like you ran into some trouble with getting the columns to match up on the index. When you do the predictions, they're all still indexed according to the original data, so you can just set new columns for each prediction:
wine_predict = pd.DataFrame(Y_hat_kmeans, columns = ['Kmeans Predicted Class'])
wine_predict['Heir Predicted Class'] = Y_hat_hier
wine_predict['Actual Class'] = wine2.Wine
wine_predict

Just remember that the class labels won't match up. A zero won't be a zero in all three. But you can still see the patterns.

Let me know if you have questions!

Thanks

HW 1 review by OttoS

@jawilliams3000

Nice work overall

"Comments refer to Question numbers:
#2. We were asking you do do len(data) and print(data[500]) - so good job. Sorry that was unclear. Just trying to get you guys to learn how to explore the JSON file.
#3. Another option is df.head(20) but both are correct
#9. Well done - you have a clear handle on SQL :-) - but here's another way, potentially shorter:

pd.read_sql(""""""SELECT neighborhood, category FROM
(SELECT neighborhood, category, MAX(cat_count) FROM
(SELECT neighborhood, category, COUNT(category) AS cat_count
FROM event GROUP BY neighborhood, category)
GROUP BY neighborhood)
WHERE category != 'Street and Sidewalk Cleaning'"""""", con)"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.