Git Product home page Git Product logo

dlab-berkeley / data-science-social-justice-2022 Goto Github PK

View Code? Open in Web Editor NEW
10.0 8.0 0.0 57.41 MB

Materials for D-Lab / UC Berkeley Graduate Division's Data Science + Social Justice summer workshop. These materials provide an introduction to Python, natural language processing, text analysis, word embeddings, and network analysis. They also include discussions on critical approaches to data science to promote social justice.

Jupyter Notebook 97.32% Python 2.68%
data-science nlp python social-justice word-embeddings

data-science-social-justice-2022's People

Contributors

emilygrabowski avatar pssachdeva avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data-science-social-justice-2022's Issues

module 02 notebook 01_preprocessing Navigating Around Section

Is this still true or left over from Tom's organizational system? Maybe take out if it's deprecated?

"Our default working directory is wherever we launched the notebook - in our case the "Week 1" folder. We want to access the "Data" folder, which is two levels "up", inside of the main "DIGHUM160" directory."

module 01 notebook 02_data_frames adding specific demographic context under "coding for social justice"

my addition in bold:

One of the most famous datasets that allows us to start thinking about social justice is the Titanic dataset. It contains information of all the passengers aboard the RMS Titanic, which unfortunately was shipwrecked. This dataset can be used to predict whether a given passenger survived or not. Specifically, the titanic dataset allows us to explore the impact of a number of demographic attributes, such as socioeconomic status, gender, and age on the likelihood of surviving the wreck.

Module 02 Notebook 01_preprocessing type error

(ran on local machine, not datahub)

Under the section Phrase Modeling with gensim, this line results in an error: bigram = Phrases(tokens, min_count=2, threshold=3, delimiter='_')

the error is the following:
TypeError: sequence item 0: expected str instance, bytes found

Clarify bullet point 2 in objectives

  • Use these word vectors to reflect on implicit binaries and normativities in your data;

Consider returning to implicit binaries/ normativities in the word vector section and contextualizing the results in light of these terms.

Binder missing packages

need to add the following lines to import pandas and spacy for the Binder version
!pip install pandas
!pip install spacy

Target sets (notebook 2)

In the example career/family target set, it looks like there part of another target set (math) that got combined in with them.

All Modules All Notebooks

Are we expecting the students to perhaps run these notebooks on Google Colab? If so, we should include the option to commented out with the link to the data

clarify helper function

calculate_biased_words(model, target1, target2, 4)

Since this is a custom helper function defined elsewhere, clarify the requirements for each of the inputs

Datahub spacy import issue

Error in the spacy.load() line when running in Datahub:

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

Datahub link still points to 'notebooks'

When I was opening the Datahub link in the main branch, it exited with this error:

error: pathspec 'notebooks/module02/01_preprocessing.ipynb' did not match any file(s) known to git

Module 02 Notebook 03_tf_idf CountVectorizer Error

Under implementing TF-IDF, this line results in an error: pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

this is the error message:
AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names_out'

Error in training t-SNE

Error in following lines:
tsne = TSNE(init='pca', learning_rate='auto')
tsne_vectors = tsne.fit_transform(model.wv.vectors)

UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U4'), dtype('float32')) -> None

module 01 notebook 01_python_fundamentals loops

"We will only focus on for loops in this section, because while loops are less used."
When I was studying for technical interviews, I had to do a deep dive on while loops. I wonder if it may be useful to let students know that while loops are used in particular contexts, but are outside of the scope of the workshop?

module 01 notebook 02_data_frames

I think the reflection part at the end of the notebook should be highlighted first under the Police Shootings Database. We need to carefully come up with a plan for how to approach that discussion and state our politics up front -- like we ,as scholars, as activists, and as an institution, support Black Lives Matter and recognize that police shootings disproportionately harm Black folks, and that cherry-picked data can be used to erase that fact.

module 01 notebook 01_python_fundamentals

At the very end of the notebook, "The program is complaining that it *"can't multiply sequence by non-int of type 'str'". What this means is that the `` operator is not defined for two strings."

Is the star (*) meant to bold something?

Error in word analogies

model.wv.most_similar(positive=['woman', 'king'], negative='man')

Key Error. Pretty sure the words in the example are not in the vocabulary of the trained model.

Comments vs posts data

In the introduction text, it says we will use comments from the subreddit, but from what I can tell it looks like we are using the same post data from previous notebooks, rather than comments.

Kernel dying module 2 notebook 3

I'm currently testing that all of the code runs in the datahub and with 8 GB of ram.

The kernel dies from too much memory being used when you run this line:
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(tfidf)
similarities.shape

Although the biggest memory drain is a few cells before with the .to_dense() function

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.