Materials for D-Lab / UC Berkeley Graduate Division's Data Science + Social Justice summer workshop. These materials provide an introduction to Python, natural language processing, text analysis, word embeddings, and network analysis. They also include discussions on critical approaches to data science to promote social justice.

Jupyter Notebook 97.32% Python 2.68%

data-science nlp python social-justice word-embeddings

data-science-social-justice-2022's People

Contributors

Stargazers

Watchers

data-science-social-justice-2022's Issues

module 02 notebook 01_preprocessing Navigating Around Section

Is this still true or left over from Tom's organizational system? Maybe take out if it's deprecated?

"Our default working directory is wherever we launched the notebook - in our case the "Week 1" folder. We want to access the "Data" folder, which is two levels "up", inside of the main "DIGHUM160" directory."

Module 02 Notebook 01_preprocesssing sns attribute error

(on local machine)

under the section incorporating time, this line results in an error: sns.set_theme()

this is error:
AttributeError: module 'seaborn' has no attribute 'set_theme'

module 01 notebook 02_data_frames adding specific demographic context under "coding for social justice"

my addition in bold:

One of the most famous datasets that allows us to start thinking about social justice is the Titanic dataset. It contains information of all the passengers aboard the RMS Titanic, which unfortunately was shipwrecked. This dataset can be used to predict whether a given passenger survived or not. Specifically, the titanic dataset allows us to explore the impact of a number of demographic attributes, such as socioeconomic status, gender, and age on the likelihood of surviving the wreck.

Module 02 Notebook 01_preprocessing type error

(ran on local machine, not datahub)

Under the section Phrase Modeling with gensim, this line results in an error: bigram = Phrases(tokens, min_count=2, threshold=3, delimiter='_')

the error is the following:
TypeError: sequence item 0: expected str instance, bytes found

Clarify bullet point 2 in objectives

Use these word vectors to reflect on implicit binaries and normativities in your data;

Consider returning to implicit binaries/ normativities in the word vector section and contextualizing the results in light of these terms.

module 01 notebook 02_data_frames Attribute Error

AttributeError: module 'seaborn' has no attribute 'set_theme'

line 1 sns.set_theme does not work. is this just an issue on my machine?

Module 02 Notebook 02_exploring_texts kernel dies

Under the section Similar Words, the kernel dies bc there's not enough RAM after running the previous lines followed by aita_tokens.similar('partner')

Module 02 Notebook 01Preprocessing kernel dies

Under preprocessing all data, this line kills the kernel and makes it restart:
lemmas = [line for line in preprocess(df)]

Module 02 Notebook 01_preprocessing pip install gensim

for jupyterhub / datahub need under preprocessing all data, we need to pip install gensim otherwise ModuleNotFoundError: No module named 'gensim'

Binder missing packages

need to add the following lines to import pandas and spacy for the Binder version
!pip install pandas
!pip install spacy

Target sets (notebook 2)

In the example career/family target set, it looks like there part of another target set (math) that got combined in with them.

All Modules All Notebooks

Are we expecting the students to perhaps run these notebooks on Google Colab? If so, we should include the option to commented out with the link to the data

clarify helper function

calculate_biased_words(model, target1, target2, 4)

Since this is a custom helper function defined elsewhere, clarify the requirements for each of the inputs

Datahub spacy import issue

Error in the spacy.load() line when running in Datahub:

OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

Datahub link still points to 'notebooks'

When I was opening the Datahub link in the main branch, it exited with this error:

error: pathspec 'notebooks/module02/01_preprocessing.ipynb' did not match any file(s) known to git

Module 02 Notebook 03_tf_idf CountVectorizer Error

Under implementing TF-IDF, this line results in an error: pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

this is the error message:
AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names_out'

Error in training t-SNE

Error in following lines:
tsne = TSNE(init='pca', learning_rate='auto')
tsne_vectors = tsne.fit_transform(model.wv.vectors)

UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U4'), dtype('float32')) -> None

module 01 notebook 01_python_fundamentals loops

"We will only focus on for loops in this section, because while loops are less used."
When I was studying for technical interviews, I had to do a deep dive on while loops. I wonder if it may be useful to let students know that while loops are used in particular contexts, but are outside of the scope of the workshop?

module 01 notebook 02_data_frames

I think the reflection part at the end of the notebook should be highlighted first under the Police Shootings Database. We need to carefully come up with a plan for how to approach that discussion and state our politics up front -- like we ,as scholars, as activists, and as an institution, support Black Lives Matter and recognize that police shootings disproportionately harm Black folks, and that cherry-picked data can be used to erase that fact.

Although the biggest memory drain is a few cells before with the .to_dense() function

'aita.emb' not available in Datahub

Word2Vec.load('aita.emb')

Looks like the '.emb' file is not available in Datahub/Binder

dlab-berkeley / data-science-social-justice-2022 Goto Github PK

data-science-social-justice-2022's People

Contributors

Stargazers

Watchers

data-science-social-justice-2022's Issues

Recommend Projects

Recommend Topics

Recommend Org