Materials for D-Lab / UC Berkeley Graduate Division's Data Science + Social Justice summer workshop. These materials provide an introduction to Python, natural language processing, text analysis, word embeddings, and network analysis. They also include discussions on critical approaches to data science to promote social justice.
Is this still true or left over from Tom's organizational system? Maybe take out if it's deprecated?
"Our default working directory is wherever we launched the notebook - in our case the "Week 1" folder. We want to access the "Data" folder, which is two levels "up", inside of the main "DIGHUM160" directory."
One of the most famous datasets that allows us to start thinking about social justice is the Titanic dataset. It contains information of all the passengers aboard the RMS Titanic, which unfortunately was shipwrecked. This dataset can be used to predict whether a given passenger survived or not. Specifically, the titanic dataset allows us to explore the impact of a number of demographic attributes, such as socioeconomic status, gender, and age on the likelihood of surviving the wreck.
Are we expecting the students to perhaps run these notebooks on Google Colab? If so, we should include the option to commented out with the link to the data
"We will only focus on for loops in this section, because while loops are less used."
When I was studying for technical interviews, I had to do a deep dive on while loops. I wonder if it may be useful to let students know that while loops are used in particular contexts, but are outside of the scope of the workshop?
I think the reflection part at the end of the notebook should be highlighted first under the Police Shootings Database. We need to carefully come up with a plan for how to approach that discussion and state our politics up front -- like we ,as scholars, as activists, and as an institution, support Black Lives Matter and recognize that police shootings disproportionately harm Black folks, and that cherry-picked data can be used to erase that fact.
At the very end of the notebook, "The program is complaining that it *"can't multiply sequence by non-int of type 'str'". What this means is that the `` operator is not defined for two strings."
In the introduction text, it says we will use comments from the subreddit, but from what I can tell it looks like we are using the same post data from previous notebooks, rather than comments.
I'm currently testing that all of the code runs in the datahub and with 8 GB of ram.
The kernel dies from too much memory being used when you run this line:
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(tfidf)
similarities.shape
Although the biggest memory drain is a few cells before with the .to_dense() function