Encode NLP workshop 2023

A practical introduction to machine learning and natural language processing on papyrus data.

In this training activity, we cover how to explore papyrus data using natural language processing and machine learning techniques. We also show how to build simple machine learning models for classifying different papyrus characteristics. The activity takes the participant through all the steps from downloading and preparing a dataset, to training a classification model. The activity will be done using Google Colab with python scripts prepared up front that the participant can modify to achieve the desired outcomes. It is recommended that the participant has basic experience with the python programming language.

A workshop created by André Walsøe for the Encode Workshop AI and Ancient Writing Cultures, Bologna 23rd-27th January 2023

Part 1:

Introduction to machine learning and NLP: Slides
Setup of google colab to run the workshop material (15 minutes)
1. Open notebook:
2. Click connect in upper right corner
3. Log in to google account
4. Click "Copy to Drive" in upper left corner. The notebook will then be copied to you google drive.
5. Click "Connect" in upper right corner to connect to a computing instance.
Download dataset and install libraries

Part 2 Data Exploration and introduction to NLP techniques

1 Data exploration
1. Basic data exploration and filtering with Pandas
2. Application of filtering techniques based on data exploration findings
3. Hands-on-exercise: Data exploration and filtering
2 Introduction to nlp techniques
1. Lower text
2. Tokenization
3. Stopword removal
4. Vectorization (count and tf-idf)

Part 3 Building a text classification model

Building text classification model
1. Choose what to classify and which input data to use
2. Split data into training and test
3. Transform/vectorize data
4. Training a logistic regression model
5. Test and evaluate metrics
6. Deploy model with Gradio
7. Hands-on exercise: Reflect on possible usecases for these techniques
Wrap-up
Resources for learning more.

Workshop agenda

Session 1 (45 min):

Introduction
Set up of google colab
Introduction to Data exploration

Session 2 (45 min):

Hands-on task 1: Data Exploration
Introduction to basic NLP techniques
How to build a text classifier

Session 3 (45 min):

How to build a text classifier (continuation)
Brainstorming and discussion: How can ML and NLP be used in my field?
Wrap up

auwalsoe / encode_nlp_workshop_2023 Goto Github PK