A practical introduction to machine learning and natural language processing on papyrus data.
In this training activity, we cover how to explore papyrus data using natural language processing and machine learning techniques. We also show how to build simple machine learning models for classifying different papyrus characteristics. The activity takes the participant through all the steps from downloading and preparing a dataset, to training a classification model. The activity will be done using Google Colab with python scripts prepared up front that the participant can modify to achieve the desired outcomes. It is recommended that the participant has basic experience with the python programming language.
A workshop created by André Walsøe for the Encode Workshop AI and Ancient Writing Cultures, Bologna 23rd-27th January 2023
- Introduction to machine learning and NLP: Slides
- Setup of google colab to run the workshop material (15 minutes)
- Download dataset and install libraries
- 1 Data exploration
- Basic data exploration and filtering with Pandas
- Application of filtering techniques based on data exploration findings
- Hands-on-exercise: Data exploration and filtering
- 2 Introduction to nlp techniques
- Lower text
- Tokenization
- Stopword removal
- Vectorization (count and tf-idf)
- Building text classification model
- Choose what to classify and which input data to use
- Split data into training and test
- Transform/vectorize data
- Training a logistic regression model
- Test and evaluate metrics
- Deploy model with Gradio
- Hands-on exercise: Reflect on possible usecases for these techniques
- Wrap-up
- Resources for learning more.
Session 1 (45 min):
- Introduction
- Set up of google colab
- Introduction to Data exploration
Session 2 (45 min):
- Hands-on task 1: Data Exploration
- Introduction to basic NLP techniques
- How to build a text classifier
Session 3 (45 min):
- How to build a text classifier (continuation)
- Brainstorming and discussion: How can ML and NLP be used in my field?
- Wrap up