Introduction to Machine Learning with Python

Home Page: https://carpentries-incubator.github.io/machine-learning-novice-python/

License: Other

Ruby 1.09% Makefile 8.60% R 11.29% Shell 0.80% Python 76.84% HTML 1.39%

lesson english pre-alpha carpentries-incubator machine-learning python bootstrapping data-leakage evaluation auroc prediction

machine-learning-novice-python's Introduction

Introduction to Machine Learning in Python

This half-day lesson gives an introduction to common methods and terminologies used in machine learning, with a focus on prediction. We cover areas such as data preparation and resampling, model building, and model evaluation.

It is a prerequisite for the other lessons in the machine learning curriculum. In later lessons we explore tree-based models for prediction, neural networks for image classification, and responsible machine learning.

Introduction to Machine Learning in Python [Lesson materials; Code repository]
Introduction to Tree Models in Python [Lesson materials; Code repository]
Introduction to artificial neural networks in Python [Lesson materials; Code repository]
Responsible machine learning in Python [Lesson materials; Code repository]

Workshop schedule

These lessons are being run at University of Edinburgh as part of the Ed-DaSH Data Science training programme for Health and Biosciences.

The first lessons were taught in May: https://edcarp.github.io/2022-05-24_ed-dash_machine-learning/. For a list of future lessons, see: https://edcarp.github.io/Ed-DaSH/workshops

Contributing

We welcome all contributions to improve the lesson! Maintainers will do their best to help you if you have any questions, concerns, or experience any difficulties along the way.

We'd like to ask you to familiarize yourself with our Contribution Guide and have a look at the more detailed guidelines on proper formatting, ways to render the lesson locally, and even how to write new episodes.

Please see the current list of issues for ideas for contributing to this repository. For making your contribution, we use the GitHub flow, which is nicely explained in the chapter Contributing to a Project in Pro Git by Scott Chacon. Look for the tag . This indicates that the maintainers will welcome a pull request fixing this issue.

Maintainer(s)

Current maintainers of this lesson are:

Tom Pollard (Website; GitHub)

Authors

A list of contributors to the lesson can be found in AUTHORS

Citation

To cite this lesson, please consult with CITATION

machine-learning-novice-python's People

Contributors

Stargazers

Watchers

Forkers

tompollard eellzz hwarden162 jsteyn chrystinne lassehhansen anenadic pkant-0 mike-allaway githubdemau

machine-learning-novice-python's Issues

Clarify number of samples taken in the resample function

In https://github.com/carpentries-incubator/machine-learning-novice-python/blob/gh-pages/_episodes/07-bootstrapping.md, the following chunk is used to resample the datasets for bootstrapping:

X_bs, y_bs = resample(x_train, y_train, replace=True)

The number of samples isn't specified in the function call, so it is unclear how many samples are being taken.

According to the documentation at https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html the number of samples is specified in the n_samples argument:

"n_samples int, default=None
Number of samples to generate. If left to None this is automatically set to the first dimension of the arrays. If replace is False it should not be larger than the length of arrays."

By default resample will use the length of the array as the number of samples. We should either: (1) note this default or (2) provide the n_samples argument.

Add calibration to the evaluation section

The evaluation section should include discussion of model calibration:
https://carpentries-incubator.github.io/machine-learning-novice-python/06-evaluation/

Use of "normal" to describe patients

Use of "normal" to describe patients

This is a bit nit-picky, but in talking about imputing missing data, the word "normal" is equated with healthy when talking about patient data. I usually avoid using "normal" since it can carry ableist connotations.

Relabel y-axis as "frequency" in the bootstrapping plots

In https://github.com/carpentries-incubator/machine-learning-novice-python/blob/gh-pages/_episodes/07-bootstrapping.md, the page includes two "density" plots.

The y-axis should be probably be relabelled as "Frequency" because the range shows an absolute count of data points.

Add section on regularization (perhaps L1 Lasso regression)

It would be good to add a section on regularization, perhaps building on https://github.com/carpentries-incubator/machine-learning-novice-python/blob/gh-pages/_episodes/04-modelling.md to introduce regression with an L1 constraint.

Definitions of sensitivity and specificity

Definition and examples of specificity and sensitivity

The discussion of sensitivity and specificity is a bit confusing, in part because it is not made completely explicit what is meant by 0/1, dead/alive and positive/negative. I think here death is the "positive" outcome, which is fine, but then the definition of specificity needs to change I think. (Or maybe I am just confused)

Consider moving bootstrap section to machine-learning-trees-python

As a step in simplifying the content of this lesson, perhaps the section on bootstrapping could be moved to https://carpentries-incubator.github.io/machine-learning-trees-python/

Add section on regularisation?

Consider adding a section on regularisation. This section could:

explain regularisation
show how extra variables can be added to the logistic regression model
show how regularisation can be applied to this model.

Adding learner profiles to guide curriculum

Here are some draft profiles:

Learner profiles

Workshop attendees include postgraduate students, early career researchers, postdocs, undergraduates, academic and non-academic staff, including those working in government and industry, and people working in clinical- and information-related roles. The learner profiles below provide examples of the diverse domain backgrounds, levels of computational experience, and career stages of learners.

Madaline

Madaline is an Associate Professor in urban development at a large teaching- and research- university. Madaline's specific area of interest is in developing cities that promote health and happiness. Prior to this role she studied urban planning as an undergraduate and she worked for several years at an architectural firm. Two years ago she completed an introduction to python course with the Carpentries and now collaborates on a project that embeds interactive, digital objects across cities to assist people with directions. She loosely follows academic developments in machine learning, but she has limited practical experience.

Many of the new students at the university are interested in learning about machine learning and its potential. Madaline has been asked to help teach on a three-week Summer course on machine learning in urban development next year, which has motivated her to get some firmer practical experience in creating and applying models. Aside from the teaching, she has an idea for a project that would forecast ground conditions such as uncleared snow and leaves and that might affect the ability of people with impaired eyesight, like her, to easily navigate.

Machine Learning Carpentry will build on Madaline's existing programming experience, offering her practical skills in building and applying regression models, decision trees, and neural networks for prediction. The course will teach her about convolutional neural networks that can be used to make predictions based on images, which will help her to get started on her latest project. Her previous experience in the Carpentries encourages her to provide assistance to her fellow learners during the course.

Mei

Mei is Vice-President of a biotech company that researches and develops medications for chronic conditions such as hypertension and arthritis. She completed degrees in biochemistry and business administration over twenty years ago. She now oversees a management team and provides strategic direction for the company. She has in-depth knowledge of the drug development industry and has some experience in data analysis with SAS and R, both of which she uses to prepare presentations. She is aware that machine learning is an increasingly important technology but her only knowledge has come from popular media. She has found it difficult to separate the hype from the truth.

Many people within her company are pushing to incorporate machine learning tools into company process. For example, the research and development team would like to use machine learning models to focus their efforts on therapeutic interventions that are most likely to be successful. The marketing team would like to use machine learning models to target certain groups of people who are most likely to benefit from their medications. One of her colleagues has raised technical and ethical concerns about the approaches. In both cases, Mei feels like she needs a firmer understanding of machine learning to be able to guide decision making.

Machine Learning Carpentry will introduce Mei to practical machine learning and its ethics and equip her with the base knowledge she needs for her job. She will understand the promise and limitations of the current state of the art in machine learning. She does not plan to continue writing machine learning code after completing the course, but she knows that she will increasingly be part of discussions about the applications of machine learning within her company.

Walter

Walter is a junior software engineer working for a social media company that has recently started to explore the area of health analytics and monitoring. He recently completed an undergraduate degree in computer science and he is well versed in multiple programming languages, including C++ and Python. During his degree he took a module in machine learning, but he would like to refresh his knowledge and familiarize himself with some of the common tools. He has noticed machine learning getting some bad press in recent months, but he hasn't thought much about why people are concerned.

Walter's company collects a large amount of data on its users, including information about gender, ethnicity, height, weight, age, and health conditions. One of the ideas that he would like to explore is whether this information can be used to predict who will need to visit a pharmacy in the coming days, so that the pharmacy can anticipate the visit and ensure appropriate goods are available for purchase. Walter would like to develop some baseline skills for working on the project, but he also welcomes the opportunity to think about whether or not the project is a good idea to pursue.

Machine Learning Carpentry will give Walter the ability to develop models to predict user behaviour and it will also give him focused time to think about the benefits and risks of such a project. His computer science background will enable him to help his fellow learners with programming and Python questions.

Ali

Ali is a governmental policy advisor for the Office of Science and Technology. She has an undergraduate degree in economics and a masters degree in international policy. She regularly reads papers and reports on machine learning and has completed a short online course in machine learning in Python. Since completing the course she has been working on some projects in her personal time for fun, building these projects in her public git repository.

The government is developing a national AI strategy to boost the country's capabilities in machine learning technologies. Ali is one of the team working on developing this strategy and she has been tasked with exploring the capabilities of machine learning and building relevant connections within the community. More personally, she is interested in continuing to work on interesting machine learning projects and would love to have the opportunity to ask questions and to meet like-minded people.

Machine Learning Carpentry will help Ali to build on her existing machine learning experience, introducing her to concepts that were only briefly covered in her online course and giving her the opportunity to clarify some points that were not clearly described, such as the difference between AI and machine learning. Ali will also have the opportunity to interact with people from a broad set of backgrounds, helping her to understand the areas where government efforts should be focused.

Misc notes from delivery

episode 1: "their response the new drug" -> "their response to the new drug"

episode 2: This is a style thing but the sentence structure is often needlessly complicated. eg, "It is often the case that our data includes categorical values." can be simplified to "Datasets like these often include categorical values."
Similarly "In our case, for example, the binary outcome we are trying to predict - in hospital mortality - is recorded as “ALIVE” and “EXPIRED”." can be simplified to "In our case, the binary outcome we are trying to predict (hospital mortality) is recorded as ALIVE and EXPIRED".

It is extremely weird to drop the categorical outcome variable and use it as y, including the encoded numeric variable in x. I realise this is an intro lesson but this seems to me a coding mistake that would be common for novices

To avoid data leaking between our training and test sets, we take the median from the training set only. The training median is then used to impute missing values in the held-out test set.

This isn't really explained at all. The data imputation section generally is a bit short. It'd be good to mention why imputing with the median is a bad idea in arguably most cases

carpentries-incubator / machine-learning-novice-python Goto Github PK