whitead / dmol-book Goto Github PK

View Code? Open in Web Editor NEW

583.0 16.0 114.0 77.56 MB

Deep learning for molecules and materials book

Home Page: https://dmol.pub

License: Other

TeX 10.31% Jupyter Notebook 88.08% JavaScript 0.30% CSS 0.26% Python 1.05%

deep-learning chemistry materials-informatics

dmol-book's People

Contributors

Stargazers

Watchers

Forkers

oakif mrzresearcharena hgandhi2411 aspirincode ziyueyang37 davidchoi76 mhartveit bernhein sailfish009 danielhollas partev zwvews jiangjing0122 victor3387 liuyunwu bio-otto yychuang afandiadib mlopezm momodz16 tangshua jboktor saifulislampharma napoles-uach a1ip nuhhatipoglu balasubu kjappelbaum elanapearl marco-foscato duerrsimon quocdat32461997 alekssro ju-liu awoonor pcastrola ezrider nilsagor mjvolk3 john-james-ai geemi725 fermiq mehradans92 morey-ow bingqingcheng cpfpengfei huzongxiang enricocid dongcf amadeusine jjiang-amoy arunppsg nmerovingian kimrojas joseteofilo resnant killiansheriff seyedmohamadmoosavi reveurmichael cihanozhan rnaimehaom sanjeevardodlapati pradipchm cmcp-group rmeli maheshjethalia lenhanpham danielathk dualword gakkilovemath takshan ojlu bkgithubbk vincehass hgandhi-nurix m-hakmi raulppelaez aroll727 eliochen leekiwook amine179 nathanielkang stjordanis daviddesancho pinidinaris sunncist busugaacevedo rcrehuet ipark2021 frenio standardgalactic foralan jianannie caochuangmit akshay-chauhan-000 agitter dieterplessers fanwangm sahalaww prabindh

dmol-book's Issues

Chapter 1: typos

Thanks for chapter 1, I like that you bundled a bunch of concepts together on the one dataset; it flowed very well. I picked out a few typos -- looking for those is helping me pay attention. :-)

1.2 Supervised Learning

Machine learning has three types of learning: supervised, semi-supervised, and unsupervised. There are two tasks: regression and classification.

Suggestion: "There are two basic tasks in supervised learning: regression and classification"

Reason: It's not immediately clear that regression and classification are typically supervised.

1.3.2. Data Exploration

Let's just see some specific example, extremes, and get a sense of the range of labels/data. We'll start with seeing what kind of molecules there are.

Suggestion: Let's first look at the first molecule, then the extreme values, and get a sense of the range of labels/data. We'll start with seeing what kind of molecules there are.

Reason: "Let's just see some specific example, extremes," could be interpreted that the extremes are the example.

This is first molecule in the dataset rendered using rdkit

Missing full stop.

ioinc compounds have higher solubility

ionic?

Outliers are extreme values that fall outside of your normal data distribution. They can be mistakes, be from a different distribution (e.g., metals instead of organic molecules), and can have a strong effect on model training.

Suggestion: Outliers are extreme values that fall outside of your normal data distribution. They could be mistakes or be from a different distribution (e.g., metals instead of organic molecules). Either way, they can have a strong effect on model training.

Reason: Just felt like distinct concepts that deserved a sentence break

(how often the molecule occured in the constituent databases),

occurred?

1.3.3 Feature Correlation

MolLogP, which is a computed estimated related to solubility, does correlate well.

computed estimate related to solubility?

1.3.4 Linear Model¶

parmaters

(in various other sections below, too) => parameters?

1.3.5 Gradient Descent

This computes an analytical derivative a python function.

analytical derivative of a Python function?

1.3.6 Batching

The loss is lower than wihtout batching

without?

1.3.7 ML Trick

Each of the features have different magnitudes. Like molecular weight, which is lagre, and number of rings which is small.

The second sentence might be nicer fleshed out a little, like:

For example, the molecular weight is large, compared to the number of rings, which is small.

from myst_nb import glue
glue('corr', np.round(np.corrcoef(labels, predicted_labels)[0,1], 2))

That is very neat, but I guess it's not supposed to be executed in Google Co-lab? If it is, the sphinx version needs some tweaking. If not, is there any way to mark that as non-executable?

1.4.1 Clustering

We’re gonig to zoom into the 99th percentile

gonig => going

we do not label the axes beceause they are arbitrary.

beceause => because

Although in PCA etc. it does seem common to label which principal component you're visualizing.

Yeah, that’s why clustering hard.

Maybe you left out the "is" for effect, but just in case.... :-)

It's not super relevant, but if you were thinking of linking to extra resources, I found Shao et al 2007 a useful MD-focused one and more generally the scikit-learn user guide.

1.4.2 Choosing Cluster Number

We can extract the most centered data points (closest to cluster center) and consider them to be representatitive of the cluster.

representatitive => representative

We might try to gain insight about predicting solubility solubility

Extra solubility

1.6.3. Minimizing Loss

Compute it’s gradient using jax.

its (no apostrophe)

Using the regularized features, show what effect batchsize has on training. Use batchsizes of 1, 8, 32, 256, 1024. Make sure you re-initialize your weights in between each run. Plot the log-loss for each batchsize on the same plot. Describe your results.

I think batch size is typically 2 words?

1.6.4. Clustering

You can still cluster if you have labeles,

labeles => labels

Reuse/Reference GNN distill papers?

Distill recently published some GNN papers with really nice visualizations, e.g., https://distill.pub/2021/gnn-intro/ it would be a pity to not reuse them or to not reference them

XAI

t-statistics is missing t CDF

urlib missing

Missing in some chapters when changed to self-hosting data

For VAE Chapter

https://arxiv.org/abs/2105.14859

Classification chapter feedback

Not so sure about "The classic application of classification is designing new drugs, but it’s widely used in materials and chemistry. Many molecular design problems can be formulated as classification. For example, you can use it to design new organic photovoltaic materials [SZY+19] or antimicrobial peptides [BJW18].". Im not sure if drug design is per se a classification. Pretty clear examples might be Alan's image classification work on the labware, maybe something on labels that are intrinsically discrete (e.g., oxidationstates, or XRPD pattern to structure)
Kind of a question of taste: I would perhaps even add the image of the Mark I as it allows to explain the same concept in one additional way: Having the dials for the potentiometers to tune the weights and photodiodes as the input vector

I liked how you showed the non-differentiability of the hard thresholding and also liked that you showed the need for a baseline model

maybe do a find/replace for naieve (#60 )
for he imbalanced case. A useful package i found is imbalanced-learn. A review I liked a lot is 10.1109/TKDE.2008.239

Plotting in chapter 1 co-lab notebook

I really enjoyed using the Co-lab notebook, it's very smooth (but for installation...). This is my first time using one. I had some suggestions on the image code.

SVG molecules

SVGs just look nicer. If it doesn't mess up the book rendering, this line might be good in the import cell.

IPythonConsole.ipython_useSVG = True

Iterating over Pandas rows

# pardon this slop, we need to have a list of strings for legends
legend_text = [f'{i}: solubility = {s:.2f}' for i,s in zip(extremes.ID, extremes.Solubility)]

This didn't really seem that sloppy! But you could try the below if you prefer.

legend_text = [f'{x.ID}: solubility = {x.Solubility:.2f}'
               for x in extremes.itertuples()]

Exploiting seaborn

In 1.3.3 Feature Correlation, much of the image code seems to replicate what seaborn is very good at. As seaborn is already imported, I wonder if the below would get the point across more of what you're doing.

Original code

features_start_at = list(soldata.columns).index('MolWt')
feature_names = soldata.columns[features_start_at:]

fig, axs = plt.subplots(nrows=5, ncols=4, sharey=True, figsize=(12, 8), dpi=300)
axs = axs.flatten() # don't want to think about i/j
for i,n in enumerate(feature_names):
    ax = axs[i]
    ax.scatter(
        soldata[n], soldata.Solubility, 
        s = 6, alpha=0.4,
        color = color_cycle[i % len(color_cycle)]) # add some color 
    if i % 4 == 0:
        ax.set_ylabel('Solubility')
    ax.set_xlabel(n)
# hide empty subplots
for i in range(len(feature_names), len(axs)):
    fig.delaxes(axs[i])
plt.tight_layout()
plt.show()

Suggested code

mpl.rcParams['figure.dpi'] = 300  # maybe move this up to the imports cell

features_start_at = list(soldata.columns).index('MolWt')
feature_names = soldata.columns[features_start_at:]

tidy_data = pd.melt(soldata, id_vars=["Solubility"],
                    value_vars=feature_names,
					var_name="Feature", value_name="Value")

g = sns.FacetGrid(tidy_data, col="Feature", hue="Feature",
                  col_wrap=4, sharey=True, sharex=False,
                  height=2, aspect=1.8,
                  palette=color_cycle, despine=False,
                  )
g.map(plt.scatter, "Value", "Solubility", s=6, alpha=0.4)
g.set_xlabels("")
g.set_titles(template="{col_name}")
plt.show()

The browser seems to resize the image whenever I try to screenshot it and the code at
the same time, but it gives pretty much what the old code did.

Edit: that dpi=300 might be why my other images are huge, though.

QM9 Videos

Use of "language" is ambiguous

Reading the section "Framework Choice", I find usage of the word "language" to be ambiguous. The author is using it in reference to any of several programming frameworks (or libraries); however, "language" infers a syntax and grammar and is thus more commonly shorthand for "programming language" when used for software development instructional materials - at least as published in the USA. Because the author has already selected a programming language, python, i recommend that the word "language" not also be used to describe an application programming interface or framework or library.

Angles - input equivariances

Add angles to list of rotation invariant terms.

Chapter 1. Tensors and shapes

I split this up from #26 for clarity. Most feedback is quibbling about language choice, so I broke that up into sections by heading as well.

1. Tensors and shapes

Rank can be defined as the number of indices required to get individual elements of a tensor.

Matrix rank is a concept from linear algebra and has nothing to do with tensor rank.

I suppose using tensor order to distinguish from matrix rank is non-conventional? It's just a little confusing when linear algebra is fairly pertinent to a lot of machine learning.

A euclidian vector

A Euclidean vector?

1.2.1 reduction operations

sum(a, axis=0)

I understand that numpy is implied, but sum is also a built in function, and does not take axis as a keyword argument. Perhaps np.sum would be clearer?

I've not used Jupyter Book before -- if intersphinx works on it, it could be a really great way to auto-link to other library's documentation.

1.4 Modifying rank

In tensorflow and jax there is expand_dims 😠 You can also use reshape and ignore newaxis

I'm not really sure how to interpret this angry face... 😅 numpy has expand_dims too!

And I think jax.numpy.newaxis also works?

>>> import jax.numpy as jnp
>>> arr = jnp.arange(12).reshape((3, 4))
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
>>> arr[..., jnp.newaxis].shape
(3, 4, 1)
>>> arr[jnp.newaxis].shape
(1, 3, 4)
>>> arr.reshape((2, 6))
DeviceArray([[ 0,  1,  2,  3,  4,  5],
             [ 6,  7,  8,  9, 10, 11]], dtype=int32)

1.4.1 Reshaping

There is one special syntax element to shaping: -1 dimensions. -1 can appear once in a reshape command and means to have the computer figure out what goes there by following the rule that the number of elements in the tensor must remain the same. Let’s see some examples.

I found the second sentence hard to follow. Would some punctuation and formatting help?

-1 can appear once in a reshape command. It tells the computer to figure out what goes there by following the rule that the total number of elements in the tensor must remain the same.

Below I also suggest my personal understanding of -1.

-1 can appear once in a reshape command. Because the total number of elements in the reshaped tensor must remain the same, -1 stands for "everything else".

1.5 Chapter Summary

There are operations that reduce ranks of tensors, like sum or mean.

This can be a pain to do consistently, but for functions that are also English words I like to format functions in code-style (i.e. like sum or mean) to distinguish when I'm talking about code vs maths.

1.6.2 Reductions

Just a suggestion: many undergraduate courses check code function by providing some test cases and example answers. It might be nice to provide an example input-output pair in Markdown code so people can check their work. e.g.

input_arr = np.array([1, 10, 2, 3])
output_arr = np.array([0.0625, 0.625 , 0.125 , 0.1875])
assert np.all(normalize_vector(input_arr) == output_arr)

1.6.3 Broadcasting

write python code to compute their outter product.

outter -> outer

You have a tensor of unknown rank A and would like to subtract 3.5, and 2.5 from every element so that your output, which is a new tensor B, is rank of rank(A) + 1. The last axis of B should be dimension 2.

I had to read over this a few times before coming to an interpretation that accounted for every word. Is this what you're after?

>>> import numpy as np
>>> subtract_twice = lambda x: np.stack([x, x], axis=-1) - [3.5, 2.5]
>>> a = np.arange(24).reshape((6, 4))
>>> subtract_twice(a).shape
(6, 4, 2)
>>> subtract_twice(a.reshape((2, 3, 4))).shape
(2, 3, 4, 2)

If so, can I please suggest breaking up the question for more immediate understanding? It's more repetitive but people don't typically read textbooks for the prose.

You have a tensor of unknown rank A. You would like to perform two operations on it at once: firstly, subtracting 3.5 from every element; and secondly, subtracting 2.5 from every element. Your output should combine the results and should be a new tensor B with rank rank(A) + 1. Hint: the new axis of B should be last, and should have dimension 2.

(I originally sat wondering how subtracting 6 should give me dimension 2).

Section Numbering

Presumably the book may not be mature enough yet to assign a definitive numbering to the chapters, but perhaps it would be a good idea to assign some temporarily so that we can reference sections (and subsections) in class/our notes, and get an idea of where different subsections start and end.

kmjablonka suggestiosn for tensors

It might be interesting to point at https://github.com/arogozhnikov/einops which is great to play with Einstein notation
It feels that also a discussion of the difference between copy and view can be useful.

Typos from Mark

Reminder for myself to address those.

Co-lab installation: chapter 1

I did not find the package installation cell to work immediately, likely due to an update in RDKit since you wrote it. It's in section "1.3. Running This Notebook".

The first issue is that "sklearn" is installed -- I think the package to install is actually scikit-learn.

Secondly, I had significant trouble getting rdkit to work properly (version 2021.03.4), and it seems this is not unique to me

     11 
     12 from rdkit import rdBase
---> 13 from rdkit.DataStructs import cDataStructs
     14 from rdkit.DataStructs.cDataStructs import *
     15 
ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /usr/local/lib/python3.7/site-packages/rdkit/DataStructs/cDataStructs.so)

Pinning RDKit to an older version as suggested worked in my case. Just installing the missing packages with apt did not. I also bunched all the numeric-y libraries together in conda, as I've found it figures out dependencies better. (Pinning other packages might extend the notebook's lifespan in Google Co-lab, too.)

Suggested solution

!wget -c https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.8.3-Linux-x86_64.sh
!time bash ./Miniconda3-py37_4.8.3-Linux-x86_64.sh -b -f -p /usr/local/

import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
 
!conda install -y -c conda-forge rdkit=2020.09.2 numpy jaxlib jax pandas scikit-learn
!pip install jupyter-book matplotlib seaborn tabulate

Intro Chapter

Install instructions are messed-up

EMLP Section

Revise Vrep sampling and S5 example according to
mfinzi/equivariant-MLP#10 (comment)

NFlow Chapter

Need to seed the random numbers (or pin version?). Something is different between Github actions and my local run.

Classification

Add information on positive only classification

https://www.sciencedirect.com/science/article/abs/pii/S2405471220304142

Equation for loss

In section B 1.3.4, the loss function is defined as jnp.sqrt(jnp.mean((y - labels)**2)) where y is the predictions from the linear model. Not that here, the loss is equal to the square root of the mean. However, in equation 1.16, the loss seems to be defined as simply the mean (of the square of the differences)—no square root involved. I presume that equation 1.16 needs to have L² on the LHS for the loss definitions to be consistent.

Remove draft note on EQV chapter

add spellcheck and link check CI?

if you think it is useful, i can make the PR.

the spellcheck CI might be annoying at first, when you do not yet have a good whitelist but might be pretty useful in the long run

Shapley Values

Why do they not converge?

QM9 App

8 is better than 16 for nearest power.

Minor code flaw

First of all, thanks for this awesome resource.

In this line (l. 615 on the source blob), z6_phi was probably supposed to be p.

standard layers feedback

i think it is very important that you brought up equivariance/invariance. However, I feel that one could a bit clearer in the fact that pooling makes convs approx. translational invariant (and perhaps use fig 9.8 from https://www.deeplearningbook.org/contents/convnets.html)
i like that you put batchnorm in the regularization section. Perhaps it might be a nice reason to make the connection between randomness and regularization and cite some of the works that postulate that batchnorm works as regularizer

DL intro feedback

I think starting with 6.3 the section headings miss some #

TFP missing

From flows chapter

XAI Chapter

Should change post-hoc/surrogate to use Miller/Lipton nomenclature of explanation vs interpretation
https://arxiv.org/abs/1706.07269

Regression chapter feedback

from #47 (which I'll now close)

Formally, standardization should only done on the training data (discussed nicely in Elements of Statistical Learning). For this reason, I would maybe do the split before the standardization.
I'd perhaps be a bit less general with "L2 gives a better model and L1 gives a more interpretable result by zeroing features" as L1 also needs some tuning of lambda to make things really zero. Do you have some reference that shows that L2 give better performance than L1? My intuition would have been "it depends"
an awesome resource for a deeper dive into model selection is https://arxiv.org/abs/1811.12808

New Chapter: Pre-training

ChemBERT, SMILES-BERT
Pretraining on graphs and EQNNs

GDPR Compliant Analytics

https://developers.google.com/analytics/devguides/collection/analyticsjs/cookies-user-id#disabling_cookies

Alt Text on Cell output

Many figures are output without alt text. This should be fixed.

EMLP Discussion Update

Number of constant values in equivariant linear basis is not the same as number of free parameters (?) Need to check on this.

mfinzi/equivariant-MLP#14 (comment)

Kernel learning chapter feedback

In the context of "training curves" it might be good to mention that some call them "learning curves". https://arxiv.org/pdf/1807.04259.pdf dwells a bit more on them

ML Intro Chapter Feedback

I’m not sure with the learning problem definitions. RL seems missing and semi/self-supervised are somehow in between.
Maybe it is good to add a note that if EDA guides the model selection the train/test split needs to happen before EDA to avoid data leakage
For teaching I would probably use the pandas bracket notation instead of the dot notation as it always works
Might be interesting to add a note/or exercise why “batch size is usually as a power of 2 (e.g., 16, 128)|
It might also be interesting to hint at the research about the link between SGD (and the randomness it introduces) and regularization and generalization
“Here we assume ${y_i}$ is a class variable and try to separate our features into these classes” might not capture that well the case of density-based clustering, similarly “1. We say that clustering is a type of unsupervised learning and that it predicts the labels. What exactly are the predicted labels in clustering? Write down what the predicted labels might look like for a few data points.”
I feel it cloud be really useful to highlight or hint at some caveats of graph-embedding based clusterings (UMAP/TSNE)
A ref for the zoo of methods to select k in k-meanshttps://journals.sagepub.com/doi/abs/10.1243/095440605x8298

from @kjappelbaum

Chapter 16 (Predicting DFT Energies with GNNs) Feedback

In table 16.1 (Label Description), for index 7's description it should be :Energy of Lowest unoccupied molecular orbital (LUMO). Currently it's occupied.
you used at multiple places "it's, we'll, I'll, we've etc.", I personally don't like in the books this language, I would suggest to use it is, we will, I will, we have etc.
On page 2 of chapter 16, in the paragraph "These are Tensorflow Tensors. They can be converted to numpy arrays via x.numpy(). The first item is the element vector 6,1,1,1,1. Do you recognize the elements? It’s C, H, H, H, H. The positions come next. Note that the there is an extra column containing the atom partial charges, which we will not use as a feature. Finally, the last tensor is the label vector.", there is an extra "the" in the sentence "Note that the there is an extra column containing the atom partial charges, which we will not use as a feature. Finally, the last tensor is the label vector."
For each plot, you should add units for 'X' and 'Y' axis fields such as energies for Val Loss (if any), Epoch (if any), Energy (if any) and predicted Energy (if any).
There should be "a comma (,)" after "thus" in the sentence " Thus we start at high learning rate and decrease." in the paragraph "This is poor performance, but it gives us a baseline value of what we can expect. One unusual detail I did in this training was to slowly reduce the learning rate. This is because our features and labels are all in different magnitudes. Our weights need to move far to get into the right order of magnitude and then need to fine-tune a little. Thus we start at high learning rate and decrease."
It should be "variances" in the sentence "You can see that the broad trends about molecule size capture a lot of variance, but more work needs to be done."
There should be "a comma (,)" after "finally" in the sentence "The global node aggregation will also be a sum. Finally we have our graph feature vector update:"

string vs. graph representation

I think you mention that there is currently no paper on this and I think i agree with that.
A nice hint is the guacamol leaderboard https://www.benevolent.com/guacamol and when we tested on SELFIES we also didn't find it outperform the graph based models. Clearly it is also related with the modeling but as you write I also think that the representation matters.

Would be interesting to do a proper benchmark, where one also considers the latent space of a model trained on string-based representation (e.g., a VAE, maybe trained jointly with some properties)

Intersphinx

A method to link to external docs - could be used to link to keras, TF, jax docs.

New Chapter: Uncertainty

Bayesian, Deep ensembling, Calibration, dropout, flows?

New Chapter: hyperparameter selection

Initial draft
Applied example

Add more materials science examles

Starting point: https://pubs.rsc.org/en/content/articlelanding/2020/SC/D0SC00594K

Via https://twitter.com/resnant/status/1428597987513561091?s=20

Cayley Table Description

Add legit point cloud example

Using emlp

G = SO(3) * S(16)
Vin = V(G)**0 + V(G)
Vout = V(G)

Maybe QM9?

VAE Wording

Language is a little unclear at beginning of VAE chapter and there is a hat missing on P(x) later on in details section.

Break-up Bibliography

Dark mode

Warnings on EMLP in equivariant chapter

need to add warnigs filter

Overview

Hi @whitead, thanks for writing this book. It's neatly laid out and easy for beginners like me to understand. I hope you don't mind that I took LiveComSJ's tweet asking for feedback seriously.

Overview

Deep learning is specifically about connecting two types of data with a neural network function, which is differentiable and able to approximate any function. The classic example is connecting function and structure in molecules.

Suggestion: The classic example is connecting molecular structure to its function.

Reason: "function" just meant a totally different thing in the sentence before. Reading "The classic example is connecting function" without the molecular context meant that I first thought of "connecting function" as a noun, i.e., a mapping from one space to another.

it’s ability to generate new data.

its, no apostrosphe

One example that sets deep learning apart from machine learning is in feature engineering.

I think many people would argue that deep learning is a form of machine learning. If I understand correctly, the next part of the paragraph argues that deep learning does not need feature engineering? Perhaps "One advantage that sets deep learning apart from other machine learning techniques is that it does not require feature engineering"?

Previously training and using models in machine learning was a tedious process and required deriving equations for each model change.

Comma after previously?

Deep learning is always a little tied-up in the implementation details. Thus, language choice can be a part of the learning process. In this book, we use Jax, Tensorflow, Keras, and scikit-learn for different purposes.

It was a little odd for me that you jumped straight to libraries instead of saying Python first, considering that the section is titled "Language choice". Throughout this section you also refer to them all as languages. I would personally consider all of those to be "frameworks" rather than languages.

Stackoverflow

Stack Overflow?

whitead / dmol-book Goto Github PK

dmol-book's People

Contributors

Stargazers

Watchers

Forkers

dmol-book's Issues

1.2 Supervised Learning

1.3.2. Data Exploration

1.3.3 Feature Correlation

1.3.4 Linear Model¶

1.3.5 Gradient Descent

1.3.6 Batching

1.3.7 ML Trick

1.4.1 Clustering

1.4.2 Choosing Cluster Number

1.6.3. Minimizing Loss

SVG molecules

Iterating over Pandas rows

Exploiting seaborn

1. Tensors and shapes

1.2.1 reduction operations

1.4 Modifying rank

1.4.1 Reshaping

1.5 Chapter Summary

1.6.2 Reductions

1.6.3 Broadcasting

Overview

Recommend Projects

Recommend Topics

Recommend Org