whitead / dmol-book Goto Github PK
View Code? Open in Web Editor NEWDeep learning for molecules and materials book
Home Page: https://dmol.pub
License: Other
Deep learning for molecules and materials book
Home Page: https://dmol.pub
License: Other
Thanks for chapter 1, I like that you bundled a bunch of concepts together on the one dataset; it flowed very well. I picked out a few typos -- looking for those is helping me pay attention. :-)
Machine learning has three types of learning: supervised, semi-supervised, and unsupervised. There are two tasks: regression and classification.
Suggestion: "There are two basic tasks in supervised learning: regression and classification"
Reason: It's not immediately clear that regression and classification are typically supervised.
Let's just see some specific example, extremes, and get a sense of the range of labels/data. We'll start with seeing what kind of molecules there are.
Suggestion: Let's first look at the first molecule, then the extreme values, and get a sense of the range of labels/data. We'll start with seeing what kind of molecules there are.
Reason: "Let's just see some specific example, extremes," could be interpreted that the extremes are the example.
This is first molecule in the dataset rendered using rdkit
Missing full stop.
ioinc compounds have higher solubility
ionic?
Outliers are extreme values that fall outside of your normal data distribution. They can be mistakes, be from a different distribution (e.g., metals instead of organic molecules), and can have a strong effect on model training.
Suggestion: Outliers are extreme values that fall outside of your normal data distribution. They could be mistakes or be from a different distribution (e.g., metals instead of organic molecules). Either way, they can have a strong effect on model training.
Reason: Just felt like distinct concepts that deserved a sentence break
(how often the molecule occured in the constituent databases),
occurred?
MolLogP, which is a computed estimated related to solubility, does correlate well.
computed estimate related to solubility?
parmaters
(in various other sections below, too) => parameters?
This computes an analytical derivative a python function.
analytical derivative of a Python function?
The loss is lower than wihtout batching
without?
Each of the features have different magnitudes. Like molecular weight, which is lagre, and number of rings which is small.
The second sentence might be nicer fleshed out a little, like:
For example, the molecular weight is large, compared to the number of rings, which is small.
from myst_nb import glue
glue('corr', np.round(np.corrcoef(labels, predicted_labels)[0,1], 2))
That is very neat, but I guess it's not supposed to be executed in Google Co-lab? If it is, the sphinx version needs some tweaking. If not, is there any way to mark that as non-executable?
We’re gonig to zoom into the 99th percentile
gonig => going
we do not label the axes beceause they are arbitrary.
beceause => because
Although in PCA etc. it does seem common to label which principal component you're visualizing.
Yeah, that’s why clustering hard.
Maybe you left out the "is" for effect, but just in case.... :-)
It's not super relevant, but if you were thinking of linking to extra resources, I found Shao et al 2007 a useful MD-focused one and more generally the scikit-learn user guide.
We can extract the most centered data points (closest to cluster center) and consider them to be representatitive of the cluster.
representatitive => representative
We might try to gain insight about predicting solubility solubility
Extra solubility
Compute it’s gradient using jax.
its (no apostrophe)
Using the regularized features, show what effect batchsize has on training. Use batchsizes of 1, 8, 32, 256, 1024. Make sure you re-initialize your weights in between each run. Plot the log-loss for each batchsize on the same plot. Describe your results.
I think batch size is typically 2 words?
1.6.4. Clustering
You can still cluster if you have labeles,
labeles => labels
Distill recently published some GNN papers with really nice visualizations, e.g., https://distill.pub/2021/gnn-intro/ it would be a pity to not reuse them or to not reference them
t-statistics is missing t CDF
Missing in some chapters when changed to self-hosting data
I liked how you showed the non-differentiability of the hard thresholding and also liked that you showed the need for a baseline model
naieve
(#60 )I really enjoyed using the Co-lab notebook, it's very smooth (but for installation...). This is my first time using one. I had some suggestions on the image code.
SVGs just look nicer. If it doesn't mess up the book rendering, this line might be good in the import cell.
IPythonConsole.ipython_useSVG = True
# pardon this slop, we need to have a list of strings for legends
legend_text = [f'{i}: solubility = {s:.2f}' for i,s in zip(extremes.ID, extremes.Solubility)]
This didn't really seem that sloppy! But you could try the below if you prefer.
legend_text = [f'{x.ID}: solubility = {x.Solubility:.2f}'
for x in extremes.itertuples()]
In 1.3.3 Feature Correlation, much of the image code seems to replicate what seaborn is very good at. As seaborn is already imported, I wonder if the below would get the point across more of what you're doing.
Original code
features_start_at = list(soldata.columns).index('MolWt')
feature_names = soldata.columns[features_start_at:]
fig, axs = plt.subplots(nrows=5, ncols=4, sharey=True, figsize=(12, 8), dpi=300)
axs = axs.flatten() # don't want to think about i/j
for i,n in enumerate(feature_names):
ax = axs[i]
ax.scatter(
soldata[n], soldata.Solubility,
s = 6, alpha=0.4,
color = color_cycle[i % len(color_cycle)]) # add some color
if i % 4 == 0:
ax.set_ylabel('Solubility')
ax.set_xlabel(n)
# hide empty subplots
for i in range(len(feature_names), len(axs)):
fig.delaxes(axs[i])
plt.tight_layout()
plt.show()
Suggested code
mpl.rcParams['figure.dpi'] = 300 # maybe move this up to the imports cell
features_start_at = list(soldata.columns).index('MolWt')
feature_names = soldata.columns[features_start_at:]
tidy_data = pd.melt(soldata, id_vars=["Solubility"],
value_vars=feature_names,
var_name="Feature", value_name="Value")
g = sns.FacetGrid(tidy_data, col="Feature", hue="Feature",
col_wrap=4, sharey=True, sharex=False,
height=2, aspect=1.8,
palette=color_cycle, despine=False,
)
g.map(plt.scatter, "Value", "Solubility", s=6, alpha=0.4)
g.set_xlabels("")
g.set_titles(template="{col_name}")
plt.show()
The browser seems to resize the image whenever I try to screenshot it and the code at
the same time, but it gives pretty much what the old code did.
Edit: that dpi=300 might be why my other images are huge, though.
Reading the section "Framework Choice", I find usage of the word "language" to be ambiguous. The author is using it in reference to any of several programming frameworks (or libraries); however, "language" infers a syntax and grammar and is thus more commonly shorthand for "programming language" when used for software development instructional materials - at least as published in the USA. Because the author has already selected a programming language, python, i recommend that the word "language" not also be used to describe an application programming interface or framework or library.
Add angles to list of rotation invariant terms.
I split this up from #26 for clarity. Most feedback is quibbling about language choice, so I broke that up into sections by heading as well.
Rank can be defined as the number of indices required to get individual elements of a tensor.
Matrix rank is a concept from linear algebra and has nothing to do with tensor rank.
I suppose using tensor order to distinguish from matrix rank is non-conventional? It's just a little confusing when linear algebra is fairly pertinent to a lot of machine learning.
A euclidian vector
A Euclidean vector?
sum(a, axis=0)
I understand that numpy is implied, but sum
is also a built in function, and does not take axis as a keyword argument. Perhaps np.sum
would be clearer?
I've not used Jupyter Book before -- if intersphinx works on it, it could be a really great way to auto-link to other library's documentation.
In tensorflow and jax there is expand_dims 😠 You can also use reshape and ignore newaxis
I'm not really sure how to interpret this angry face... 😅 numpy has expand_dims too!
And I think jax.numpy.newaxis
also works?
>>> import jax.numpy as jnp
>>> arr = jnp.arange(12).reshape((3, 4))
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
>>> arr[..., jnp.newaxis].shape
(3, 4, 1)
>>> arr[jnp.newaxis].shape
(1, 3, 4)
>>> arr.reshape((2, 6))
DeviceArray([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11]], dtype=int32)
There is one special syntax element to shaping: -1 dimensions. -1 can appear once in a reshape command and means to have the computer figure out what goes there by following the rule that the number of elements in the tensor must remain the same. Let’s see some examples.
I found the second sentence hard to follow. Would some punctuation and formatting help?
-1 can appear once in a reshape
command. It tells the computer to figure out what goes there by following the rule that the total number of elements in the tensor must remain the same.
Below I also suggest my personal understanding of -1.
-1 can appear once in a reshape
command. Because the total number of elements in the reshaped tensor must remain the same, -1 stands for "everything else".
There are operations that reduce ranks of tensors, like sum or mean.
This can be a pain to do consistently, but for functions that are also English words I like to format functions in code-style (i.e. like sum
or mean
) to distinguish when I'm talking about code vs maths.
Just a suggestion: many undergraduate courses check code function by providing some test cases and example answers. It might be nice to provide an example input-output pair in Markdown code so people can check their work. e.g.
input_arr = np.array([1, 10, 2, 3])
output_arr = np.array([0.0625, 0.625 , 0.125 , 0.1875])
assert np.all(normalize_vector(input_arr) == output_arr)
write python code to compute their outter product.
outter -> outer
You have a tensor of unknown rank A and would like to subtract 3.5, and 2.5 from every element so that your output, which is a new tensor B, is rank of rank(A) + 1. The last axis of B should be dimension 2.
I had to read over this a few times before coming to an interpretation that accounted for every word. Is this what you're after?
>>> import numpy as np
>>> subtract_twice = lambda x: np.stack([x, x], axis=-1) - [3.5, 2.5]
>>> a = np.arange(24).reshape((6, 4))
>>> subtract_twice(a).shape
(6, 4, 2)
>>> subtract_twice(a.reshape((2, 3, 4))).shape
(2, 3, 4, 2)
If so, can I please suggest breaking up the question for more immediate understanding? It's more repetitive but people don't typically read textbooks for the prose.
You have a tensor of unknown rank A. You would like to perform two operations on it at once: firstly, subtracting 3.5 from every element; and secondly, subtracting 2.5 from every element. Your output should combine the results and should be a new tensor B with rank rank(A) + 1. Hint: the new axis of B should be last, and should have dimension 2.
(I originally sat wondering how subtracting 6 should give me dimension 2).
Presumably the book may not be mature enough yet to assign a definitive numbering to the chapters, but perhaps it would be a good idea to assign some temporarily so that we can reference sections (and subsections) in class/our notes, and get an idea of where different subsections start and end.
Reminder for myself to address those.
I did not find the package installation cell to work immediately, likely due to an update in RDKit since you wrote it. It's in section "1.3. Running This Notebook".
The first issue is that "sklearn" is installed -- I think the package to install is actually scikit-learn.
Secondly, I had significant trouble getting rdkit to work properly (version 2021.03.4), and it seems this is not unique to me
11
12 from rdkit import rdBase
---> 13 from rdkit.DataStructs import cDataStructs
14 from rdkit.DataStructs.cDataStructs import *
15
ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /usr/local/lib/python3.7/site-packages/rdkit/DataStructs/cDataStructs.so)
Pinning RDKit to an older version as suggested worked in my case. Just installing the missing packages with apt did not. I also bunched all the numeric-y libraries together in conda, as I've found it figures out dependencies better. (Pinning other packages might extend the notebook's lifespan in Google Co-lab, too.)
Suggested solution
!wget -c https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.8.3-Linux-x86_64.sh
!time bash ./Miniconda3-py37_4.8.3-Linux-x86_64.sh -b -f -p /usr/local/
import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
!conda install -y -c conda-forge rdkit=2020.09.2 numpy jaxlib jax pandas scikit-learn
!pip install jupyter-book matplotlib seaborn tabulate
Install instructions are messed-up
Revise Vrep sampling and S5 example according to
mfinzi/equivariant-MLP#10 (comment)
Need to seed the random numbers (or pin version?). Something is different between Github actions and my local run.
Add information on positive only classification
https://www.sciencedirect.com/science/article/abs/pii/S2405471220304142
In section B 1.3.4, the loss function is defined as jnp.sqrt(jnp.mean((y - labels)**2))
where y is the predictions from the linear model. Not that here, the loss is equal to the square root of the mean. However, in equation 1.16, the loss seems to be defined as simply the mean (of the square of the differences)—no square root involved. I presume that equation 1.16 needs to have L² on the LHS for the loss definitions to be consistent.
similar to https://github.com/cheminfo/eln-docs/blob/main/.github/workflows/spellcheck.yml and https://github.com/cheminfo/eln-docs/blob/main/.github/workflows/check_links.yml.
if you think it is useful, i can make the PR.
the spellcheck CI might be annoying at first, when you do not yet have a good whitelist but might be pretty useful in the long run
Why do they not converge?
8 is better than 16 for nearest power.
First of all, thanks for this awesome resource.
In this line (l. 615 on the source blob), z6_phi
was probably supposed to be p
.
#
From flows chapter
Should change post-hoc/surrogate to use Miller/Lipton nomenclature of explanation vs interpretation
https://arxiv.org/abs/1706.07269
from #47 (which I'll now close)
Many figures are output without alt text. This should be fixed.
Number of constant values in equivariant linear basis is not the same as number of free parameters (?) Need to check on this.
from @kjappelbaum
I think you mention that there is currently no paper on this and I think i agree with that.
A nice hint is the guacamol leaderboard https://www.benevolent.com/guacamol and when we tested on SELFIES we also didn't find it outperform the graph based models. Clearly it is also related with the modeling but as you write I also think that the representation matters.
Would be interesting to do a proper benchmark, where one also considers the latent space of a model trained on string-based representation (e.g., a VAE, maybe trained jointly with some properties)
A method to link to external docs - could be used to link to keras, TF, jax docs.
Bayesian, Deep ensembling, Calibration, dropout, flows?
Using emlp
G = SO(3) * S(16)
Vin = V(G)**0 + V(G)
Vout = V(G)
Maybe QM9?
Language is a little unclear at beginning of VAE chapter and there is a hat missing on P(x) later on in details section.
need to add warnigs filter
Hi @whitead, thanks for writing this book. It's neatly laid out and easy for beginners like me to understand. I hope you don't mind that I took LiveComSJ's tweet asking for feedback seriously.
Deep learning is specifically about connecting two types of data with a neural network function, which is differentiable and able to approximate any function. The classic example is connecting function and structure in molecules.
Suggestion: The classic example is connecting molecular structure to its function.
Reason: "function" just meant a totally different thing in the sentence before. Reading "The classic example is connecting function" without the molecular context meant that I first thought of "connecting function" as a noun, i.e., a mapping from one space to another.
it’s ability to generate new data.
its, no apostrosphe
One example that sets deep learning apart from machine learning is in feature engineering.
I think many people would argue that deep learning is a form of machine learning. If I understand correctly, the next part of the paragraph argues that deep learning does not need feature engineering? Perhaps "One advantage that sets deep learning apart from other machine learning techniques is that it does not require feature engineering"?
Previously training and using models in machine learning was a tedious process and required deriving equations for each model change.
Comma after previously?
Deep learning is always a little tied-up in the implementation details. Thus, language choice can be a part of the learning process. In this book, we use Jax, Tensorflow, Keras, and scikit-learn for different purposes.
It was a little odd for me that you jumped straight to libraries instead of saying Python first, considering that the section is titled "Language choice". Throughout this section you also refer to them all as languages. I would personally consider all of those to be "frameworks" rather than languages.
Stackoverflow
Stack Overflow?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.