Git Product home page Git Product logo

dmol-book's People

Contributors

alekssro avatar arunppsg avatar duerrsimon avatar elanapearl avatar euhruska avatar frenio avatar geemi725 avatar hgandhi2411 avatar killiansheriff avatar kjappelbaum avatar mehradans92 avatar oakif avatar partev avatar raulppelaez avatar rcrehuet avatar rmeli avatar samcox822 avatar whitead avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dmol-book's Issues

Chapter 1: typos

Thanks for chapter 1, I like that you bundled a bunch of concepts together on the one dataset; it flowed very well. I picked out a few typos -- looking for those is helping me pay attention. :-)

1.2 Supervised Learning

Machine learning has three types of learning: supervised, semi-supervised, and unsupervised. There are two tasks: regression and classification.

Suggestion: "There are two basic tasks in supervised learning: regression and classification"

Reason: It's not immediately clear that regression and classification are typically supervised.

1.3.2. Data Exploration

Let's just see some specific example, extremes, and get a sense of the range of labels/data. We'll start with seeing what kind of molecules there are.

Suggestion: Let's first look at the first molecule, then the extreme values, and get a sense of the range of labels/data. We'll start with seeing what kind of molecules there are.

Reason: "Let's just see some specific example, extremes," could be interpreted that the extremes are the example.

This is first molecule in the dataset rendered using rdkit

Missing full stop.

ioinc compounds have higher solubility

ionic?

Outliers are extreme values that fall outside of your normal data distribution. They can be mistakes, be from a different distribution (e.g., metals instead of organic molecules), and can have a strong effect on model training.

Suggestion: Outliers are extreme values that fall outside of your normal data distribution. They could be mistakes or be from a different distribution (e.g., metals instead of organic molecules). Either way, they can have a strong effect on model training.

Reason: Just felt like distinct concepts that deserved a sentence break

(how often the molecule occured in the constituent databases),

occurred?

1.3.3 Feature Correlation

MolLogP, which is a computed estimated related to solubility, does correlate well.

computed estimate related to solubility?

1.3.4 Linear Model¶

parmaters

(in various other sections below, too) => parameters?

1.3.5 Gradient Descent

This computes an analytical derivative a python function.

analytical derivative of a Python function?

1.3.6 Batching

The loss is lower than wihtout batching

without?

1.3.7 ML Trick

Each of the features have different magnitudes. Like molecular weight, which is lagre, and number of rings which is small.

The second sentence might be nicer fleshed out a little, like:

For example, the molecular weight is large, compared to the number of rings, which is small.

from myst_nb import glue
glue('corr', np.round(np.corrcoef(labels, predicted_labels)[0,1], 2))

That is very neat, but I guess it's not supposed to be executed in Google Co-lab? If it is, the sphinx version needs some tweaking. If not, is there any way to mark that as non-executable?

Screen Shot 2021-07-25 at 12 26 01 am

1.4.1 Clustering

We’re gonig to zoom into the 99th percentile

gonig => going

we do not label the axes beceause they are arbitrary.

beceause => because

Although in PCA etc. it does seem common to label which principal component you're visualizing.

Yeah, that’s why clustering hard.

Maybe you left out the "is" for effect, but just in case.... :-)

It's not super relevant, but if you were thinking of linking to extra resources, I found Shao et al 2007 a useful MD-focused one and more generally the scikit-learn user guide.

1.4.2 Choosing Cluster Number

We can extract the most centered data points (closest to cluster center) and consider them to be representatitive of the cluster.

representatitive => representative

We might try to gain insight about predicting solubility solubility

Extra solubility

1.6.3. Minimizing Loss

Compute it’s gradient using jax.

its (no apostrophe)

Using the regularized features, show what effect batchsize has on training. Use batchsizes of 1, 8, 32, 256, 1024. Make sure you re-initialize your weights in between each run. Plot the log-loss for each batchsize on the same plot. Describe your results.

I think batch size is typically 2 words?

1.6.4. Clustering

You can still cluster if you have labeles,

labeles => labels

XAI

t-statistics is missing t CDF

urlib missing

Missing in some chapters when changed to self-hosting data

Classification chapter feedback

  • Not so sure about "The classic application of classification is designing new drugs, but it’s widely used in materials and chemistry. Many molecular design problems can be formulated as classification. For example, you can use it to design new organic photovoltaic materials [SZY+19] or antimicrobial peptides [BJW18].". Im not sure if drug design is per se a classification. Pretty clear examples might be Alan's image classification work on the labware, maybe something on labels that are intrinsically discrete (e.g., oxidationstates, or XRPD pattern to structure)
  • Kind of a question of taste: I would perhaps even add the image of the Mark I as it allows to explain the same concept in one additional way: Having the dials for the potentiometers to tune the weights and photodiodes as the input vector

I liked how you showed the non-differentiability of the hard thresholding and also liked that you showed the need for a baseline model

  • maybe do a find/replace for naieve (#60 )
  • for he imbalanced case. A useful package i found is imbalanced-learn. A review I liked a lot is 10.1109/TKDE.2008.239

Plotting in chapter 1 co-lab notebook

I really enjoyed using the Co-lab notebook, it's very smooth (but for installation...). This is my first time using one. I had some suggestions on the image code.

SVG molecules

SVGs just look nicer. If it doesn't mess up the book rendering, this line might be good in the import cell.

IPythonConsole.ipython_useSVG = True

Iterating over Pandas rows

# pardon this slop, we need to have a list of strings for legends
legend_text = [f'{i}: solubility = {s:.2f}' for i,s in zip(extremes.ID, extremes.Solubility)]

This didn't really seem that sloppy! But you could try the below if you prefer.

legend_text = [f'{x.ID}: solubility = {x.Solubility:.2f}'
               for x in extremes.itertuples()]

Exploiting seaborn

In 1.3.3 Feature Correlation, much of the image code seems to replicate what seaborn is very good at. As seaborn is already imported, I wonder if the below would get the point across more of what you're doing.

Original code

features_start_at = list(soldata.columns).index('MolWt')
feature_names = soldata.columns[features_start_at:]

fig, axs = plt.subplots(nrows=5, ncols=4, sharey=True, figsize=(12, 8), dpi=300)
axs = axs.flatten() # don't want to think about i/j
for i,n in enumerate(feature_names):
    ax = axs[i]
    ax.scatter(
        soldata[n], soldata.Solubility, 
        s = 6, alpha=0.4,
        color = color_cycle[i % len(color_cycle)]) # add some color 
    if i % 4 == 0:
        ax.set_ylabel('Solubility')
    ax.set_xlabel(n)
# hide empty subplots
for i in range(len(feature_names), len(axs)):
    fig.delaxes(axs[i])
plt.tight_layout()
plt.show()

Suggested code

mpl.rcParams['figure.dpi'] = 300  # maybe move this up to the imports cell

features_start_at = list(soldata.columns).index('MolWt')
feature_names = soldata.columns[features_start_at:]

tidy_data = pd.melt(soldata, id_vars=["Solubility"],
                    value_vars=feature_names,
					var_name="Feature", value_name="Value")

g = sns.FacetGrid(tidy_data, col="Feature", hue="Feature",
                  col_wrap=4, sharey=True, sharex=False,
                  height=2, aspect=1.8,
                  palette=color_cycle, despine=False,
                  )
g.map(plt.scatter, "Value", "Solubility", s=6, alpha=0.4)
g.set_xlabels("")
g.set_titles(template="{col_name}")
plt.show()

The browser seems to resize the image whenever I try to screenshot it and the code at
the same time, but it gives pretty much what the old code did.

Screen Shot 2021-07-25 at 12 17 22 am

Edit: that dpi=300 might be why my other images are huge, though.

Use of "language" is ambiguous

Reading the section "Framework Choice", I find usage of the word "language" to be ambiguous. The author is using it in reference to any of several programming frameworks (or libraries); however, "language" infers a syntax and grammar and is thus more commonly shorthand for "programming language" when used for software development instructional materials - at least as published in the USA. Because the author has already selected a programming language, python, i recommend that the word "language" not also be used to describe an application programming interface or framework or library.

Chapter 1. Tensors and shapes

I split this up from #26 for clarity. Most feedback is quibbling about language choice, so I broke that up into sections by heading as well.

1. Tensors and shapes

Rank can be defined as the number of indices required to get individual elements of a tensor.

Matrix rank is a concept from linear algebra and has nothing to do with tensor rank.

I suppose using tensor order to distinguish from matrix rank is non-conventional? It's just a little confusing when linear algebra is fairly pertinent to a lot of machine learning.

A euclidian vector

A Euclidean vector?

1.2.1 reduction operations

sum(a, axis=0)

I understand that numpy is implied, but sum is also a built in function, and does not take axis as a keyword argument. Perhaps np.sum would be clearer?

I've not used Jupyter Book before -- if intersphinx works on it, it could be a really great way to auto-link to other library's documentation.

1.4 Modifying rank

In tensorflow and jax there is expand_dims 😠 You can also use reshape and ignore newaxis

I'm not really sure how to interpret this angry face... 😅 numpy has expand_dims too!

And I think jax.numpy.newaxis also works?

>>> import jax.numpy as jnp
>>> arr = jnp.arange(12).reshape((3, 4))
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
>>> arr[..., jnp.newaxis].shape
(3, 4, 1)
>>> arr[jnp.newaxis].shape
(1, 3, 4)
>>> arr.reshape((2, 6))
DeviceArray([[ 0,  1,  2,  3,  4,  5],
             [ 6,  7,  8,  9, 10, 11]], dtype=int32)

1.4.1 Reshaping

There is one special syntax element to shaping: -1 dimensions. -1 can appear once in a reshape command and means to have the computer figure out what goes there by following the rule that the number of elements in the tensor must remain the same. Let’s see some examples.

I found the second sentence hard to follow. Would some punctuation and formatting help?

-1 can appear once in a reshape command. It tells the computer to figure out what goes there by following the rule that the total number of elements in the tensor must remain the same.

Below I also suggest my personal understanding of -1.

-1 can appear once in a reshape command. Because the total number of elements in the reshaped tensor must remain the same, -1 stands for "everything else".

1.5 Chapter Summary

There are operations that reduce ranks of tensors, like sum or mean.

This can be a pain to do consistently, but for functions that are also English words I like to format functions in code-style (i.e. like sum or mean) to distinguish when I'm talking about code vs maths.

1.6.2 Reductions

Just a suggestion: many undergraduate courses check code function by providing some test cases and example answers. It might be nice to provide an example input-output pair in Markdown code so people can check their work. e.g.

input_arr = np.array([1, 10, 2, 3])
output_arr = np.array([0.0625, 0.625 , 0.125 , 0.1875])
assert np.all(normalize_vector(input_arr) == output_arr)

1.6.3 Broadcasting

write python code to compute their outter product.

outter -> outer

You have a tensor of unknown rank A and would like to subtract 3.5, and 2.5 from every element so that your output, which is a new tensor B, is rank of rank(A) + 1. The last axis of B should be dimension 2.

I had to read over this a few times before coming to an interpretation that accounted for every word. Is this what you're after?

>>> import numpy as np
>>> subtract_twice = lambda x: np.stack([x, x], axis=-1) - [3.5, 2.5]
>>> a = np.arange(24).reshape((6, 4))
>>> subtract_twice(a).shape
(6, 4, 2)
>>> subtract_twice(a.reshape((2, 3, 4))).shape
(2, 3, 4, 2)

If so, can I please suggest breaking up the question for more immediate understanding? It's more repetitive but people don't typically read textbooks for the prose.

You have a tensor of unknown rank A. You would like to perform two operations on it at once: firstly, subtracting 3.5 from every element; and secondly, subtracting 2.5 from every element. Your output should combine the results and should be a new tensor B with rank rank(A) + 1. Hint: the new axis of B should be last, and should have dimension 2.

(I originally sat wondering how subtracting 6 should give me dimension 2).

Section Numbering

Presumably the book may not be mature enough yet to assign a definitive numbering to the chapters, but perhaps it would be a good idea to assign some temporarily so that we can reference sections (and subsections) in class/our notes, and get an idea of where different subsections start and end.

Co-lab installation: chapter 1

I did not find the package installation cell to work immediately, likely due to an update in RDKit since you wrote it. It's in section "1.3. Running This Notebook".

The first issue is that "sklearn" is installed -- I think the package to install is actually scikit-learn.

Secondly, I had significant trouble getting rdkit to work properly (version 2021.03.4), and it seems this is not unique to me

     11 
     12 from rdkit import rdBase
---> 13 from rdkit.DataStructs import cDataStructs
     14 from rdkit.DataStructs.cDataStructs import *
     15 
ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.26' not found (required by /usr/local/lib/python3.7/site-packages/rdkit/DataStructs/cDataStructs.so)

Pinning RDKit to an older version as suggested worked in my case. Just installing the missing packages with apt did not. I also bunched all the numeric-y libraries together in conda, as I've found it figures out dependencies better. (Pinning other packages might extend the notebook's lifespan in Google Co-lab, too.)

Suggested solution

!wget -c https://repo.continuum.io/miniconda/Miniconda3-py37_4.8.3-Linux-x86_64.sh
!chmod +x Miniconda3-py37_4.8.3-Linux-x86_64.sh
!time bash ./Miniconda3-py37_4.8.3-Linux-x86_64.sh -b -f -p /usr/local/

import sys
sys.path.append('/usr/local/lib/python3.7/site-packages/')
 
!conda install -y -c conda-forge rdkit=2020.09.2 numpy jaxlib jax pandas scikit-learn
!pip install jupyter-book matplotlib seaborn tabulate

NFlow Chapter

Need to seed the random numbers (or pin version?). Something is different between Github actions and my local run.

Equation for loss

In section B 1.3.4, the loss function is defined as jnp.sqrt(jnp.mean((y - labels)**2)) where y is the predictions from the linear model. Not that here, the loss is equal to the square root of the mean. However, in equation 1.16, the loss seems to be defined as simply the mean (of the square of the differences)—no square root involved. I presume that equation 1.16 needs to have L² on the LHS for the loss definitions to be consistent.

QM9 App

8 is better than 16 for nearest power.

standard layers feedback

  • i think it is very important that you brought up equivariance/invariance. However, I feel that one could a bit clearer in the fact that pooling makes convs approx. translational invariant (and perhaps use fig 9.8 from https://www.deeplearningbook.org/contents/convnets.html)
  • i like that you put batchnorm in the regularization section. Perhaps it might be a nice reason to make the connection between randomness and regularization and cite some of the works that postulate that batchnorm works as regularizer

Regression chapter feedback

from #47 (which I'll now close)

  • Formally, standardization should only done on the training data (discussed nicely in Elements of Statistical Learning). For this reason, I would maybe do the split before the standardization.
  • I'd perhaps be a bit less general with "L2 gives a better model and L1 gives a more interpretable result by zeroing features" as L1 also needs some tuning of lambda to make things really zero. Do you have some reference that shows that L2 give better performance than L1? My intuition would have been "it depends"
  • an awesome resource for a deeper dive into model selection is https://arxiv.org/abs/1811.12808

ML Intro Chapter Feedback

  • I’m not sure with the learning problem definitions. RL seems missing and semi/self-supervised are somehow in between.
  • Maybe it is good to add a note that if EDA guides the model selection the train/test split needs to happen before EDA to avoid data leakage
  • For teaching I would probably use the pandas bracket notation instead of the dot notation as it always works
  • Might be interesting to add a note/or exercise why “batch size is usually as a power of 2 (e.g., 16, 128)|
  • It might also be interesting to hint at the research about the link between SGD (and the randomness it introduces) and regularization and generalization
  • “Here we assume ${y_i}$ is a class variable and try to separate our features into these classes” might not capture that well the case of density-based clustering, similarly “1. We say that clustering is a type of unsupervised learning and that it predicts the labels. What exactly are the predicted labels in clustering? Write down what the predicted labels might look like for a few data points.”
  • I feel it cloud be really useful to highlight or hint at some caveats of graph-embedding based clusterings (UMAP/TSNE)
  • A ref for the zoo of methods to select k in k-meanshttps://journals.sagepub.com/doi/abs/10.1243/095440605x8298

from @kjappelbaum

Chapter 16 (Predicting DFT Energies with GNNs) Feedback

  1. In table 16.1 (Label Description), for index 7's description it should be :Energy of Lowest unoccupied molecular orbital (LUMO). Currently it's occupied.
  2. you used at multiple places "it's, we'll, I'll, we've etc.", I personally don't like in the books this language, I would suggest to use it is, we will, I will, we have etc.
  3. On page 2 of chapter 16, in the paragraph "These are Tensorflow Tensors. They can be converted to numpy arrays via x.numpy(). The first item is the element vector 6,1,1,1,1. Do you recognize the elements? It’s C, H, H, H, H. The positions come next. Note that the there is an extra column containing the atom partial charges, which we will not use as a feature. Finally, the last tensor is the label vector.", there is an extra "the" in the sentence "Note that the there is an extra column containing the atom partial charges, which we will not use as a feature. Finally, the last tensor is the label vector."
  4. For each plot, you should add units for 'X' and 'Y' axis fields such as energies for Val Loss (if any), Epoch (if any), Energy (if any) and predicted Energy (if any).
  5. There should be "a comma (,)" after "thus" in the sentence " Thus we start at high learning rate and decrease." in the paragraph "This is poor performance, but it gives us a baseline value of what we can expect. One unusual detail I did in this training was to slowly reduce the learning rate. This is because our features and labels are all in different magnitudes. Our weights need to move far to get into the right order of magnitude and then need to fine-tune a little. Thus we start at high learning rate and decrease."
  6. It should be "variances" in the sentence "You can see that the broad trends about molecule size capture a lot of variance, but more work needs to be done."
  7. There should be "a comma (,)" after "finally" in the sentence "The global node aggregation will also be a sum. Finally we have our graph feature vector update:"

string vs. graph representation

I think you mention that there is currently no paper on this and I think i agree with that.
A nice hint is the guacamol leaderboard https://www.benevolent.com/guacamol and when we tested on SELFIES we also didn't find it outperform the graph based models. Clearly it is also related with the modeling but as you write I also think that the representation matters.

Would be interesting to do a proper benchmark, where one also considers the latent space of a model trained on string-based representation (e.g., a VAE, maybe trained jointly with some properties)

Intersphinx

A method to link to external docs - could be used to link to keras, TF, jax docs.

VAE Wording

Language is a little unclear at beginning of VAE chapter and there is a hat missing on P(x) later on in details section.

Overview

Hi @whitead, thanks for writing this book. It's neatly laid out and easy for beginners like me to understand. I hope you don't mind that I took LiveComSJ's tweet asking for feedback seriously.

Overview

Deep learning is specifically about connecting two types of data with a neural network function, which is differentiable and able to approximate any function. The classic example is connecting function and structure in molecules.

Suggestion: The classic example is connecting molecular structure to its function.

Reason: "function" just meant a totally different thing in the sentence before. Reading "The classic example is connecting function" without the molecular context meant that I first thought of "connecting function" as a noun, i.e., a mapping from one space to another.

it’s ability to generate new data.

its, no apostrosphe

One example that sets deep learning apart from machine learning is in feature engineering.

I think many people would argue that deep learning is a form of machine learning. If I understand correctly, the next part of the paragraph argues that deep learning does not need feature engineering? Perhaps "One advantage that sets deep learning apart from other machine learning techniques is that it does not require feature engineering"?

Previously training and using models in machine learning was a tedious process and required deriving equations for each model change.

Comma after previously?

Deep learning is always a little tied-up in the implementation details. Thus, language choice can be a part of the learning process. In this book, we use Jax, Tensorflow, Keras, and scikit-learn for different purposes.

It was a little odd for me that you jumped straight to libraries instead of saying Python first, considering that the section is titled "Language choice". Throughout this section you also refer to them all as languages. I would personally consider all of those to be "frameworks" rather than languages.

Stackoverflow

Stack Overflow?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.