nickbrazeau / discent Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 0.0 22.34 MB

Gradient descent for estimating inbreeding spatial coefficients from pairwise-IBD estimates

License: Other

R 47.81% C++ 52.04% C 0.14%

gradient-descent identity-by-descent spatial

discent's Introduction

DISCent

Overview

This is a simple package with a single function that runs a simple isolation by distance and identity by descent model for producing deme inbreeding spatial coefficients (DISCs). By theory, genetic relatnedness is strictly limited to measure of identity by descent. For a full mathematical overview of the simple model, please see the supplements from Brazeau et al. 2022.

Briefly, in the early 1970s, Malécot described a relationship between genetic relatedness and physical distance under the premise that pairs that are far apart are less likely to mate: isolation by distance. Capitalizing on this framework by using measures of identity by descent (IBD), we produce a deme inbreeding spatial coefficient (DISC) for point process data in continuous space. Essentially, this measure estimates the amount of "inbreeding" or within deme relatedness under the assumption that relatedness between two locations in space can be measured from the average pairwise IBD between those two locations conditional on the physical distance that seperates them. Further, we assume that geographic distance is scaled by a migration rate, which is a global parameter among all spatial locations (i.e. the migration rate is assumed to be constant through space). As a result, the model produces a DISC for each spatial location considered (user defined) and an overall migration rate parameter (m). Parameter optimization is achieved with a "vanilla" gradient descent algorithm. Given that "vanilla" gradient descent is used, we recommend running numerous models from a wide of array of start parameters to ensure that your final results are the global maximum and not a local maximum (i.e. a false peak).

Running the Model

For instructions on running the model, please see our vignettes: Vignettes

IMPORTANT NOTES

⚠️ This model has planned development and is not guaranteed to be backwards compatible. Version updates that are not backwards compatible will be noted in the NEWS.md file.

discent's People

Contributors

Stargazers

Watchers

discent's Issues

simulated annealing?

Cpp vector redef

Why does this produce a vector pushback in R for cost that is all the same?

this code works:

vector<double> cost(steps);
...
  // fill in for class
  discParticle.cost = cost;

this code produces single cost, not moving:

...
  // fill in for class
  discParticle.cost = vector<double>(steps);

NaN in m

Reprex:

library(discent)

load("../data/discent_sim_dat.rda")
input <- discent_sim_dat$discent_dat %>%
dplyr::filter(locat1 != locat2)
our_start_params <- rep(0.1, 30) # 100 sim clusters
names(our_start_params) <- 1:30
our_start_params <- c(our_start_params, "m" = 1e-5)
ret <- discent::deme_inbreeding_spcoef(K_gendist_geodist = input,
start_params = our_start_params,
f_learningrate = 1e-5,
m_learningrate = 1e-5,
momentum = 0.9,
steps = 1e4,
report_progress = T,
return_verbose = T)

Overflow cost catch

Costs pushing back to R result in NaN from overflow - need catch

cost not moving

Bug? Getting stuck here?

library(tidyverse)
library(discent)


input <- readRDS("data/sim_data/sim_gengeodat.rds") %>%
  dplyr::filter(q == "mq_good")
input <- input$gengeodat[[1]] %>%
  dplyr::filter(locat1 != locat2)

# misc
our_start_params <- rep(0.1, 100) # 100 sim clusters
names(our_start_params) <- 1:100
our_start_params <- c(our_start_params, "m" = 1e-2)
# run
ret <- discent::deme_inbreeding_spcoef(K_gendist_geodist = input,
                                       start_params = our_start_params,
                                       m_lowerbound = -.Machine$double.xmax,
                                       m_upperbound = 100,
                                       f_learningrate = 1e-3,
                                       m_learningrate = 1e-3,
                                       steps = 1e4,
                                       report_progress = T,
                                       return_verbose = T)
ret$m_run
firuns <- ret$fi_run %>% 
  do.call("rbind.data.frame", .)
names(firuns) <- ""
firuns[,1]
ret$Final_m

plotdat <- ret$cost[2:length(ret$cost)]
plot(plotdat, ylim = c(139, 141))
hist(ret$Final_Fis)

overflow +/- underflow catch - logical?

// Catch and Cap Extreme Costs
if (cost[0] > OVERFLO_DOUBLE | isnan(cost[0])) {
cost[0] = OVERFLO_DOUBLE;
}

adaptive learning rate

Implemented two different methods (code needs to be reviewed):
1. ADAdelta method (no learning rate tuning needed)
2. ADAgrad w/ normalization (learning rate tuning still needed)

The advantage of each is that learning rate should decay with iterations leading to more stable convergence. Disadvantage in (1) and (2) is that learning rate decays extremely fast... as our gradients are large in summing the upper/lower triangle....and ends up not accounting for information that can still be "learned" from the gradient

Have left for future enhancement considerations

Need clever "escape"

Need to have a clever escape when the cost reaches a minimum --- we are having this oscillation issue

This looks relatively common and momentum is an easy fix?
https://towardsdatascience.com/gradient-descent-algorithms-and-adaptive-learning-rate-adjustment-methods-79c701b086be

tuning for starting params

having a tuning setting for starting params so can escape seeds that aren't going to yield much

Negative Fis

Able to get negative values of Fi with code below (my computer seed)

there is nothing mathematically stopping Fi from being negative. conceptually does it make sense? think of relatedness as bounded but this is a inbreeding coeff

library(discent)
set.seed(48)
#......................
# sim
#......................
distmat <- matrix(c(0,100,200,100,0,300,200,300,0), nrow = 3, ncol = 3)
input <-  discent::sim_IBDIBD(demesize = c(3,5,5),
                              distmat = distmat,
                              rate = 1e-2,
                              Ft = 0.3)
input <- input %>%
  dplyr::filter(locat1 != locat2)

#......................
# run discent
#......................
our_start_params <- rep(0.2, 3)
names(our_start_params) <- 1:3
our_start_params <- c(our_start_params, "m" = 1e-5)
ret <- discent::deme_inbreeding_spcoef(K_gendist_geodist = input,
                                       start_params = our_start_params,
                                       f_learningrate = 1e-5,
                                       m_learningrate = 1e-10,
                                       momentum = 0.9,
                                       steps = 1e4,
                                       report_progress = T,
                                       return_verbose = T)
ret$Final_m
ret$Final_Fis

# cost
plot(ret$cost)
summary(ret$cost)
ret$cost[length(ret$cost)]
plot(ret$cost[(length(ret$cost)-500):length(ret$cost)])

# runs
plot(ret$m_run)
plot(ret$fi_run[,1])
plot(ret$fi_run[,3])
# updates
plot(ret$m_update)
plot(ret$fi_update[,1])
plot(ret$fi_update[,3])

simulation confounding & learning rate

interplay between what is a DISC with space? our simulation not producing intuitive results 2/2 confounding of geographic distance and IBD?

separately, there appears to be an adaptive learning rate issue, where we just need to find the right spot? Related to #1 and #3

library(tidyverse)
library(discent)


fulldat <- readRDS("data/sim_data/sim_gengeodat.rds") %>%
  dplyr::filter(q == "coi")
input <- fulldat$gengeodat[[1]] %>%
  dplyr::filter(locat1 != locat2)

# misc
our_start_params <- rep(0.1, 100) # 100 sim clusters
names(our_start_params) <- 1:100
our_start_params <- c(our_start_params, "m" = 1e-2)
# run
ret <- discent::deme_inbreeding_spcoef(K_gendist_geodist = input,
                                       start_params = our_start_params,
                                       m_lowerbound = -.Machine$double.xmax,
                                       m_upperbound = 100,
                                       f_learningrate = 1e-3,
                                       m_learningrate = 1e-6,
                                       steps = 1e5,
                                       report_progress = T,
                                       return_verbose = T)
plot(ret$m_run)
firuns <- ret$fi_run %>% 
  do.call("rbind.data.frame", .)
plot(firuns[,1])
ret$Final_m

plotdat <- ret$cost[1e4:length(ret$cost)]
plot(plotdat)
hist(ret$Final_Fis)

# plot 
locats <- readRDS("data/sim_data/lattice_model.rds") %>% 
  dplyr::select(-c("migration"))
finalfs <- tibble::tibble(deme = ret$deme_key$Deme,
                          finalf = ret$Final_Fis)

dplyr::left_join(locats, finalfs) %>% 
  ggplot() + 
  geom_point(aes(x = longnum, y = latnum, color = finalf),
             size = 2, alpha = 0.8) + 
  scale_color_viridis_c() + 
  theme_void()

plot_realIBD_discent(gengeodat = fulldat$gengeodat[[1]],
                     discentret = ret,
                     locats = locats)

is this confounding a signal of our expectation not being met --> are those important clusters? or is this just a bug... or even a limitation?

"Not real" Inbreeding Coefficients

Realized needed to remove the bounds [0,1] on the inbreeding coefficient estimate (y_{ij}) --> inappropriate "convergence"
Now, getting totally correct inbreeding estimates, they are just not bounded by [0,1]

A minimal reprex is below:
NB this reprex ignores the property of transitivity, but the model shouldn't know/care about that.

I am stumped how to get this bounded if you have any ideas, @bob. I was thinking a logit space transformation, but then that messes with the migration rate -- which may be OK if we just call that a nuissance parameter?

set.seed(48)
library(tidyverse)
library(discent)


locats2 <- tibble::tibble(deme = c(1,  2,  3,  4,  5),
                          long = c(0, -50, 50, -50, 50),
                          lat  = c(0, -50, -50, 50, 50))

gendat <- tibble::tibble(deme = c(1,  2,  3,  4,  5),
                         smpl1 = letters[1:5],
                         fibd = c(0, -50, 50, -50, 50),
                         lat  = c(0, -50, -50, 50, 50))


gendat <- as.data.frame(t(combn(x = 1:5, m = 2))) %>% 
  dplyr::mutate(fibd = ifelse(V1 == 1 | V2 == 1, 1, 0),
                geodist = ifelse(V1 == 1 | V2 == 1, 50, 100))


# viz
locats_x <- locats2 %>% 
  dplyr::rename(V1 = deme)
locats_y <- locats2 %>% 
  dplyr::rename(V2 = deme)

gendat %>% 
  dplyr::left_join(., locats_x, by = "V1") %>% 
  dplyr::left_join(., locats_y, by = "V2") %>% 
  ggplot() + 
  geom_segment(aes(x = long.x, xend = long.y,
                   y = lat.x, yend = lat.y,
                   size = fibd))

# make input
input <- gendat %>% 
  dplyr::rename(smpl1 = V1,
                smpl2 = V2,
                gendist = fibd) %>% 
  dplyr::mutate(locat1 = smpl1,
                locat2 = smpl2) %>% 
  dplyr::select(c("smpl1", "smpl2", "locat1", "locat2", "gendist", "geodist"))
# run discent 
our_start_params <- rep(0.5, 5) 
names(our_start_params) <- 1:5
our_start_params <- c(our_start_params, "m" = 1e-2)
# run
ret <- discent::deme_inbreeding_spcoef(K_gendist_geodist = input,
                                       start_params = our_start_params,
                                       f_learningrate = 1e-3,
                                       m_learningrate = 1e-8,
                                       steps = 1e4,
                                       report_progress = T,
                                       return_verbose = T)
ret$Final_Fis
plot(tail(ret$cost))
plot(tail(ret$cost))
plot(tail(ret$m_run))
ret$Final_m

# rij
exp(ret$Final_m * -50) * (ret$Final_Fis[1] + ret$Final_Fis[2])/2

dij <- 50
exp(-log(0.5) * (1/50) * dij) * (1 + 0)/2

Find Optimal Start Parameters

Ideally would have a better system for looking through start parameters than a search grid.

Wrote a simulated annealer function but required too much tuning to actually save time. Could better optimize it's moves or abandon SA and use a different approach that resembles hyperparameter tuning

SHA: 42c556b

NaN cost

Catch did not work for NaN cost

Think this is an overflow issue with a very large number is exponentiated in the gradient