Git Product home page Git Product logo

discent's Introduction

DISCent

main_build main_build Code Coverage

Overview

This is a simple package with a single function that runs a simple isolation by distance and identity by descent model for producing deme inbreeding spatial coefficients (DISCs). By theory, genetic relatnedness is strictly limited to measure of identity by descent. For a full mathematical overview of the simple model, please see the supplements from Brazeau et al. 2022.

Briefly, in the early 1970s, Malécot described a relationship between genetic relatedness and physical distance under the premise that pairs that are far apart are less likely to mate: isolation by distance. Capitalizing on this framework by using measures of identity by descent (IBD), we produce a deme inbreeding spatial coefficient (DISC) for point process data in continuous space. Essentially, this measure estimates the amount of "inbreeding" or within deme relatedness under the assumption that relatedness between two locations in space can be measured from the average pairwise IBD between those two locations conditional on the physical distance that seperates them. Further, we assume that geographic distance is scaled by a migration rate, which is a global parameter among all spatial locations (i.e. the migration rate is assumed to be constant through space). As a result, the model produces a DISC for each spatial location considered (user defined) and an overall migration rate parameter (m). Parameter optimization is achieved with a "vanilla" gradient descent algorithm. Given that "vanilla" gradient descent is used, we recommend running numerous models from a wide of array of start parameters to ensure that your final results are the global maximum and not a local maximum (i.e. a false peak).

Running the Model

For instructions on running the model, please see our vignettes: Vignettes

IMPORTANT NOTES

⚠️ This model has planned development and is not guaranteed to be backwards compatible. Version updates that are not backwards compatible will be noted in the NEWS.md file.

discent's People

Contributors

nickbrazeau avatar

Stargazers

 avatar

Watchers

 avatar  avatar

discent's Issues

Cpp vector redef

Why does this produce a vector pushback in R for cost that is all the same?

this code works:

vector<double> cost(steps);
...
  // fill in for class
  discParticle.cost = cost;

this code produces single cost, not moving:

...
  // fill in for class
  discParticle.cost = vector<double>(steps);

NaN in m

Reprex:

library(discent)

load("../data/discent_sim_dat.rda")
input <- discent_sim_dat$discent_dat %>%
dplyr::filter(locat1 != locat2)
our_start_params <- rep(0.1, 30) # 100 sim clusters
names(our_start_params) <- 1:30
our_start_params <- c(our_start_params, "m" = 1e-5)
ret <- discent::deme_inbreeding_spcoef(K_gendist_geodist = input,
start_params = our_start_params,
f_learningrate = 1e-5,
m_learningrate = 1e-5,
momentum = 0.9,
steps = 1e4,
report_progress = T,
return_verbose = T)

cost not moving

Bug? Getting stuck here?

library(tidyverse)
library(discent)


input <- readRDS("data/sim_data/sim_gengeodat.rds") %>%
  dplyr::filter(q == "mq_good")
input <- input$gengeodat[[1]] %>%
  dplyr::filter(locat1 != locat2)

# misc
our_start_params <- rep(0.1, 100) # 100 sim clusters
names(our_start_params) <- 1:100
our_start_params <- c(our_start_params, "m" = 1e-2)
# run
ret <- discent::deme_inbreeding_spcoef(K_gendist_geodist = input,
                                       start_params = our_start_params,
                                       m_lowerbound = -.Machine$double.xmax,
                                       m_upperbound = 100,
                                       f_learningrate = 1e-3,
                                       m_learningrate = 1e-3,
                                       steps = 1e4,
                                       report_progress = T,
                                       return_verbose = T)
ret$m_run
firuns <- ret$fi_run %>% 
  do.call("rbind.data.frame", .)
names(firuns) <- ""
firuns[,1]
ret$Final_m

plotdat <- ret$cost[2:length(ret$cost)]
plot(plotdat, ylim = c(139, 141))
hist(ret$Final_Fis)

adaptive learning rate

  • Implemented two different methods (code needs to be reviewed):
    1. ADAdelta method (no learning rate tuning needed)
    2. ADAgrad w/ normalization (learning rate tuning still needed)

The advantage of each is that learning rate should decay with iterations leading to more stable convergence. Disadvantage in (1) and (2) is that learning rate decays extremely fast... as our gradients are large in summing the upper/lower triangle....and ends up not accounting for information that can still be "learned" from the gradient

Have left for future enhancement considerations

Negative Fis

Able to get negative values of Fi with code below (my computer seed)

  • there is nothing mathematically stopping Fi from being negative. conceptually does it make sense? think of relatedness as bounded but this is a inbreeding coeff
library(discent)
set.seed(48)
#......................
# sim
#......................
distmat <- matrix(c(0,100,200,100,0,300,200,300,0), nrow = 3, ncol = 3)
input <-  discent::sim_IBDIBD(demesize = c(3,5,5),
                              distmat = distmat,
                              rate = 1e-2,
                              Ft = 0.3)
input <- input %>%
  dplyr::filter(locat1 != locat2)

#......................
# run discent
#......................
our_start_params <- rep(0.2, 3)
names(our_start_params) <- 1:3
our_start_params <- c(our_start_params, "m" = 1e-5)
ret <- discent::deme_inbreeding_spcoef(K_gendist_geodist = input,
                                       start_params = our_start_params,
                                       f_learningrate = 1e-5,
                                       m_learningrate = 1e-10,
                                       momentum = 0.9,
                                       steps = 1e4,
                                       report_progress = T,
                                       return_verbose = T)
ret$Final_m
ret$Final_Fis

# cost
plot(ret$cost)
summary(ret$cost)
ret$cost[length(ret$cost)]
plot(ret$cost[(length(ret$cost)-500):length(ret$cost)])

# runs
plot(ret$m_run)
plot(ret$fi_run[,1])
plot(ret$fi_run[,3])
# updates
plot(ret$m_update)
plot(ret$fi_update[,1])
plot(ret$fi_update[,3])

simulation confounding & learning rate

interplay between what is a DISC with space? our simulation not producing intuitive results 2/2 confounding of geographic distance and IBD?

separately, there appears to be an adaptive learning rate issue, where we just need to find the right spot? Related to #1 and #3

library(tidyverse)
library(discent)


fulldat <- readRDS("data/sim_data/sim_gengeodat.rds") %>%
  dplyr::filter(q == "coi")
input <- fulldat$gengeodat[[1]] %>%
  dplyr::filter(locat1 != locat2)

# misc
our_start_params <- rep(0.1, 100) # 100 sim clusters
names(our_start_params) <- 1:100
our_start_params <- c(our_start_params, "m" = 1e-2)
# run
ret <- discent::deme_inbreeding_spcoef(K_gendist_geodist = input,
                                       start_params = our_start_params,
                                       m_lowerbound = -.Machine$double.xmax,
                                       m_upperbound = 100,
                                       f_learningrate = 1e-3,
                                       m_learningrate = 1e-6,
                                       steps = 1e5,
                                       report_progress = T,
                                       return_verbose = T)
plot(ret$m_run)
firuns <- ret$fi_run %>% 
  do.call("rbind.data.frame", .)
plot(firuns[,1])
ret$Final_m

plotdat <- ret$cost[1e4:length(ret$cost)]
plot(plotdat)
hist(ret$Final_Fis)

# plot 
locats <- readRDS("data/sim_data/lattice_model.rds") %>% 
  dplyr::select(-c("migration"))
finalfs <- tibble::tibble(deme = ret$deme_key$Deme,
                          finalf = ret$Final_Fis)

dplyr::left_join(locats, finalfs) %>% 
  ggplot() + 
  geom_point(aes(x = longnum, y = latnum, color = finalf),
             size = 2, alpha = 0.8) + 
  scale_color_viridis_c() + 
  theme_void()

plot_realIBD_discent(gengeodat = fulldat$gengeodat[[1]],
                     discentret = ret,
                     locats = locats)


is this confounding a signal of our expectation not being met --> are those important clusters? or is this just a bug... or even a limitation?

"Not real" Inbreeding Coefficients

  • Realized needed to remove the bounds [0,1] on the inbreeding coefficient estimate (y_{ij}) --> inappropriate "convergence"
  • Now, getting totally correct inbreeding estimates, they are just not bounded by [0,1]

A minimal reprex is below:
NB this reprex ignores the property of transitivity, but the model shouldn't know/care about that.

I am stumped how to get this bounded if you have any ideas, @bob. I was thinking a logit space transformation, but then that messes with the migration rate -- which may be OK if we just call that a nuissance parameter?

set.seed(48)
library(tidyverse)
library(discent)


locats2 <- tibble::tibble(deme = c(1,  2,  3,  4,  5),
                          long = c(0, -50, 50, -50, 50),
                          lat  = c(0, -50, -50, 50, 50))

gendat <- tibble::tibble(deme = c(1,  2,  3,  4,  5),
                         smpl1 = letters[1:5],
                         fibd = c(0, -50, 50, -50, 50),
                         lat  = c(0, -50, -50, 50, 50))


gendat <- as.data.frame(t(combn(x = 1:5, m = 2))) %>% 
  dplyr::mutate(fibd = ifelse(V1 == 1 | V2 == 1, 1, 0),
                geodist = ifelse(V1 == 1 | V2 == 1, 50, 100))


# viz
locats_x <- locats2 %>% 
  dplyr::rename(V1 = deme)
locats_y <- locats2 %>% 
  dplyr::rename(V2 = deme)

gendat %>% 
  dplyr::left_join(., locats_x, by = "V1") %>% 
  dplyr::left_join(., locats_y, by = "V2") %>% 
  ggplot() + 
  geom_segment(aes(x = long.x, xend = long.y,
                   y = lat.x, yend = lat.y,
                   size = fibd))

# make input
input <- gendat %>% 
  dplyr::rename(smpl1 = V1,
                smpl2 = V2,
                gendist = fibd) %>% 
  dplyr::mutate(locat1 = smpl1,
                locat2 = smpl2) %>% 
  dplyr::select(c("smpl1", "smpl2", "locat1", "locat2", "gendist", "geodist"))
# run discent 
our_start_params <- rep(0.5, 5) 
names(our_start_params) <- 1:5
our_start_params <- c(our_start_params, "m" = 1e-2)
# run
ret <- discent::deme_inbreeding_spcoef(K_gendist_geodist = input,
                                       start_params = our_start_params,
                                       f_learningrate = 1e-3,
                                       m_learningrate = 1e-8,
                                       steps = 1e4,
                                       report_progress = T,
                                       return_verbose = T)
ret$Final_Fis
plot(tail(ret$cost))
plot(tail(ret$cost))
plot(tail(ret$m_run))
ret$Final_m

# rij
exp(ret$Final_m * -50) * (ret$Final_Fis[1] + ret$Final_Fis[2])/2

dij <- 50
exp(-log(0.5) * (1/50) * dij) * (1 + 0)/2


Find Optimal Start Parameters

Ideally would have a better system for looking through start parameters than a search grid.

Wrote a simulated annealer function but required too much tuning to actually save time. Could better optimize it's moves or abandon SA and use a different approach that resembles hyperparameter tuning

SHA: 42c556b

NaN cost

Catch did not work for NaN cost

Think this is an overflow issue with a very large number is exponentiated in the gradient

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.