emjun / rtisane Goto Github PK

View Code? Open in Web Editor NEW

2.0 4.0 1.0 61.72 MB

Author statistical models from domain knowledge

Home Page: https://rtisane.tisane-stats.org/

License: Apache License 2.0

R 59.84% HTML 0.26% JavaScript 0.22% CSS 0.01% Python 39.66%

generalized-linear-models

rtisane's Introduction

rTisane

rTisane is an R library and interactive system for authoring statistical models from domain knowledge.rTisane provides (i) a domain-specific language for expressing conceptual assumptions and (ii) an interactive disambiguation process for deriving an output statistical model.

rTisane is designed for analysts who have domain expertise, are not statistical experts, and are comfortable with minimal programming (e.g., many researchers). Currently, rTisane supports authoring generalized linear models.

rTisane is an active research project at UCLA Computer Science, previously at the University of Washington.

Our ACM CHI 2024 paper describes the design, development, and evaluation of rTisane:

@inproceedings{jun2023rTisane,
  title={rTisane: Externalizing Conceptual Models for Data Analysis Prompts Reconsideration of Domain Assumptions and Facilitates Statistical Modeling},
  author={Jun, Eunice and Misback, Edward and Heer, Jeffrey and Just, Ren{\'e}},
  journal={Proceedings of the 2024 ACM Conference on Human Factors in Computing Systems (CHI)},
  pages={1--16},
  year={2024}
}

Get in touch! Especially if you are using rTisane, want to use it, or want to collaborate get in touch.

rtisane's People

Contributors

Stargazers

Watchers

Forkers

elmisback

rtisane's Issues

Add summary() code in output model script

Add more examples

Examples where there no hypothesize (invalidate early check)

Update Internal API docs

🐛 Conceptual Disambiguation: Removing all edges in cycle doesn't update graph correctly (in the UI)

Sample program:


library(rTisane)

pid <- Participant("ID")

age <- continuous(unit=pid, name="AGEP")
edu <- categories(unit=pid, name="SCHL", cardinality=10)
sex <- categories(unit=pid, name="SEX", cardinality=2)
income <- continuous(unit=pid, name="PINCP")

cm <- ConceptualModel() %>%
# Specify conceptual relationships
  assume(causes(age, income)) %>%
  hypothesize(causes(income, age))
  
updatedCM <- checkAndRefineConceptualModel(conceptualModel=cm)

bugCycleBreaking.mov

Get feedback on ideas for changing the language

Here is what I think the new DSL (in R) should look like! I use 👍 to indicate the versions of the syntax that I think we should implement/use among the ones I've brainstormed.

Feedback requested:

What are your reactions to the ✨ New ✨ ideas? : Are you convinced they are good ideas based on your gut + our qual lab study? What are your biggest concerns about introducing/incorporating them?
For the ✨ New ✨ language constructs that you like/are on board with, which of the syntaxes do you prefer? Maybe you don't like any of them, why not? Note: I expect the query ideas will become clearer when we discuss the interaction model.

Variables declaration

Not many changes here. Just a couple changes:

new Participant class
new condition function (still a measure under the hood)

# Specify participant unit (new class)
member <- Participant("member", cardinality=386) 
motivation <- ordinal(unit=member, name="motivation", order=[1, 2, 3, 4, 5, 6])
age <- numeric(unit=member, name="age", number_of_instances=1)
pounds_lost <- numeric(unit=member, name="pounds_lost")
# Specify group unit
group <- Unit("group", cardinality=40)  # 40 groups
# Specify condition (new function)
condition <- condition(unit=group, "treatment", cardinality=2, number_of_instances=1)

Data measurement

No changes. Keep above variable declarations + nests within:

nests_within(member, group)

Conceptual modeling

cm <- ConceptualModel() # explicitly handle to abide by functional programming model in R

✨New: Distinguish known vs. suspected relationships

Four different ways to realize this idea in the syntax listed below.

👍 Syntax 1: Unique functions (seems the easiest to remember and write)

suspect_causes(motivation, pounds_lost, cm)
suspect_causes(condition, pounds_lost, cm)
know_causes(age, pounds_lost, cm)
suspect_causes(age, motivation, cm)

Syntax 2: Nested know/suspect function calls

suspect(causes(motivation, pounds_lost), cm) # Modify cm object
suspect(causes(condition, pounds_lost), cm)
know(causes(age, pounds_lost), cm)
suspect(causes(age, motivation), cm)

Syntax 3: Add parameter

causes(motivation, pounds_lost, "suspect", cm)
causes(condition, pounds_lost, "suspect", cm)
causes(age, pounds_lost, "know", cm)

Syntax 4: Assume vs. Assert (check and error out) vs. Assess

Pro: Familiar concepts to programmers (e.g., assert statements)
Note: But maybe this is something that fits better later on in the interaction depending on query?

assess(causes(motivation, pounds_lost), cm) # suspect
assess(causes(condition, pounds_lost), cm) # suspect
assume(causes(age, pounds_lost), cm) # know
assert(causes(age, pounds_lost), cm) # know

✨New: Replace associates_with, use only causes + unobserved variables

Declare unobserved variable:

# Mediation
# age -> midlife_crisis -> pounds_lost
# age -> midlife_crisis -> motivation
midlife_crisis <- ts.Unobserved()
suspect_causes(age, midlife_crisis, cm)
suspect_causes(midlife_crisis, pounds_lost, cm)
suspect_causes(midlife_crisis, motivation, cm)
# Could have used know_causes as well - Does not change behavior later on.

# Common ancestor
# latent_var -> age, latent_var -> motivation
latent_var <- ts.Unobserved()
causes(latent_var, age, cm)
causes(latent_var, motivation, cm)

✨New: Express relationships with more specificity

In these scenarios, "causes()" would be syntactic sugar (or vice versa, depending on how you view it)
The benefit of these functions is (i) closer mapping to the detail with which how analysts think about causal relationships and (ii) more thorough documentation to facilitate analysts' reflection

👍 Syntax 1: Add when/then constructs

"if" is a keyword in R, so have to use when. Could also use "_if" or something like that, but I think when reads nicer and is easier to remember.
Have to infer/follow-up when not all categories in a categorical variable is stated - e.g., condition below

# Read as: I suspect that when motivation increases, then pounds_lost also increases in my conceptual model.
suspect(when(motivation, "increases").then(pounds_lost, "increases"), cm)

# Read as: I suspect that when condition is treatment, then pounds_lost increases in my conceptual model. 
suspect(when(condition, "==`treatment`").then(pounds_lost, "increases"), cm) # implied: if(condition, "!=treatment").then(pounds_lost, "decreases")?

# Read as: I suspect that when condition is not treatment, then pounds_lost decreases in my conceptual model. 
suspect(when(condition, "!=`treatment`").then(pounds_lost, "decreases"), cm) # implied: if(condition, "==`treatment`").then(pounds_lost, "increases")?

# Read as: I know that when age increases, then pounds_lost increases in my conceptual model. 
know(when(age, "increases").then(pounds_lost, "increases"), cm)

# Read as: I suspect that when age increases, then pounds_lost increases in my conceptual model. 
suspect(when(age, "increases").then(motivation, "increases"), cm)

Syntax 2: Add when construct (no explicit then construct)

Note: what comparisons (i.e., "increases") are valid depends on data type

# Read as: I suspect that when motivation increases, pounds_lost also increases in my conceptual model.
suspect(when(motivation, "increases", pounds_lost, "increases"), cm)

# Read as: I suspect that when condition is treatment, pounds_lost increases in my conceptual model.
suspect(when(condition, "==`treatment`", pounds_lost, "increases"), cm) # implied: when(condition, "==control", pounds_lost, "decreases")

# Read as: I suspect that when condition is not treatment, then pounds_lost decreases in my conceptual model. 
suspect(when(condition, "!=`treatment`").then(pounds_lost, "decreases"), cm) # implied: if(condition, "==`treatment`", pounds_lost, "increases")?

# Read as: I know when age increases, pounds_lost increases in my conceptual model. 
know(when(age, "increases", pounds_lost, "increases"), cm)

# Read as: I suspect that when age increases, motivation increases in my conceptual model. 
suspect(when(age, "increases", motivation, "increases"), cm)

✨New: Specify potential interactions

Trying to express the following interaction: motivation * age on pounds_lost

👍 Syntax 1: Multiple when clauses: Each tuple could be its own when clause (conjunction)

Note: what comparisons (i.e., "increases") are valid depends on data type

suspect(when((motivation, "==low"), (age, "increases")).then(pounds_lost, "increases"), cm)
# allow if to have many tuples of relationships
suspect(when((motivation, "==low"), (age, "increases")).then(pounds_lost, "baseline"), cm)
suspect(when((motivation, "==high"), (age, "increases")).then(pounds_lost, "increases"), cm)

Syntax 2: No tuple distinction, similar to some existing R/Python libraries

when(age, "increases", motivation "decreases").then(pounds_lost, "decreases")

Queries to issue

How to use: Include variables in the "iv"/"ivs" parameter what the end-user really cares about; system adds adjustment sets for other variables that should be included to account for confounding.

✨New: Assess a conceptual model holistically

Returns set of statistical models + conditional independencies we would expect to observe if the data supports the conceptual model.

Based on the conditional independencies that we test, we can see if we have evidence for the conceptual model. If we have evidence for all the conditional independencies (which suggests the conceptual model is not wrong), yay. If only some of the conditional independencies are supported, this suggests the conceptual model needs to change.
ts.assess(conceptual_model=cm)

Assess direct effect of a IV on a DV

Returns statistical model that has IV and the minimal adjustment set to measure influence of IV on DV

#Example:  How does condition affect pounds_lost?
ts.query(iv=condition, dv=pounds_lost, assuming=cm)

Assess direct effects of multiple IVs on a DV

Treat the IVs as the "exposures" and find the minimal adjustment set to adjust for as the other covariates.
If the variables included should not all be in one model because doing so may introduce inaccurate causal effect estimation (e.g., due to mediating causal structure), show warning.

#Example: How do these IVs influence pounds_lost? 
ts.query(ivs=[condition, motivation], dv=pounds_lost, assuming=cm)

✨New: What are all the variables that influence a DV?

This is mostly useful for suggesting queries/statistical models the end-user could issue to make the interaction model more interactive.
Enumerates all sets of variables that influence CM while still avoiding confounding among the possible IVs (due to mediation, for example)
Likely computationally intensive
ts.query(dv=pounds_lost, assuming=cm)

Can't install rTisane

I have tired to install rTisane for many times with following step:
install.packages("remotes")
remotes::install_github("emjun/tisaner")
But after that, I got this error.

byte-compile and prepare package for lazy loading
in method for 'assign_cardinality_from_data' with signature '"AbstractVariable","Dataset"': no definition for class "Dataset"
in method for 'calculate_cardinality_from_data' with signature '"AbstractVariable","Dataset"': no definition for class "Dataset"
Error in setClassUnion("integerORAbstractVariableORAtMostORPer", c("integer", :
the member classes must be defined: not true of "AtMost", "Per"
Error: unable to load R code in package 'tisaner'
Execution halted
ERROR: lazy loading failed for package 'tisaner'

removing 'C:/Users/Golden/AppData/Local/R/win-library/4.3/tisaner'

Revisit: Suggest interactions in the statistical model disambiguation phase + interface

See previous discussion here.

If analysts do not provide interacts, should rTisane suggest interactions in the statistical model disambiguation phase? This could be helpful for promoting deeper reflection, but comes with challenges in (i) generating appropriate (all?) possible interaction effects (or maybe: is there a smarter subset of interactions to suggest?) and (ii) surfacing/showing/guiding end-users through the possible effects in the interface.

Revisit: Exploration and evaluation of language constructs for expressing study designs/data collection procedures

Key questions:

How do people want to express their study designs?
What core language constructs are necessary to capture a large breadth of experiments + observational studies?
Could we borrow ideas/reuse these constructs for other tasks like data scraping, data collection, etc.?
Could we reuse some constructs from other sources? "Grammar of study design"?

Proposal: Remove interacts from conceptual modeling phase

Key points:

Proposal: Remove interacts from the DSL and conceptual model disambiguation phase.
Rationale: Interactions start to get at statistical formulation. They are a good example of a modeling decision that is pretty squarely in between/combination of conceptual and statistical concerns. However, for the process supported by rTisane, they seem better suited to be considered in the statistical modeling phase.
Alternative implementations that allow for interactions:
- (1) Update statistical model disambiguation process to always ask about interactions among confounders
- (2) Update query function to accept specification about interaction(s), only show/ask about interactions in the statistical modeling disambiguation interface when interactions are specified in query

Longer discussion:

Previous interaction for includinginteracts: So that analysts could express research questions and hypotheses about interaction effects between two or more variables. In other words, we wanted analysts to be able to include interactions in their statistical models, either by including it in their query or by including interactions during statistical model disambiguation.
Current implementation: interacts constructs a new variable that analysts involve in conceptual relationships. Concretely, something like this is possible:

a <- categorical("a")
b <- categorical("b")
z <- continuous("z")
ixn <- interacts(a, b)  # Construct a variable representing an interaction

cm <- ConceptualModel() %>%
     assume(causes(a, z)) %>% 
     assume(causes(ixn, z)) # Involve interaction in a conceptual relationship

The troubles:
- The problem with treating interactions as variables that are "equal" to their component variables in the conceptual model is a bit misleading. Interactions aren't actually equivalent concepts. In other words, the conceptual model is now not totally conceptual. With interaction variables included, the conceptual model starts to mix conceptual and statistical concepts.
- Reasoning about interactions is also up for debate in the causal diagramming community. Pearl has argued that interactions are already captured in causal diagrams, and others have proposed new graphical structures to explicitly represent and reason about interactions. Because we are relying on Cinelli, Forney, and Pearl's recommendations for graphical reasoning about statistical models, I'm not sure it's completely sound to reason about interactions as nodes in the underlying graph.
- Apart from the above reasons, because interactions start to call to mind statistical formulation, it is better to consider during the statistical model disambiguation phase, when rTisane guides analysts to think a bit lower level than the input spec/DSL already.
Conclusion: Basically, I think we can have a cleaner delineation/separation of conceptual and statistical models by removing interacts from the DSL while retaining it as a consideration/possibility in the statistical model disambiguation phase. Either alternatives (very top) would help achieve that although alternative 1 may be even cleaner.

Open to comments, debate, deliberation.

Simplify, streamline install instructions for developers

Add more tests to ensure that variables are treated correctly in the output generated code script

Test for variables of all data types + family/link function combinations

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.