Load and sample from model checkpoints

Decision Transformer Interpretability

Docs Here App Here

This project is designed to facilitate mechanistic interpretability of decision transformers as well as RL agents using transformer architectures.

This is achieved by:

Training scripts for online RL agents using the PPO algorithm. This training script can be used to generate trajectories for training a decision transformer.
A decision transformer implementation and training script. This implementation is based on the transformer architecture and the decision transformer architecture.
A streamlit app. This app enables researchers to play minigrid games whilst observing the decision transformer's predictions/activations.

Future work will include:

creating an interpretability portfolio, expanding various exploratory techniques already present in the streamlit app.
solving tasks which require memory or language instruction. Many MiniGrid tasks require agents have memory and currently our PPO agent only responds to the last timestep.
validating hypotheses about model circuits using casual scrubbing.

Write Up

You can find an initial technical report for this project here.

Package Overview

The package contains several important components:

The environments package which provides utilities for generating environments (mainly focussed on MiniGrid).
The decision_transformer package which provides utilities for training and evaluating decision transformers (via calibration curves).
The ppo package which provides utilities for training and evaluating PPO agents.
The streamlit app which provides a user interface for playing games and observing the decision transformer's predictions/activations.
The models package which provides the a common trajectory-transformer class so as to keep architectures homogeneous across the project.

Other notable files/folders:

The scripts folder contains bash scripts which show how to use various interfaces in the project.
The test folder which contains extensive tests for the projcect.

Example Results

We've successfully trained a decision transformer on several games including DoorKey and Dynamic Obstacles.

Calibration Plot	MiniGrid-Dynamic-Obstacles-8x8-v0, after 6000 batch, episode length 14, RTG 1.0, reward 0.955

I highly recommend playing with the streamlit app if you are interested in this project. It relies heavily on an understanding of the Mathematical Framework for Transformer Circuits.

Running the scripts

Example bash scripts are provided in the scripts folder. They make use of argparse interfaces in the package.

Training a PPO agent

If you set 'track' to true, a weights and biases dashboard will be generated. A trajectories pickle file will be generated in the trajectories folder. This file can be used to train a decision tranformer.

python -m src.run_ppo --exp_name "Test" \
    --seed 1 \
    --cuda \
    --track \
    --wandb_project_name "PPO-MiniGrid" \
    --env_id "MiniGrid-DoorKey-8x8-v0" \
    --view_size 5 \
    --total_timesteps 350000 \
    --learning_rate 0.00025 \
    --num_envs 8 \
    --num_steps 128 \
    --num_minibatches 4 \
    --update_epochs 4 \
    --clip_coef 0.2 \
    --ent_coef 0.01 \
    --vf_coef 0.5 \
    --max_steps 1000 \
    --one_hot_obs

Training a decision transformer

Targeting the trajectories file and setting the model architecture details and hyperparameters, you can run the decision transformer training script.

python -m src.run_decision_transformer \
    --exp_name MiniGrid-Dynamic-Obstacles-8x8-v0-Refactor \
    --trajectory_path trajectories/MiniGrid-Dynamic-Obstacles-8x8-v0bd60729d-dc0b-4294-9110-8d5f672aa82c.pkl \
    --d_model 128 \
    --n_heads 2 \
    --d_mlp 256 \
    --n_layers 1 \
    --learning_rate 0.0001 \
    --batch_size 128 \
    --train_epochs 5000 \
    --test_epochs 10 \
    --n_ctx 3 \
    --pct_traj 1 \
    --weight_decay 0.001 \
    --seed 1 \
    --wandb_project_name DecisionTransformerInterpretability-Dev \
    --test_frequency 1000 \
    --eval_frequency 1000 \
    --eval_episodes 10 \
    --initial_rtg -1 \
    --initial_rtg 0 \
    --initial_rtg 1 \
    --prob_go_from_end 0.1 \
    --eval_max_time_steps 1000 \
    --track True

Note, if you want the training data from the blog post, you can download it like so

cd trajectories
gdown 1UBMuhRrM3aYDdHeJBFdTn1RzXDrCL_sr

Running the Streamlit app

To run the Streamlit app:

streamlit run app.py

To run the Streamlit app on Docker, see the Development section.

Setting up the environment

I haven't been too careful about this yet. Using python 3.9.15 with the requirements.txt file. We're using the V2 branch of transformer lens and Minigrid 2.1.0.

conda env create --name decision_transformer_interpretability python=3.9.15
conda activate decision_transformer_interpretability
pip install -r requirements.txt

The docker file should work and we can make use of it more when the project is further ahead/if we are alternativing developers frequently and have any differential behavior.

./scripts/build_docker.sh
./scripts/run_docker.sh

Then you can ssh into the docker and a good ide will bring credentials etc.

Development

Docker

If you're having trouble making the environment work, I recommend Docker. There's a dockerfile in the main folder - it takes a few minutes the first time, and 10-15 seconds for me when only changing code. If adding requirements it may take a bit longer. I (Jay) use Ubuntu through WSL and Docker Desktop, and it worked pretty easily for me.

To run it, first navigate to your project directory, then:

docker build -t IMAGE_NAME .
docker run -d -it -v $(pwd):/app --name CONTAINER_NAME IMAGE_NAME bash

To reset the container (e.g, you've changed the code, and want to rerun your tests), use:

docker stop CONTAINER_NAME
docker rm CONTAINER_NAME
docker rmi IMAGE_NAME
docker build -t IMAGE_NAME .
docker run -p 8501:8501 -d -it -v $(pwd):/app --name CONTAINER_NAME IMAGE_NAME bash

I recommend setting this all up as a batch command so you can do it easily for a quick iteration time.

Finally, to run a command, use:

docker exec CONTAINER_NAME COMMAND

For instance, to run unit tests, you would use docker exec CONTAINER_NAME pytest tests/unit.

To run Streamlit on your local browser, you can use the following command:

docker exec CONTAINER_NAME streamlit run app.py --server.port=8501

Tests:

Ensure that the run_tests.sh script is executable:

chmod a+x ./scripts/run_tests.sh

Run the tests. Note: the end to end tests are excluded from the run_test.sh script since they take a while to run. They make wandb dashboards are are useful for debugging but they are not necessary for development.

To run end-to-end tests, you can use the command 'pytest -v --cov=src/ --cov-report=term-missing'. If the trajectories file 'MiniGrid-Dynamic-Obstacles-8x8-v0bd60729d-dc0b-4294-9110-8d5f672aa82c.pkl' is not found in the tests, the 'gdown' command has failed to download it. In that case, download it manually or run 'conda install -c conda-forge gdown' and try again.

./scripts/run_tests.sh

You should see something like this after the tests run. This is the coverage report. Ideally this is 100% but we're not there yet. Furthermore, it will be 100% long before we have enough tests. But if it's 100% and we have performant code with agents training and stuff otherwise working, that's pretty good.

---------- coverage: platform darwin, python 3.9.15-final-0 ----------
Name                                Stmts   Miss  Cover   Missing
-----------------------------------------------------------------
src/__init__.py                         0      0   100%
src/decision_transformer.py           132      8    94%   39, 145, 151, 156-157, 221, 246, 249
src/ppo.py                             20     20     0%   2-28
src/ppo/__init__.py                     0      0   100%
src/ppo/agent.py                      109     10    91%   41, 45, 112, 151-157
src/ppo/compute_adv_vectorized.py      30     30     0%   1-65
src/ppo/memory.py                      88     11    88%   61-64, 119-123, 147-148
src/ppo/my_probe_envs.py               99      9    91%   38, 42-44, 74, 99, 108, 137, 168
src/ppo/train.py                       69      6    91%   58, 74, 94, 98, 109, 113
src/ppo/utils.py                      146     54    63%   41-42, 61-63, 69, 75, 92-96, 110-115, 177-206, 217-235
src/utils.py                           40     17    58%   33-38, 42-65, 73, 76-79
src/visualization.py                   25     25     0%   1-34
-----------------------------------------------------------------
TOTAL                                 758    190    75%

Next Steps

Getting PPO to work with a transformer architecture.
Analyse this model/the decision transformer/a behavioural clone and publish the results.
Get a version of causal-scrubbing working
Study BabyAI (adapt all models to take an instruction token that is prepended to the context window)

	def sample_from_agents(agents, rollout_length=2000, trajectory_path=None, num_envs=1):
	all_episode_lengths = []
	all_episode_returns = []

	# Sample rollouts from each agent
	for i, agent in enumerate(agents):
	memory = Memory(agent.envs, OnlineTrainConfig(
	num_envs=num_envs), device=agent.device)
	if trajectory_path:
	trajectory_writer = TrajectoryWriter(
	path=os.path.join(trajectory_path, f"rollouts_agent_{i}.gz"),
	run_config=RunConfig(track=False),
	environment_config=agent.environment_config,
	online_config=OnlineTrainConfig(num_envs=num_envs),
	model_config=agent.model_config
	)
	else:
	trajectory_writer = None
	agent.rollout(memory, rollout_length, agent.envs, trajectory_writer)
	if trajectory_writer:
	trajectory_writer.tag_terminated_trajectories()
	trajectory_writer.write(upload_to_wandb=False)

	# Process the episode lengths and returns
	df = process_memory_vars_to_log(memory.vars_to_log)
	all_episode_lengths.append(df['episode_length'])
	all_episode_returns.append(df['episode_return'])

	return all_episode_lengths, all_episode_returns

	def get_trajectory_minibatches(self, timesteps: int, prob_go_from_end: float = 0.1) -> List[TrajectoryMinibatch]:
	'''Return a list of trajectory minibatches, where each minibatch contains
	experiences from a single trajectory.

	Args:
	- timesteps (int): the number of timesteps to include in each minibatch.

	Returns:
	- List[TrajectoryMinibatch]: a list of minibatches.
	'''
	obs, dones, actions, logprobs, values, rewards = [
	t.stack(arr) for arr in zip(*self.experiences)]

	next_values = t.cat([values[1:], self.next_value.unsqueeze(0)])
	next_dones = t.cat([dones[1:], self.next_done.unsqueeze(0)])

	# px.imshow(obs[:,1,:,:,0].transpose(-1,-2), animation_frame = 0, range_color = [0,10]).show()
	# set last value of dones to 1
	dones[-1] = t.ones(dones.shape[-1])

	# hack for now.
	# will cause problems if you only have one environment
	if logprobs.shape[-1] == 1:
	logprobs = logprobs.squeeze(-1)

	# rearrange to flatten out the env dimension (2nd dimension)
	obs = rearrange(obs, "T E ... -> (E T) ...")
	dones = rearrange(dones, "T E -> (E T)")
	next_dones = rearrange(next_dones, "T E -> (E T)")
	actions = rearrange(actions, "T E ... -> (E T) ...")
	logprobs = rearrange(logprobs, "T E -> (E T)")
	values = rearrange(values, "T E -> (E T)")
	next_values = rearrange(next_values, "T E -> (E T)")
	rewards = rearrange(rewards, "T E -> (E T)")

	# find the indices of the end of each trajectory
	traj_end_idxs = (t.where(dones)[0] + 1).tolist()

	# split these trajectories on the dones
	traj_obs = t.tensor_split(obs, traj_end_idxs)
	traj_actions = t.tensor_split(actions, traj_end_idxs)
	traj_logprobs = t.tensor_split(logprobs, traj_end_idxs)
	traj_values = t.tensor_split(values, traj_end_idxs)
	traj_rewards = t.tensor_split(rewards, traj_end_idxs)
	traj_dones = t.tensor_split(dones, traj_end_idxs)
	traj_next_values = t.tensor_split(next_values, traj_end_idxs)
	traj_next_dones = t.tensor_split(next_dones, traj_end_idxs)

	# px.imshow(traj_obs[0][:,:,:,0].transpose(-1,-2), animation_frame = 0, range_color = [0,10]).show()

	# so now we have lists of trajectories, what we want is to split each trajectory
	# so for each trajectory, sample an index and go n_steps back from that.
	# since we're encoding states and actions, we want to go context_length//2 back
	# if that happens to go off the end, then we
	minibatches = []

	# remove trajectories of length 0
	traj_obs = [traj for traj in traj_obs if len(traj) > 0]

	n_trajectories = len(traj_obs)
	trajectory_lengths = [len(traj) for traj in traj_obs]

	for _ in range(self.args.num_minibatches):

	minibatch_obs = []
	minibatch_actions = []
	minibatch_logprobs = []
	minibatch_advantages = []
	minibatch_values = []
	minibatch_returns = []
	minibatch_timesteps = []
	minibatch_rewards = []

	for _ in range(self.args.minibatch_size):

	# randomly select a trajectory
	traj_idx = np.random.randint(n_trajectories)

	# randomly select an end index from the trajectory
	# TODO later add a hyperparameter to oversample last step
	traj_len = trajectory_lengths[traj_idx]

	if traj_len <= timesteps:
	end_idx = traj_len
	start_idx = 0
	else:
	if prob_go_from_end is not None:
	if random.random() < prob_go_from_end:
	end_idx = traj_len
	start_idx = end_idx - timesteps
	else:
	end_idx = np.random.randint(timesteps, traj_len)
	start_idx = end_idx - timesteps
	else:
	end_idx = np.random.randint(timesteps, traj_len)
	start_idx = end_idx - timesteps

	# get the trajectory
	current_traj_obs = traj_obs[traj_idx][start_idx:end_idx]
	current_traj_actions = traj_actions[traj_idx][start_idx:end_idx]
	current_traj_logprobs = traj_logprobs[traj_idx][start_idx:end_idx]
	current_traj_values = traj_values[traj_idx][start_idx:end_idx]
	current_traj_dones = traj_dones[traj_idx][start_idx:end_idx]
	current_traj_rewards = traj_rewards[traj_idx][start_idx:end_idx]
	current_traj_next_value = traj_next_values[traj_idx][end_idx - 1]
	current_traj_next_done = traj_next_dones[traj_idx][end_idx - 1]

	# make timesteps
	current_traj_timesteps = t.arange(start_idx, end_idx)

	# Compute the advantages and returns for this trajectory.
	current_traj_advantages = self.compute_advantages(
	current_traj_next_value,
	current_traj_next_done,
	current_traj_rewards,
	current_traj_values,
	current_traj_dones,
	self.device,
	self.args.gamma,
	self.args.gae_lambda
	)

	current_traj_returns = current_traj_advantages + current_traj_values

	# we need to pad current_traj_obs and current_traj_actions
	current_traj_obs = pad_tensor(
	current_traj_obs,
	timesteps,
	ignore_first_dim=False,
	pad_token=0,
	pad_left=True
	)

	current_traj_actions = pad_tensor(
	current_traj_actions,
	timesteps,
	ignore_first_dim=False,
	pad_token=0,
	pad_left=True
	)

	current_traj_timesteps = pad_tensor(
	current_traj_timesteps,
	timesteps,
	ignore_first_dim=False,
	pad_token=0,
	pad_left=True
	)

	# add to minibatch
	minibatch_obs.append(current_traj_obs)
	minibatch_actions.append(current_traj_actions)
	minibatch_logprobs.append(current_traj_logprobs[-1])
	minibatch_advantages.append(current_traj_advantages[-1])
	minibatch_values.append(current_traj_values[-1])
	minibatch_returns.append(current_traj_returns[-1])
	minibatch_rewards.append(current_traj_rewards[-1])
	minibatch_timesteps.append(current_traj_timesteps)

	# stack the minibatch
	minibatch_obs = t.stack(minibatch_obs)
	minibatch_actions = t.stack(minibatch_actions)

	# only take the last values of the logprob, advantage,
	# value and return (relevant to the last step of each trajectory)
	minibatch_logprobs = t.stack(minibatch_logprobs)
	minibatch_advantages = t.stack(minibatch_advantages)
	minibatch_values = t.stack(minibatch_values)
	minibatch_returns = t.stack(minibatch_returns)
	minibatch_timesteps = t.stack(minibatch_timesteps)
	minibatch_rewards = t.stack(minibatch_rewards)

	minibatches.append(TrajectoryMinibatch(
	obs=minibatch_obs,
	actions=minibatch_actions,
	logprobs=minibatch_logprobs,
	advantages=minibatch_advantages,
	values=minibatch_values,
	returns=minibatch_returns,
	timesteps=minibatch_timesteps,
	rewards=minibatch_rewards
	))

	return minibatches

	def write(self, upload_to_wandb: bool = False):

	data = {
	'observations': np.array(self.observations, dtype=np.float64),
	'actions': np.array(self.actions, dtype=np.int64),
	'rewards': np.array(self.rewards, dtype=np.float64),
	'dones': np.array(self.dones, dtype=bool),
	'truncated': np.array(self.truncated, dtype=bool),
	'infos': np.array(self.infos, dtype=object)
	}
	if dataclasses.is_dataclass(self.args):
	metadata = {
	"args": asdict(self.args), # Args such as ppo args
	"time": time.time() # Time of writing
	}
	else:
	metadata = {
	"args": self.args, # Args such as ppo args
	"time": time.time() # Time of writing
	}

	if not os.path.exists(os.path.dirname(self.path)):
	os.makedirs(os.path.dirname(self.path))

	# use lzma to compress the file
	if self.path.endswith(".xz"):
	print(f"Writing to {self.path}, using lzma compression")
	with lzma.open(self.path, 'wb') as f:
	pickle.dump({
	'data': data,
	'metadata': metadata
	}, f)
	elif self.path.endswith(".gz"):
	print(f"Writing to {self.path}, using gzip compression")
	with gzip.open(self.path, 'wb') as f:
	pickle.dump({
	'data': data,
	'metadata': metadata
	}, f)
	else:
	print(f"Writing to {self.path}")
	with open(self.path, 'wb') as f:
	pickle.dump({
	'data': data,
	'metadata': metadata
	}, f)

	if upload_to_wandb:
	artifact = wandb.Artifact(
	self.path.split("/")[-1], type="trajectory")
	artifact.add_file(self.path)
	wandb.log_artifact(artifact)

	print(f"Trajectory written to {self.path}")

	def make_env(
	env_id: str,
	seed: int,
	idx: int,
	capture_video: bool,
	run_name: str,
	render_mode="rgb_array",
	max_steps=100,
	fully_observed=False,
	flat_one_hot=False,
	agent_view_size=7,
	video_frequency=50
	):

	@dataclass
	class EnvironmentConfig():
	'''
	Configuration class for the environment.
	'''
	env_id: str = 'MiniGrid-Dynamic-Obstacles-8x8-v0'
	one_hot_obs: bool = False
	img_obs: bool = False
	fully_observed: bool = False
	max_steps: int = 1000
	seed: int = 1
	view_size: int = 7
	capture_video: bool = False
	video_dir: str = 'videos'
	render_mode: str = 'rgb_array'
	action_space: None = None
	observation_space: None = None
	device: str = 'cpu'

	def __post_init__(self):

	env = gym.make(self.env_id)

	if self.env_id.startswith('MiniGrid'):
	if self.fully_observed:
	env = FullyObsWrapper(env)
	elif self.one_hot_obs:
	env = OneHotPartialObsWrapper(env)
	elif self.img_obs:
	env = RGBImgPartialObsWrapper(env)

	if self.view_size != 7:
	env = ViewSizeWrapper(env, self.view_size)

	self.action_space = self.action_space or env.action_space
	self.observation_space = self.observation_space or env.observation_space

	class TrajPPOAgent(PPOAgent):
	def __init__(self,
	envs: gym.vector.SyncVectorEnv,
	environment_config: EnvironmentConfig,
	transformer_model_config: TransformerModelConfig,
	device: t.device = t.device("cpu")
	):
	'''
	An agent for a Proximal Policy Optimization (PPO) algorithm.

	Args:
	- envs (gym.vector.SyncVectorEnv): the environment(s) to interact with.
	- device (t.device): the device on which to run the agent.
	- environment_config (EnvironmentConfig): the configuration for the environment.
	- transformer_model_config (TransformerModelConfig): the configuration for the transformer model.
	- device (t.device): the device on which to run the agent.
	'''
	super().__init__(envs=envs, device=device)
	self.environment_config = environment_config
	self.transformer_model_config = transformer_model_config
	self.obs_shape = get_obs_shape(envs.single_observation_space)
	self.num_obs = np.array(self.obs_shape).prod()
	self.num_actions = envs.single_action_space.n
	self.hidden_dim = transformer_model_config.d_model
	self.critic = CriticTransfomer(
	transformer_config=transformer_model_config,
	environment_config=environment_config,
	)
	self.layer_init(self.critic.value_predictor, std=0.01)
	self.actor = ActorTransformer(
	transformer_config=transformer_model_config,
	environment_config=environment_config,
	)
	self.layer_init(self.actor.action_predictor, std=0.01)
	self.device = device
	self.to(device)

	def rollout(self,
	memory: Memory,
	num_steps: int,
	envs: gym.vector.SyncVectorEnv,
	trajectory_writer=None) -> None:
	"""Performs the rollout phase of the PPO algorithm, collecting experience by interacting with the environment.

	Args:
	memory (Memory): The replay buffer to store the experiences.
	num_steps (int): The number of steps to collect.
	envs (gym.vector.SyncVectorEnv): The vectorized environment to interact with.
	trajectory_writer (TrajectoryWriter, optional): The writer to
	log the collected trajectories. Defaults to None.
	"""

	device = memory.device
	obs = memory.next_obs
	action = None # will be set before used
	done = memory.next_done
	truncated = memory.next_done # mem done represents done \| truncated
	context_window_size = self.actor.transformer_config.n_ctx
	obs_timesteps = (context_window_size - 1) // 2 + 1 # (the current obs)
	actions_timesteps = obs_timesteps - 1
	action_pad_token = self.actor.environment_config.action_space.n
	n_envs = envs.num_envs
	if isinstance(device, str):
	device = t.device(device)
	cuda = device.type == "cuda"

	obss = t.zeros((n_envs, obs_timesteps, *obs.shape[1:]), device=device)
	acts = t.ones((n_envs, actions_timesteps, 1),
	device=device).to(t.long) * action_pad_token
	timesteps = t.zeros((n_envs, obs_timesteps, 1),
	device=device).to(t.long)
	obss[:, -1] = obs
	for step in range(num_steps):

	if len(memory.experiences) == 0:
	with t.inference_mode():
	logits = self.actor(obss[:, -1:], None, timesteps[:, -1:])
	values = self.critic(obss[:, -1:], None, timesteps[:, -1:])
	value = values[:, -1].squeeze(-1) # value is scalar
	else:
	# temporarily making this code worse, refactor soon.
	if obs_timesteps - 1 == 0:
	obss = obs.unsqueeze(1) # just add the current obs
	acts = None
	else:
	# obss
	obss = t.cat((obss, obs.unsqueeze(1)),
	dim=1) # add current obs
	obss = obss[:, -obs_timesteps:] # truncate
	# acts
	# add current action
	acts = t.cat(
	(acts, action.unsqueeze(1).unsqueeze(-1)), dim=1)
	acts = acts[:, -actions_timesteps:] # truncate
	# timesteps
	# add current timestep
	timesteps = t.cat(
	(timesteps, timesteps[:, -1:] + 1), dim=1)
	if timesteps.max() > self.environment_config.max_steps:
	assert False
	timesteps = timesteps[:, -obs_timesteps:] # truncate

	# Generate the next set of new experiences (one for each env)
	with t.inference_mode():
	# Our actor generates logits over actions which we can then sample from
	logits = self.actor(obss, acts, timesteps)
	# Our critic generates a value function (which we use in the value loss, and to estimate advantages)
	values = self.critic(obss, acts, timesteps)
	values = values[:, -1].squeeze(-1) # value is scalar

	# get the last state action prediction
	probs = Categorical(logits=logits[:, -1])
	action = probs.sample()
	logprob = probs.log_prob(action)
	next_obs, reward, next_done, next_truncated, info = envs.step(
	action.cpu().numpy())
	next_obs = memory.obs_preprocessor(next_obs)
	reward = t.from_numpy(reward).to(device)

	# in each case where an episode is done, we need to reset the context window
	# this is done by setting the last obs to the current obs and the rest to 0
	# all the actions are set to zero
	# timesteps are also reset
	next_done_or_truncated = next_done \| next_truncated
	for i, d in enumerate(next_done_or_truncated):
	if d:
	obss[i, -1] = obs[i]
	obss[i, :-1] = 0
	if acts is not None:
	acts[i] = action_pad_token
	timesteps[i] = 0

	if trajectory_writer is not None:
	obs_np = obs.detach().cpu().numpy() if cuda else obs.detach().numpy()
	reward_np = reward.detach().cpu().numpy() if cuda else reward.detach().numpy()
	action_np = action.detach().cpu().numpy() if cuda else action.detach().numpy()
	trajectory_writer.accumulate_trajectory(
	next_obs=obs_np,
	reward=reward_np,
	action=action_np,
	done=next_done,
	truncated=next_truncated,
	info=info
	)

	# Store (s_t, d_t, a_t, logpi(a_t\|s_t), v(s_t), r_t+1)
	mem_done = (done.to(bool) \| truncated.to(bool)).to(float)
	memory.add(info, obs, mem_done, action, logprob, value, reward)
	obs = t.from_numpy(next_obs).to(device)
	done = t.from_numpy(next_done).to(device, dtype=t.float)
	truncated = t.from_numpy(next_truncated).to(device, dtype=t.float)

	# Store last (obs, done, value) tuple, since we need it to compute advantages
	memory.next_obs = obs
	memory.next_done = done
	with t.inference_mode():

	obss = t.cat((obss, obs.unsqueeze(1)), dim=1)
	acts = t.cat((acts, action.unsqueeze(1).unsqueeze(-1)),
	dim=1) if acts is not None else None

	obss = obss[:, -obs_timesteps:]
	actions = acts[:, -
	actions_timesteps:] if acts is not None else None
	timesteps = timesteps[:, -obs_timesteps:]

	values = self.critic(obss, actions, timesteps)
	memory.next_value = values[:, -1].squeeze(-1)

	def learn(self,
	memory: Memory,
	args: OnlineTrainConfig,
	optimizer: optim.Optimizer,
	scheduler: PPOScheduler,
	track: bool) -> None:
	"""Performs the learning phase of the PPO algorithm, updating the agent's parameters
	using the collected experience.

	Args:
	memory (Memory): The replay buffer containing the collected experiences.
	args (OnlineTrainConfig): The configuration for the training.
	optimizer (optim.Optimizer): The optimizer to update the agent's parameters.
	scheduler (PPOScheduler): The scheduler attached to the optimizer.
	track (bool): Whether to track the training progress.
	"""

	for _ in range(args.update_epochs):
	n_timesteps = (self.actor.transformer_config.n_ctx - 1) // 2 + 1
	minibatches = memory.get_trajectory_minibatches(
	n_timesteps, args.prob_go_from_end)

	# Compute loss on each minibatch, and step the optimizer
	for mb in minibatches:
	obs = mb.obs
	actions = mb.actions[:, :-1].unsqueeze(-1).to(
	int) if mb.obs.shape[1] > 1 else None
	timesteps = mb.timesteps.unsqueeze(-1).to(int)

	logits = self.actor(obs, actions, timesteps)
	values = self.critic(obs, actions, timesteps)
	values = values[:, -1].squeeze(-1)

	probs = Categorical(logits=logits[:, -1])

	clipped_surrogate_objective = calc_clipped_surrogate_objective(
	probs=probs,
	mb_action=mb.actions[:, -1].squeeze(-1),
	mb_advantages=mb.advantages,
	mb_logprobs=mb.logprobs,
	clip_coef=args.clip_coef)

	value_loss = calc_value_function_loss(
	values, mb.returns, args.vf_coef)
	entropy_bonus = calc_entropy_bonus(probs, args.ent_coef)
	total_objective_function = clipped_surrogate_objective - value_loss + entropy_bonus
	optimizer.zero_grad()
	total_objective_function.backward()
	nn.utils.clip_grad_norm_(self.parameters(), args.max_grad_norm)
	optimizer.step()

	# Step the scheduler
	scheduler.step()

	# Get debug variables, for just the most recent minibatch (otherwise there's too much logging!)
	if track:
	with t.inference_mode():
	newlogprob = probs.log_prob(mb.actions.unsqueeze(-1))
	logratio = newlogprob - mb.logprobs
	ratio = logratio.exp()
	approx_kl = (ratio - 1 - logratio).mean().item()
	clipfracs = [
	((ratio - 1.0).abs() > args.clip_coef).float().mean().item()]
	memory.add_vars_to_log(
	learning_rate=optimizer.param_groups[0]["lr"],
	avg_value=values.mean().item(),
	value_loss=value_loss.item(),
	clipped_surrogate_objective=clipped_surrogate_objective.item(),
	entropy=entropy_bonus.item(),
	approx_kl=approx_kl,
	clipfrac=np.mean(clipfracs)
	)

jbloomaus / decisiontransformerinterpretability Goto Github PK

decisiontransformerinterpretability's Introduction

Decision Transformer Interpretability

Write Up

Package Overview

Example Results

Running the scripts

Training a PPO agent

Training a decision transformer

Running the Streamlit app

Setting up the environment

Development

Docker

Tests:

Next Steps

Relevant Projects:

decisiontransformerinterpretability's People

Contributors

Stargazers

Watchers

Forkers

decisiontransformerinterpretability's Issues

Analysis features

Static

Dynamic

Causal

Advanced

Recommend Projects

Recommend Topics

Recommend Org