I couldn't find a policy-gradient-based historical self-play here in open-spiel where

Policy Gradient based Self Play,about google-deepmind/open_spiel

Comments (23)

aadharna commented on June 19, 2024 1

Okay. I have PPO-BR PSRO up and running. Now I just need a method of evaluating the trained agents and to try the implementation on Connect-Four.

from open_spiel.

aadharna commented on June 19, 2024 1

If I just make the P2 population be a pointer to the P1 population, then I should recover something like my diagram above (since I'm assuming the symmetric case for now).

Turns out there's a symmetry flag in the PSRO code that does exactly this. It breaks a few other things (like setting rectifier to on w/o also setting training_strategy_selector to rectified, and the code has warnings saying This might not be correct for certain applications e.g., a DQN BR oracle with player_id information so I turned off using player_id information and that should now be okay), but this with PPO-BR is almost exactly what I had in mind.

from open_spiel.

lanctot commented on June 19, 2024

You might need to be more specific.. "policy-gradient-based historical self-play" could mean many things.

There is basic policy gradient algorithms in:

python/policy_gradient.py
python/jax/policy_gradient.py
python/pytorch/policy_gradient.py

But I would guess you mean something closer to PSRO with a uniform meta-solver and PG-based best response? Is that what you mean by "historical self play"? (basically fictitious play where PG is used as the best responder)

If so, then you should be able to use the RL oracle in PSRO. The main implementation is in python/psro_v2 and there is an example here: psro_v2_example.

That said, the PG algorithms above are quite basic and they don't have A2C-style bootstrapping via n-step returns, but the JAX one does support TD(lambda). We do have a PPO implementation here (with example here) but it was mostly designed for Atari and AFAIK has never been tested within a PSRO loop. So, it may not work out-of-the-box, but I'd be happy to help get it working because we should really add it to the PSRO example too.

from open_spiel.

lanctot commented on June 19, 2024

I noticed you're at UBC. :) Turns out the people who added that PPO implementation are also at UBC, so maybe you can even try asking them about it in person: @newmanne @gregdeon (they may even have tried PSRO with a PPO based best-response at some point).

from open_spiel.

aadharna commented on June 19, 2024

I hadn't thought about it in the framing of PSRO with a uniform meta-solver and PPO best response, but I do think that maps cleanly onto what I have in mind.

In my mind what I was thinking was something like this as my baseline algorithm:

where we're growing that sp-archive every k PPO updates by copying the current learner into the archive and we sample a new opponent to improve against after each update step.

Yeah, I think with uniform sampling on the SP-archive this should map into PSRO w/ a PPO best response. I'll play around with PSRO tomorrow!

My (limited) understanding of PSRO was that we maintained a population for both of the players in the two-player game so that we can give a matrix of payoffs to the meta-solver when updating both players.

I did see that PPO implementation (which looks like it was adapted from the cleanrl implementation) but also noticed that it was pretty heavily tailored to atari. The PPO I am currently using is basically the same as that one.

from open_spiel.

lanctot commented on June 19, 2024

Yes, their PPO was indeed adapted from cleanrl. It indeed is heavily geared to Atari as you see in OpenSpiel, but I know that Neil and Greg used it for games, so I suggest contacting them as they may have some advice or maybe even some code to share.

Ok based on that image, there are likely some subtle differences in what you mean and PSRO with a uniform meta-solver + PPO-based BR:

PSRO assumes the asymmetric case generally. That is, one population per player. This is because it's more general: many games have different observation spaces, so it's easier to have a set of networks per player.
PSRO would start the BR from scratch at the start of each epoch whereas what the image outline (and your wording) implies that PPO would continue training from where it left off in the last epoch.

Having the algorithm you're suggesting above outside the PSRO loop would be kind of cool and would make a welcome contribution, but it certainly does not currently exist.

Hope this helps, but please feel free ask more questions if something is not clear.

from open_spiel.

aadharna commented on June 19, 2024

PSRO assumes the asymmetric case

That makes sense and I see why one would then want a population per player. How do the per-player populations change over time? Do we just add the newly trained P1 BR-agent to the P1 population and then let P2 calculate a BR to the newly-expanded P1 population? Then these per-agent populations should only ever grow.

If I just make the P2 population be a pointer to the P1 population, then I should recover something like my diagram above (since I'm assuming the symmetric case for now).

PSRO would start the BR from scratch at the start of each epoch

I'm totally fine with a variant of my algorithm that uses blank-slate NNs for each best response; I just imagine that would make the algorithm slower as the network has to relearn basics about the game every time. And in the SP works I usually think of (Open-AI Hide and Seek, Emergent Complexity via multi-agent competition, AlphaStar, etc) I don't recall seeing constant resetting of the learning agent.

I'll just need to think about the research I'm doing downstream from the baseline implementation and make sure that I don't break anything there as my research has me iteratively building an additional novelty-based population and then creating a response matrix of the SP population vs the novelty-population to guide exploration in the gamespace for the learner.

Having the algorithm you're suggesting above outside the PSRO loop would be kind of cool and would make a welcome contribution, but it certainly does not currently exist.

I don't follow this point as the PPO-BR calculation would be inside PSRO, no? What I think you're saying is, what if I took my above diagram, but instead of PPO being the optimizer, what if I put a PPO-based PSRO there but kept the rest of the setup the same.

from open_spiel.

lanctot commented on June 19, 2024

To your first point, that is quite close to how it works, yes. Yes we have mostly just gone on forever since the BR steps takes so much longer it has never been an issue. I think most of the PSRO code in OpenSpiel would not alternate the BRs but compute them simultaneously (every BR is learning against previous iterates) because that is closer to the standard game-theoretic algorithms they are built on. Yes, I agree conceptually it is easy to get the symmetric case in the way you describe, and it may already be supported. But if not, it might be nontrivial to fit into the current code logic (I am not sure -- it might be easy).

Similar answer for your second point. PSRO tries to stay close to the game theoretic foundations it is built upon so starts fresh with each new BR. It being slower might depend a lot on the context. Often what you want is for PSRO to generate a rich enough "gamescape" to be able to represent specific strategies from the mixtures over the networks, and possibly not starting the BRs from scratch could lead to a pit where you repeatedly generating very similar BRs and have no way to escape the pit. But I can see a counter argument for some games where it is a waste of time because that richness is not needed.

Sorry for the confusion on the third point. If PSRO with a uniform meta solver and PPO based BR (with the above caveats or if they are easy to bypass) would work for your purposes, great! Otherwise, it might be nontrivial to hack the PSRO code to get the algorithm outlined in your image and might be easier to do it from scratch (still using the PPO code just not the PSRO code). If you go that route, I would love it if you contributed a basic version as a PR because it would be great to have in the suite. Hope that clarifies!

from open_spiel.

aadharna commented on June 19, 2024

might be easier to do it from scratch (still using the PPO code just not the PSRO code)

Gotcha. Regarding the diagram above, I already have code that does that (Slowly build an archive of past solutions and train against them using PPO to update the same learner iteratively). It just results in bad agents (at the end of training, the agent matches an MCTS agent with a search depth of 10 at connect four) even if I run it for a long time -- the learner just stagnates. Using the language of the Spinning Top paper, the learner is getting stuck in the extremely non-transitive section of learning.

Hence I was looking for a self-play implementation to use as a new baseline. So I think what I'm going to do first is port the tf PSRO code to pytorch and run a PPO-based PSRO with a uniform meta solver as my benchmark agent. Then, if I get a good agent I can start customizing that for my actual research. To that end, is there a pytorch-based PSRO implementation internally? I didn't see one.

from open_spiel.

lanctot commented on June 19, 2024

If that's the case then PSRO is likely to help because it was designed specifically for that case, but you might want to experiment with more than just the uniform meta-solver because that might be the underlying cause (opponent meta policy not changing fast enough for strategic variation to arise).

No, we don't sorry, because we are at Google DeepMind we support TF and JAX because that is what we use for our research (mostly JAX nowadays). All of the Pytorch code in OpenSpiel has been due to external contributions, actually. But I would think that the PSRO code should not be too dependent on the specific ML framework used..? Because the RL oracles just take in RL agent classes, which the PPO class inherits from. These details are all abstracted away, so it should just work (though may not be included in the example). Have you run into any specific problems using PPO in the RLOracle?

from open_spiel.

aadharna commented on June 19, 2024

I almost have PPO integrated as the BR (with a slight bit of hacking) for PSRO. However, I am getting this error from the get_joint_strategy_from_marginals function in meta_strategies.py:

(Pdb) np.product(probas)
*** ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.

(Pdb) probas
[array([[0.5],
       [0.5]]), array([[0.5, 0.5]])]

Is this just supposed to be probas[0] * probas[1] because that gives me out a 2x2 with element-wise multiplication? I assume these are shaped differently because the first element is the column player's population and the second element is the row player's population? ~~I could just flatten the two vectors and then take their outer product for the general solution then, no?~~

Also, because I'm testing in connect-four, I had to add a try-except to the rl_policy code:

try:
      self._obs["info_state"][cur_player] = (
          state.information_state_tensor(cur_player))
except:
      self._obs["info_state"][cur_player] = (
          state.observation_tensor(cur_player))

Is that likely to break some subtle assumption?

from open_spiel.

lanctot commented on June 19, 2024

Hmm, I will ask @PaulFMMuller and/or Luke Marris to chime in on an answer for the first question, but you might not need to call that function. Do you need to extract the average strategies for any reason? Is your game small enough that you have space to store it? (Normally you do this to then go and compute exploitability, but maybe you don't intend to do that.)

That should work but there's likely a cleaner way to do it. After the environment step (or anytime by calling the get_timestep function IIRC), the environment gives back a time step object with the observations in it (and this will be the correct one depending on the environment). So the RL policy should probably use the ones in that timestep object. If your code can't get the timestep at that point, though, then what you have should work.

from open_spiel.

lanctot commented on June 19, 2024

Tagging @rezunli96 who might be able to answer the first question.

from open_spiel.

lanctot commented on June 19, 2024

@aadharna can you confirm that you're running into the problem with the basic PSRO using PPO as a best responder? (i.e. you didn't try to make the symmetric population version yet, right?)

from open_spiel.

rezunli96 commented on June 19, 2024

@aadharna For the first question, np.prod([array([[0.5],
[0.5]]), array([[0.5, 0.5]])]) case works perfectly fine for me and returns a 2*2 matrix with each element 0.25. Can I check what is your current numpy version?

from open_spiel.

aadharna commented on June 19, 2024

Do you need to extract the average strategies for any reason? Is your game small enough that you have space to store it? (Normally you do this to then go and compute exploitability, but maybe you don't intend to do that.)

For now, I want to do exactly what is usually done, especially while I'm adding something new and then running this as a benchmark.

can you confirm that you're running into the problem with the basic PSRO using PPO as a best responder? (i.e. you didn't try to make the symmetric population version yet, right?)

Yep. I haven't changed any of the PSRO logic. I have added a init_ppo_response function similar to the init_br_response and then a few small tweaks where PPO was failing e.g., needed to wrap the timestep object in a list during step or where the game choice caused a failure e.g., using state.observation_tensor as mentioned above.

Can I check what is your current numpy version?

numpy version = '1.24.3'

Also, is there a way for me to silence pyspiel errors? Because of my try-catch, I keep getting the following error printed to the terminal and so I can't see any of the standard print outs: OpenSpiel exception: InformationStateTensorShape unimplemented.

Also, in the sample episode code:

  def sample_episode(self, unused_time_step, agents, is_evaluation=False):
    time_step = self._env.reset()
    cumulative_rewards = 0.0
    while not time_step.last():
      if time_step.is_simultaneous_move():
        action_list = []
        for agent in agents:
          output = agent.step(time_step, is_evaluation=is_evaluation)
          action_list.append(output.action)
        time_step = self._env.step(action_list)
        cumulative_rewards += np.array(time_step.rewards)
      else:
        player_id = time_step.observations["current_player"]

        # is_evaluation is a boolean that, when False, lets policies train. The
        # setting of PSRO requires that all policies be static aside from those
        # being trained by the oracle. is_evaluation could be used to prevent
        # policies from training, yet we have opted for adding frozen attributes
        # that prevents policies from training, for all values of is_evaluation.
        # Since all policies returned by the oracle are frozen before being
        # returned, only currently-trained policies can effectively learn.
        agent_output = agents[player_id].step(
            time_step, is_evaluation=is_evaluation)
        action_list = [agent_output.action]
        time_step = self._env.step(action_list)
        # import pdb; pdb.set_trace()
        if not is_evaluation:
          # ppo wants this data for proper opt steps
          agents[player_id]._policy.post_step(time_step.rewards[player_id], time_step.last())
        cumulative_rewards += np.array(time_step.rewards)

    if not is_evaluation:
      for agent in agents:
        agent.step(time_step)

    return cumulative_rewards

why is the last bit of code there where all the agents taking a step on the terminal state after the game is over?

from open_spiel.

lanctot commented on June 19, 2024

Brief responses because I only have a few minutes:

First one is definitely a NumPy backwards-incompatibility. Not the first time.. we're on our third or fourth now :) That line also fails for me on NumPy 1.26.2. @rezunli96 which version of NumPy are you using? I'll talk to Paul & Luke tomorrow about a fix. (which would need to handle the >2 player case as well). Meanwhile if you have a proposed fix that works for the general case, please submit a PR
Don't use try catch, inspect the game type which contains this info. Example: state.get_game().get_type().provides_information_state_tensor is a bool. There's a similar one for observation_tensor. There's a lot of other useful info in the game type (see top of spiel.h)
You need that last agent step for turn-based games to signal to all the learning agents that the game ended and what the payoffs are. Imagine self-play DQN in Tic-Tac-Toe. If X takes a move to finish the game, O still needs to record O's last (state, action) pair that led to the state where X took the last action, and then record the transition as terminal. Without it, O would never see any terminal transitions (and hence no rewards). I may have explained toward the end of the OpenSpiel tutorial video on youtube (see the link on the main page)

Hope this helps

from open_spiel.

rezunli96 commented on June 19, 2024

Mine is 1.23.5.

It is indeed very likely to be a backwards-incompatibility issue. It prints out /usr/local/lib/python3.10/dist-packages/numpy/core/fromnumeric.py:86: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

from open_spiel.

lanctot commented on June 19, 2024

Okay. I have PPO-BR PSRO up and running.

Great! How did you fix the numpy issue? Can you copy your solution here, or submit a PR fix?

from open_spiel.

aadharna commented on June 19, 2024

How did you fix the numpy issue?

Replaced the np.product with np.outer (which assumes two player game, so this isn't a total fix)

Can you copy your solution here, or submit a PR fix?

I'll clean it up and put it into a PR tomorrow (note, I haven't been able to validate the implementation yet because it's very slow on Connect Four. It was also slow on Tic-Tac-Toe but it did about 20 iterations over yesterday)

from open_spiel.

lanctot commented on June 19, 2024

Ok thanks.

Yeah it would be good to understand a bit more about what that code was designed for, because you are using it slightly out of context.

There are steps in that PSRO code that are intended for small (and imperfect information) games like Kuhn and Leduc poker. Specifically, the strategy conversion you pointed to traverses the entire game tree. The average strategy is used to compute exploitability (which requires several tree traversals).

That won't be possible in Connect Four: the game is too large. Also in Tic Tac Toe the game will be exponentially larger in size than necessary if you use the information state tensor due to perfect recall.

If you are looking to run PSRO on perfect information games, especially large ones, I would just skip the exploitability computation altogether (which would mean not having to compute the average strategy) and I would use observation state tensors.

from open_spiel.

lanctot commented on June 19, 2024

Replaced the np.product with np.outer (which assumes two player game, so this isn't a total fix)

Oh, ok. When I suggest submitting a PR, I only meant the fix for this. So I continue discussing a general fix with the team.

Do you think there is a clean implementation of PPO-BR that works within the PSRO code? If so, that'd be a welcome contribution too, but there is no rush because it is hard to find time to review the larger PRs these days anyway :)

from open_spiel.

aadharna commented on June 19, 2024

Yes. I can clean up PPO-BR code so that it works within the PSRO code. I did have to make a small change to the `rl_oracle` because of how they had written the PPO code (the PPO code has a step function and a post_step function to capture all the data necessary). So I'll need to find a way to make the ppo code behave like the pg code, but that should be doable. Might be a couple of days though because I'm also debugging my regular self play code trying to make that work.

…

On Sun, Dec 3, 2023, 2:14 AM lanctot ***@***.***> wrote: Replaced the np.product with np.outer (which assumes two player game, so this isn't a total fix) Oh, ok. When I suggest submitting a PR, I only meant the fix for this. So I continue discussing a general fix with the team. Do you think there is a clean implementation of PPO-BR that works within the PSRO code? If so, that'd be a great contribution too. :) — Reply to this email directly, view it on GitHub <#1148 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD7DSFIQUSZQDVW7QTW2CLDYHRGJ3AVCNFSM6AAAAAA76YLLNKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMZXGQZTGMRRGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from open_spiel.

Policy Gradient based Self Play about open_spiel HOT 23 OPEN

Comments (23)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent