Hi folks, I'm evaluating a few different sim engines and I was curio

Very cool, thanks so much for the detailed info <a class="user-mention notranslate" da

Here's the limit I can push on my 3080 <div class="snippet-clipboard-content notra

This benchmark is super helpful <a class="user-mention notranslate" data-hovercard-typ

Improving env.step() latency/fps? about maniskill HOT 8 CLOSED

rubendsa commented on August 23, 2024

Improving env.step() latency/fps?

from maniskill.

Comments (8)

StoneT2000 commented on August 23, 2024

So GPU sim is only helpful when you have a ton of workers/thread. In any simulator (ours, isaac, mujoco), the GPU sim per thread is slower than CPU sim per thread. It's just the fact that there are so many "threads" in a GPU that its throughput is higher.

You can activate CPU sim automatically by setting num_envs=1. If you want multiple CPU simulations that is also possible, you can use the gymnasium AsyncVectorEnv which is easy to create by running the following

env_kwargs = dict(obs_mode=obs_mode, control_mode=control_mode)
env = gym.make_vec(env_id, num_envs=num_envs, 
    vectorization_mode="async", vector_kwargs=dict(context="spawn"), 
   **env_kwargs
)

If you are doing teleoperation, you should definitely use the CPU simulation backend. Runs faster and the GUI lets you perturb the states of objects any way you like. Example of teleoperation code via a simple hardware less click+drag system is this one: https://maniskill.readthedocs.io/en/latest/user_guide/data_collection/teleoperation.html#click-drag-system which uses the CPU simulation.

from maniskill.

StoneT2000 commented on August 23, 2024

More notes on improving env.step() latency/fps:

Stripping out unused functions

For absolute maximum latency you can also strip out functions in an environment. Currently these are bit more "RL focused" so they will by default generate state based observations, rewards etc.

e.g.

env = gym.make(env_id, obs_mode="none", reward_mode="none")

You can try and also avoid using any IK based / end-effector controllers. They run an extra bit of computation to do inverse kinematics which slows sim down a tiny bit.

Modifying simulation configurations

The default simulation configurations for all environments are generally not heavily tuned. They are chosen so that basically without any modification the simulation will run as "intended" (very few to no object penetrations, no objects jittering around). This also means they run slower than what peak performance could be.

For example, the PickCube-v1 environment seems to run fine if you lower the number of solver iterations of the physics engine from the default 15 to 8. Your FPS will shoot up just by reducing the solver iterations.

Collisions

Another trick is to reduce the number of collisions in an environment in the task itself and the robot, see https://maniskill.readthedocs.io/en/latest/user_guide/tutorials/custom_robots.html#fast-simulation-tricks

Less collisions means less needs to be solved in each simulation step, meaning more FPS per thread.

Another trick is using simple primitive shapes as opposed to convex collision meshes. In fact the PickCube-v1 env can run even faster by simply deleting the table collision mesh and replacing it with a box collision.

Another trick is to keep as many objects as static as opposed to dynamic/kinematic: https://maniskill.readthedocs.io/en/latest/user_guide/tutorials/custom_tasks/intro.html#building-actors

Actually after typing this out a lot of these things should be collated into a page on our documentation for a list of GPU/CPU fps/throughput improvement tricks.

from maniskill.

rubendsa commented on August 23, 2024

Very cool, thanks so much for the detailed info @StoneT2000! I'll try benchmarking some of my tasks using the AsyncVectorEnv. Looks like there's also some good benchmarks here: https://github.com/haosulab/ManiSkill/blob/3b4c27e586698014aed9d66614cc58e2c1a464f8/mani_skill/examples/benchmarking/gpu_sim.py

I'm basically trying to get a large number of envs running (1000+) with ~2 640x480 RGB cameras per env, on a single large VM with a env.step(action) rate around 60fps. I can do ~1500envs at 60fps in Isaac Gym if it's state only (this includes some gpu-cpu copies for state serialization). With rendering the 2x 640x480 RGB rendering, the performance in Isaac Gym is real bad (un-usable for my use case). Given the 60fps constraint along with the large number of envs and env cameras, I was optimistic that batched gpu steps would pay off but it appears this might not be the case.

from maniskill.

StoneT2000 commented on August 23, 2024

I see, two 640x480 cameras is very big. The memory usage of 1500 parallel environments with 128x128 cameras with an extremely minimal shader (basically renderer is already extremely optimized here in terms of GPU memory) is already near 20GB of GPU memory iirc. I would guess the memory use of two 640x480 at 1000 envs is easily 50GB+ of GPU memory. Another thing is if you need to render a lot in parallel, the CPU sim will not be helpful and will be slower since you need to transfer cpu state to the renderer on the GPU.

A few questions to help see if we can help here:

Are 1000+ environments necessary? Is there any way to reduce this (and thus reduce the memory requirement to render so much)
Are 640x480 image resolutions required or can smaller resolutions be used (and reduce GPU memory). Another solution is to use simple interpolation operations to scale a generated small image to a bigger one if needed.
Is 60FPS absolutely necessary or how low can we cut it? Is this for teleoperation?
What GPUs do you have access to / what hardware do you plan to use? I can get a better estimate of what is possible in terms of FPS. The free Colab GPU is certainly not very strong and much slower than e.g. a 3080 or 4090 GPU

from maniskill.

StoneT2000 commented on August 23, 2024

Here's the limit I can push on my 3080

python -m mani_skill.examples.benchmarking.gpu_sim -e "PickCube-v1" -n=24 -o=rgbd --control-freq=100
# -------------------------------------------------------------------------- #
Task ID: PickCube-v1, 24 parallel environments, sim_backend=gpu
obs_mode=rgbd, control_mode=pd_joint_delta_pos
render_mode=sensors, sensor_details=RGBD(640x480), RGBD(640x480)
sim_freq=100, control_freq=100
observation space: Dict('agent': Dict('qpos': Box(-inf, inf, (24, 9), float32), 'qvel': Box(-inf, inf, (24, 9), float32)), 'extra': Dict('is_grasped': Box(False, True, (24,), bool), 'tcp_pose': Box(-inf, inf, (24, 7), float32), 'goal_pos': Box(-inf, inf, (24, 3), float32)), 'sensor_param': Dict('base_camera': Dict('extrinsic_cv': Box(-inf, inf, (24, 3, 4), float32), 'cam2world_gl': Box(-inf, inf, (24, 4, 4), float32), 'intrinsic_cv': Box(-inf, inf, (24, 3, 3), float32)), 'base_camera2': Dict('extrinsic_cv': Box(-inf, inf, (24, 3, 4), float32), 'cam2world_gl': Box(-inf, inf, (24, 4, 4), float32), 'intrinsic_cv': Box(-inf, inf, (24, 3, 3), float32))), 'sensor_data': Dict('base_camera': Dict('rgb': Box(0, 255, (24, 480, 640, 3), uint8), 'depth': Box(-32768, 32767, (24, 480, 640, 1), int16), 'segmentation': Box(-32768, 32767, (24, 480, 640, 1), int16)), 'base_camera2': Dict('rgb': Box(0, 255, (24, 480, 640, 3), uint8), 'depth': Box(-32768, 32767, (24, 480, 640, 1), int16), 'segmentation': Box(-32768, 32767, (24, 480, 640, 1), int16))))
(single) action space: Box(-1.0, 1.0, (8,), float32)
# -------------------------------------------------------------------------- #
start recording env.step metrics
env.step: 1487.386 steps/s, 61.974 parallel steps/s, 100 steps in 1.614s
start recording env.step+env.reset metrics
env.step+env.reset: 1476.223 steps/s, 61.509 parallel steps/s, 1000 steps in 16.258s

~60 FPS across 24 parallel environments where each has two 640x480 RGBD cameras.

This is the result of me making a few changes to the PickCube environment to push performance a bit further at the cost of some other things

from maniskill.

rubendsa commented on August 23, 2024

This benchmark is super helpful @StoneT2000!

Are 1000+ environments necessary? Is there any way to reduce this (and thus reduce the memory requirement to render so much)

It could be lower, horizontally scaling across a bunch of gpu VMs is an option, although it's highly preferable to avoid this direction.

Are 640x480 image resolutions required or can smaller resolutions be used (and reduce GPU memory). Another solution is to use simple interpolation operations to scale a generated small image to a bigger one if needed.

Definitely, 640x480 is very rich, particularly for small objects in a large scene. It could be reduced.

Is 60FPS absolutely necessary or how low can we cut it? Is this for teleoperation?

Could probably cut it down to 30fps xD

What GPUs do you have access to / what hardware do you plan to use? I can get a better estimate of what is possible in terms of FPS. The free Colab GPU is certainly not very strong and much slower than e.g. a 3080 or 4090 GPU

L4's would likely be used which isn't super great, but horizontal scaling will likely be needed regardless.

Dropping camera resolution + horizontal scaling might be the way to go for now. I haven't deep-dived through the Sapian codebase yet, but do you see any significant opportunity for latency improvement by dropping into cpp/physx?

from maniskill.

StoneT2000 commented on August 23, 2024

Horizontal scaling works easily and is definitely your best bet.

If you have colab premium you can use L4 GPUs and run the test code below which modifies PickCube to have two smaller cameras (160x120). You can play around with this and tune it yourself. This nets at best ~128 parallel environments at 31 calls to env.step per second.

from mani_skill.envs.tasks import PickCubeEnv
from mani_skill.sensors.camera import CameraConfig
from mani_skill.utils import sapien_utils
from mani_skill.utils.registration import register_env
import gymnasium as gym
import mani_skill.envs
import torch
import numpy as np
import time
@register_env("TestPickCube-v1", max_episode_steps=50, override=True)
class TestPickCubeEnv(PickCubeEnv):
    """modifies pick cube to use custom camera configs and put two of them for testing"""
    @property
    def _default_sensor_configs(self):
        pose = sapien_utils.look_at(eye=[0.3, 0, 0.6], target=[-0.1, 0, 0.1])
        return [CameraConfig("base_camera", pose, 160, 120, np.pi / 2, 0.01, 100), 
                CameraConfig("base_camera2", pose, 160, 120, np.pi / 2, 0.01, 100)]
num_envs = 128
env = gym.make("TestPickCube-v1", num_envs=num_envs, obs_mode="rgbd", sim_cfg=dict(control_freq=100, scene_cfg=dict(solver_position_iterations=8)))
env.unwrapped.print_sim_details()
obs, _ = env.reset(seed=0)
done = False
start_time = time.time()
total_rew = 0
TIMESTEPS = 100
for i in range(TIMESTEPS):
    # note that env.action_space is now a batched action space
    obs, rew, terminated, truncated, info = env.step(torch.from_numpy(env.action_space.sample()))

N = num_envs * TIMESTEPS
dt = time.time() - start_time
FPS = N / (dt)
print(f"Frames Per Second = {N} / {dt} = {FPS}")
print(f"Parallel Steps Per Second = {TIMESTEPS / dt}")

I get

Frames Per Second = 12800 / 3.1169700622558594 = 4106.55211450321
Parallel Steps Per Second = 32.08243839455633

on colab

On the point of dropping into the cpp/physx level, I am not confident this will change much. Most of the FPS slowdown is dominated by rendering. You can first try using the direct SAPIEN python API and build a task with raw SAPIEN. Dropping to the level below the python API might not be worth the trouble as again, the main bottleneck is the renderer and python overhead is fairly negligble.

from maniskill.

StoneT2000 commented on August 23, 2024

Closing the issue now as it has become stale.

from maniskill.

Improving env.step() latency/fps? about maniskill HOT 8 CLOSED

Comments (8)

Stripping out unused functions

Modifying simulation configurations

Collisions

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent