Hi, Thanks for your research and paper. I am trying to implement it in pytorch. <p

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Data augmentation about region-ensemble-network HOT 14 CLOSED

guohengkai commented on June 8, 2024

Data augmentation

from region-ensemble-network.

Comments (14)

xinghaochen commented on June 8, 2024 2

@dhecloud We predict (u, v, d) directly. _transform_pose is used to convert normalized coordinates (u, v, d) in cropped image into (u, v, d) in full image coordinates.

from region-ensemble-network.

guohengkai commented on June 8, 2024

@dhecloud Hi, your comprehension is right. The augmentation are all 2D transformation. So they can be easily applied on the cropped 2D image just like other 2D image tasks. Note that the labels should be also changed according to the transformation.

from region-ensemble-network.

dhecloud commented on June 8, 2024

@guohengkai Thanks for your reply.

I have some confusion regarding the MSRA dataset. In your code the bin files are loaded into a numpy array of shape (240,320). This is what i assume to be the x and y coordinate of the image, containing a single channel for depth.
However in joints.txt the ground truth (x,y,z) have negative values for x and y. For eg, x for P0/5/00000_depth.bin is -0.747919 and y is -51.5306. Shouldnt the ground truth be >= 0 or am i interpreting this wrongly?

from region-ensemble-network.

guohengkai commented on June 8, 2024

@dhecloud I think this is because the hands of some images in MSRA dataset are out of FOV. You can view the image to check it. I'm not very convinced.

from region-ensemble-network.

xinghaochen commented on June 8, 2024

@dhecloud In guohengkai's code, the labels are in the format of (u, v, d) where u and v are pixel coordinates in the image. For example, u \in [0, 319] and v \in [0, 239] for MSRA dataset. You should first convert the groundtruth labels from (x, y, z) to (u, v, d).

from region-ensemble-network.

dhecloud commented on June 8, 2024

@guohengkai @xinghaochen oh i see. that clarifies a lot as i thought xyz and uvd were interchangeable.
One more question. In your training, do you:

predict (u,v,d) directly,
or predict (x,y,z) then convert to (u,v,d) when drawing the pose?

It seems like you did the second one then used _transform_pose to change to (u,v,d). Just wanted to clarify.
Thanks a lot for your help!!

from region-ensemble-network.

dhecloud commented on June 8, 2024

@xinghaochen @guohengkai thanks a lot for your help!! time to start training. :)

from region-ensemble-network.

dhecloud commented on June 8, 2024

@xinghaochen Hi, could i check if my way of converting to uvd is correct? i got the the formula from your other repo which collated the all the hand pose research.

def world2pixel(x):
    fx, fy, ux, uy = 241.42, 241.42, 160, 120
    x[:, 0] = x[:, 0] * fx / x[:, 2] + ux
    x[:, 1] = x[:, 1] * fy / x[:, 2] + uy
    return x

When i tried visualizing the ground truth with opencv2 it gives me weird thumb joints

Thank you so much for your help!!

EDIT: From closer introspection, seems like world2pixel is giving me horizontally flipped coordinates i.e. if i flip the depth image and then draw (u,v,d) it works well.

Is this normal or did i miss some step in the conversion to (u,v,d)?

from region-ensemble-network.

xinghaochen commented on June 8, 2024

@dhecloud Hi, it seems the sample images posted by you are from MSRA dataset. As far as I can remember, the coordinate system for pose annotations of MSRA dataset is a bit different from the traditional one. I first multiply y and z with -1 and convert xyz to uvd, which is exactly the same as your situation of horizontally flipped coordinates.
It's ok to transform the coordinates as long as it's corresponding to the depth image.

from region-ensemble-network.

dhecloud commented on June 8, 2024

@xinghaochen Hi, thanks for your help. Just a minor follow up, what did you set the probability of transforming your input depth to? I set mine to 60% chance to randomly translate/rotate/scale but the training does not seem to be going well.

from region-ensemble-network.

xinghaochen commented on June 8, 2024

@dhecloud Hi, we set the probability to 100%, that is, each sample will go through random translation, rotation and scaling before being fed into the network for training.
How is the performance without data augmentation? You may first make sure the training without data augmentation works well and then add the data augmentation, so that you can find out whether the problem comes from data augmentation or not. If that's the case, what are the parameters of the ranges of random translation/rotation/scaling?

from region-ensemble-network.

dhecloud commented on June 8, 2024

@xinghaochen Without augmentation, smoothL1 loss goes down to around 6-10 after 150 epoches. What was your loss at the end of the training? i will probably redo all the training again.

I tried to follow the parameters in your paper. random [-10,10] horizontal and random [-10,10] vertical translation, rotate about [-180, 180] degrees. For scaling, i did not randomize between 0.9 and 1.1, but instead randomed one of these values [0.9,0.96,1, 1.04, 1.1].

Could i clarify something; i read in this issue that you did the augmentation after cropping to 96x96. I did it the same way too. For eg, if i horizontally translate the 96x96 input to the right by 10 pixels, I would add 10 to the u coordinates for the corresponding joint. Is this right? Does translating the resized 96x96 depth image by 10 to the right correspond to the joints moving to the right by 10 too?

from region-ensemble-network.

guohengkai commented on June 8, 2024

@dhecloud Have you normalized the joints according to the cropping? If so, the things you have done are right.

from region-ensemble-network.

dhecloud commented on June 8, 2024

@guohengkai Hi, no, i don't think so. All i did was to convert xyz to uvd for the joints. This might be the reason why. The process is same as in _crop_image right?

Edit: this is my code for normalizing the joint:

def _normalize_joints(joints, center, is_debug=False):
    _fx, _fy, _ux, _uy = 241.42, 241.42, 160, 120
    _cube_size = 150
    _input_size = 96
    xstart = center[0] - _cube_size / center[2] * _fx
    xend = center[0] + _cube_size / center[2] * _fx
    ystart = center[1] - _cube_size / center[2] * _fy
    yend = center[1] + _cube_size / center[2] * _fy
    src = [(xstart, ystart), (xstart, yend), (xend, ystart)]
    dst = [(0, 0), (0, _input_size - 1), (_input_size - 1, 0)]
    trans = cv2.getAffineTransform(np.array(src, dtype=np.float32),
            np.array(dst, dtype=np.float32))
    joints = get_translated_points(joints.reshape(21,3),trans)
    return joints

def get_translated_points(joints, M):                 #to get new coordinate after applying transformation mat
    for i in range(len(joints)):
        x = joints[i][0]
        y = joints[i][1]
        joints[i][0] = M[0,0]*x+ M[0,1]*y + M[0,2]
        joints[i][1] = M[1,0]*x + M[1,1]*y + M[1,2]

    return joints

Drew it on the 96x96 and it looks fine. Just a question though, the d in u,v,d is largely untouched. d is usually in the 200-300 range while u and v is 0-96. This difference is fine right? Since the u and v coordinates are the most important when predicting

from region-ensemble-network.

Data augmentation about region-ensemble-network HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent