Git Product home page Git Product logo

Comments (14)

xinghaochen avatar xinghaochen commented on June 8, 2024 2

@dhecloud We predict (u, v, d) directly. _transform_pose is used to convert normalized coordinates (u, v, d) in cropped image into (u, v, d) in full image coordinates.

from region-ensemble-network.

guohengkai avatar guohengkai commented on June 8, 2024

@dhecloud Hi, your comprehension is right. The augmentation are all 2D transformation. So they can be easily applied on the cropped 2D image just like other 2D image tasks. Note that the labels should be also changed according to the transformation.

from region-ensemble-network.

dhecloud avatar dhecloud commented on June 8, 2024

@guohengkai Thanks for your reply.

I have some confusion regarding the MSRA dataset. In your code the bin files are loaded into a numpy array of shape (240,320). This is what i assume to be the x and y coordinate of the image, containing a single channel for depth.
However in joints.txt the ground truth (x,y,z) have negative values for x and y. For eg, x for P0/5/00000_depth.bin is -0.747919 and y is -51.5306. Shouldnt the ground truth be >= 0 or am i interpreting this wrongly?

from region-ensemble-network.

guohengkai avatar guohengkai commented on June 8, 2024

@dhecloud I think this is because the hands of some images in MSRA dataset are out of FOV. You can view the image to check it. I'm not very convinced.

from region-ensemble-network.

xinghaochen avatar xinghaochen commented on June 8, 2024

@dhecloud In guohengkai's code, the labels are in the format of (u, v, d) where u and v are pixel coordinates in the image. For example, u \in [0, 319] and v \in [0, 239] for MSRA dataset. You should first convert the groundtruth labels from (x, y, z) to (u, v, d).

from region-ensemble-network.

dhecloud avatar dhecloud commented on June 8, 2024

@guohengkai @xinghaochen oh i see. that clarifies a lot as i thought xyz and uvd were interchangeable.
One more question. In your training, do you:

  1. predict (u,v,d) directly,
  2. or predict (x,y,z) then convert to (u,v,d) when drawing the pose?

It seems like you did the second one then used _transform_pose to change to (u,v,d). Just wanted to clarify.
Thanks a lot for your help!!

from region-ensemble-network.

dhecloud avatar dhecloud commented on June 8, 2024

@xinghaochen @guohengkai thanks a lot for your help!! time to start training. :)

from region-ensemble-network.

dhecloud avatar dhecloud commented on June 8, 2024

@xinghaochen Hi, could i check if my way of converting to uvd is correct? i got the the formula from your other repo which collated the all the hand pose research.

def world2pixel(x):
    fx, fy, ux, uy = 241.42, 241.42, 160, 120
    x[:, 0] = x[:, 0] * fx / x[:, 2] + ux
    x[:, 1] = x[:, 1] * fy / x[:, 2] + uy
    return x

When i tried visualizing the ground truth with opencv2 it gives me weird thumb joints
image
Thank you so much for your help!!

EDIT: From closer introspection, seems like world2pixel is giving me horizontally flipped coordinates i.e. if i flip the depth image and then draw (u,v,d) it works well.
image
Is this normal or did i miss some step in the conversion to (u,v,d)?

from region-ensemble-network.

xinghaochen avatar xinghaochen commented on June 8, 2024

@dhecloud Hi, it seems the sample images posted by you are from MSRA dataset. As far as I can remember, the coordinate system for pose annotations of MSRA dataset is a bit different from the traditional one. I first multiply y and z with -1 and convert xyz to uvd, which is exactly the same as your situation of horizontally flipped coordinates.
It's ok to transform the coordinates as long as it's corresponding to the depth image.

from region-ensemble-network.

dhecloud avatar dhecloud commented on June 8, 2024

@xinghaochen Hi, thanks for your help. Just a minor follow up, what did you set the probability of transforming your input depth to? I set mine to 60% chance to randomly translate/rotate/scale but the training does not seem to be going well.

from region-ensemble-network.

xinghaochen avatar xinghaochen commented on June 8, 2024

@dhecloud Hi, we set the probability to 100%, that is, each sample will go through random translation, rotation and scaling before being fed into the network for training.
How is the performance without data augmentation? You may first make sure the training without data augmentation works well and then add the data augmentation, so that you can find out whether the problem comes from data augmentation or not. If that's the case, what are the parameters of the ranges of random translation/rotation/scaling?

from region-ensemble-network.

dhecloud avatar dhecloud commented on June 8, 2024

@xinghaochen Without augmentation, smoothL1 loss goes down to around 6-10 after 150 epoches. What was your loss at the end of the training? i will probably redo all the training again.

I tried to follow the parameters in your paper. random [-10,10] horizontal and random [-10,10] vertical translation, rotate about [-180, 180] degrees. For scaling, i did not randomize between 0.9 and 1.1, but instead randomed one of these values [0.9,0.96,1, 1.04, 1.1].

Could i clarify something; i read in this issue that you did the augmentation after cropping to 96x96. I did it the same way too. For eg, if i horizontally translate the 96x96 input to the right by 10 pixels, I would add 10 to the u coordinates for the corresponding joint. Is this right? Does translating the resized 96x96 depth image by 10 to the right correspond to the joints moving to the right by 10 too?

from region-ensemble-network.

guohengkai avatar guohengkai commented on June 8, 2024

@dhecloud Have you normalized the joints according to the cropping? If so, the things you have done are right.

from region-ensemble-network.

dhecloud avatar dhecloud commented on June 8, 2024

@guohengkai Hi, no, i don't think so. All i did was to convert xyz to uvd for the joints. This might be the reason why. The process is same as in _crop_image right?

Edit: this is my code for normalizing the joint:

def _normalize_joints(joints, center, is_debug=False):
    _fx, _fy, _ux, _uy = 241.42, 241.42, 160, 120
    _cube_size = 150
    _input_size = 96
    xstart = center[0] - _cube_size / center[2] * _fx
    xend = center[0] + _cube_size / center[2] * _fx
    ystart = center[1] - _cube_size / center[2] * _fy
    yend = center[1] + _cube_size / center[2] * _fy
    src = [(xstart, ystart), (xstart, yend), (xend, ystart)]
    dst = [(0, 0), (0, _input_size - 1), (_input_size - 1, 0)]
    trans = cv2.getAffineTransform(np.array(src, dtype=np.float32),
            np.array(dst, dtype=np.float32))
    joints = get_translated_points(joints.reshape(21,3),trans)
    return joints
def get_translated_points(joints, M):                 #to get new coordinate after applying transformation mat
    for i in range(len(joints)):
        x = joints[i][0]
        y = joints[i][1]
        joints[i][0] = M[0,0]*x+ M[0,1]*y + M[0,2]
        joints[i][1] = M[1,0]*x + M[1,1]*y + M[1,2]

    return joints

Drew it on the 96x96 and it looks fine. Just a question though, the d in u,v,d is largely untouched. d is usually in the 200-300 range while u and v is 0-96. This difference is fine right? Since the u and v coordinates are the most important when predicting

from region-ensemble-network.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.