Computer Vision Course 2024-01
pilab-cau / computervision-2401 Goto Github PK
View Code? Open in Web Editor NEWComputer Vision Course 2024-01
License: Apache License 2.0
Computer Vision Course 2024-01
License: Apache License 2.0
In the lecture, @yjyoo3312 explained that random cropping of images may exclude object regions, which can decrease accuracy.
However, I think the Cut-Mix or Cut-Out also excludes object regions, as shown in the image below. If the object region is small, the Cut-Out randomly cuts a patch that might contain the object region. Cut-Mix has a similar issue as Cut-Out. Is it necessary to additionally annotate where the object is and its size to avoid excluding the object region?
Hello Professor Youngjoon Yoo
This is a question I asked you in person in a class quite a while ago, and I wanted to write down my question and share it with other students.
The question I asked at the time was, "You said that the impulse function can check the value of the kernel, but in computer vision, we already know the value of the kernel, so why do we need to use the impulse function?
My understanding of the answer was as follows.
'Actually, impulse functions are not only used in computer vision, but also in signal processing, and there is a reason to use impulse functions because they are used in situations where the kernel is not known, but in computer vision, the impulse function is only a mathematical function because it is imported from signal processing.
I'm raising this issue to confirm that my understanding is correct.
Hyunwoong LIM
Hello, professor @yjyoo3312. I understand that both the pointwise convolution and the depthwise convolution can reduce the parameters because they reduce the dimensions. So, are the two convolutions different in the degree to which they reduce the parameters?
-Kim JiHyeon
The reason for using the traditional bottleneck architecture is to reduce the number of parameters and computational cost. The inverted bottleneck, as I understand it, is illustrated on the right. I am curious if this inverted bottleneck structure, despite the intermediate expansion of channels, affects the number of parameters or computational cost.
Sumin Park
Thank you.
I know that "Nesterov Momentum anticipates the next step's location by first applying momentum in the current direction from the current position, then computing the gradient at that position, and moving in that direction once more."
I think I have some understanding of what "nesterov" is. However, I don't understand what is the status of "if nesterov" in the pseudo code in the image above.
My understanding is that if nesterov is applied, it should be applied in every step, and if not, it should not be applied at all. I don't know how the if nesterov state can exist..
I would be grateful if you could tell me about "if nesterov"...
@yjyoo3312 Hello Professor, my name is Jeon Yonghyeon, and I am currently enrolled in your class. I am writing to inquire about the reasons behind the differences in the algorithms of the Canny edge operator and the Harris corner detector.
In the image aboves, the algorithm for the Canny edge operator applies a Sobel filter to a Gaussian-smoothed image, whereas for the Harris corner detector, it seems that the gradient is first calculated using the Sobel filter, followed by Gaussian smoothing. What is the reason for these differences in the order of operations between the two algorithms?
My hypothesis is that the Harris corner detector necessitates the calculation of the window function (Gaussian smoothing) later in the process due to transformations in the formula.
However, in Lecture 3, there is an image showing the problems that arise when calculating the gradient before smoothing. Does the Harris corner detector not encounter these problems? Or would it be acceptable for the Canny edge detector to also compute the gradient before applying Gaussian smoothing?
Thanks for watching my issue!
This issue is about group convolution getting a computation reduction.
We would appreciate it if you could confirm that this math is correct and let us know if you have any corrections.
The idea of a group convolution is to divide the number of channels.
However, one of the common misconceptions is that since you're dividing the channels, you're doing G many convolutions, so there's no computational gain.
The reason for this misconception is that you need to divide not only the channels, but also the channels in the kernel into groups, which is sometimes well illustrated with pictures, but is easy to miss.
Therefore, I have attached the math for this in the photo below.
Thank you.
Hyunwoong Lim
Hi, professor! I have question about getting the weight parameter w of Bounding Box Regression.
When calculating w, the equation seems like using MSE loss and ridge regression form.
But you mentioned that the BBox regression is not easy process, so it cannot be solved using just MSE Loss.
I am confusing about this part. So can you explain a little bit more about this issue?
Thank you!!!
When I was trying to find the homography in the picture above, I was told that the array that A.appends to would make sense if I tried it.
I created an issue to confirm my understanding and to share it with other students.
Here's what I want to write
Also, as I understand it, I would like to know why it is okay to leave h_22 at 1.
If we extend this to v_1, we can see that the equation is satisfied.
However, I still don't understand why it's okay to have h_22 = 1.
I would like you to confirm if the following understanding and explanation is correct.
Thank you.
When you upload the issue, please follow the title format in this issue.
Hello, @yjyoo3312. I have a question �about training the model.
When training AI model, many factors besides the model architecture determine its performance.
It can be challenging to identify whether high loss is due to issues with the architecture itself or the training setting(including the optimizer, hyperparameters, learning rate scheduler, or training duration)
Very often I can't decide if the architecture is a problem or if the training setting is a problem.
Are there any personal tips for finding where the high loss came from??
Thank you!!
It seems like convolution layers produce the desired number of outputs for the specified kernels without ReLU function. Then, for page 55, what's the necessity of the ReLU function here? I'm curious about what output would be produced if the ReLU function computation is added compared to the output which doesn't contain that function.
Minjung Kim (20210172)
@yjyoo3312 I have a question about fast approximation in harris corner detection!
As you can see in this figure's highlighted formula, the theta value is computed using eigenvalues of matrix M.
But you mentioned that the benefit of this fast approximation is We do not need to calculate Eigenvalue.
How can I understand this part?
Thank you! :)
Hello, @yjyoo3312. My name is JongHan Leem.
Upon reviewing Lecture 1, the part where we discuss the similarity of two images using Bhattacharya distance of their histograms, I noticed that the formula for Bhattacharya distance might be incorrect.
According to the definition of Bhattacharyya distance,
The formula for Bhattacharya distance is given by:
where
However, in the code, the Bhattacharya coefficient is calculated as:
bc = np.sqrt(np.sum(hist1 * hist2))
And also, since the Bhattacharya distance measures the similarity between two probability distributions, the input histograms should be normalized, such that they can be treated as probability distributions. i.e. the histograms should be normalized using the L1 norm so that the sum of the elements in the histogram is equal to 1. (Currently, the code normalizes histograms using L2 norm)
It might be a subtle thing, but I will create a PR about this!
I also created a Jupyter Notebook file to show some visualizations :)
@yjyoo3312
I have a questions in lecture 3
In lecture slide, Sobel filter of Sx is
But in the source code of lecture 3, you used filter2D without filpping the Sobel kernel. I heard that cv2.filter2D does actually compute correlation, not the convolution. OpenCV doc: https://docs.opencv.org/2.4/modules/imgproc/doc/filtering.html#filter2d
h_x = (1/8)*np.array([[-1.0, 0.0, 1.0],
[-2.0, 0.0, 2.0],
[-1.0, 0.0, 1.0]])
h_y = (1/8)*np.array([[1.0, 2.0, 1.0],
[0.0, 0.0, 0.0],
[-1.0, -2.0, -1.0]])
lenna_grad_x = cv2.filter2D(lenna_gaussian, -1, h_x, borderType=cv2.BORDER_CONSTANT)
lenna_grad_y = cv2.filter2D(lenna_gaussian, -1, h_y, borderType=cv2.BORDER_CONSTANT)
So I was wondering if the kernel of the filter doesn't have to flip?
Hello,
I am writing to ask about the differences between the ResNet architectures shown in lecture 12-1 (page 15) and lecture 11 (page 45). It appears that there are discrepancies in the detailed structure of ResNet between the two slides, particularly in terms of the block structure and the filter size of the pooling layer, as highlighted by the boxes in the images. Could you explain the reasons behind these differences?
Thank you for your time.
I have a question about the notation in Fully Connected Layer slide, on page 6 of lecture 10. I think the notation "# hidden layer" should be "# hidden units" or "# hidden neurons", because we are calculating the dimensions within a layer.
Is my opinion correct? Or, do we use the two notations interchangeably?
I understand the idea of the residual block.
For an input x, the plain block trained on the output, H(x).
However, in the residual block, we define and train a new residual: F(x)+x.
Since it is very difficult to learn the ideal mapping function H(x), we use F(x), which is a slightly more trainable form.
By adding the input x to F(x), the residual method brings stability to the optimization process by learning only the additional information it needs, while retaining what it has already learned.
Intuitively, it seems that adding only x to identical blocks would lead to more computation.
However, the overall residual network has rather fewer FLOPs.
So my question is:
Thank you!
HaeSeong Kim
Hello, professor, I'm raising this issue because I'd like to review what we covered in class today about ReLU6, ShuffleNet, etc. and make sure I'm correct. I'd also like to ask if you know anything about the Shift method for computational reduction.
This issue is a complication of other questions, so I hope that this doesn't confuse anyone, as only the second question is included in the class, and the third question is for private study.
Here's the question in a nutshell (the questions are labeled with numbers)
ReLU6 is min(max(0, x), 6), where the value above a certain point is 6.
"ShuffleNet"
"Shift"
This is something I came across while looking for methods for computational reduction, and the original paper referenced the following. (Wu, Bichen, et al. "Shift: A zero flop, zero parameter alternative to spatial convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.)
In a nutshell, the convolution is performed by shifting the feature map that comes out of the convolution up, down, left, and right, and excluding the parts that are beyond the original range.
Thank you.
Hyunwoong LIM
Regarding page 28 of yesterday's lecture slides,
On the left side, you mentioned that in the inverted residual network, the activation function is not used in the final 1x1 part due to the risk of information loss.
So I thought the performance of this network should generally be better.
However, on the right table, it shows that the performance of ReLU6 bottleneck is lower.
I'm wondering if there is something wrong with my understanding on this part.
Minjung Kim (20210172)
Why is Conv_channel1 on the first slide different from Conv_channel1 on the second slide, but the type of Convolution (name?) that represents them? (channelwise Conv, pointwise Conv) They have the same components.
Next, I was wondering what determines the characteristics of each convolution. At first, I thought it was the components that make up each convolution (like C_in, C_out, ...). I guess the position where the convolution is located has affected it. I look forward to hearing from the expert :)
Hello everyone, in discussing how the Transformation Matrix R is derived, we've noticed the explanation might be a bit confusing, specifically regarding clockwise rotation.
Thanks for bringing up this question; we'll provide a revised explanation.
The essential idea is that to rotate a point by \theta, we must rotate the coordinate frame by -\theta.
Hence, the derivation process will be adjusted accordingly.
Thank you for the comment! and the slide including the changes will be updated in eclass.
Thanks for the comment in the lecture.
I thought about the rotation invariance of the two window functions in:
Certainly, the uniform window itself does not guarantee the rotation invariance (but it is right that we will find corners by seeing eigenvalues, which will implicitly concern the rotation invariance).
Also, for a continuous 2D-Gaussian filter, it is a rotation invariant, but strictly saying not for 3x3 Gaussian filter.
(rotation invariant)
(actually not)
So, Forget about rotation invariance in this slide, it would be controversial in my opinion:)
Thank you for attending today's course! we have two things to be fixed in the slide.
@yjyoo3312, when I'm attempting to solve the self-study materials, specifically on problem 2, I have concerns that the proof regarding shift invariance may not be properly constructed. Could you please advise if the proof I have formulated is sufficient? If not, could you suggest which properties should be utilized?
As professor @yjyoo3312 pointed out in the lecture, one of the key architectural changes in MobileNet v2 is the use of a linear bottleneck structure.
So here I examined the official PyTorch implementation of MobileNet v2 and compared it to ResNet.
ResNet's block design uses ReLU activation in its output.
As you can see here, MobileNet v2's block does not use an activation function in its final point-wise convolution, employing a linear activation.
Additionally, you can notice that the block uses skip-connection when the spatial dimensions and channels of the input and output match.
As many have noticed, terms like filtering, convolution, and correlation are often used ambiguously in computer vision. The ambiguity in this field often arises because much of the implementation follows the conventions from widely used libraries, such as OpenCV. For example, OpenCV's cv2.filter2D operation is implemented as a correlation by default, a common assumption in the field. This leads to older filters being set up for correlation.
Similarly, OpenCV uses BGR instead of RGB as its default color format, and while points are accessed in (x, y) order, matrix elements are accessed in (row, col) order, which can be quite confusing, even though there might be valid reasons for this.
Given this, many libraries in the field implement convolution and filtering (which is typically correlation) in different ways, so I often check the results to see how the operation actually works. If it's working as intended, I use it; otherwise, I flip the kernel and try again.
However, I need to be extra careful when it comes to exams; I'll be sure to define notations precisely in the questions. As long as you're aware of the difference between convolution and correlation, you should be fine.
Thanks for the comment! @nshuhsn
@yjyoo3312, I have two questions about Laplacian of Gaussian (LoG)!
When using LoG, if edge detection is based on zero-crossings, how do we differentiate between case 1 and case 2 in Figure, which are both zero-crossing but one is on edge and the other is not?
I heard that the reason for approximating LoG with DoG is due to computational complexity. However, since convolutional filters are fixed in advance, it doesn't seem necessary to worry about computation. Are there cases where LoG is computed multiple times?
Thank you very much!
Hi professor!
I have question about setting number of anchor boxes.
By default, we set it as 9 in usual case.
However, If we set it to lower value, I can expect more FPS without losing performance loss.
Is there any research or results about adjusting this value??
Thank you!
First, I checked issue #39. I'm curious about the dimension of W. Like (3x32x32)x(#hidden=10)=30K, I'm curious about what 3 is, what is the first 32 and what is the second 32? I've never studied neural networks, so please understand that I lack relevant knowledge.
@yjyoo3312, I have 4 questions about Harris Corner and SIFT.
This question is about The issue #5 . I want to double-check if my understanding is correct.
There are two ways to blur an image: applying Gaussian blur or performing downsampling and resampling subsequently. If blurring is done using the former method, it results in a Gaussian pyramid, while using the latter method results in a Laplacian pyramid, is it right?
It seems like the meaning of "scale" differs between the first and second images. In the first image, it appears to represent octaves, while in the second image, it seems to represent layers, i.e. different sigmas. Are both concepts related to the term "scale" and thus referred to as "scale"?
SIFT does not use Harris Corner Detector on the 'space' axis; instead, it employs DoG (Difference of Gaussians). As learned in the previous lecture, DoG approximates the Laplacian. Therefore, does SIFT treat the local maxima of the Laplacian as keypoints? And is it reasonable? I'm not sure what the local maxima of the Laplacian means.
Thank you for reading my lengthy text!
I don't quite understand the idea that "weight decay removes the effect of old parameters." I know that weight decay is a regularization technique used to prevent overfitting by reducing large weights, which helps the model generalize better. Could you explain what it means by "weight decay removes the effect of old parameters"? Does it simply mean that weight decay reduces the values of large weights?
I am writing to inquire about the specific limitations of the YOLO-v1 model as discussed in our recent lecture. YOLO-v1, while being an innovative and efficient object detection model, is known to have several limitations that impact its performance. I would like to understand these limitations better and verify their validity.
Could you please elaborate on the following points regarding YOLO-v1's limitations?
Detection of Multiple Objects in a Single Grid Cell: It has been noted that YOLO-v1 struggles to detect multiple objects within a single grid cell. How does this limitation affect the model’s performance in dense object scenarios?
Handling of Small Objects: The model reportedly has difficulties with small object detection due to its grid cell approach favoring larger objects. What are the specific challenges YOLO-v1 faces with small objects, and are there any particular cases where this limitation is most evident?
Bounding Box Regression Issues: YOLO-v1’s bounding box predictions can sometimes be inaccurate, leading to poor localization. How significant is this issue in practical applications, and are there known methods to mitigate it?
I would appreciate a detailed explanation of these points to understand the limitations of YOLO-v1 better. Additionally, if there are any insights or counterarguments that might provide a more balanced view, I would be very interested in hearing them.
Thank you.
NCWH -> NCHW
In PyTorch, the dimension order of input values is known to be NC"HW". However, in the CutMix slide, it seems to be in NC"WH" order. So I think in W = size[2], 2 should be changed to 3 and in H = size[3], 3 should be changed to 2.
Additionally, in the cutmix_data function on the left, since the third dimension is H and the fourth dimension is W, I think the order of bbx and bby should be swapped. Am I correct?
np.int deprecated
I knew that np.int has been deprecated in PyTorch. Currently, np.int32 or np.int64 is being used. Could you please check this?
I know that the larger the scale, the smaller the image size should be, as shown below.
I also �think that the larger the octave and scale, the smaller the image size should be, as shown below.
However, in our PPT, the scale at 0 octave is represented as the largest.
I think this problem can be solved by reversing the direction of the arrow for scale.
Is my way of thinking correct?
I apologize if this has already been mentioned in class.
Thank you!
HaeSeong Kim
When we calculate H in Ransac, we set h22 = 1 because it is used to maintain 1 in translation movement. So my question is why h20 and h21 is not zero? We don't need h20 and h21 in Scaling, Rotation, Translation, so in lecture 2 we set h20 and h21 as zero.
Are there any reasons we cannot set h20 and h21 as zero?
I am writing to inquire about the specific ways in which the auxiliary fully connected (FC) layers in GoogleNet help mitigate the gradient vanishing problem during training. The gradient vanishing issue is a significant challenge in deep neural network training, where gradients become progressively smaller as they are backpropagated through the layers, leading to slow learning or even a complete halt in learning for the earlier layers.
Could you please elaborate on how these auxiliary FC layers address this problem within the context of GoogleNet? I am particularly interested in understanding the mechanisms by which these layers influence the backpropagation process, the strategic placement of these layers within the network, and any additional benefits they provide beyond mitigating the gradient vanishing issue.
Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.