World to Image - 1
What's the aim of this article? Firstly, to understand the various commonly used reference coordinate frames. Secondly, to understand the conversion from 3D world coordinate frame to image pixels. Let’s break it down in layman's terms.
Understanding the World Coordinate Frame
We see an object (e.g., a Red Volkswagen Polo GT) in reality. Now we want to know its position vector (Origin to center of mass of Polo). What is the origin (reference point) in our case? We need to decide that. We see a man sitting on a bench across the road. Let’s assume that he is the origin or the reference point for our 3D world coordinate system. Yes, we call it a 3D world coordinate system!
Let’s go into more detail about the origin. Where are the X, Y and Z axes of it? More assumptions: His left direction is the X-axis, the direction of his head facing (Front) is the Y-axis and the direction perpendicular to the ground is the Z-axis (why? To follow the Right-hand rule). Now technically we can find the position vector of the Red Polo GT with respect to the sitting man(origin in the world frame).
Understanding the Camera Coordinate Frame
Now where am I in this whole thing? Technically we will find that, but first, we'll make another assumption, i.e. I have a camera and I can see the world through it. I also take the liberty to assume that I know my position vector with respect to the sitting man/origin.
Now I am going to define another coordinate frame: the camera coordinate frame.
By the way, what does it mean to define a coordinate system? For easy understanding: it means that we define a new origin and its corresponding X, Y and Z axis (orientations). Then measure everything with respect to the defined origin. Do you know the exact difference between a coordinate system and a coordinate frame?
Now coming back to defining the camera coordinate system: The center of the camera lens - the principal point is the origin of it. The direction to the right of the camera is the X-axis, the direction to the down towards the ground is the Y-axis (assuming that I am almost standing perpendicular to the ground) and the direction towards the principal axis is the Z-axis as shown in the picture.
Understanding the Image Coordinate Frame
You may have a question that the images we get are in 2D pixels, but till now we only talked about 3D coordinate frames (world coordinate frame, camera coordinate frame). If not I suggest you think that and be more curious!
In our example, what do I see in the camera? I can see an image of a Red Volkswagen Polo GT on a road as I looked towards it and clicked a picture on my camera. As we all know this image is 2D, i.e., we get a matrix of pixels and its corresponding intensity (0-255 for grayscale images). This 2D is a projection of the real 3D world into a (M x N) 2D image. Let’s say we have an image of size 100 x 250 pixels (100 in height and 250 in width).
Now we can define another and final: image coordinate frame.
The left-top point of the image is defined as the origin. The right direction along with the width of the image is the X-axis and the direction to the bottom of the image along the height is considered as the Y-axis.
What is the goal to assume and understand these things?
Now can you tell me where this Polo GT is with respect to the sitting man? Yeah, that’s the goal⚽ or rather I’d say that’s a huge six🏏 as the whole match of computer vision has just started…
Trick to visualize what is a coordinate frame.
For a moment try to imagine yourself as Nani from the Eega movie. I mean in the fly body. Now if you want to comprehend visually what it means - the world coordinate frame that was defined earlier - the man sitting on a bench, then become an insect and fly towards the sitting man. Carefully sit on his nose facing away from his eyes. What do you see? Where is the Red Polo GT from me? Many computer vision applications want this type of answer.