NeRF

Sundry

This project is a group effort of Deniz Demirtas and Tan Sarp Gerzile. The link to the other project is: Neural Face Detection

Part 1: Fit a Neural Field to a 2D Image

To fit a neural field toa 2D image, the goal is to optimize a neural network (muli-layer perceptron (MLP)) such that it learns a mapping from 2D pixel coordinates (u, v) to the RGB values of an image. The neural network architecture can be found below.

MLP Architecture:

Positional Encodings:

For all 2D coordinates in the picture that we input to the network, we first get their positional encodings. The reason for that is 2 dimensions are simply now enough and we need sinusoidal encodings with frequency L = 10. The output of this network is predicted RGB values for each pixel. The target is the ground-truth RGB values extracted from the image. For efficiency purposes, during training, at each iteration, random subsamples of pixels are sampled. the network optimizes MSE between the predicted RGB and the ground truth RGB values. We want PSNR as high as possible since as MSE decreases, PSNR increases. Another way to interpret this is that the noise passed through the neural channel is minimized.

The default parameters for training are 2000 iterations, 10000 batch size, learning rate of 1e-2, frequency L = 10, and channels sizes same as in the network architecture. The results for the test image are below. The final PSNR we achieved was 26.7 which is not bad.

Results of fitting a Neural Field to a 2D image

We then varied some of the hyperparameters to model the behavior of the model with different hyperparameters. We first tested two different learning apart from the baseline which is 1e-2, [1e-4, 1e-1]. 1e-4 is a slower learing rate, so without increasing the number of training iterations, we shouldn't expect a reconstructed image with better resolution. This is confirmed by the results with learning rate = 1e-4 below.

As can be seen from the above images, compared to the learning rate of 1e-2, at the same iteration, we get a reconstructed image with lower quality, which is because of the slower learning. Below are the results for learning rate 1e-1:

As expected, a learning rate this high quickly and incorrectly converges to 0, which is the value for black in RGB, which results in a totally wrong reconstructed image.

Then we varied frequency L. Namely, apart from the baseline L = 10, we tried values in [2, 15]. The results for L = 2 is below:

This is an interesting result. Lowering the frequency acts as a low pass filter, namely a Gaussian Blur, blurring the reconstructed image, which makes sense as low pass filters attenuate low frequency features and since our encodings are low frequency features with L = 2, but still surprising.

In conjuncture with the above arguement, setting L = 15, which is too high, will attenuate high frequency features, which results in the above edgy images, though PSNR is high.

We tried everything we did with another image, which is sampled from Pink Floyd's Meddle album cover, the results are below in sequential order as for this test image:

Results for Pink Floyd Meddle Album Cover

Part 2.1: Creating Rays from Cameras

To create rays from cameras, we first need to find the transformation from camera coordinates to the world coordinates. Any rigid body transformation can be represented by a rotation along the intertial axis, and a translation. The below rigid body transformation achieves this. In the dataset, we are given transformations from camera to world coordinates.

For the pinhole camera model with principal point (o_x, o_y) and focal lengths (f_x, f_y) and no shearing, we can define the calibration matrix K as follows:

when we multiply the camera coordinates with the calibration matrix, we formally and intuitively would get the depth s (scalar) multiplied by the pixel coordinates in homogenous coordinates. This implies that in order to get the camera coordinates back, it is enough to multiply LHS by K inverse.

In this case, if we have the world to camera matrix defined as the first equation, then the vector quantity:

Would be the ray to origin of the world frame from the camera point. This makes sense because we rotate the body frame of the camera to point to the world frame first by applying R inverse on the translational component, which is the translation vector pointing to the origin. In this case, the following equation make sense to represent rays from camera to the world frame in a normalized manner. We implement all of these transformations.

Parts 2.2 and 2.3: Sampling

To sample rays from images, we first generated a grid of pixel coordinates for all images (all operations suppport batched inputs). Rays are defined by their origins r_o and directions r_d, so we generated this tuple for all the images. Then we offset 0.5 to shift the coordinates to pixel centers (normalization). For each grid coordinate, we compute the ray origin r_o and the ray direction r_d using transformations and camera intrinsics provided by the dataset. We sampled N rays globally across all images. Then we implemented the functionality to sample points along rays. After obtaining the rays (r_o, r_d) we sampled poijts along each ray in 3D space. The samplig method included sampling values uniformly between near = 2.0 and far = 6.0, and a perturbation, which is adding random ofsets to the values during training to ensure every location on the ray is covered. Then the sampled point would be p = r_o + t * r_d. We then incorporated this logic to the Dataloader and the preimplemented function to visualize rays which resulted in the following image (pretty cool)!

Part 2.4: Neural Radiance Field

The network architecture for 3D images is similar to how we fitted a neural network to predict rgb values from positional encodings. The network consists of two key inputs, 3D coordinates representing points in 3D space and a view direction, representing the direction from which we're viewing each point. Both inputs go through positional encoding as before to capture higher frequency details. The network then processes these inputs. The mai backbone consists of 8 linear layers with ReLU activations, processing the encoded 3D coordintes. Halfway through the pipeline, the network concatenates th encoded view direction with the intermediate features. This injection of view information helps the network understand how the appearence of points changes based on viewing angle. The network also has two outputs, density output which preci ts how 'solid' each 3D point is and a color branch (rgb) output which predicts the RGB color of each point when viewd from the given direction. This architecture allows the network to learn a volumetric representation of a scene, where each point in 3D space has both a density and a view-dependent color. In our network, the inputted 3D pixel is positionally encoded with frequency L = 10 and the ray direction is also positionally encoded with L = 4.

Part 2.5 Volume Rendering

Volume rendering is how we turn the NeRF's outputs to actual 2D images. For any camera ray going through the scene, we sample multiple points along that ray. At each point sigma(t) represents the density, c(t) represents the color, T(t) represents the transmittence, which is how much light makes it through all the previous points. Dense points block more light from points behind them. Thus, the color we see is a weighted sum of all points' colors along the ray. Points closer to the camera contribute more to the final color than distant points. The discrete approximation makes this computable by sampling N points alog each ray and for each point i, calculating the probabilithy of light reaching that point (1 - exp(sigma(i)delta(i))) and multiplying these with the color values c(i) and summing up all these contributions. This rendering equation is what allows NeRF to create realistic 3D views.

Part 2.4 and 2.5 continued: Training of the Network

We train this network in a batched manner. For each training iteration, the code samples a batch of random rays from the dataset. Along eah ray, it samples multiple points (num_samples=64) between near and far bounds (2.0 and 6.0 respectively). These points and their corresponding view directions become the input to the network. In the forward pass, the MLP processes each 3D point and its viewing direction through positional encodings. For each point, the network outputs a density value and an RGB color value. These outputs are then processed through discrete volume rendering to produce a predicted RGB color for each ray. The rendered colors are compared against the ground truth from the training images using MSE loss. The Adam optimized updates the network weights to minize the differencce between predicted and actual colors. This process is repeated for 6000 iterations (thanks to A100 GPU) with a batch size of 10000 rays. Through this process the network leaerns to represent the 3D scene as a continuous function that can predict density and color at any point in space from any viewing direction.