autonomousvision / mip-splatting Goto Github PK

View Code? Open in Web Editor NEW

1.0K 23.0 66.0 21.15 MB

[CVPR'24 Best Student Paper] Mip-Splatting: Alias-free 3D Gaussian Splatting

Home Page: https://niujinshuchong.github.io/mip-splatting/

License: Other

Python 63.47% CMake 0.44% C++ 6.83% Cuda 28.90% C 0.36%

3d-reconstruction guassian-splatting nerf novel-view-synthesis anti-aliasing

mip-splatting's People

Contributors

Stargazers

Watchers

mip-splatting's Issues

Installation has extra 'submodules' in the code

for pip install submodules/simple-knn/ in the code on the main page there's an extra 'submodules'

Currently: pip install submodules/submodules/simple-knn/
Should be corrected to: pip install submodules/simple-knn

about eq. (9)

Thanks your great work again!

In gaussian_model.py, fiter_3D = distance / focal_length * scale1
according paper eq. (9), new gaussian covariance matrix's scale should be $\sqrt{scale^2 + filter3D}$
But, why in function get_scaling_with_3D_filter() is $\sqrt{scale^2 + filter3D^2}$ ?
Why square fiter_3D? Thanks

How to choose the variacne of 2D Mip Filter ?

Thanks for your great work! I understand more about 3dgs from your work!

In 6.1 Implementation, We choose the variance of our 2D Mip filter as 0.1, approximating a single pixel

My questions:
How to deduce from a single pixel to 0.1 value？
By the way, what is the unit of 0.1?

In my understande:
In computeCov2D function, cov[0][0] += kernel_size; means the scale (or unit) of kernel_size and cov should be the same, because they can add together.
cov is the covariance in ray space. So the unit of 0.1 should be the unit in ray space. But why 0.1 in $(x_0, x_1)$ plane of ray space occupies a aingle pixel?

From Fig. 8 ray space. of EWA Splatting

Thanks!

About 2D filter

I wanna ask why the 2D filter could work. In original 3DGS, it uses 0.3f, and you use 0.1f with a coef.

why does it work?

Low matching rate of data preprocessing using convert.by

I did not directly use the sample data mic from nerf_synthetic for training, but instead preprocessed the sample data mic using convert.by. However, the matching rate was very low. After preprocessing 100 sample data using convert.by, only two of them matched. May I ask, what is the reason for this?How can I preprocess data with high matching rates?

Questions about the stmt training on the mip360 dataset

Hi authors, thanks for your great work! I would like to ask about the stmt training on mip360 datasets. Your paper mentioned: "we train models on data downsampled by a factor of 8 and rendered at successively higher resolutions (1×, 2×, 4×, and 8×)".

However, I found that this training file still uses {scene}/images for training (when using -r 4) since I printed the image_path in the readColmapCameras functions. In my understanding, we should use {scene}/images4 in this case. Is it correct?

How to use custom data for training

What should I do if I want to use custom data for training。
Looking forward to your reply, thank you

About GPU Memory

Thank you for your awesome work! In the bicycle scene that is downsampled 4 times, at 3000 iterations, the 3090 24G memory is full and the training is slow. It seems that mip-splatting requires much more densified point clouds than the original 3DGS. What is the GPU memory overhead of the scene?

Thanks for your great work. Does the s in equation (9) refer to 0.2 in the code?

Does the s in equation (9) refer to 0.2 in the code?

Swapping out Original Gaussian Splatting

Hello, this is a great project, thanks for making it public!

I'm considering integrating mip-splatting with my code that uses the original gaussian splatting implementation. Is it okay if I simply swap out the diff-gaussian-rasterization package for the one in this repo (submodules/diff-gaussian-rasterization)? Are there any other changes I would have to make to the original gsplatting code? I am not working on reconstruction exactly, so I cannot use this repository as is, unfortunately. Thank you!

FileNotFoundError: [Errno 2] No such file or directory: 'benchmark_nerf_synthetic_ours_stmt/mic/cfg_args'

Looking for config file in benchmark_nerf_synthetic_ours_stmt/mic/cfg_args
Traceback (most recent call last):
File "render.py", line 60, in
args = get_combined_args(parser)
File "/home/jovyan/cfs/personal_data/gongyan/It/mip-splatting-main/arguments/init.py", line 105, in get_combined_args
with open(cfgfilepath) as cfg_file:
FileNotFoundError: [Errno 2] No such file or directory: 'benchmark_nerf_synthetic_ours_stmt/mic/cfg_args'

The low-pass filter in Eq. 8 and Eq. 9

Thank you for your work! I have a question about the low-pass filter in Eq. 8 and Eq.9, i.e., how do you construct it?

May I ask if any modifications have been made to the diff-Gaussian-transformation

I want to extract depth rendering results from 3D Gaussian, so may I ask if you have made any modifications to the diff Gaussian transformation file

Error: nvrtc: error: invalid value for --gpu-architecture (-arch)

hey, you did a great job. I encountered an error during the deployment of the project, which led to unsuccessful training and Unable to compute metrics for model. Can you please help me identify the reason for this?

the error is
nvrtc: error: invalid value for --gpu-architecture (-arch)
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off]

best wishes,

How to improve the quality of pointclouds?

Great job. But when i train on my own data, i found that the quality of pointclouds has not a hight quality, especially the road surface.

Unable to reproduce your results

There is my command and I didn't change anything from your code: python ./scripts/run_mipnerf360.py ,
but I uploaded the .ply generated by your code to the onlineviwer I found some floaters and white spots.

GPU Memory - Massive Issue

Work is needed on the GOU memory allocation. Cant get any results as compared to original paper as GPU memory max's out way to quickly.

Suggest to have another go basis using latest 3dgs implementation with your code revision....

Hope this helps...

It looks good, but cant use this at all or even see results...

Running on Apple / Metal?

Is it possible to run on a 128gb M3 Max?

If not on Metal/gpu, will it run in CPU only mode?

how to extract mesh?

how to extract mesh use this model?

can not display in the viewer which provided by graphdeco-inria

Hi, greate work,, but it seems can not display in the viewer which provided by graphdeco-inria?

installation

in installation section
https://github.com/autonomousvision/mip-splatting?tab=readme-ov-file#installation

conda install cudatoolkit-dev=11.3 -c conda-forge

after search
https://anaconda.org/search?q=cudatoolkit-dev

the lowest version is 11.4 now, can we use
conda install cudatoolkit-dev=11.3 -c conda-forge

thanks

radii > 0 - CUDA error illegal access memory

Hi !

Currently trying to give a try to your method on a custom scene I have but I'm facing quite a disturbing and strange issue. Training starts well, but at some points, I got some CUDA-kernel error I cannot get.

I highly suspect an issue related to the gaussians.compute_3D_filter method (https://github.com/autonomousvision/mip-splatting/blob/746a17c9a906be256ed85b8fe18632f5d53e832d/train.py#L164C1-L165C1) but I don't manage to investigate in a deeper way the error.

I've found some similar issue in the original GS repo (here: graphdeco-inria/gaussian-splatting#41 (comment)), made the corresponding changes by rebuilding the diff-gaussian-rasterizer submodule, but I still get the error.

Here is the log stack I get.

Do you have any clues / insights on what's happening and why ?

Thanks a lot for your time, your work,

Best,

Gaétan.

Custom data artifacts

I have validated your method and I think it is very great. But when I use custom data, the details are great, but there are many artifacts. Is there any way to deal with this?

About Eq.9

In Eq.9, it is the square of covariance plus the square of 3D Filter, but in the code, it is the square of exp(scaling) plus the square of 3D Filter squared. Why?

help

(mip-splatting) root@autodl-container-2a5049addd-1b9dbf2b:~/mip-splatting# OMP_NUM_THREADS=4 CUDA_VISIBLE_DEVICES=0 python train.py -s 360_v2/bonsai -m output --eval -r 8 --kernel_size 0.1
Optimizing output
Output folder: output [09/04 12:43:50]
Tensorboard not available: not logging progress [09/04 12:43:50]
Reading camera 292/292 [09/04 12:43:51]
Loading Training Cameras [09/04 12:43:51]
Loading Test Cameras [09/04 12:44:19]
Number of points at initialisation : 206613 [09/04 12:44:23]
Computing 3D filter [09/04 12:44:23]
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 268, in
training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
File "train.py", line 125, in training
render_pkg = render(viewpoint_cam, gaussians, pipe, background, kernel_size=dataset.kernel_size, subpixel_offset=subpixel_offset)
File "/root/mip-splatting/gaussian_renderer/init.py", line 60, in render
opacity = pc.get_opacity_with_3D_filter
File "/root/mip-splatting/scene/gaussian_model.py", line 132, in get_opacity_with_3D_filter
det1 = scales_square.prod(dim=1)
RuntimeError:
#define POS_INFINITY __int_as_float(0x7f800000)
#define INFINITY POS_INFINITY
#define NEG_INFINITY __int_as_float(0xff800000)
#define NAN __int_as_float(0x7fffffff)

typedef long long int int64_t;
typedef unsigned int uint32_t;
typedef signed char int8_t;
typedef unsigned char uint8_t; // NOTE: this MUST be "unsigned char"! "char" is equivalent to "signed char"
typedef short int16_t;
static_assert(sizeof(int64_t) == 8, "expected size does not match");
static_assert(sizeof(uint32_t) == 4, "expected size does not match");
static_assert(sizeof(int8_t) == 1, "expected size does not match");
constexpr int num_threads = 128;
constexpr int thread_work_size = 4; // TODO: make template substitution once we decide where those vars live
constexpr int block_work_size = thread_work_size * num_threads;
//TODO use _assert_fail, because assert is disabled in non-debug builds
#define ERROR_UNSUPPORTED_CAST assert(false);

namespace std {

using ::signbit;
using ::isfinite;
using ::isinf;
using ::isnan;

using ::abs;

using ::acos;
using ::acosf;
using ::asin;
using ::asinf;
using ::atan;
using ::atanf;
using ::atan2;
using ::atan2f;
using ::ceil;
using ::ceilf;
using ::cos;
using ::cosf;
using ::cosh;
using ::coshf;

using ::exp;
using ::expf;

using ::fabs;
using ::fabsf;
using ::floor;
using ::floorf;

using ::fmod;
using ::fmodf;

using ::frexp;
using ::frexpf;
using ::ldexp;
using ::ldexpf;

using ::log;
using ::logf;

using ::log10;
using ::log10f;
using ::modf;
using ::modff;

using ::pow;
using ::powf;

using ::sin;
using ::sinf;
using ::sinh;
using ::sinhf;

using ::sqrt;
using ::sqrtf;
using ::tan;
using ::tanf;

using ::tanh;
using ::tanhf;

using ::acosh;
using ::acoshf;
using ::asinh;
using ::asinhf;
using ::atanh;
using ::atanhf;
using ::cbrt;
using ::cbrtf;

using ::copysign;
using ::copysignf;

using ::erf;
using ::erff;
using ::erfc;
using ::erfcf;
using ::exp2;
using ::exp2f;
using ::expm1;
using ::expm1f;
using ::fdim;
using ::fdimf;
using ::fmaf;
using ::fma;
using ::fmax;
using ::fmaxf;
using ::fmin;
using ::fminf;
using ::hypot;
using ::hypotf;
using ::ilogb;
using ::ilogbf;
using ::lgamma;
using ::lgammaf;
using ::llrint;
using ::llrintf;
using ::llround;
using ::llroundf;
using ::log1p;
using ::log1pf;
using ::log2;
using ::log2f;
using ::logb;
using ::logbf;
using ::lrint;
using ::lrintf;
using ::lround;
using ::lroundf;

using ::nan;
using ::nanf;

using ::nearbyint;
using ::nearbyintf;
using ::nextafter;
using ::nextafterf;
using ::remainder;
using ::remainderf;
using ::remquo;
using ::remquof;
using ::rint;
using ::rintf;
using ::round;
using ::roundf;
using ::scalbln;
using ::scalblnf;
using ::scalbn;
using ::scalbnf;
using ::tgamma;
using ::tgammaf;
using ::trunc;
using ::truncf;

} // namespace std

// NB: Order matters for this macro; it is relied upon in
// promoteTypesLookup and the serialization format.
// Note, some types have ctype as void because we don't support them in codegen
#define AT_FORALL_SCALAR_TYPES_WITH_COMPLEX()
_(uint8_t, Byte) /* 0 /
_(int8_t, Char) / 1 /
_(int16_t, Short) / 2 /
_(int, Int) / 3 /
_(int64_t, Long) / 4 /
_(at::Half, Half) / 5 /
_(float, Float) / 6 /
_(double, Double) / 7 /
_(std::complexat::Half, ComplexHalf) / 8 /
_(std::complex, ComplexFloat) / 9 /
_(std::complex, ComplexDouble) / 10 /
_(bool, Bool) / 11 /
_(void, QInt8) / 12 /
_(void, QUInt8) / 13 /
_(void, QInt32) / 14 /
_(at::BFloat16, BFloat16) / 15 */ \

#define AT_FORALL_SCALAR_TYPES_WITH_COMPLEX_EXCEPT_QINT(_)
_(uint8_t, Byte)
_(int8_t, Char)
_(int16_t, Short)
_(int, Int)
_(int64_t, Long)
_(at::Half, Half)
_(float, Float)
_(double, Double)
_(std::complexat::Half, ComplexHalf)
_(std::complex, ComplexFloat)
_(std::complex, ComplexDouble)
_(bool, Bool)
_(at::BFloat16, BFloat16)

enum class ScalarType : int8_t {
#define DEFINE_ENUM(_1, n) n,
AT_FORALL_SCALAR_TYPES_WITH_COMPLEX(DEFINE_ENUM)
#undef DEFINE_ENUM
Undefined,
NumOptions
};

template <typename T, int size>
struct Array {
T data[size];

device T operator[](int i) const {
return data[i];
}
device T& operator[](int i) {
return data[i];
}
Array() = default;
Array(const Array&) = default;
Array& operator=(const Array&) = default;
device Array(T x) {
for (int i = 0; i < size; i++) {
data[i] = x;
}
}
};

template
struct DivMod {
T div;
T mod;

device DivMod(T _div, T _mod) {
div = _div;
mod = _mod;
}
};

//
struct IntDivider {
IntDivider() = default;

device inline unsigned int div(unsigned int n) const {
unsigned int t = __umulhi(n, m1);
return (t + n) >> shift;
}

device inline unsigned int mod(unsigned int n) const {
return n - div(n) * divisor;
}

device inline DivMod divmod(unsigned int n) const {
unsigned int q = div(n);
return DivMod(q, n - q * divisor);
}

unsigned int divisor; // d above.
unsigned int m1; // Magic number: m' above.
unsigned int shift; // Shift amounts.
};

template
struct TrivialOffsetCalculator {
// The offset for each argument. Wrapper around fixed-size array.
// The offsets are in # of elements, not in bytes.
Array<unsigned int, NARGS> get(unsigned int linear_idx) const {
Array<unsigned int, NARGS> offsets;
#pragma unroll
for (int arg = 0; arg < NARGS; arg++) {
offsets[arg] = linear_idx;
}
return offsets;
}
};

template
struct OffsetCalculator {
OffsetCalculator() = default;
device forceinline Array<unsigned int, NARGS> get(unsigned int linear_idx) const {
Array<unsigned int, NARGS> offsets;
#pragma unroll
for (int arg = 0; arg < NARGS; ++arg) {
offsets[arg] = 0;
}

  #pragma unroll
  for (int dim = 0; dim < 25; ++dim) {
  if (dim == dims) {
      break;
  }

  auto divmod = sizes_[dim].divmod(linear_idx);
  linear_idx = divmod.div;

  #pragma unroll
  for (int arg = 0; arg < NARGS; ++arg) {
      offsets[arg] += divmod.mod * strides_[dim][arg];
  }
  //printf("offset calc thread dim size stride offset %d %d %d %d %d %d %d %d\n",
  //threadIdx.x, dim, sizes_[dim].divisor, strides_[dim][0], offsets[0], linear_idx, divmod.div, divmod.mod);
  }
  return offsets;

}

int dims;
IntDivider sizes_[25];
// NOTE: this approach will not support nInputs == 0
unsigned int strides_[25][NARGS];

};

#define C10_HOST_DEVICE host device
#define C10_DEVICE device

template
device forceinline T WARP_SHFL_DOWN(T value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
{
return __shfl_down_sync(mask, value, delta, width);
}

#if 0
template
device forceinline std::complex WARP_SHFL_DOWN(std::complex value, unsigned int delta, int width = warpSize, unsigned int mask = 0xffffffff)
{
return std::complex(
__shfl_down_sync(mask, value.real(), delta, width),
__shfl_down_sync(mask, value.imag(), delta, width));
}
#endif

// aligned vector generates vectorized load/store on CUDA
template<typename scalar_t, int vec_size>
struct alignas(sizeof(scalar_t) * vec_size) aligned_vector {
scalar_t val[vec_size];
};

C10_HOST_DEVICE static void reduce_fraction(size_t &numerator, size_t &denominator) {
// get GCD of num and denom using Euclid's algorithm.
// Can replace this with std::gcd if we ever support c++17.
size_t a = denominator;
size_t b = numerator;
while (b != 0) {
a %= b;
// swap(a,b)
size_t tmp = a;
a = b;
b = tmp;
}

// a is now the GCD
numerator /= a;
denominator /= a;

}

struct ReduceConfig {
//has to match host-side ReduceConfig in the eager code
static constexpr int BLOCK_X = 0;
static constexpr int BLOCK_Y = 1;
static constexpr int CTA = 2;

static constexpr int input_vec_size = 4;
int element_size_bytes;
int num_inputs;
int num_outputs;
int step_input = 1;
int step_output = 1;
int ctas_per_output = 1;
int input_mult[3] = {0, 0, 0};
int output_mult[2] = {0, 0};

int block_width;
int block_height;
int num_threads;

bool vectorize_input = false;
int output_vec_size = 1;

C10_HOST_DEVICE bool should_block_x_reduce() const {
return input_mult[BLOCK_X] != 0;
}

C10_HOST_DEVICE bool should_block_y_reduce() const {
return input_mult[BLOCK_Y] != 0;
}

C10_HOST_DEVICE bool should_global_reduce() const {
return input_mult[CTA] != 0;
}

C10_DEVICE bool should_store(int output_idx) const {
return output_idx < num_outputs &&
(!should_block_x_reduce() || threadIdx.x == 0) &&
(!should_block_y_reduce() || threadIdx.y == 0);
}

C10_DEVICE bool should_reduce_tail() const {
return (!should_block_y_reduce() || threadIdx.y == 0) &&
(!should_global_reduce() || blockIdx.y == 0);
}

C10_HOST_DEVICE int input_idx() const {
int lane = threadIdx.x;
int warp = threadIdx.y;
int cta2 = blockIdx.y;
return (lane * input_mult[BLOCK_X] +
warp * input_mult[BLOCK_Y] +
cta2 * input_mult[CTA]);
}

template
C10_HOST_DEVICE int output_idx() const {
int lane = threadIdx.x;
int warp = threadIdx.y;
int cta1 = blockIdx.x;
return (lane * output_mult[BLOCK_X] +
warp * output_mult[BLOCK_Y] +
cta1 * step_output) * output_vec_size;
}

C10_DEVICE int shared_memory_offset(int offset) const {
return threadIdx.x + (threadIdx.y + offset) * blockDim.x;
}

C10_DEVICE int staging_memory_offset(int cta2) const {
int offset = cta2 + blockIdx.x * gridDim.y;
if (!should_block_x_reduce()) {
offset = threadIdx.x + offset * blockDim.x;
}
return offset;
}

};

//TODO this will need to be different for more generic reduction functions
namespace reducer {

using scalar_t = float;
using arg_t = float;
using out_scalar_t = float;

inline device arg_t combine(arg_t a, arg_t b) { return a * b; }

inline device out_scalar_t project(arg_t arg) {
return (out_scalar_t) arg;
}

inline device arg_t warp_shfl_down(arg_t arg, int offset) {
return WARP_SHFL_DOWN(arg, offset);
}

inline device arg_t translate_idx(arg_t acc, int64_t /idx/) {
return acc;
}

// wrap a normal reduction that ignores the index
inline device arg_t reduce(arg_t acc, arg_t val, int64_t idx) {
return combine(acc, val);
}
}

struct ReduceJitOp {
using scalar_t = float;
using arg_t = float;
using out_scalar_t = float;

using InputCalculator = OffsetCalculator<1>;
using OutputCalculator = OffsetCalculator<2>;

// static constexpr bool can_accumulate_in_output =
// std::is_convertible<arg_t, out_scalar_t>::value
// && std::is_convertible<out_scalar_t, arg_t>::value;

static constexpr int input_vec_size = ReduceConfig::input_vec_size;

arg_t ident;
ReduceConfig config;
InputCalculator input_calc;
OutputCalculator output_calc;
const void* src;
const char* dst[2]; //it accepts at most two destinations
// acc_buf used for accumulation among sub Tensor Iterator when accumulation on
// output is not permissible
void* acc_buf;
// cta_buf used for accumulation between blocks during global reduction
void* cta_buf;
int* semaphores;
int64_t base_idx;
bool accumulate;
bool final_output;
int noutputs;

C10_DEVICE void run() const {
extern shared char shared_memory[];
uint32_t output_idx = config.output_idx<1>();
uint32_t input_idx = config.input_idx();
auto base_offsets1 = output_calc.get(output_idx)[1];

using arg_vec_t = Array<arg_t, 1>;
arg_vec_t value;

if (output_idx < config.num_outputs && input_idx < config.num_inputs) {
  const scalar_t* input_slice = (const scalar_t*)((const char*)src + base_offsets1);

  value = thread_reduce<1>(input_slice);
}

if (config.should_block_y_reduce()) {
  value = block_y_reduce<1>(value, shared_memory);
}
if (config.should_block_x_reduce()) {
  value = block_x_reduce<1>(value, shared_memory);
}

using out_ptr_vec_t = Array<out_scalar_t*, 1>;
using offset_vec_t = Array<uint32_t, 1>;
offset_vec_t base_offsets;
out_ptr_vec_t out;

#pragma unroll
for (int i = 0; i < 1; i++) {
  base_offsets[i] = output_calc.get(output_idx + i)[0];
  out[i] = (out_scalar_t*)((char*)dst[0] + base_offsets[i]);
}

arg_vec_t* acc = nullptr;
if (acc_buf != nullptr) {
  size_t numerator = sizeof(arg_t);
  size_t denominator = sizeof(out_scalar_t);
  reduce_fraction(numerator, denominator);
  acc = (arg_vec_t*)((char*)acc_buf + (base_offsets[0] * numerator / denominator));
}

if (config.should_global_reduce()) {
  value = global_reduce<1>(value, acc, shared_memory);
} else if (config.should_store(output_idx)) {
  if (accumulate) {
    #pragma unroll
    for (int i = 0; i < 1; i++) {
      value[i] = reducer::translate_idx(value[i], base_idx);
    }
  }

  if (acc == nullptr) {
    if (accumulate) {
      value = accumulate_in_output<1>(out, value);
    }
    if (final_output) {
      set_results_to_output<1>(value, base_offsets);
    } else {
      #pragma unroll
      for (int i = 0; i < 1; i++) {
        *(out[i]) = get_accumulated_output(out[i], value[i]);
      }
    }
  } else {
    if (accumulate) {
      #pragma unroll
      for (int i = 0; i < 1; i++) {
        value[i] = reducer::combine((*acc)[i], value[i]);
      }
    }
    if (final_output) {
      set_results_to_output<1>(value, base_offsets);
    } else {
      *acc = value;
    }
  }
}

}

template
C10_DEVICE Array<arg_t, output_vec_size> thread_reduce(const scalar_t* data) const {
if (config.vectorize_input) {
assert(output_vec_size == 1);
// reduce at the header of input_slice where memory is not aligned,
// so that thread_reduce will have an aligned memory to work on.
return {input_vectorized_thread_reduce_impl(data)};
} else {
uint32_t element_stride = input_calc.strides_[0][0] / sizeof(scalar_t);
bool is_contiguous = (input_calc.dims == 1 && element_stride == 1);
if (is_contiguous) {
return thread_reduce_impl<output_vec_size>(data, [](uint32_t idx) { return idx; });
} else if (input_calc.dims == 1) {
return thread_reduce_impl<output_vec_size>(data, [&](uint32_t idx) { return idx * element_stride; });
} else {
return thread_reduce_impl<output_vec_size>(data, [&](uint32_t idx) { return input_calc.get(idx)[0] / sizeof(scalar_t); });
}
}
}

C10_DEVICE arg_t input_vectorized_thread_reduce_impl(const scalar_t* data) const {
uint32_t end = config.num_inputs;

// Handle the head of input slice where data is not aligned
arg_t value = ident;
constexpr int align_bytes = alignof(aligned_vector<scalar_t, input_vec_size>);
constexpr int align_elements = align_bytes / sizeof(scalar_t);
int shift = ((int64_t)data) % align_bytes / sizeof(scalar_t);
if (shift > 0) {
  data -= shift;
  end += shift;
  if(threadIdx.x >= shift && threadIdx.x < align_elements && config.should_reduce_tail()){
    value = reducer::reduce(value, data[threadIdx.x], threadIdx.x - shift);
  }
  end -= align_elements;
  data += align_elements;
  shift = align_elements - shift;
}

// Do the vectorized reduction
using load_t = aligned_vector<scalar_t, input_vec_size>;

uint32_t idx = config.input_idx();
const uint32_t stride = config.step_input;

// Multiple accumulators to remove dependency between unrolled loops.
arg_t value_list[input_vec_size];
value_list[0] = value;

#pragma unroll
for (int i = 1; i < input_vec_size; i++) {
  value_list[i] = ident;
}

scalar_t values[input_vec_size];

load_t *values_vector = reinterpret_cast<load_t*>(&values[0]);

while (idx * input_vec_size + input_vec_size - 1 < end) {
  *values_vector = reinterpret_cast<const load_t*>(data)[idx];
  #pragma unroll
  for (uint32_t i = 0; i < input_vec_size; i++) {
    value_list[i] = reducer::reduce(value_list[i], values[i], shift + idx * input_vec_size + i);
  }
  idx += stride;
}

// tail
uint32_t tail_start = end - end % input_vec_size;
if (config.should_reduce_tail()) {
  int idx = tail_start + threadIdx.x;
  if (idx < end) {
    value_list[0] = reducer::reduce(value_list[0], data[idx], idx + shift);
  }
}

// combine accumulators
#pragma unroll
for (int i = 1; i < input_vec_size; i++) {
  value_list[0] = reducer::combine(value_list[0], value_list[i]);
}
return value_list[0];

}

template <int output_vec_size, typename offset_calc_t>
C10_DEVICE Array<arg_t, output_vec_size> thread_reduce_impl(const scalar_t* data_, offset_calc_t calc) const {
uint32_t idx = config.input_idx();
const uint32_t end = config.num_inputs;
const uint32_t stride = config.step_input;
const int vt0=4;

using arg_vec_t = Array<arg_t, output_vec_size>;
using load_t = aligned_vector<scalar_t, output_vec_size>;
const load_t* data = reinterpret_cast<const load_t*>(data_);

// Multiple accumulators to remove dependency between unrolled loops.
arg_vec_t value_list[vt0];

#pragma unroll
for (int i = 0; i < vt0; i++) {
  #pragma unroll
  for (int j = 0; j < output_vec_size; j++) {
    value_list[i][j] = ident;
  }
}

load_t values[vt0];

while (idx + (vt0 - 1) * stride < end) {
  #pragma unroll
  for (uint32_t i = 0; i < vt0; i++) {
    values[i] = data[calc(idx + i * stride) / output_vec_size];
  }
  #pragma unroll
  for (uint32_t i = 0; i < vt0; i++) {
    #pragma unroll
    for (uint32_t j = 0; j < output_vec_size; j++) {
      value_list[i][j] = reducer::reduce(value_list[i][j], values[i].val[j], idx + i * stride);
    }
  }
  idx += stride * vt0;
}

// tail
int idx_ = idx;
#pragma unroll
for (uint32_t i = 0; i < vt0; i++) {
  if (idx >= end) {
    break;
  }
  values[i] = data[calc(idx) / output_vec_size];
  idx += stride;
}
idx = idx_;
#pragma unroll
for (uint32_t i = 0; i < vt0; i++) {
  if (idx >= end) {
    break;
  }
  #pragma unroll
  for (uint32_t j = 0; j < output_vec_size; j++) {
    value_list[i][j] = reducer::reduce(value_list[i][j], values[i].val[j], idx);
  }
  idx += stride;
}

// combine accumulators
#pragma unroll
for (int i = 1; i < vt0; i++) {
  #pragma unroll
  for (uint32_t j = 0; j < output_vec_size; j++) {
    value_list[0][j] = reducer::combine(value_list[0][j], value_list[i][j]);
  }
}
return value_list[0];

}
template
C10_DEVICE Array<arg_t, output_vec_size> block_x_reduce(Array<arg_t, output_vec_size> value, char* shared_memory) const {
using args_vec_t = Array<arg_t, output_vec_size>;
int dim_x = blockDim.x;
args_vec_t* shared = (args_vec_t*)shared_memory;
if (dim_x > warpSize) {
int address_base = threadIdx.x + threadIdx.y*blockDim.x;
shared[address_base] = value;
for (int offset = dim_x/2; offset >= warpSize; offset >>= 1) {
__syncthreads();
if (threadIdx.x < offset && threadIdx.x + offset < blockDim.x) {
args_vec_t other = shared[address_base + offset];
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
value[i] = reducer::combine(value[i], other[i]);
}
shared[address_base] = value;
}
}
dim_x = warpSize;
}

__syncthreads();

for (int offset = 1; offset < dim_x; offset <<= 1) {
  #pragma unroll
  for (int i = 0; i < output_vec_size; i++) {
    arg_t other = reducer::warp_shfl_down(value[i], offset);
    value[i] = reducer::combine(value[i], other);
  }
}
return value;

}

template
C10_DEVICE Array<arg_t, output_vec_size> block_y_reduce(Array<arg_t, output_vec_size> value, char* shared_memory) const {
using args_vec_t = Array<arg_t, output_vec_size>;
args_vec_t* shared = (args_vec_t*)shared_memory;
shared[config.shared_memory_offset(0)] = value;
for (int offset = blockDim.y / 2; offset > 0; offset >>= 1) {
__syncthreads();
if (threadIdx.y < offset && threadIdx.y + offset < blockDim.y) {
args_vec_t other = shared[config.shared_memory_offset(offset)];
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
value[i] = reducer::combine(value[i], other[i]);
}
shared[config.shared_memory_offset(0)] = value;
}
}
return value;
}

C10_DEVICE bool mark_block_finished() const {
shared bool is_last_block_done_shared;

__syncthreads();
if (threadIdx.x == 0 && threadIdx.y == 0) {
  int prev_blocks_finished = atomicAdd(&semaphores[blockIdx.x], 1);
  is_last_block_done_shared = (prev_blocks_finished == gridDim.y - 1);
}

__syncthreads();

return is_last_block_done_shared;

}

template
C10_DEVICE Array<arg_t, output_vec_size> accumulate_in_output(
Array<out_scalar_t*, output_vec_size> out,
Array<arg_t, output_vec_size> value
) const {
Array<arg_t, output_vec_size> ret;
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
ret[i] = reducer::combine(*(out[i]), value[i]);
}
return ret;
}

C10_DEVICE out_scalar_t get_accumulated_output(
out_scalar_t* out, arg_t value
) const {
assert(!final_output);
return (out_scalar_t)value;
}

template
C10_DEVICE void set_results(const T x, const uint32_t base_offset) const {
assert(noutputs == 1);
auto res = (out_scalar_t*)((char*)dst[0] + base_offset);
*res = x;
}

//TODO - multi-output reduction - we won't be able to use thrust::pair
//just explicitly specify typed output reads/writes
//Currently implemented for max of two outputs
// template<class T1, class T2>
// C10_DEVICE void set_results(const thrust::pair<T1, T2> x, const index_t base_offset) const {
// if (noutputs >= 1) {
// auto res0 = (T1*)((char*)dst[0] + base_offset);
// res0 = x.first;
// }
// if (noutputs >= 2) {
// // base offset is computed assuming element size being sizeof(T1), so we need to make a
// // correction to obtain the correct base offset
// auto res1 = (T2) ((char *) dst[1] + base_offset / sizeof(T1) * sizeof(T2));
// *res1 = x.second;
// }
// }

template
C10_DEVICE void set_results_to_output(Array<arg_t, output_vec_size> value, Array<uint32_t, output_vec_size> base_offset) const {
assert(final_output);
#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
set_results(reducer::project(value[i]), base_offset[i]);
}
}

template
C10_DEVICE Array<arg_t, output_vec_size> global_reduce(Array<arg_t, output_vec_size> value, Array<arg_t, output_vec_size> acc, char shared_memory) const {
using arg_vec_t = Array<arg_t, output_vec_size>;
using out_ptr_vec_t = Array<out_scalar_t*, output_vec_size>;
using offset_vec_t = Array<uint32_t, output_vec_size>;

arg_vec_t* reduce_buffer = (arg_vec_t*)cta_buf;
uint32_t output_idx = config.output_idx<output_vec_size>();
offset_vec_t base_offsets;
out_ptr_vec_t out;

#pragma unroll
for (int i = 0; i < output_vec_size; i++) {
  base_offsets[i] = output_calc.get(output_idx + i)[0];
  out[i] = (out_scalar_t*)((char*)dst[0] + base_offsets[i]);
}

bool should_store = config.should_store(output_idx);
if (should_store) {
  uint32_t offset = config.staging_memory_offset(blockIdx.y);
  reduce_buffer[offset] = value;
}

__threadfence(); // make sure writes are globally visible
__syncthreads(); // if multiple warps in this block wrote to staging, make sure they're all done
bool is_last_block_done = mark_block_finished();

if (is_last_block_done) {
  value = ident;
  if (config.should_block_x_reduce()) {
    uint32_t input_offset = threadIdx.x + threadIdx.y * blockDim.x;
    uint32_t step = blockDim.x * blockDim.y;
    for (; input_offset < config.ctas_per_output; input_offset += step) {
      uint32_t idx = config.staging_memory_offset(input_offset);
      arg_vec_t next = reduce_buffer[idx];
      #pragma unroll
      for (int i = 0; i < output_vec_size; i++) {
        value[i] = reducer::combine(value[i], next[i]);
      }
    }
  } else {
    uint32_t input_offset = threadIdx.y;
    uint32_t step = blockDim.y;
    for (; input_offset < config.ctas_per_output; input_offset += step) {
      uint32_t idx = config.staging_memory_offset(input_offset);
      arg_vec_t next = reduce_buffer[idx];
      #pragma unroll
      for (int i = 0; i < output_vec_size; i++) {
        value[i] = reducer::combine(value[i], next[i]);
      }
    }
  }
  value = block_y_reduce(value, shared_memory);
  if (config.should_block_x_reduce()) {
    value = block_x_reduce<output_vec_size>(value, shared_memory);
  }
  if (should_store) {
    if (accumulate) {
      #pragma unroll
      for (int i = 0; i < output_vec_size; i++) {
        value[i] = reducer::translate_idx(value[i], base_idx);
      }
    }

    if (acc == nullptr) {
      if (accumulate) {
        value = accumulate_in_output<output_vec_size>(out, value);
      }
      if (final_output) {
        set_results_to_output<output_vec_size>(value, base_offsets);
      } else {
        #pragma unroll
        for (int i = 0; i < output_vec_size; i++) {
          *(out[i]) = get_accumulated_output(out[i], value[i]);
        }
      }
    } else {
      if (accumulate) {
        #pragma unroll
        for (int i = 0; i < output_vec_size; i++) {
          value[i] = reducer::combine((*acc)[i], value[i]);
        }
      }
      if (final_output) {
        set_results_to_output<output_vec_size>(value, base_offsets);
      } else {
        *acc = value;
      }
    }
  }
}

return value;

}
};

extern "C"
launch_bounds(512, 4)
global void reduction_prod_kernel(ReduceJitOp r){
r.run();
}
nvrtc: error: invalid value for --gpu-architecture (-arch)

Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
(mip-splatting) root@autodl-container-2a5049addd-1b9dbf2b:~/mip-splatting#

A question about the number of gaussians.

Hi.

Congrats on the great work.

I have a question regarding regarding the total number of gaussians that are there in the final scene. Do the 3D smoothing and 2D Mip Filters end up helping in reducing the number of gaussians too?

If that is the case, could you reply with how much the effect is?

Thanks a lot.
Rahul Goel

artifacts in almost every predicted image

Hi,

thank you for your great work. but when i tried your method, I found that there are artifacts in almost every predicted image. Is it bacause that before rendering, we didn't utilize "create_fused_ply.py" to get the filtered solution? But after that I created the fused ply file and copy it into the model path and renamed it as point_cloud and I run "render.py", An error occured.
So, we don't need to fuse the point_cloud.py before we render? But, as I said in the beginning, artifacts exist in almost every predicted image which is totally different from waht you showed in your paper.
Could you tell me what's going wrong here?
thx

best,

Multi resolution training

Hey,

Thanks for publishing your code!
In the paper, you write that you train and test on multi-resolution images (40% full resolution).
In the code this is activated with the "sample_more_highres" flag.
But this flag is only set to True for the synthetic data.
Am I missing something or do you not sample different resolutions for the other datasets?

Thank you!

wrong paper link

it links to monosdf

Experimental results

Thank you for your open source.
Your paper does not include the results of single-scale training on the mip-nerf 360 dataset, and the results of each scene at the 4x magnification scale. There are only average results. Can you publish these results?

Why opacity compensation on the 3D filter? This makes small Gaussians non-opaque.

The justification for the 3D filter in the paper comes from the fact that "a primitive smaller than 2Tˆ may result in aliasing artifacts during the splatting process, since its size is below twice the sampling interval."

In practice, this means that there is a minimum Gaussian scale (filter_3D in the code), and the real scales are computed from the scales parameters that way:

        scales = torch.square(scales) + torch.square(self.filter_3D)
        scales = torch.sqrt(scales)

However, the code actually implements it as a 3D Gaussian filter, and also applies opactity compensation in 3D:

        scales_square = torch.square(scales)
        det1 = scales_square.prod(dim=1)
        
        scales_after_square = scales_square + torch.square(self.filter_3D) 
        det2 = scales_after_square.prod(dim=1) 
        coef = torch.sqrt(det1 / det2)
        return opacity * coef[..., None]

This last part (opacity compensation), in my opinion, is wrong, because this prevents small Gaussians from being 100% opaque.

For example, a round Gaussian that would exactly have the "parameter" scale filter_3D, and thus the "real" scale sqrt(2)*filter_3D, and a "parameter" opacity of 1, will have a "real" opacity of...

det1 = (filter_3D**2)**3
det2 = (2*filter_3D**2)**3
opacity = 1 * sqrt(det1/det2) = sqrt(1/8) = 0.35

35% !

The effect is similar on larger Gaussian: no Gaussian can be 100% opaque with this 3D opacity compensation.

Conclusion: The 3D filter should really be only about setting a minimum size, without adjusting the opacity, so that small Gaussians can be opaque.

Of course, I totally agree with opacity compensation on 2D Gaussians (which I proposed in Oct 2023 graphdeco-inria/gaussian-splatting#294 (comment))

OMP_NUM_THREADS not recognised

As I am trying to Train the model, I get the issue stating OMP_NUM_THREADS not recognised as an internal or external command...,
Any clue what I could be doing wrong?
I am pretty new to this subject, so any help would be appreciated.
Thanks

Would the online viewer be open-source?

Hi, thanks for great work.

Just wonder if the online viewer would also be open-source?

Thanks.

关于训练尺度

你好请问你的论文里面描述的1x 2x 4x 8x和 1 1/2 1/4 1/8分别是什么意思没太看明白

Slow speed than original 3DGS

I think it may be a little slower than original 3DGS. Have you ever tested its time efficiency?

the meaning of "Mip-Splatting w/o 2D Mip filter"

Hi, thanks for your great work!

I have a question about the part "8. Ablation": To evaluate the effectiveness of the 2D Mip filter, you show Table 5/6/7. And these three tables all have a row with the title "Mip-Splatting w/o 2D Mip filter".

I am confused about the "w/o 2D Mip filter" -- do you just remove the 2D Mip filter, or remove the 2D Mip filter and re-add the Dilation operator? That is to say, I wonder whether "Mip-Splatting w/o 2D Mip filter" means "3DGS + 3D smoothing filter" or "3DGS - Dilation + 3D smoothing filter".

From the part "8.1 Effectiveness of the 3D Smoothing Filter", you mention that "The absence of both the 3D smoothing filter and the 2D Mip filter leads to an excessive generation of small Gaussian primitives". So "removing both the 3D smoothing filter and the 2D Mip filter" means "3DGS - Dilation" (compare Table 5 and Table 6). So I guess that "Mip-Splatting w/o 2D Mip filter" means "3DGS - Dilation + 3D smoothing filter"?

Web Viewer Source Code

Originally posted by @niujinshuchong in #21 (comment)

Although the link provided points to the location of the Viewer's JavaScript source code, there are two issues that need to be addressed:

The source code is packaged, and it is unclear which parts of GaussianSplats3D have been modified;
The online viewer for mip-splatting has performance issues. For the same scene, such as Garden, the frame rate for mip-splatting is only 30 FPS, while the online demo for GaussianSplats3D achieves 60 FPS. How can the rendering performance of mip-splatting be improved?

About subpixel_offset

Great work! And I have a question about code? This parameter（subpixel_offset）is not used by default in the code. What is the meaning of this code?

 #TODO ignore border pixels
        if dataset.ray_jitter:
            subpixel_offset = torch.rand((int(viewpoint_cam.image_height), int(viewpoint_cam.image_width), 2), dtype=torch.float32, device="cuda") - 0.5
            # subpixel_offset *= 0.0
        else:
            subpixel_offset = None

        if dataset.resample_gt_image:
            gt_image = create_offset_gt(gt_image, subpixel_offset)

In Factor 2, the quality of the garden is lower.

After training the garden in Factor 2, I converted it according to the documentation and visualized it through the Viewer.

In Factor 2, however, there are artifacts and errors. Should other parameters also be changed when changing the factor?

The left side shows the training results of Factor 2, while the right side shows the training results of Factor 4.

the meaning of "3DG" and "3DGS + EWA"

Thanks for your great work!

I have a question. In the part "6.1. Implementation", you mention that

"3DGS + EWA" replaces the dilation of 3DGS with the EWA filter.

I'm confused about the meaning of "the EWA filter".

In the original Gaussian-Splatting (3DGS) implementation, there is a 2D low-pass filter, which chooses s=0.3 in the equation $\mathcal{G}_k^{2D}(\mathbf{x}) = e^{-\frac{1}{2}\left(\mathbf{x}-\mathbf{p}_k\right) ^T \left(\boldsymbol{\Sigma}_k^{2D} + s\mathbf{I}\right)^{-1} \left(\mathbf{x}-\mathbf{p}_k\right)}$, and the correlated codes are shown in https://github.com/graphdeco-inria/diff-gaussian-rasterization/blob/59f5f77e3ddbac3ed9db93ec2cfe99ed6c5d121d/cuda_rasterizer/forward.cu#L73-L113.

Do you mean the low-pass filter with s=0.3? If so, it seems that the EWA Splatting algotithm / EWA filter is already included in the official implementation of 3DGS. Could you please explain the differences between row titles of "3DGS" and "3DGS + EWA" that appear multiple times in tables of your paper? Do you mean:

"3DGS" ≈ official 3DGS - low-pass filter with s=0.3
"3DGS + EWA" ≈ official 3DGS implementation

Or to be more simplified, my question is presented in the image below:

about 5.1 3D Smoothing Filter

Thanks your great work!

One question:
In the last paragraph in 5.1 3D Smoothing Filter,
"By employing 3D Gaussian smoothing, we ensure that the highest frequency component of any Gaussian does not exceed half of its maximal sampling rate for at least one camera."

As I know, gaussian filter is soft smoothing filter, so most high frequency ($> \hat v_k/2$) component of a gaussian will be remove, but still very little high frequency ($> \hat v_k/2$) component will remain .
Am I corret?

In one word:
No any high frequency($> \hat v_k/2$) component remain, or still very little high frequency($> \hat v_k/2$) component remain?

extern "C" __launch_bounds(512, 4) global__ void reduction_prod_kernel(ReduceJitOp r){ r.run(); }

when i run python train.py -s data use my data happened that

and following is my data file

about the meaning of kernel_size

Hi,

Thanks for your great work!
A stupid question is that what exactly does kernel_size in the code mean, and why it is set as 0.1 for all the exps?