Surface Normals & Properties Estimation on Point Sets

Surface Normals & Properties Estimation on Point Sets

1.      Introduction

Point sets or point clouds (PC) are collection of points that represent a 3D shape. It is an inherently discontinuous primitive compared to triangles and volumetric for mesh representation. Often PC’s are acquired without normals from 3D scanners for engineering, medical, or visual effects applications. Point Normals are crucial for shading, determining connectivity among neighboring points, forming surfaces, and geometry processing operations such as denoising and resampling. The purpose of this project is to implement normals estimation techniques for point clouds, and explore the relevant surface parameters.

Go to Point Cloud posts

NB: Posts are excerpts from ongoing thesis reports.

Raytracing PCD as Implicit Surface

Implicit surface: higher quality than disk render, more accurate surface silhouettes.
Raytracing at interactive rates (video sped up)

3 clips:
1) No octree
2) In octree, radius 0.1: takes 60% time, very small speedup (compared to disk) since smooth surface requires more points. A denser model will benefit more from octree.
3) Radius 0.05: 36% time of 0.1 radius is a significant speedup. Smaller radius has less octree artifact due to less overlap, but less points mean it is statistically less stable as shown by the noise around ears.

Hierarchical LOD by Clustering & Frontface fix

Clustering – Point Cloud Simplification

Read the rest of this entry

Raytracing PCD as Implicit Surface

Iteratively project point x onto implicit surface, until x converges:

Read the rest of this entry

Diffuse 02 – Point Cloud Octree Raytracing

Video details: Diffuse pass, 1 directional light.
1) Octree partition artifact: overlap points duplicated across octree partitions to remove holes.
2) Redundant points from different leaves: fixed to avoid shading an overlap point twice.
3) Some backface and frontface fixes (a bit hacky), so the far frontface gets occluded by near frontface, regardless how large a leaf is.
4) Smooth Silhouette: removes the “disk” look.

Diffuse 01: previous artifacts

 

Octree Partition artifact:
Diagnostic:
Any duplicate points may cause discrete shading, which isn’t addressed in any papers that suggest redundant points in voxels. Tracing 2 rays 1 pixel apart, pixel 85 hits 4 leaves while pixel 84 hits 5 leaves. 2 overlap points appear in 3 leaves. Each point can appear in up to 7 leaves, and both rays hit 3 of the leaves.
Overlap points:
p2 [85, 256]:      dist prune IN. pos: 1.398461 1.886459 1.376074
p1 [85, 256]:      dist prune IN. pos: 1.395789 1.871096 1.369734

Pixel 85 gathers 2 duplicates per point, while pixel 84 gathers 3. So the former shades 17 points while later 19.

Problem 1: Different nnr_count (final shaded points count) results in different shading. I didn’t find a shading method that ignores duplicates (without explicit duplicates removal).
Problem 2: Overlap points are weighted multiple times if they exist in multiple leaves the ray hits.

I considered over three methods beforehand, but assumed the problem wouldn’t be visible and trusted the way mentioned in the papers.
Solutions: Either need to ensure overlap points are “unique” at octree construction phase. Or include the entire voxel which a point overlaps, forming a kind of nearest neighbor voxel structure.

High % of extra overlap points makes storage inefficient. Previous papers cite 3x-8x increase. Overlap points grow at more than 2x per level when the level scale is small, so need to detect when to terminate a branch. Yet point count has great effect on time, eg 150 points 160ms, 75 points 90ms. So need to balance between overlap redundancy vs point count per voxel.

CUDA: Raytracing PCD & Octree

video speed 4x. Screenshot explanations below.

Interactive octree visualization of the bunny ears. Red shows leaf nodes, green internal nodes. Various pointers and memory issues finally fixed after a whole week merging the Nvidia traversal algorithm with the system.

device time: 8.40~20.55ms
host time: 39~156ms

 

 

 

Raytracing oriented disks. Silhouette shows disk primitives, but has the least artifact from octree partitioning.

device time: 17~71ms
host time: 50~450ms

 

With nnr_count>=3, silhouette shows the artifacts. Compare with non-octree renders (previous post).

device time: 71ms
host time: 462ms

 

Octree partitioning artifact is more obvious for large radius. Points in a voxel are missing neighbors required to calculate a continuous silhouette & shading.

device time: 73ms
host time: 456ms

 

Each ray merges all points from leafnodes traversed, so two ears appear stacked on top transparently.

Benchmark

Non-octree raytrace times:
device time: 928~950ms
host time: 935~1000ms

Octree traversal overhead is small (<21ms). And octree PCD render takes 8% the time, or ~13x speedup due to octree. The host time is however highly unusual, seems to be 7~8x the device time. Device code supposedly does not send image data to host. There is a significant amount of CUDA printf’s compared to the non-octree version, but it should be counted as part of CUDA event?
Speed is highly sensitive to hit rays count, suggesting each ray shading is slow.

Plan

1) Diffuse shading. Currently only flat point color rendered. Diffuse exposes artifacts and noise better.

2) Octree partitioning artifact:
A common method [Kashyap10] describes is insert duplicate points into every voxel that contacts the point with respective radius. They further implement culling rays that remove the added duplicates if they are not visible from test rays. Resulting points data is 2~3x the original. Both the voxel-sphere hit test and culling take significant preprocessing time.
Another method briefly suggested is each leaf maintains neighbor leaves, but how a leaf discovers and stores neighbors from the same or different level requires further research.

3) Leaf occlusion: distance threshold heuristics?

4) Discover device and host time relationship

This is the first version to actually render using PCD and octree. So there is no optimization yet. Focus is on making parts work and fixing artifacts first. A later priority is to simplify the multiple for loops in gathering leaf points and shading.


Octree

An inefficiency issue is a few points (eg right ear tip) occupy a large voxel at coarser level (eg level 2). It means many rays hit the large voxel that contains points in a much limited bounding region. Here KD-tree or BVH has pruning advantage, though the octree suitability for GPU has its own strength that may need benchmark comparisons. 

GPU benchmark timing

CPU timer:
GLUT doesn’t work (only seconds resolution).
Most suggest OS dependent code. I used some public code that works on Linux & Windows.

GPU timer works. Resolution about 0.5 millisecond. I packaged up some code from the programming guide:

cudaEvent_t start, stop;
float time;
void cudaTimeStart(){
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord( start, 0 );
}
void cudaTimeStop(){
cudaEventRecord( stop, 0 );
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cudaEventCreate(&start);
cudaEventCreate(&stop);
printf("time %.2fms\n\n", time);
}

CUDA notes: kernel memory & variables

Some following errors may be due to other compounded reasons, treat them as suggestions and confirm in your own code.

  • All kernel local variables allocated in sequence.
    Each variable must suit alignment req.
    Variables that violate alignment will affect variables allocated afterwards.
    Don’t initialize variables anywhere in loops (eg int i iterator in for(;;)). Caused array data write error.
    Only initialize variables in function scope (none in nested scope).
    Always initialize variables manually before read, like C. In C++ many constructors automatically init to 0. Without initialization, printing CUDA variables show 0 as well (missleading), but writing to elements in an array “spills over” to other elements (eg setting x[i]=7 also sets x[?]==7).
  • c++: code generation: struct member alignment: default | /Zp16 bytes. Don’t change this, keep at default.
  • Local memory: Different from what I learned in HPC over OpenCL GPGPU. “Local” is not near the core. The access is private to each thread, but the data is resides in the same chip as global memory (equally slow). __local__ qualifier doesn’t exist even though some online PDFs mention this, perhaps it was deprecated? From CUDA docs, it seems we should avoid this. Basically kernel variables default to private in fast register or L1 cache, and spill over to local memory only if too large.
  • __alignof(v) // always return alternating 4 0 4 0, regardless of variable order. Might find alignment issues by printing this property, but it hasn’t helped.

CUDA raytrace first images

9/10/2011 First CUDA images. It’s been a long way setting up the dev env, and learning basics of CUDA. Especially memory and some syntax have high learning curve. Finding and cooperating with helper libraries is also a challenge, and spend days fixing compiler/linker parameters.

Currently only x64 bit release configuration works, and debug config is “faked”. There’s a lot of dependency on debug version dlls, that certain libraries are missing. So I’ll just keep the env for now and focus on porting Matlab to CUDA code.

Camera is perspective while Matlab was in
orthographic, so some camera parameters must be adjusted for similar view.Blending shader fixed. The 2 ears were
black or flickering colors for quite awhile, before learning some more CUDA
code specifics, how it’s different from C/C++. Not every C syntax works in
kernel code, though most should work in the host portion.
Render points as oriented disks

 

Read the rest of this entry

Backface culling & Octree

Backface cull

Only slight speed up, because more set
checking overhead, but may help more on high point count.

backface may mess up contour because
backfacing points at contours contribute to normals blending.
BENCHMARK

setup general 0.0053
r_max setup 0.0017
r_max prune 1.1013
raytrace loop 9.2197
rescale normal->color 0.0513

profiler (CPU time): 16 seconds
Shrink all r to r_render instead of using larger nn.r. Minor raytrace changes.
Around 3.7 seconds off raytrace loop by smaller r, shows the r_max pre-pruning
step is very effective.

profiler (CPU time): 8.780 seconds

BENCHMARK

setup general 0.0143
r_max setup 0.0018
r_max prune 1.0260
raytrace loop 5.4854
rescale normal->color 0.0326

#################
backface: simple method 1:

Reasoning: Most points along the silhouette
have normals perpendicular to view ray, or is similar to normals of
front-facing points near the silhouette.

Method: Detect backface border points pj by
angle threshold & distance threshold to the centroid of frontface border
points pi.

Description: image008
point render, 4 pixel size raytrace xyz2 +3pts r_scale-min6
backfacecull dist_avgMUL1 borderface Vd lt.7071Method 1: Per ray block cull. Blocky artifacts mar the border & color because backface is
determined per ray block. Per ray may work but requires weighting the point
by ray distance, like raytracing through volumetric opacity, very expensive.Because of the border characteristic,
backface points cannot apply the same algorithm as from a local per triangle
culling.

Read the rest of this entry

Update 05/07/2011: Normals Fix & Radius Learning

Introduction

Normals fixes page 1. r radius estimation page 6.

Unify Normals

BSP is very fast but has a percentage of failed unifications at sharper regions, and required
tuning the r radii by remapping the per point r. MST ran for 2 hours but could
not finish. The following explain each improvement to minimize failed
unification.

r limits

nn.r has been scaled down by 0.5, and min
clamped by distance to the kth neighbor (set as degrees=6). Remap changes are
fairly unstable, because larger r unifies irrelevant neighbors. If nn.r is
minimized to 1st neighbor distance, it effectively becomes MST. But distance similarity
alone does not correctly unify neighbors.

k limits

Each unifier can only affect k nearest
ununified neighbors (set as degrees=6). Where k=1 models MST, k limit is a
tradeoff between normals propagation speed versus accuracy. Because corner
points have highly varied neighbors distances, k limit is more stable than r
limits. This also prevents unifiers unifying across two nearby parallel
surfaces (Figure k limits).

Figure k limits.
k=Inf (left). k=6 (right).
Although
BSP takes the same iterations (44), maximum branching unifiers increased from
471 to 515.

  Read the rest of this entry

Follow

Get every new post delivered to your Inbox.