Surface Normals & Properties Estimation on Point Sets
Surface Normals & Properties Estimation on Point Sets
1. Introduction
Point sets or point clouds (PC) are collection of points that represent a 3D shape. It is an inherently discontinuous primitive compared to triangles and volumetric for mesh representation. Often PC’s are acquired without normals from 3D scanners for engineering, medical, or visual effects applications. Point Normals are crucial for shading, determining connectivity among neighboring points, forming surfaces, and geometry processing operations such as denoising and resampling. The purpose of this project is to implement normals estimation techniques for point clouds, and explore the relevant surface parameters.
NB: Posts are excerpts from ongoing thesis reports.
Raytracing PCD as Implicit Surface
Implicit surface: higher quality than disk render, more accurate surface silhouettes.
Raytracing at interactive rates (video sped up)
3 clips:
1) No octree
2) In octree, radius 0.1: takes 60% time, very small speedup (compared to disk) since smooth surface requires more points. A denser model will benefit more from octree.
3) Radius 0.05: 36% time of 0.1 radius is a significant speedup. Smaller radius has less octree artifact due to less overlap, but less points mean it is statistically less stable as shown by the noise around ears.
Hierarchical LOD by Clustering & Frontface fix
Clustering – Point Cloud Simplification
Read the rest of this entry
Diffuse 02 – Point Cloud Octree Raytracing
Video details: Diffuse pass, 1 directional light.
1) Octree partition artifact: overlap points duplicated across octree partitions to remove holes.
2) Redundant points from different leaves: fixed to avoid shading an overlap point twice.
3) Some backface and frontface fixes (a bit hacky), so the far frontface gets occluded by near frontface, regardless how large a leaf is.
4) Smooth Silhouette: removes the “disk” look.
Diffuse 01: previous artifacts
Octree Partition artifact:
Diagnostic:
Any duplicate points may cause discrete shading, which isn’t addressed in any papers that suggest redundant points in voxels. Tracing 2 rays 1 pixel apart, pixel 85 hits 4 leaves while pixel 84 hits 5 leaves. 2 overlap points appear in 3 leaves. Each point can appear in up to 7 leaves, and both rays hit 3 of the leaves.
Overlap points:
p2 [85, 256]: dist prune IN. pos: 1.398461 1.886459 1.376074
p1 [85, 256]: dist prune IN. pos: 1.395789 1.871096 1.369734
Pixel 85 gathers 2 duplicates per point, while pixel 84 gathers 3. So the former shades 17 points while later 19.
Problem 1: Different nnr_count (final shaded points count) results in different shading. I didn’t find a shading method that ignores duplicates (without explicit duplicates removal).
Problem 2: Overlap points are weighted multiple times if they exist in multiple leaves the ray hits.
I considered over three methods beforehand, but assumed the problem wouldn’t be visible and trusted the way mentioned in the papers.
Solutions: Either need to ensure overlap points are “unique” at octree construction phase. Or include the entire voxel which a point overlaps, forming a kind of nearest neighbor voxel structure.
High % of extra overlap points makes storage inefficient. Previous papers cite 3x-8x increase. Overlap points grow at more than 2x per level when the level scale is small, so need to detect when to terminate a branch. Yet point count has great effect on time, eg 150 points 160ms, 75 points 90ms. So need to balance between overlap redundancy vs point count per voxel.
CUDA: Raytracing PCD & Octree
video speed 4x. Screenshot explanations below.
![]() |
|
Interactive octree visualization of the bunny ears. Red shows leaf nodes, green internal nodes. Various pointers and memory issues finally fixed after a whole week merging the Nvidia traversal algorithm with the system. device time: 8.40~20.55ms |
![]() |
![]() |
|
Raytracing oriented disks. Silhouette shows disk primitives, but has the least artifact from octree partitioning. device time: 17~71ms |
![]() |
|
With nnr_count>=3, silhouette shows the artifacts. Compare with non-octree renders (previous post). device time: 71ms |
![]() |
|
Octree partitioning artifact is more obvious for large radius. Points in a voxel are missing neighbors required to calculate a continuous silhouette & shading. device time: 73ms |
![]() |
|
Each ray merges all points from leafnodes traversed, so two ears appear stacked on top transparently. |
Benchmark
Non-octree raytrace times:
device time: 928~950ms
host time: 935~1000ms
Octree traversal overhead is small (<21ms). And octree PCD render takes 8% the time, or ~13x speedup due to octree. The host time is however highly unusual, seems to be 7~8x the device time. Device code supposedly does not send image data to host. There is a significant amount of CUDA printf’s compared to the non-octree version, but it should be counted as part of CUDA event?
Speed is highly sensitive to hit rays count, suggesting each ray shading is slow.
Plan
1) Diffuse shading. Currently only flat point color rendered. Diffuse exposes artifacts and noise better.
2) Octree partitioning artifact:
A common method [Kashyap10] describes is insert duplicate points into every voxel that contacts the point with respective radius. They further implement culling rays that remove the added duplicates if they are not visible from test rays. Resulting points data is 2~3x the original. Both the voxel-sphere hit test and culling take significant preprocessing time.
Another method briefly suggested is each leaf maintains neighbor leaves, but how a leaf discovers and stores neighbors from the same or different level requires further research.
3) Leaf occlusion: distance threshold heuristics?
4) Discover device and host time relationship
This is the first version to actually render using PCD and octree. So there is no optimization yet. Focus is on making parts work and fixing artifacts first. A later priority is to simplify the multiple for loops in gathering leaf points and shading.
Octree
An inefficiency issue is a few points (eg right ear tip) occupy a large voxel at coarser level (eg level 2). It means many rays hit the large voxel that contains points in a much limited bounding region. Here KD-tree or BVH has pruning advantage, though the octree suitability for GPU has its own strength that may need benchmark comparisons.
GPU benchmark timing
CPU timer:
GLUT doesn’t work (only seconds resolution).
Most suggest OS dependent code. I used some public code that works on Linux & Windows.
GPU timer works. Resolution about 0.5 millisecond. I packaged up some code from the programming guide:
cudaEvent_t start, stop;
float time;
void cudaTimeStart(){
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord( start, 0 );
}
void cudaTimeStop(){
cudaEventRecord( stop, 0 );
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
cudaEventCreate(&start);
cudaEventCreate(&stop);
printf("time %.2fms\n\n", time);
}
CUDA notes: kernel memory & variables
Some following errors may be due to other compounded reasons, treat them as suggestions and confirm in your own code.
- All kernel local variables allocated in sequence.
Each variable must suit alignment req.
Variables that violate alignment will affect variables allocated afterwards.
Don’t initialize variables anywhere in loops (eg int i iterator in for(;;)). Caused array data write error.
Only initialize variables in function scope (none in nested scope).
Always initialize variables manually before read, like C. In C++ many constructors automatically init to 0. Without initialization, printing CUDA variables show 0 as well (missleading), but writing to elements in an array “spills over” to other elements (eg setting x[i]=7 also sets x[?]==7). - c++: code generation: struct member alignment: default | /Zp16 bytes. Don’t change this, keep at default.
- Local memory: Different from what I learned in HPC over OpenCL GPGPU. “Local” is not near the core. The access is private to each thread, but the data is resides in the same chip as global memory (equally slow). __local__ qualifier doesn’t exist even though some online PDFs mention this, perhaps it was deprecated? From CUDA docs, it seems we should avoid this. Basically kernel variables default to private in fast register or L1 cache, and spill over to local memory only if too large.
- __alignof(v) // always return alternating 4 0 4 0, regardless of variable order. Might find alignment issues by printing this property, but it hasn’t helped.
CUDA raytrace first images
9/10/2011 First CUDA images. It’s been a long way setting up the dev env, and learning basics of CUDA. Especially memory and some syntax have high learning curve. Finding and cooperating with helper libraries is also a challenge, and spend days fixing compiler/linker parameters.
Currently only x64 bit release configuration works, and debug config is “faked”. There’s a lot of dependency on debug version dlls, that certain libraries are missing. So I’ll just keep the env for now and focus on porting Matlab to CUDA code.
![]() |
![]() |
| Camera is perspective while Matlab was in orthographic, so some camera parameters must be adjusted for similar view.Blending shader fixed. The 2 ears were black or flickering colors for quite awhile, before learning some more CUDA code specifics, how it’s different from C/C++. Not every C syntax works in kernel code, though most should work in the host portion. |
Render points as oriented disks
|
Backface culling & Octree
Backface cull
Only slight speed up, because more set
checking overhead, but may help more on high point count.
backface may mess up contour because
backfacing points at contours contribute to normals blending.
BENCHMARK
| setup general | 0.0053 |
| r_max setup | 0.0017 |
| r_max prune | 1.1013 |
| raytrace loop | 9.2197 |
| rescale normal->color | 0.0513 |
profiler (CPU time): 16 seconds
Shrink all r to r_render instead of using larger nn.r. Minor raytrace changes.
Around 3.7 seconds off raytrace loop by smaller r, shows the r_max pre-pruning
step is very effective.
profiler (CPU time): 8.780 seconds
BENCHMARK
| setup general | 0.0143 |
| r_max setup | 0.0018 |
| r_max prune | 1.0260 |
| raytrace loop | 5.4854 |
| rescale normal->color | 0.0326 |
#################
backface: simple method 1:
Reasoning: Most points along the silhouette
have normals perpendicular to view ray, or is similar to normals of
front-facing points near the silhouette.
Method: Detect backface border points pj by
angle threshold & distance threshold to the centroid of frontface border
points pi.
![]() |
![]() |
| point render, 4 pixel size | raytrace xyz2 +3pts r_scale-min6 backfacecull dist_avgMUL1 borderface Vd lt.7071Method 1: Per ray block cull. Blocky artifacts mar the border & color because backface is determined per ray block. Per ray may work but requires weighting the point by ray distance, like raytracing through volumetric opacity, very expensive.Because of the border characteristic, backface points cannot apply the same algorithm as from a local per triangle culling. |
Update 05/07/2011: Normals Fix & Radius Learning
Introduction
Normals fixes page 1. r radius estimation page 6.
Unify Normals
BSP is very fast but has a percentage of failed unifications at sharper regions, and required
tuning the r radii by remapping the per point r. MST ran for 2 hours but could
not finish. The following explain each improvement to minimize failed
unification.
r limits
nn.r has been scaled down by 0.5, and min
clamped by distance to the kth neighbor (set as degrees=6). Remap changes are
fairly unstable, because larger r unifies irrelevant neighbors. If nn.r is
minimized to 1st neighbor distance, it effectively becomes MST. But distance similarity
alone does not correctly unify neighbors.
k limits
Each unifier can only affect k nearest
ununified neighbors (set as degrees=6). Where k=1 models MST, k limit is a
tradeoff between normals propagation speed versus accuracy. Because corner
points have highly varied neighbors distances, k limit is more stable than r
limits. This also prevents unifiers unifying across two nearby parallel
surfaces (Figure k limits).
|
|
|
Figure k limits. |



















