Press F11 to view this in full screen.
Press Left/Right to advance through the presentation.
Make sure to click on the play button for Point Clouds!
Don't miss the vertical slides - you'll see up/down arrows on the bottom right!
You can press the "esc" key to go to a slide overview.
Perceptual Segmentation of Visual Streams
by Tracking of Objects and Parts
Georg-August-Universität Göttingen
Institut für Informatik
Göttingen, 2014 Oct 17
Disputation for the award of the degree "Doctor of Philosophy"
How do we learn to perceive objects?
“Infants appear to perceive objects by analyzing three-dimensional surface arrangements and motions... [they] divide perceptual arrays into units that move as connected wholes, that move separately from one another, that tend to maintain their size and shape over motion, and that tend to act upon each other only on contact.” *
There are multiple interacting elements essential to development:
- Coherent motion at multiple levels
- Temporal continuity of size and shape
- Only interact with contact
* Spelke, Elizabeth S. "Principles of object perception." Cognitive science 14, no. 1 (1990): 29-56.
Temporal Connections without Objects
How can we create partitions when we don't know what an object is before-hand?
We have no difficulty tracking the pieces of objects when they split.
- This implies maintenance of both low-level and object-level spatio-temporal tracking.
Parsing Video Streams - Existing Methodologies
Video Object Segmentation
e.g.Abramov et al.
Grundmann et al.
This parses a video into spatio-temporal volumes - “objects”
Core assumption means that “objects” must form continuous spatio-temporal volumes!
Processed on VideoSegmentation.com
Parsing Video Streams - Existing Methodologies
Semantic Event Chains - Represents by analyzing creation & deletion of edges in segment adjacency graph.
Analysis of temporal evolution of graph structure yields semantics
Maniac Dataset: Breakfast
This requires a-priori knowledge of objects!
A Point Cloud
Advantages of 3D
- Avoids size/shape ambiguities of perspective transformation.
- Can reason about occlusions at a low level.
- Can use size and shape as a feature.
Building an Adjacency Graph
- Special octree type developed which maintains adjacency information of voxels
- This gives us back pixel-like (grid) relations, while keeping real 3D adjacency
- Region growing and connectivity graph become very efficient
Octree Adjacency Structure - Leaves now link to their spatial neighbors.
Voxel Cloud Connectivity Segmentation
- VCCS is a region-growing oversegmentation technique that uses local geometry to respect object boundaries
- Constrained to flow across voxel connections
- Use color, normals, and a spatial smoothness constraint
Test Scene
Iterative Expansion of Supervoxels using VCCS
Examples of Supervoxels
Example of Supervoxels with different seed sizes - from NYU Dataset
Performance of VCCS Compared to state of the art methods
Quantitative Comparison to SLIC
SLIC Superpixels
Speed and Performance vs State of the Art
Performance of VCCS Compared to state of the art methods
Speed of VCCS Compared to state of the art methods
Supervoxels in a Point Cloud
Local Convexity Segmentation (LCCP)
Use a local convexity criterion on adjacency graph edges to split graph.
Flow of segmentation: voxels to supervoxels to local convex patches.
LCCP Comparison on OSD Dataset
LCCP Comparison on NYU Dataset
LCCP Segments in a Point Cloud
Can segment huge full 3D scenes efficiently.
Sequential Clouds & Occlusion Reasoning
Occlusions appear as “shadows” in rendered point clouds.
For instance, here the lemon (which we want to keep track of) and much of the table is hidden by the bowl.
These blank areas limit our ability to have temporal continuity - object permanence.
Pointcloud without Occlusion Reasoning
Fortunately, we can perform some low-level reasoning about occlusions.
Sequentially Updated Octree
If we assume no camera motion, we can reason about why voxels “disappear”
Check for occlusion by ray-tracing paths from voxel to camera
Camera is facing us from this perspective - notice shadows extend towards the viewer.
Demonstration of Occlusion Reasoning
Left frame shows input data without occlusion reasoning
Right shows the same input with ray-tracing checks
Pointcloud with Occlusion Reasoning
Particle filter tracking in Point Clouds
Correspondence-Based Particle Filter approach is used.
Models used for tracking are point clouds, partitioned using supervoxels into strata for sampling.
Stratified Correspondence Sampling
Supervoxels are used to choose spatial strata for uniform random sampling.
Results on Synthetic Benchmark
Results on VR Data
Plot of Displacement Error vs time per frame (ms) averaged across 50 VR Test Runs for different numbers of particles and samples per stratum.
Results on VR Data
Plot of Rotational Error vs time per frame (ms) averaged across 50 VR Test Runs for different numbers of particles and samples per stratum.
Tracking Low Level Patches - Why Temporal Supervoxels?
Tracking low level patches would let us make temporal connections without needing to specify a-priori objects.
Splitting objects are problematic if we segment and track using a-priori models. How do we label the pieces?
We have our low level patch representation - Supervoxels.
We have en efficient tracking method.
So, what's the problem?
Cortical Feedback Mechanisms
Humans appear to use top-down feedback mechanisms
Feedback allows high-level areas to influence low-level vision, even receptive fields
Feed-forward and Feedback Mechanisms in the Human Visual Cortex
Hierarchical Temporal (super)Voxel Fields (HTVF)
Press "a" and "d" to advance and go back through the algorithm.
HTVF - Occlusions - Without Voxel Raytracing
HTVF - Occlusions - With Voxel Raytracing 0
HTVF - Occlusions - With Voxel Raytracing 1
HTVF - Occlusions - With Voxel Raytracing 2
HTVF - Occlusions - With Voxel Raytracing 3
Occlusions - Just Occlusion Filling
Summary
We have presented a novel pipeline for creating spatio-temporal connections in point cloud video
Importantly, our method:
- Can handle occlusions - labels persist
- Does not make a-priori assumptions about objects
- Handles rapid movement of people/cameras
- Provides stable temporal-supervoxels that can be used for learning
Other Contributions
- Oculus vision GUI
- All algorithms have been released as Open Source
- 2D Tracking and Segmentation using Particle Filters
Oculus Vision GUI
Outlook and Future Work
Many opportunities exist now that we have low-level temporal connections
- Bootstrap learning - learn “objectness” from observations
- Higher levels in the hierarchy
- Sensor pose - improve performance with moving camera
- Object recognition - group parts into meaningful objects
- Occlusion reasoning - remove self occlusion, occluded movement
- Dynamic level of detail & attention
- Less samples on large uniform surfaces
- More samples on small irregular areas
Bootstrapping Visual Understanding...
iCub
So they never lose track of you.
iCub
Acknowledgements and Thanks
Thesis-related Publications
-
Papon, J.; Wörgötter, F., Spatially Stratified Correspondence Sampling for Real-Time Point Cloud Tracking, Applications of Computer Vision (WACV), 2015 IEEE International Conference on, Jan. 2015.
- Stein, S.; Schoeler, M.; Papon, J.; Wörgötter, F., Object Partitioning using Local Convexity, Computer Vision and Pattern Recognition (CVPR) 2014, June 2014.
-
Papon, J.; Kulvicius, T.; Aksoy, E.; Wörgötter, F. Point Cloud Video Object Segmentation using a Persistent Supervoxel World-Model, Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, Nov. 2013.
-
Papon, J.; Abramov, A.; Schoeler, M.; Wörgötter, F., Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds, Computer Vision and Pattern Recognition (CVPR) 2013, June 2013.
-
Papon, J.; Abramov, A.; Wörgötter, F., Occlusion Handling in Video Segmentation via Predictive Feedback, European Conference on Computer Vision (ECCV) 2012, Workshops and Demonstrations, Oct. 2012.
-
Papon, J.; Abramov, A.; Aksoy, E.; Wörgötter, F., A modular system architecture for online parallel vision pipelines, Applications of Computer Vision (WACV) 2012, Jan. 2012.
Dieter Hogrefe
Justus Piater
Florentin Wörgötter
Colleagues & Friends: Markus Schoeler, Alexey Abramov, Tomas Kulvicius, Mohammad Aein, Minija Tamosiunaite, Simon Stein, Simon Reich, Eren Aksoy, Christian Tetzlaff, Ursula Hahn-Wörgötter, Jan-Matthias Braun, Timo Luddecke, Timo Nachstedt, Alejandro Agostini, Michael Fauth, Xiaofeng Xiong, Sakya Dasgupta, Yinyun Li, Rajeeth Savarimuthu, Anders Buch, Sergey Alexandrov.
My loving and supportive parents - Jean-Marc and Marian.
HTVF - Camera Pan 2 - LCCP Overlay