Press F11 to view this in full screen.

Press Left/Right to advance through the presentation.

Make sure to click on the play button for Point Clouds!

Don't miss the vertical slides - you'll see up/down arrows on the bottom right!

You can press the "esc" key to go to a slide overview.

Perceptual Segmentation of Visual Streams

by Tracking of Objects and Parts

Jérémie Papon

Georg-August-Universität Göttingen Institut für Informatik Göttingen, 2014 Oct 17

Disputation for the award of the degree "Doctor of Philosophy"

How do we learn to perceive objects?

“Infants appear to perceive objects by analyzing three-dimensional surface arrangements and motions... [they] divide perceptual arrays into units that move as connected wholes, that move separately from one another, that tend to maintain their size and shape over motion, and that tend to act upon each other only on contact.” *

There are multiple interacting elements essential to development:

Coherent motion at multiple levels
Temporal continuity of size and shape
Only interact with contact

* Spelke, Elizabeth S. "Principles of object perception." Cognitive science 14, no. 1 (1990): 29-56.

Temporal Connections without Objects

How can we create partitions when we don't know what an object is before-hand?

We have no difficulty tracking the pieces of objects when they split.

This implies maintenance of both low-level and object-level spatio-temporal tracking.

Parsing Video Streams - Existing Methodologies

Video Object Segmentation e.g.Abramov et al. Grundmann et al.

This parses a video into spatio-temporal volumes - “objects”

Core assumption means that “objects” must form continuous spatio-temporal volumes!

Processed on VideoSegmentation.com

Abramov et al., Real-Time Segmentation of Stereo Videos on a Portable System With a Mobile GPU, IEEE Transactions on Circuits and Systems for Video Technology 2012.Grundmann et al., Efficient Hierarchical Graph Based Video Segmentation,Computer Vision and Pattern Recognition (CVPR) 2010.

Parsing Video Streams - Existing Methodologies

Processed on VideoSegmentation.com

Complete failure if this assumption is violated.

Parsing Video Streams - Existing Methodologies

Semantic Event Chains - Represents by analyzing creation & deletion of edges in segment adjacency graph.

Analysis of temporal evolution of graph structure yields semantics

Maniac Dataset: Breakfast

This requires a-priori knowledge of objects!

Aksoy, Eren Erdal, et al. Learning the semantics of object–action relations by observation. The International Journal of Robotics Research (2011).

Overview of Methodology

A Point Cloud

Advantages of 3D

Avoids size/shape ambiguities of perspective transformation.
Can reason about occlusions at a low level.
Can use size and shape as a feature.

Building an Adjacency Graph

Special octree type developed which maintains adjacency information of voxels
This gives us back pixel-like (grid) relations, while keeping real 3D adjacency
Region growing and connectivity graph become very efficient

Octree Adjacency Structure - Leaves now link to their spatial neighbors.

Voxel Cloud Connectivity Segmentation

VCCS is a region-growing oversegmentation technique that uses local geometry to respect object boundaries
Constrained to flow across voxel connections
Use color, normals, and a spatial smoothness constraint

Test Scene

Iterative Expansion of Supervoxels using VCCS

OSD Dataset Sergey Alexandrov

Papon et al., Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds, Computer Vision and Pattern Recognition (CVPR) 2013.

Examples of Supervoxels

Example of Supervoxels with different seed sizes - from NYU Dataset

Papon et al. CVPR 2013

Performance of VCCS Compared to state of the art methods

Silberman et al., Indoor Segmentation and Support Inference from RGBD Images, European Conference on Computer Vision (ECCV) 2012.

Quantitative Comparison to SLIC

VCCS Supervoxels for increasing seed size.

Papon et al. CVPR 2013

SLIC Superpixels

Achanta et al., SLIC Superpixels Compared to State-of-the-art Superpixel Methods, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.

Speed and Performance vs State of the Art

Performance of VCCS Compared to state of the art methods

Speed of VCCS Compared to state of the art methods

Supervoxels in a Point Cloud

Local Convexity Segmentation (LCCP)

Use a local convexity criterion on adjacency graph edges to split graph.

Flow of segmentation: voxels to supervoxels to local convex patches.

Stein, S.; Schoeler, M.; Papon, J.; Wörgötter, F., Object Partitioning using Local Convexity, Computer Vision and Pattern Recognition (CVPR) 2014, June 2014.

LCCP Example Results

LCCP Comparison on OSD Dataset LCCP Comparison on NYU Dataset

LCCP Segments in a Point Cloud

Can segment huge full 3D scenes efficiently.

Overview of Methodology

Sequential Clouds & Occlusion Reasoning

Occlusions appear as “shadows” in rendered point clouds.

For instance, here the lemon (which we want to keep track of) and much of the table is hidden by the bowl.

These blank areas limit our ability to have temporal continuity - object permanence.

Pointcloud without Occlusion Reasoning

Fortunately, we can perform some low-level reasoning about occlusions.

Sequentially Updated Octree

If we assume no camera motion, we can reason about why voxels “disappear”

Check for occlusion by ray-tracing paths from voxel to camera

Camera is facing us from this perspective - notice shadows extend towards the viewer.

Papon et al., Point Cloud Video Object Segmentation using a Persistent Supervoxel World-Model, Intelligent Robots and Systems (IROS) 2013.

Demonstration of Occlusion Reasoning

Left frame shows input data without occlusion reasoning

Right shows the same input with ray-tracing checks

Pointcloud with Occlusion Reasoning

Overview of Methodology

Particle filter tracking in Point Clouds

Correspondence-Based Particle Filter approach is used.

Models used for tracking are point clouds, partitioned using supervoxels into strata for sampling.

Stratified Correspondence Sampling

Supervoxels are used to choose spatial strata for uniform random sampling.

Papon et al., Spatially Stratified Correspondence Sampling for Real-Time Point Cloud Tracking, Applications of Computer Vision (WACV), 2015.

Results in Real Application

IntellAct Project

Results on Synthetic Benchmark

Choi and Christensen, Object Tracking: A Particle Filter Approach on GPU, International Conference on Intelligent Robots and Systems (IROS), 2013.

Results on VR Data

Plot of Displacement Error vs time per frame (ms) averaged across 50 VR Test Runs for different numbers of particles and samples per stratum.

Results on VR Data

Plot of Rotational Error vs time per frame (ms) averaged across 50 VR Test Runs for different numbers of particles and samples per stratum.

Tracking Low Level Patches - Why Temporal Supervoxels?

Tracking low level patches would let us make temporal connections without needing to specify a-priori objects.

Splitting objects are problematic if we segment and track using a-priori models. How do we label the pieces?

We have our low level patch representation - Supervoxels.

We have en efficient tracking method.

So, what's the problem?

Why can't we just track Supervoxels?

Cannot track exclusively at low-level due to the “aperture problem”

MIT Perceptual Science Group

McDermott, et al., Beyond junctions: Nonlocal form contraints on motion interpretation. Perception 2001.

Cortical Feedback Mechanisms

Humans appear to use top-down feedback mechanisms

Feedback allows high-level areas to influence low-level vision, even receptive fields

Feed-forward and Feedback Mechanisms in the Human Visual Cortex

Gilbert and Wu Li. Top-down influences on visual processing, Nature Reviews Neuroscience, 2013.

Hierarchical Temporal (super)Voxel Fields (HTVF)

Press "a" and "d" to advance and go back through the algorithm.

HTVF - Cutting Video 0

HTVF - Cutting Video 1

HTVF - Occlusions - Without Voxel Raytracing

HTVF - Occlusions - With Voxel Raytracing 0

HTVF - Occlusions - With Voxel Raytracing 1

HTVF - Occlusions - With Voxel Raytracing 2

HTVF - Occlusions - With Voxel Raytracing 3

Occlusions - Just Occlusion Filling

Summary

We have presented a novel pipeline for creating spatio-temporal connections in point cloud video

Importantly, our method:

Can handle occlusions - labels persist
Does not make a-priori assumptions about objects
Handles rapid movement of people/cameras
Provides stable temporal-supervoxels that can be used for learning

Other Contributions

Oculus vision GUI
All algorithms have been released as Open Source
2D Tracking and Segmentation using Particle Filters

Oculus Vision GUI

Outlook and Future Work

Many opportunities exist now that we have low-level temporal connections

Bootstrap learning - learn “objectness” from observations
Higher levels in the hierarchy
- Sensor pose - improve performance with moving camera
- Object recognition - group parts into meaningful objects
- Occlusion reasoning - remove self occlusion, occluded movement
Dynamic level of detail & attention
- Less samples on large uniform surfaces
- More samples on small irregular areas

Bootstrapping Visual Understanding...

iCub

So they never lose track of you.

iCub

Acknowledgements and Thanks

Thesis-related Publications

Papon, J.; Wörgötter, F., Spatially Stratified Correspondence Sampling for Real-Time Point Cloud Tracking, Applications of Computer Vision (WACV), 2015 IEEE International Conference on, Jan. 2015.
Stein, S.; Schoeler, M.; Papon, J.; Wörgötter, F., Object Partitioning using Local Convexity, Computer Vision and Pattern Recognition (CVPR) 2014, June 2014.
Papon, J.; Kulvicius, T.; Aksoy, E.; Wörgötter, F. Point Cloud Video Object Segmentation using a Persistent Supervoxel World-Model, Intelligent Robots and Systems (IROS), 2013 IEEE/RSJ International Conference on, Nov. 2013.
Papon, J.; Abramov, A.; Schoeler, M.; Wörgötter, F., Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds, Computer Vision and Pattern Recognition (CVPR) 2013, June 2013.
Papon, J.; Abramov, A.; Wörgötter, F., Occlusion Handling in Video Segmentation via Predictive Feedback, European Conference on Computer Vision (ECCV) 2012, Workshops and Demonstrations, Oct. 2012.
Papon, J.; Abramov, A.; Aksoy, E.; Wörgötter, F., A modular system architecture for online parallel vision pipelines, Applications of Computer Vision (WACV) 2012, Jan. 2012.

Dieter Hogrefe

Justus Piater

Florentin Wörgötter

Colleagues & Friends: Markus Schoeler, Alexey Abramov, Tomas Kulvicius, Mohammad Aein, Minija Tamosiunaite, Simon Stein, Simon Reich, Eren Aksoy, Christian Tetzlaff, Ursula Hahn-Wörgötter, Jan-Matthias Braun, Timo Luddecke, Timo Nachstedt, Alejandro Agostini, Michael Fauth, Xiaofeng Xiong, Sakya Dasgupta, Yinyun Li, Rajeeth Savarimuthu, Anders Buch, Sergey Alexandrov.

Parsing Video Streams - Existing Methodologies – Examples of Supervoxels – Local Convexity Segmentation (LCCP)

jpapon

Parsing Video Streams - Existing Methodologies – Examples of Supervoxels – Local Convexity Segmentation (LCCP)

1 0 (function() { var po = document.createElement('script'); po.type = 'text/javascript'; po.async = true; po.src = 'https://apis.google.com/js/platform.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(po, s); })();

jpapon.github.io

Perceptual Segmentation of Visual Streams

by Tracking of Objects and Parts

How do we learn to perceive objects?

Temporal Connections without Objects

Parsing Video Streams - Existing Methodologies

Parsing Video Streams - Existing Methodologies

Parsing Video Streams - Existing Methodologies

Overview of Methodology

A Point Cloud

Building an Adjacency Graph

Voxel Cloud Connectivity Segmentation

Examples of Supervoxels

Quantitative Comparison to SLIC

Speed and Performance vs State of the Art

Supervoxels in a Point Cloud

Local Convexity Segmentation (LCCP)

LCCP Segments in a Point Cloud

Can segment huge full 3D scenes efficiently.

Overview of Methodology

Sequential Clouds & Occlusion Reasoning

Pointcloud without Occlusion Reasoning

Sequentially Updated Octree

Demonstration of Occlusion Reasoning

Pointcloud with Occlusion Reasoning

Overview of Methodology

Particle filter tracking in Point Clouds

Stratified Correspondence Sampling

Results in Real Application

Results on Synthetic Benchmark

Results on VR Data

Results on VR Data

Tracking Low Level Patches - Why Temporal Supervoxels?

Why can't we just track Supervoxels?

Cortical Feedback Mechanisms

Hierarchical Temporal (super)Voxel Fields (HTVF)

HTVF - Cutting Video 0

HTVF - Cutting Video 1

HTVF - Occlusions - Without Voxel Raytracing

HTVF - Occlusions - With Voxel Raytracing 0

HTVF - Occlusions - With Voxel Raytracing 1

HTVF - Occlusions - With Voxel Raytracing 2

HTVF - Occlusions - With Voxel Raytracing 3

Occlusions - Just Occlusion Filling

Summary

Outlook and Future Work

Acknowledgements and Thanks

Thesis-related Publications

Questions?

HTVF - Camera Pan 0

HTVF - Camera Pan 2 - LCCP Overlay

HTVF - Camera Pan 1

1 0