Parsing Video Streams - Existing Methodologies – Examples of Supervoxels – Local Convexity Segmentation (LCCP)



Parsing Video Streams - Existing Methodologies – Examples of Supervoxels – Local Convexity Segmentation (LCCP)

1 0


jpapon.github.io

Disputation Slide Show

On Github jpapon / jpapon.github.io

Press F11 to view this in full screen.

Press Left/Right to advance through the presentation.

Make sure to click on the play button for Point Clouds!

Don't miss the vertical slides - you'll see up/down arrows on the bottom right!

You can press the "esc" key to go to a slide overview.

Perceptual Segmentation of Visual Streams

by Tracking of Objects and Parts

Georg-August-Universität Göttingen Institut für Informatik Göttingen, 2014 Oct 17

Disputation for the award of the degree "Doctor of Philosophy"

How do we learn to perceive objects?

“Infants appear to perceive objects by analyzing three-dimensional surface arrangements and motions... [they] divide perceptual arrays into units that move as connected wholes, that move separately from one another, that tend to maintain their size and shape over motion, and that tend to act upon each other only on contact.” *

There are multiple interacting elements essential to development:

  • Coherent motion at multiple levels
  • Temporal continuity of size and shape
  • Only interact with contact
* Spelke, Elizabeth S. "Principles of object perception." Cognitive science 14, no. 1 (1990): 29-56.

Temporal Connections without Objects

How can we create partitions when we don't know what an object is before-hand?

We have no difficulty tracking the pieces of objects when they split.

  • This implies maintenance of both low-level and object-level spatio-temporal tracking.

Parsing Video Streams - Existing Methodologies

Video Object Segmentation e.g.Abramov et al. Grundmann et al.

This parses a video into spatio-temporal volumes - “objects”

Core assumption means that “objects” must form continuous spatio-temporal volumes!

Processed on VideoSegmentation.com

Abramov et al., Real-Time Segmentation of Stereo Videos on a Portable System With a Mobile GPU, IEEE Transactions on Circuits and Systems for Video Technology 2012.Grundmann et al., Efficient Hierarchical Graph Based Video Segmentation,Computer Vision and Pattern Recognition (CVPR) 2010.

Parsing Video Streams - Existing Methodologies

Processed on VideoSegmentation.com

Complete failure if this assumption is violated.

Parsing Video Streams - Existing Methodologies

Semantic Event Chains - Represents by analyzing creation & deletion of edges in segment adjacency graph.

Analysis of temporal evolution of graph structure yields semantics

Maniac Dataset: Breakfast

This requires a-priori knowledge of objects!
Aksoy, Eren Erdal, et al. Learning the semantics of object–action relations by observation. The International Journal of Robotics Research (2011).

Overview of Methodology

A Point Cloud

Advantages of 3D

  • Avoids size/shape ambiguities of perspective transformation.
  • Can reason about occlusions at a low level.
  • Can use size and shape as a feature.

Building an Adjacency Graph

  • Special octree type developed which maintains adjacency information of voxels
  • This gives us back pixel-like (grid) relations, while keeping real 3D adjacency
  • Region growing and connectivity graph become very efficient
Octree Adjacency Structure - Leaves now link to their spatial neighbors.

Voxel Cloud Connectivity Segmentation

  • VCCS is a region-growing oversegmentation technique that uses local geometry to respect object boundaries
  • Constrained to flow across voxel connections
  • Use color, normals, and a spatial smoothness constraint
Test Scene
Iterative Expansion of Supervoxels using VCCS

OSD Dataset Sergey Alexandrov

Papon et al., Voxel Cloud Connectivity Segmentation - Supervoxels for Point Clouds, Computer Vision and Pattern Recognition (CVPR) 2013.

Examples of Supervoxels

Example of Supervoxels with different seed sizes - from NYU Dataset

Performance of VCCS Compared to state of the art methods
Silberman et al., Indoor Segmentation and Support Inference from RGBD Images, European Conference on Computer Vision (ECCV) 2012.

Quantitative Comparison to SLIC

VCCS Supervoxels for increasing seed size.

Papon et al. CVPR 2013

SLIC Superpixels
Achanta et al., SLIC Superpixels Compared to State-of-the-art Superpixel Methods, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.

Speed and Performance vs State of the Art

Performance of VCCS Compared to state of the art methods
Speed of VCCS Compared to state of the art methods

Supervoxels in a Point Cloud

Local Convexity Segmentation (LCCP)

Use a local convexity criterion on adjacency graph edges to split graph.

Flow of segmentation: voxels to supervoxels to local convex patches.
Stein, S.; Schoeler, M.; Papon, J.; Wörgötter, F., Object Partitioning using Local Convexity, Computer Vision and Pattern Recognition (CVPR) 2014, June 2014.
LCCP Example Results
LCCP Example Results
LCCP Comparison on OSD Dataset LCCP Comparison on NYU Dataset

LCCP Segments in a Point Cloud

Can segment huge full 3D scenes efficiently.

Overview of Methodology

Sequential Clouds & Occlusion Reasoning

Occlusions appear as “shadows” in rendered point clouds.

For instance, here the lemon (which we want to keep track of) and much of the table is hidden by the bowl.

These blank areas limit our ability to have temporal continuity - object permanence.

Pointcloud without Occlusion Reasoning

Fortunately, we can perform some low-level reasoning about occlusions.

Sequentially Updated Octree

If we assume no camera motion, we can reason about why voxels “disappear”

Check for occlusion by ray-tracing paths from voxel to camera

Camera is facing us from this perspective - notice shadows extend towards the viewer.

Papon et al., Point Cloud Video Object Segmentation using a Persistent Supervoxel World-Model, Intelligent Robots and Systems (IROS) 2013.

Demonstration of Occlusion Reasoning

Left frame shows input data without occlusion reasoning

Right shows the same input with ray-tracing checks

Pointcloud with Occlusion Reasoning

Overview of Methodology

Particle filter tracking in Point Clouds

Correspondence-Based Particle Filter approach is used.

Models used for tracking are point clouds, partitioned using supervoxels into strata for sampling.

Stratified Correspondence Sampling

Supervoxels are used to choose spatial strata for uniform random sampling.
Papon et al., Spatially Stratified Correspondence Sampling for Real-Time Point Cloud Tracking, Applications of Computer Vision (WACV), 2015.

Results in Real Application

IntellAct Project

Results on Synthetic Benchmark

Choi and Christensen, Object Tracking: A Particle Filter Approach on GPU, International Conference on Intelligent Robots and Systems (IROS), 2013.

Results on VR Data

Plot of Displacement Error vs time per frame (ms) averaged across 50 VR Test Runs for different numbers of particles and samples per stratum.

Results on VR Data

Plot of Rotational Error vs time per frame (ms) averaged across 50 VR Test Runs for different numbers of particles and samples per stratum.

Tracking Low Level Patches - Why Temporal Supervoxels?

Tracking low level patches would let us make temporal connections without needing to specify a-priori objects.

Splitting objects are problematic if we segment and track using a-priori models. How do we label the pieces?

We have our low level patch representation - Supervoxels.

We have en efficient tracking method.

So, what's the problem?

Why can't we just track Supervoxels?

Cannot track exclusively at low-level due to the “aperture problem”

MIT Perceptual Science Group

Cortical Feedback Mechanisms

Humans appear to use top-down feedback mechanisms

Feedback allows high-level areas to influence low-level vision, even receptive fields

Feed-forward and Feedback Mechanisms in the Human Visual Cortex
Gilbert and Wu Li. Top-down influences on visual processing, Nature Reviews Neuroscience, 2013.

Hierarchical Temporal (super)Voxel Fields (HTVF)

Press "a" and "d" to advance and go back through the algorithm.

HTVF - Cutting Video 0

HTVF - Cutting Video 1

HTVF - Occlusions - Without Voxel Raytracing

HTVF - Occlusions - With Voxel Raytracing 0

HTVF - Occlusions - With Voxel Raytracing 1

HTVF - Occlusions - With Voxel Raytracing 2

HTVF - Occlusions - With Voxel Raytracing 3

Occlusions - Just Occlusion Filling

Summary

We have presented a novel pipeline for creating spatio-temporal connections in point cloud video

Importantly, our method:

  • Can handle occlusions - labels persist
  • Does not make a-priori assumptions about objects
  • Handles rapid movement of people/cameras
  • Provides stable temporal-supervoxels that can be used for learning

Other Contributions

  • Oculus vision GUI
  • All algorithms have been released as Open Source
  • 2D Tracking and Segmentation using Particle Filters
Oculus Vision GUI

Outlook and Future Work

Many opportunities exist now that we have low-level temporal connections

  • Bootstrap learning - learn “objectness” from observations
  • Higher levels in the hierarchy
    • Sensor pose - improve performance with moving camera
    • Object recognition - group parts into meaningful objects
    • Occlusion reasoning - remove self occlusion, occluded movement
  • Dynamic level of detail & attention
    • Less samples on large uniform surfaces
    • More samples on small irregular areas
Bootstrapping Visual Understanding...

iCub

So they never lose track of you.

iCub

Acknowledgements and Thanks

Thesis-related Publications

Dieter Hogrefe
Justus Piater
Florentin Wörgötter

Colleagues & Friends: Markus Schoeler, Alexey Abramov, Tomas Kulvicius, Mohammad Aein, Minija Tamosiunaite, Simon Stein, Simon Reich, Eren Aksoy, Christian Tetzlaff, Ursula Hahn-Wörgötter, Jan-Matthias Braun, Timo Luddecke, Timo Nachstedt, Alejandro Agostini, Michael Fauth, Xiaofeng Xiong, Sakya Dasgupta, Yinyun Li, Rajeeth Savarimuthu, Anders Buch, Sergey Alexandrov.

My loving and supportive parents - Jean-Marc and Marian.

Questions?

XKCD

HTVF - Camera Pan 0

HTVF - Camera Pan 2 - LCCP Overlay

HTVF - Camera Pan 1