Human Motion Tracking by Temporal-Spatial Local Gaussian Process Experts

Human pose estimation via motion tracking systems can be considered as a regression problem within a discriminative framework. It is always a challenging task to model the mapping from observation space to state space because of the high-dimensional characteristic in the multimodal conditional distribution. In order to build the mapping, existing techniques usually involve a large set of training samples in the learning process which are limited in their capability to deal with multimodality. We propose, in this work, a novel online sparse Gaussian Process (GP) regression model to recover 3-D human motion in monocular videos.

Particularly, we investigate the fact that for a given test input, its output is mainly determined by the training samples potentially residing in its local neighborhood and defined in the unified input-output space. This leads to a local mixture GP experts system composed of different local GP experts, each of which dominates a mapping behavior with the specific covariance function adapting to a local region. To handle the multimodality, we combine both temporal and spatial information therefore to obtain two categories of local experts.

The temporal and spatial experts are integrated into a seamless hybrid system, which is automatically self-initialized and robust for visual tracking of nonlinear human motion. Learning and inference are extremely efficient as all the local experts are defined online within very small neighborhoods. Extensive experiments on two real-world databases, Human Eva and PEAR, demonstrate the effectiveness of our proposed model, which significantly improve the performance of existing models.

Existing System:

Only spatial correlation of the pixels inside the single 2-D block is considered and the correlation from the pixels of the neighboring blocks is neglected. Impossible to completely de-correlate the blocks at their boundaries using DCT. Undesirable blocking artifacts affect the reconstructed images or video frames. (high compression ratios or very low bit rates). Since the input image needs to be ``blocked,'' correlation across the block boundaries is not eliminated. This results in noticeable and annoying ``blocking artifacts'' particularly at low bit rates.

VISION BASED human motion tracking has been a fundamental open problem, with pervasive real-world applications [1], such as surveillance, rehabilitation, diagnostics, and human computer interaction. Among the large amount of studies in this field, the discriminative approach [2] has been prevalent due to its feasibility of fast inference in real-world scenarios and flexibility of adapting to different learning methods.

Suffering from the intrinsic visual-to-pose ambiguity, however, all the discriminative approaches have the same difficulty of effectively modelling multimodal conditional distributions with small-size training data in a high-dimensional space.

Proposed System:

We propose a novel mixture of local GP expertsí model in this work, which incorporates both temporal and spatial information. Theoretically, it is insufficient to effectively handle multimodality only by spatial information since the problem of monocular human motion estimation itself is ill-posed. Introducing temporal information into the model is reasonably necessary. But existing discriminative methods are short of temporal estimation framework.

One exception is the parametric model proposed in which temporal smoothness constraints are added into the BME model. It is also worth noting that in, the Gaussian Process Dynamical Model (GPDM) is used to model the dynamics of human Motions. As the original GPDM is designed to find a low-dimensional latent space with associated dynamics, it is introduced to capture the motion priors in the latent state space.

Modules:

  • Video source selection
  • Analyzing Video
  • Extracting Frames
  • Track the Objects
  • Reconstruct the Frames with motion identifiers

Tools Used:

Front End : C#.NET