Title

home agenda abstracts

 

We present a discriminative, example-based approach to pose recovery, using histograms of oriented gradients as image descriptors. Such an approach has the potential to recover human poses in real-time without knowledge of the subject, and with automatic (re)initialization. In this talk, we discuss our experiments on the HumanEva data sets. The results provide insight into the strengths and limitations of an example-based approach. Results for monocular and multi-view settings are reported, with both unseen and seen persons and actions. Also, we discuss ways to improve performance, both on accuracy and computational cost.

Links:
Related publication
Workshop presentation
Research page

We propose to create an immersive interactive display system that can be projected on arbitrary surfaces using simple commodity hardware such as a projector, video camera and mirrors. Cameras are used for determining geometric and photometric properties of the projected displays, and provide an avenue for user/system interaction. Our first application area is focused for the educational arena, where inexpensive commodity hardware can be used to enrich the classroom experience.

Back to the Top

Human posture estimation is a research topic attracting increasing interests. It has potentially huge impacts with applications in home care, gaming, clinics etc. In this presentation, existing human posture estimation techniques are firstly reviewed. As a balance between image observations and a human model, concerns from both sides and the "matching bridge" are discussed. Due to the inherent ambiguity of single-views, some recent works have shown more and more interest in using multiple views. However, access to multiple views opens up new challenges, including how to optimize information selection in order to reduce redundancy and remove ambiguity, how to communicate between cameras, etc. In a multi-camera network, additional constraints are imposed. The presenter's recent work is briefly discussed as illustrating examples.

Links:
Related publication
Research page

Back to the Top

 We design a robust people detection module based on a probabilistic combination of fast human body part detectors. The representation is robust to partial occlusions, part detector false alarms and missed detections of body parts. Furthermore, we show how to use the fact that the persons walk on a known floor plane to detect them more reliably and efficiently. Finally, we show how our framework can be used to combine information from different sensors.  This work is a part of our research within the European project COGNIRON. The project objective it to study the perceptual, representational, reasoning and learning capabilities of embodied robots in human centered environments.

Links:
Related publications: (1) (2)
Related videos: (1) (2)
multisensor robot dataset and some matlab tools
Research page

Back to the Top

We present a system for estimating 3D human upper body pose from multiple cameras. The system involves two processing stages. In the hypothesis generation stage, candidate poses are generated based on an analysis of the individual camera views. For this, a 3D human model is first synthesized in all possible poses off-line. The large number of resulting 2D exemplars are subsequently pruned efficiently on-line by a hierarchical shape-matching approach. In the hypothesis verification stage, candidate 3D poses are re-projected to the other camera views and ranked according to a multi-view matching score. The current system is constantly on the look-out for clearly distinguishable human poses, which it then utilizes to incrementally enrich its 3D shape model with texture information. In experiments with complex real-world data from a station hall, we demonstrate the robustness of this multi-stage approach.

Back to the Top

Compared to traditional mono-view systems, stereo or in general multi-view systems provide interesting additional information about a captured scene, which can significantly facilitate content extraction. This property makes them very useful for many emerging applications, such as 3D TV and video surveillance. However, the use of such systems has been limited so far because of the processing time and bandwidth requirements for multi-view data. These major drawbacks can only be relieved by the development of dedicated algorithms. In this presentation, we present an efficient, flexible and content-aware coding method for a multi-view video system. The framework consists of a central processor and camera, completed by a flexible number of smart Wyner-Ziv cameras. The latter ones provide a content-aware representation of their viewpoint, thus greatly reducing the amount of data to be sent to the central processor. By employing Distributed Video (DV) coding, i.e. joint decoding of the independently encoded frames of the different cameras, we achieve good coding efficiency without inter-camera communication.

Links:
Related publications: (1) (2)
Research page -- Linda Tessens
Research page -- Marleen Morbee

Back to the Top

In this presentation we show some of the recent advantages TNO has made towards panoramic video processing and 3-D scene reconstruction.  Omnivideo is a video capture technique that records omni-directional video by stitching 6 camera views together to get a complete 360° view of the surrounding. When mounted on a moving platform it can record and create 3-D tours of the world that can be played back interactively as a 3-D computer game. Omnivideo has many applications: interactive exploration of museums, crime scene investigation, reconnaissance and debriefing for military missions. We will show some of our in- and outdoor recordings and their processing.

Currently, there are many applications that require accurate three-dimensional computer models of real world scenes.  Example applications can be found in crime scene investigation, engineering, construction work, and the entertainment industry. Often, the creation of realistic 3-D computer models for these applications involves costly and tedious manual labour. In this presentation we show our results with the automatic creation of 3-D computer models of real world scenes. The idea is that a hand-held stereo camera is used to capture images of a scene from different viewpoints.  A scene model can then be build automatically by merging the resulting 3-D stereo measurements based on the estimated camera ego-motion.

Back to the Top

Visual surveillance in wide areas (e.g. airports) relies on cameras that observe non-overlapping scenes.  Multi-person tracking requires re-identification of a person when he/she leaves one field of view, and later appears at another.  For this, we use appearance cues, and assume that all observations of a single person are Gaussian distributed. The observation/appearance model in our approach consists of a Mixture of Gaussians.  Multi-Observations Newscast EM algorithm allows us to learn this MoG in a distributed way, where each camera learns from both, its own observations and communication with other cameras. MO-NEM relies on a gossip-based protocol to estimate the mean of a set of distributed values.

The presented algorithm is tested on artificial generated data and on a collection of real-world observations gathered by a system of cameras in an office building.

Links:
Research page and related publication

Back to the Top

The aim of the project is to research a method for finding and following elderly people inside their house. The goal is to enable an AIBO robot to move around the house, while it keeps an eye on the person living there.  A second, successive goal is to detect an occasional fall of the person, and react to that, for example by alarming a nurse.

To reach the first goal, several algorithms for tracking and localizing people using the noisy (and while the robot is moving, shaky) camera images  should be considered. Possible algorithms are colour histogram tracking, salient point tracking or background estimation. The solution could consist of one of the possible algorithms, but is more likely to be an adaptable hybrid combination of multiple algorithms, in which the best features of each algorithm can be exploited.
The second goal of detecting falls of the person being watched will probably be out of the scope of this master project. This part can be considered as a completely separate research and could be taken on as future research.

Back to the Top

A 3D video is typically obtained from a set of synchronized cameras, which are capturing the same scene from different view points (multi-view video). This technique enables applications such as free-viewpoint video or 3D-TV. Free-viewpoint video applications provide the feature to interactively select a viewpoint of the scene. With 3D-TV, the depth of the scene can be perceived using a multi-view display that shows simultaneously several views of the same scene. In this talk, we present an architecture of an acquisition, compression and rendering system for 3D-TV and free-viewpoint video. We show that the proposed system yields multiple advantages. First, we show that the 3D video acquisition sub-system can be simplified by employing a Depth Image Based Representation (DIBR) of the 3D scene. Second, the proposed system achieves an efficient compression of 3D/multi-view video by extending a standard H.264 encoder such that near backward compatibility is retained. Third, a high-quality 3D rendering is obtained by appropriately handling occluded pixels. The current proposal allows a gradual introduction of cost-efficient 3D-TV and free-viewpoint systems.

Links:
Research page

Back to the Top

Skeletons are compact shape descriptors that capture the topology and articulation of a shape in an effective manner. Applications range from shape analysis, segmentation, and matching, to motion planning and pose  estimation. In this presentation I will give an overview of skeleton properties and applications, and present our novel 3D skeletonization and segmentation methods.

Links:
Related publication
Research page

Back to the Top

Large Displays are suited to support discussions by providing a large physical plane that displays information relevant to the discussion. An intuitive – natural – form of interacting with this display is needed. This novel interface should be conforming to the discussants. Visual cues in user behaviour with a large display are identified and interpreted. The aim is to find such cues in users' manual gesturing. Recordings of omics meetings serve as a corpus in which human motion is annotated. We plan to manually annotate user gesturing using a written sign language scheme that is extended to include on-screen objects that are being referred to.

Links:
Related publication
Workshop presentation
Research page

Back to the Top

The focus of our research is on image content analysis and object detection to enable content-adaptive local processing of TV video for achieving higher levels of picture quality enhancement and facilitating content-based applications. Examples of important objects are sky, grass, human skin, that can be used for content-adaptive noise reduction, sharpness enhancement and color correction. An important factor here is the real-time nature and hardware constraints of TV video processing.

In the examples below, left is the input image, middle is the detection result for grass and sky in the first and the second image, respectively, and right is the result of content-adaptive color enhancement.

Links:

Back to the Top

Last Updated: August 13th, 2007
Y. A.