2D and 3D focusing and tomography of UAV-borne Synthetic Aperture Radar images
Abstract:
This talk introduces the topic of Synthetic Aperture Radar (SAR) from the point of view of the Backprojection algorithm. This imaging method is based on the transmission of radio pulses and the subsequent recording of the received echoes. After software processing, 2D and 3D ground images in different frequency bands and modalities can be obtained.
Emphasis is done in the challenges and particularities that pose the studied case, which consists on Time-Domain Backprojection for an UAV-mounted SAR platform flying in arbitrary trajectories. Results are provided for 2D and 3D focusing in P, L and C bands. GPU parallelization is discussed, and insights are provided in regard to possible performance improvements that can be obtained in similar algorithms.
Back to the Future: Boosting 3D Perception with Privileged Temporal Information
Abstract:
Pseudo-labeling is a powerful tool to tap into large amounts of unlabelled data. Such automatic annotation promises are especially appealing in safety-critical applications where performance requirements are extreme and data annotation is challenging. Sensing 3D dynamic environments for autonomous robotics is such a domain. We propose in this work to boost pseudo-labeling for 3D dynamic perception by leveraging the whole temporal information contained in unlabeled sequences. Building on the teacher-student pseudo-labeling paradigm, in which the teacher provides annotations for unlabeled data to improve the training of the student, we leverage both past and future frames in different ways: (1) We boost the performance of the offline teacher through access to richer information, in particular to the privileged information (PI) from the future, which online student will not see at run-time; (2) We learn and combine multiple teachers with different temporal horizons; (3) We improve the selection of the final pseudo-labels by including the agreement of the multiple teachers in the confidence-based selection criterion. We demonstrate the merit of our approach on the fundamental perception tasks of 3D semantic segmentation, 3D object detection, and motion forecasting in lidar point clouds.
Incremental learning on large visual datasets
Abstract:
This work presents a streaming learning method for large visual recognition datasets where models should learn from new data as soon as it becomes available. ImageNet200 image recognition and video set BIRDS dataset for risk situation of frail people are considered in this work. We trained models based on ResNet50 for ImageNet200 and a pooling vision transformer for BIRDS dataset. Subsequently, we trained our models on the streaming set by passing data points one at a time. We base our approach on the existing Move-to-Data(MTD) continual learning method that uses vector projection for weight updates. We introduced MTD with gradient by fusing MTD with gradient-based weight updates and a buffer. We used ExStream streaming learning as a baseline for comparison. On ImageNet200, our newly proposed MTD with gradient achieves an accuracy of 67.24% surpassing the baseline model by 0.5%. On BIRDS, only ExStream and MTD were evaluated and MTD did not perform well in this regard.
Image Processing in Nuclear Medicine
Abstract:
Nuclear medicine uses radioactive material inside the body to see how organs or tissue are functioning (for diagnosis) or to target and destroy damaged or diseased organs or tissue (for treatment). Image processing has a crucial role during the entire imaging chain, starting from the image acquisition, through the reconstruction and post-processing of tomographic images. The presentation aims to provide an insight to the various image processing methods and problems.
Abstract:
Object detection and segmentation for 360 urban stocktaking and Equirectangular projection for data augmentation.
Abstract:
Computer Vision (CV) and Deep Learning (DL) algorithms can play an essential role in the urban mobile mapping, allowing to process of large data volumes through intelligent automatic systems able to leverage decision-making and management. This talk presents the development of two DL-based modules capable of detecting objects at different granularity levels for Semantic Segmentation (SS) and Object Detection (OD). So that they can form part of a more extensive intelligent stocktaking system for urban public infrastructures, providing essential information for high-level decisions. The work shows the performance of SoTA standard models in a 360 urban-context dataset. The experimentation in this work showed how SoTA SS models trained in generic datasets can have a remarkable performance on 360 images. Moreover, SoTA OD models can be trained over 360 datasets, with some considerations related to the equirectangular high-deformation poles. Furthermore, trying to understand the impact of the equirectangular geometrical deformations in the SoTA Deep Learning models, a novel research line was studied to assess geometrical projection as data augmentation transformations to improve the prediction capabilities of deep models. It shows how the equirectangular projection positively impacts the model’s performance.
Analysis and Extensions of Adversarial Training for Video Classification
Abstract:
Adversarial training (AT) is a simple yet effective defense against adversarial attacks to image classification systems, which is based on augmenting the training set with attacks that maximize the loss. However, the effectiveness of AT as a defense for video classification has not been thoroughly studied. Our first contribution is to show that generating optimal attacks for video requires carefully tuning the attack parameters, especially the step size. Notably, we show that the optimal step size varies linearly with the attack budget. Our second contribution is to show that using a smaller (sub-optimal) attack budget at training time leads to a more robust performance at test time. Based on these findings, we propose three defenses against attacks with variable attack budgets. The first one, Adaptive AT, is a technique where the attack budget is drawn from a distribution that is adapted as training iterations proceed. The second, Curriculum AT, is a technique where the attack budget is increased as training iterations proceed. The third, Generative AT, further couples AT with a denoising generative adversarial network to boost robust performance. Experiments on the UCF101 dataset demonstrate that the proposed methods improve adversarial robustness against multiple attack types.
Abstract:
Classification of Breast Cancer Subtypes and Evaluation of Biomarker Expression in Whole Slide Images
Abstract:
Breast cancer is one of the most commonly diagnosed cancers in women worldwide and its overall survival is based on several factors including accurate and prompt diagnosis and treatment. Some important prognostic factors are its morphological subtype and biomarker status. The shift to the digital era has made it possible to use gigapixel images of breast biopsies to determine the best possible treatment for each patient. However, challenges such as 1) the high morphological heterogeneity, 2) the size of the images, 3) the presence of various artifacts, and 4) the size of the datasets still need to be addressed in order to implement deep learning techniques efficiently. To this end, a multiple instance learning approach was implemented to cope with the problem of weak and noisy labels in breast cancer subtype classification. On the other hand, for biomarker status prediction, the advantages of transfer learning were exploited. Further evaluation of each methodology should be performed to allow for decreasing costs and timelines in cancer treatment and, at the same time, increasing patient survival.
Triplet networks for cross-modal document image retrieval
Abstract:
A particularly successful approach to detect fraud in documents consists in comparing a new document with a legitimate one in order to spot anomalies. However, to make this comparison viable, it is necessary to retrieve similar documents within a database. Such similarity can be established by considering textual, structural or visual features. Hybrid models, i.e., those that combine all these kinds of features have shown the best performances. In view of this, a content-based document image retrieval system is proposed with a novel and robust architecture. It is designed as a triplet neural network whose branches consist of a visual encoder, a text encoder with spatial-aware self-attention mechanism, and a pooling module. Due to the different nature of the inputs expected by each block, it can be considered as a cross-modal neural network, which offers the advantage of leveraging the benefits of the individual sources of information while compensating each other’s flaws. The resulting model is able to achieve a precision comparable to the state of the art, even though the train and test datasets consist of samples captured by end users in non-controlled environments.
Automatic Quantification of Children’s Lived Experience
Abstract:
One core assumption in developmental psychology is that children’s cognition develops in interaction with their social and physical environment. One way to study this relation is to code interactions from video recordings of children’s daily activities. However, this coding is usually done by hand and therefore very labor-intensive. Modern Computer Vision (CV) techniques– such as automatic people and object detection– can significantly reduce this effort and thereby facilitate the study of cognitive development. We want to use these techniques to evaluate a dataset of children’s daily activities. That is, we want to automatically quantify children’s interactions with people and objects. For this, we are collecting video recordings at home and in kindergartens using small lightweight bodycams. So far, we have recorded ten hours of videos from six children. To evaluate the accuracy of various models, we hand-coded a subset of videos. Antecedently, we have compared 11 state-of-the-art CV detectors for people and objects against the hand-coded subset of videos. The detection accuracy is between 30-35%, leaving room for improvement. We identified key limitations of the state-of-the-art models by specifying systematic detection errors (i.e. conditions under which the model fails to detect a person). This is the basis for improving our processing pipeline. Our next step is to improve the state-of-the-art models by fine-tuning or changing the model architecture. Once the detection is sufficiently accurate, we want to use these models to study the effect of children’s daily activities on their cognitive development at a scale.