Over the last few years, the field of Artificial Intelligence has witnessed significant growth, largely fueled by the development of large-scale machine learning models. The foundational models are characterized by extensive training on diverse datasets that encompass various input modalities (e.g. images, text, audio, 3D data), showing excellent flexibility and effectiveness across a wide range of standard NLP and Computer Vision tasks. Such general-purpose solutions often reveal potentials that go beyond what their creators originally envisioned, motivating users to adopt these models for a broad spectrum of applications. Nevertheless, the knowledge embedded in these models may not be enough when the final goal goes beyond perception benchmarks. These considerations spark important questions that can only be answered with a collaborative dialogue between researchers developing these models (creators) and those employing them in downstream tasks (users). Each group brings a unique perspective that will be crucial in shaping the future of this technology.

The goal of this workshop is to identify and discuss strategies to assess both positive and negative (possibly unexpected) behaviors in the development and use of foundation models.

Particular attention will be given to applications that diverge significantly from the scenarios encountered during the training phase of foundational models. These include application-specific visual understanding, uncertainty evaluation, goal-conditioned reasoning, human habits learning, task and motion planning, scene navigation, vision-based manipulation, etc. Our purpose is to foster an open discussion between foundation model creators and users, targeting the analysis of the most pressing open questions for the two communities and fostering new fruitful collaborations.

A few examples are:

  1. Should we always lean towards versatile generalist models, or are there circumstances where specialized task-specific models are still preferred?
  2. What are the risk of blindly using foundation models?
  3. What are the downstream tasks that are still challenging for current foundational models?
  4. How the peculiarities of those tasks may guide an effective re-design of foundation models?
  5. Do such difficulties arise from data-related issues, or are they rather caused by the learning objective and network design aspects?
  6. What level of common sense and abstract reasoning is necessary for these models to operate effectively in real-world and possibly onboard robotic agents?
  7. How can we properly benchmark these capabilities in current models?

Keynote Speakers


Note: Quoted times are in CEST and may be subject to slight changes. (non-definitive).

14:00 - 14:10
Opening remarks
14:10 - 14:45
Invited talk #1 by Zsolt Kira
14:45 - 15:20
Invited talk #2 by Ishan Misra
15:20 - 16:00
Coffee break ☕
16:00 - 16:35
Invited talk #3 by Hilde Kuehne
16:35 - 17:45
Poster Session
17:45 - 18:00
Closing remarks

Call for Contributions

We invite participants to submit their work to the FOCUS Workshop a full paper (Proceeding - max. 14 pages excluding references) and short paper (Non-archival - max. 4 pages excluding references).

Full-Paper Submissions

Full papers must present original research, not published elsewhere, and follow the ECCV main conference policies and format with a maximum length of 14 pages (extra pages with references only are allowed). Supplemental materials are not allowed. Accepted full papers will be included in the ECCV 2024 Workshop proceedings.

Short-paper Submissions

We welcome short papers, which may serve works of a more speculative or preliminary nature that may not be fit for a full-length paper. Authors are also welcome to submit short papers for previously or concomitantly published works that could foster the workshop objectives. Short-papers will have a maximum length of 4 pages (extra pages with references only are allowed), they will be presented without inclusion in the ECCV 2024 Workshop proceedings.

Topics of Interest

The FOCUS workshop welcomes contributions on a broad range of topics concerning robustness, generalization, and transparency of computer vision. Areas of interest encompass, but are not limited to:

  1. New vision-and-language applications
  2. Supervised vs unsupervised based foundation model and downstream tasks
  3. Zero-shot, Few-shot, continual and life-long learning of foundation model
  4. Open set, out-of-distribution detection and uncertainty estimation
  5. Perceptual reasoning and decision making: alignment with human intents and modeling
  6. Prompt and Visual instruction tuning
  7. Novel evaluation schemes and benchmarks
  8. Task-specific vs general-purpose models
  9. Robustness and generalization
  10. Interpretability and explainability
  11. Ethics and bias in prompting


All submissions must be made through the CMT submission portal, before the submission deadline (July 15th, 2024).

Authors must follow the ECCV 2024 main conference policy on anonymity and must adhere to the official ECCV template. Each submission will undergo review by a minimum of two reviewers under a double-blind policy.

We also encourage authors to follow the ECCV 2024 Suggested Practices for Authors, except in what concerns supplemental material, which is not allowed.

Important Dates

All dates are listed in the YYYY-MM-DD format and all times refer to Central European Summer Time (CEST, UTC+2).

CMT submission portal opens.
Submission Deadline - 11:59 PM CEST
Decision to Authors
Full-paper camera-ready deadline
Workshop day - starting at 2:00 PM CEST


Supported by

Images source [1], [2], [3], [4], [5].