Over the last few years, the field of Artificial Intelligence has witnessed significant growth,
largely fueled by the development of large-scale machine learning models.
The foundational models are characterized by extensive training on diverse datasets that encompass
various input modalities (e.g. images, text, audio, 3D data), showing excellent flexibility and
effectiveness across a wide range of standard NLP and Computer Vision tasks.
Such general-purpose solutions often reveal potentials that go beyond what their creators originally
envisioned, motivating users to adopt these models for a broad spectrum of applications.
Nevertheless, the knowledge embedded in these models may not be enough when the final goal goes
beyond perception benchmarks.
These considerations spark important questions that can only be answered with a collaborative
dialogue between researchers developing these models (creators) and those employing them in
downstream tasks (users). Each group brings a unique perspective that will be crucial in shaping the
future of this technology.
The goal of this workshop is to identify and discuss strategies to assess both positive and
negative (possibly unexpected) behaviors in the development and use of foundation models.
Particular attention will be given to applications that diverge significantly from the scenarios
encountered during the training phase of foundational models. These include application-specific
visual understanding, uncertainty evaluation, goal-conditioned reasoning,
human habits learning, task and motion planning, scene navigation, vision-based manipulation, etc.
Our purpose is to foster an open discussion between foundation model creators and users, targeting
the analysis of the most pressing open questions for the two communities and fostering new
fruitful collaborations.
A few examples are:
- Should we always lean towards versatile generalist models, or are there circumstances where
specialized task-specific models are still preferred?
- What are the risk of blindly using foundation models?
- What are the downstream tasks that are still challenging for current foundational models?
- How the peculiarities of those tasks may guide an effective re-design of foundation models?
- Do such difficulties arise from data-related issues, or are they rather caused by the learning
objective and network design aspects?
- What level of common sense and abstract reasoning is necessary for these models to operate
effectively in real-world and possibly onboard robotic agents?
- How can we properly benchmark these capabilities in current models?
Program
The workshop will be held at MiCo Milano, Suite 2. (All times listed are in CEST)
14:00 - 14:10
Opening remarks
14:10 - 14:45
Invited talk #1 by Zsolt Kira
14:45 - 15:20
Invited talk #2 by Ishan Misra
15:30 - 16:00
Coffee break ☕
16:00 - 16:35
Invited talk #3 by Hilde Kuehne
16:35 - 17:30
Poster Session
17:30 - 18:00
Discussion and closing remarks
Call for Contributions
We invite participants to submit their work to the FOCUS Workshop a full paper (Proceeding - max. 14
pages excluding references) and short paper (Non-archival - max. 4 pages excluding references).
Full-Paper Submissions
Full papers must present original research, not published elsewhere, and follow the ECCV main
conference policies and format with a maximum length of 14 pages (extra pages with references
only are allowed). Supplemental materials are not allowed. Accepted full papers will be included
in the ECCV 2024 Workshop proceedings.
Short-paper Submissions
We welcome short papers, which may serve works of a more speculative or preliminary nature that may
not be fit for a full-length paper. Authors are also welcome to submit short papers for previously
or concomitantly published works that could foster the workshop objectives. Short-papers will have a
maximum length of 4 pages (extra pages with references only are allowed), they will be presented
without inclusion in the ECCV 2024 Workshop proceedings.
Topics of Interest
The FOCUS workshop welcomes contributions on a broad range of topics concerning robustness,
generalization, and transparency of computer vision. Areas of interest encompass, but are not limited
to:
- New vision-and-language applications
- Supervised vs unsupervised based foundation model and downstream tasks
- Zero-shot, Few-shot, continual and life-long learning of foundation model
- Open set, out-of-distribution detection and uncertainty estimation
- Perceptual reasoning and decision making: alignment with human intents and modeling
- Prompt and Visual instruction tuning
- Novel evaluation schemes and benchmarks
- Task-specific vs general-purpose models
- Robustness and generalization
- Interpretability and explainability
- Ethics and bias in prompting
Guidelines
All submissions must be made through the CMT submission portal, before the submission deadline
(July 22nd 29th, 2024).
Authors must follow the ECCV 2024 main conference policy on anonymity and must adhere to the official ECCV template.
Each submission will undergo review by a minimum of two reviewers under a double-blind policy.
We also encourage authors to follow the ECCV 2024 Suggested Practices for Authors, except in what concerns supplemental
material, which is not allowed.
Important Dates
All dates are listed in the YYYY-MM-DD format and all times refer to Central European Summer Time (CEST,
UTC+2).
Jun 15th
CMT submission portal opens.
Jul 29th
Submission Deadline - 23:59 CEST
Aug 10th
Decision to Authors
Aug 19th
Full-paper camera-ready deadline
Sept 30th
Workshop day - starting at 14:00, Workshop Location: Suite 2 at MiCo Milano