AI Toolkit Language-Vision Open Sources Salesforce LAVIS

AI Toolkit Language-Vision Open Sources Salesforce LAVIS

Salesforce Research recently launched LANguage-VISion (LAVIS), a unified library for in-depth language vision research. LAVIS supports over 10 linguistic vision tasks on 20 public datasets and includes pre-trained model weights for over 30 fine-tuned models.

The release was announced on the Salesforce Research blog. LAVIS features a modular design that allows easy integration of new models and provides standard interfaces for model inference. The built-in models, which are trained on public datasets, allow researchers to use LAVIS as a benchmark against which to evaluate their own original work; the models could also be used as is in an AI application. LAVIS also includes other tools, such as utilities and GUIs for downloading and browsing common public training datasets. According to the Salesforce team:

[We built LAVIS] make the intelligence and emerging capabilities of language vision accessible to a wider audience, promote their practical adoptions, and reduce repetitive effort in future development.

Multimodal deep learning models, especially those that perform combined language-vision tasks, are an active area of ​​research. InfoQ has covered several leading models such as OpenAI’s CLIP and DeepMind’s Flamingo. However, the Salesforce researchers note that training and evaluating such models can be challenging for new practitioners, due to inconsistencies “between models, datasets, and task evaluations.” There have been other efforts to create toolkits similar to LAVIS, including Microsoft’s UniLM and Meta’s MMF and TorchMultimodal.

LAVIS supports language vision tasks in seven different categories: end-to-end pre-training, multimodal retrieval, captioning, visual question answering, multimodal classification, visual dialogue, and multimodal feature extraction. These tasks are performed by refined models based on four different core models, which include OpenAI’s CLIP as well as three models developed by Salesforce: ALign BEfore Fuse (ALBEF), Bootstrapping Language-Image Pretraining (BLIP), and ALign and PROmpt (ALPRO).

LAVIS High Level Architecture

Image source: https://blog.salesforceairesearch.com/lavis-language-vision-library/

The figure above shows the high-level architecture of LAVIS. In addition to templates, the library exposes preprocessors for text and image input, which are applied before passing data to the template. The code below shows an example of using LAVIS to generate captions for an image input:

import torch
from lavis.models import load_model_and_preprocess
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
# this also loads the associated image processors
model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
# preprocess the image
# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
# generate caption
model.generate({"image": image})
# ['a large fountain spewing water into the air']

In addition to pre-trained models, LAVIS includes several tools for researchers interested in developing new models. This includes scripts for downloading training and test datasets, code for training and evaluating included models, and benchmark results on test datasets. The documentation includes tutorials on how to add new modules to the library, including datasets, preprocessors, models, and tasks.

Salesforce is actively developing LAVIS and recently added a new visual question answering model, PNP-VQA. Instead of refining models, this new framework focuses on zero-hit learning. A pre-trained image model generates “question-driven” captions, which are passed to a pre-trained language model as context for answering questions. According to Salesforce, “At 11B settings, it outperforms the 80B-setting Flamingo model by 8.5% on VQAv2.”

The LAVIS source code is available on GitHub, as are several Jupyter notebooks demonstrating its use in various language vision tasks. A web demo of the GUI should be “coming soon”.


#Toolkit #LanguageVision #Open #Sources #Salesforce #LAVIS

Leave a Comment

Your email address will not be published. Required fields are marked *