[NOTE] Vision and NLP Models, AI Agents

Posted on October 29, 2023 in Transformer Computer-Vision Notes

Topics on vision and language models, including classic DNN-based image classification, visual question answering, visual object tracking, transformer-based LLM, and their potential applications such as AI agents or robots.

Visual Object Tracking (VOT)

VOT Basics

Motion Modeling: estimating the position of object in coming frames to reduce the search space. However, motion may suffer from abrupt abrupt drifts; Appearance modeling can be an alternative to motion modeling.
LiDAR (light detection and ranging): is an active sensing system to emit laser light from source, in order to measure the distance/elevation of the target object (by calculating time and intensity).
Siamese-based Networks: a pair of identical network branches used in visual object tracking to compare the similarity between the target object and the search region.
- Two-stream: features or templates and search region are extracted separately
- Two-stage: feature extraction (CNN backbones) + relation modeling (e.g., correlation layer to fuse features for state estimate, or using transformer attention mechanism).
  - Feature extraction is determined after offline-training, and no feature-target interaction. Not good for one-shot tracking
- One-stream light relation-modeling: relation is modeled in self-attention. The inference is only performed on selected tokens from search region adaptively, which leads to less computational cost.
FPN (Feature Pyramid Network):
- An architecture where images of different scales are processed in parallel (fine-grained details and coarse contextual information).
- The multi-scale feature maps are fused together (concatenate or stacking) to generate comprehensive representation. FPN is backbone-agnostic.
Predictor Metrics
- Metrics to evaluate the prediction quality of a tracker
  - Precision = $\frac{TP}{TP + FP}$ (in all positive predictions, how many are correct)
  - Recall = $\frac{TP}{TP + FN}$ (in all ground truth, how many are correctly predicted)
  - F1 Score = $\frac{2 \times Precision \times Recall}{Precision + Recall}$ (harmonic mean of precision and recall)
    - F1 is a better metric than accuracy when the dataset is imbalanced (e.g., 99% of the data is negative, and 1% is positive).
    - F1 score will penalize the FN/FP more than accuracy, while accuracy is more leaned to TN/TP.
- Aggregate Metrics:
  - Taking into account of confidence level (threshold) of the prediction
  - A 2D curve of precision-recall relation: lower confidence threshold will lead to higher recall, but lower precision
  - AUC (area under curve) is the integral of the precision-recall curve
  - AP (average precision) is the average precision at different recall levels.
  - mAP (mean average precision) is the mean of APs across different classes.

OpenAI (CLIP) Contrastive Language-Image Pre-training
- A transferable visual model trained with text supervision; the model can be used for image classification, image querying with natural language.
- CLIP is pre-trained by maximizing the cosine similarity between the image and its corresponding text (while minimizing the similarity of other pairs).

[NOTE] Vision and NLP Models, AI Agents

Visual Object Tracking (VOT)

VOT Basics

Multi-Modal Models