[NOTE] Vision and NLP Models, AI Agents
Topics on vision and language models, including classic DNN-based image classification, visual question answering, visual object tracking, transformer-based LLM, and their potential applications such as AI agents or robots.
Visual Object Tracking (VOT)
VOT Basics
-
Motion Modeling: estimating the position of object in coming frames to reduce the search space. However, motion may suffer from abrupt abrupt drifts; Appearance modeling can be an alternative to motion modeling.
-
LiDAR (light detection and ranging): is an active sensing system to emit laser light from source, in order to measure the distance/elevation of the target object (by calculating time and intensity).
-
Siamese-based Networks: a pair of identical network branches used in visual object tracking to compare the similarity between the target object and the search region.
- Two-stream: features or templates and search region are extracted separately
- Two-stage: feature extraction (CNN backbones) + relation modeling (e.g., correlation layer to fuse features for state estimate, or using transformer attention mechanism).
- Feature extraction is determined after offline-training, and no feature-target interaction. Not good for one-shot tracking
- One-stream light relation-modeling: relation is modeled in self-attention. The inference is only performed on selected tokens from search region adaptively, which leads to less computational cost.
-
FPN (Feature Pyramid Network):
- An architecture where images of different scales are processed in parallel (fine-grained details and coarse contextual information).
- The multi-scale feature maps are fused together (concatenate or stacking) to generate comprehensive representation. FPN is backbone-agnostic.
-
Predictor Metrics
-
Metrics to evaluate the prediction quality of a tracker
- Precision = $\frac{TP}{TP + FP}$ (in all positive predictions, how many are correct)
- Recall = $\frac{TP}{TP + FN}$ (in all ground truth, how many are correctly predicted)
- F1 Score = $\frac{2 \times Precision \times Recall}{Precision + Recall}$ (harmonic mean of precision and recall)
- F1 is a better metric than accuracy when the dataset is imbalanced (e.g., 99% of the data is negative, and 1% is positive).
- F1 score will penalize the FN/FP more than accuracy, while accuracy is more leaned to TN/TP.
-
Aggregate Metrics:
- Taking into account of confidence level (threshold) of the prediction
- A 2D curve of precision-recall relation: lower confidence threshold will lead to higher recall, but lower precision
- AUC (area under curve) is the integral of the precision-recall curve
- AP (average precision) is the average precision at different recall levels.
- mAP (mean average precision) is the mean of APs across different classes.
-
Multi-Modal Models
- OpenAI (CLIP) Contrastive Language-Image Pre-training
- A transferable visual model trained with text supervision; the model can be used for image classification, image querying with natural language.
- CLIP is pre-trained by maximizing the cosine similarity between the image and its corresponding text (while minimizing the similarity of other pairs).