A powerful 8B parameter multimodal model that can understand images and videos with GPT-4V level performance.