Multimodality is King - Bridging the Gap Between Language and Vision in AI

Wilko Wolters

Transforming Industries with Applied AI | Catalyzing AI-Driven Value Creation | Interim Executive | AI Strategy Leader | Bridging Business, Legal, Finance and Tech in Industrial and Service Sectors | Ex-IBM

Published Oct 31, 2023

LLMs are a thing of yesterday now - LMMs are the new frontier. Researchers from Microsoft have investigated the capabilities of OpenAI's multimodal AI model, GPT-4V, with the 'V' signifying 'Vision'. The results documented in their latest paper are really impressive.

Large multimodal models (LMMs) extend the capabilities of large language models (LLMs) by incorporating multisensory capabilities such as visual understanding to improve general intelligence.

Building upon the foundations of its predecessor, GPT-3.5 (commonly known as ChatGPT), GPT-4V is equipped with a unique ability to seamlessly integrate visual and textual information. This integration opens up exciting possibilities in human-computer interaction, particularly through the interpretation of visual cues, a feature known as visual referring prompting.

The Versatility of GPT-4V

GPT-4V's primary strength lies in its ability to process and understand both text and images. This model can provide detailed, nuanced descriptions of images and effortlessly handle tasks that involve images with embedded text. For example, it can solve mathematical problems presented through images and offers innovative ways to combine text and images to address a wide range of queries. Building upon the foundations of its predecessor, GPT-3.5 (commonly known as ChatGPT), GPT-4V is equipped with a unique ability to seamlessly integrate visual and textual information. This integration opens up exciting possibilities in human-computer interaction, particularly through the interpretation of visual cues, a feature known as visual referring prompting.

The Dawn of LMMs - Preliminary Explorations with GPT-4V(ision), 10/2023

Combining speech and visual understanding One of the most notable aspects of GPT-4V is its seamless integration of speech and visual understanding. It overcomes language barriers at ease and allows the user to describe images in different languages and styles. Symbols, annotations and handwritten text in images are accurately recognized and interpreted. It can even handle scenarios where users enter multiple blocks of text and images, and process them in a meaningful way to solve complex tasks.

Limitations and Challenges

Although GPT-4V has remarkable capabilities, it is not without constraints. It sometimes has difficulty with spatial relationships, which affects its ability to accurately interpret geometric problems. Interestingly, the reproduction of images generated by GPT-4 itself poses a particular challenge. Understanding humor, a nuanced aspect of human communication, presents another hurdle. The model can describe the content of funny images, but often misses the mark in explaining humor - but let's be honest, some humans have their own problems with this too...

Multimodal Plugins and Multimodal Chains

Integrating GPT-4V into the world of large multimodal models opens the door to a wide range of possibilities. The use of multimodal plugins, such as Bing Image Search, is critical to extending the capabilities of LMMs. In addition, the concept of multimodal chains, combining LMMs with plugins for advanced reasoning and interactions, paves the way for a more comprehensive understanding and analysis of multimodal data.

Recommended by LinkedIn

The Next Leap In AI: From Large Language Models Toâ€¦

Fabio Moioli 10 months ago

Understanding & Building LLM Applications!

Pavan Belagatti 6 months ago

Introduction to LLAMA 3

Blockchain Council 3 months ago

The Future of LMMs

Considering the speed of current developments, the potential for AI models - at least at present - seems to be almost unlimited. These models should evolve to generate layered image-text content, ushering in a new era of multimedia learning and content creation. The inclusion of other modalities, such as video, audio, and sensor data, will expand their capabilities. In addition, models such as GPT-4V should move to learn from multiple sources, including web content and real-world environments, to enable continuous self-development.

Finally, the emergence of large multimodal models, of which GPT-4V is an example, will shape the landscape of AI in 2023 and beyond. And we should also be curious about what to expect from the upcoming launch of Google's Gemini AI , which had already hinted at the model's far-reaching multimodal capabilities a few weeks ago.

These models, with their unique ability to seamlessly combine speech and image processing, will be a driving force for innovation across a wide range of applications. From automating insurance checks to improving the checkout experience in supermarkets, their potential for transformative change is undeniable. As the gap between tech capabilities and practical implementation widens, we are excited to see the many applications that will emerge in the coming years that will realize the full potential of these powerful AI systems.

#ai #genai #generativeai #gpt4v #multimodal #strategy #innovation #digitaltransformation

Sources:

Microsoft, Zhengyuan Yang, Linjie Li, Kevin Lin, et al., 10/23, The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
Golem, Helmut Linde, 10/23, ChatGPT lernt das Sehen
OpenAI, Alec Radford,Â Tao Xu,Â Jong, Raul Puri,Â et al., 09/23, ChatGPT can now see, hear, and speak
Wilko Wolters, 09/23, Google Nears Release of Conversational AI Software â€˜Geminiâ€™

Cyril Coste

Co-Founder & Chief Product Officer @merveilleux. Building the #1 AI agents product development platform ðŸ”¥

Impressive milestone for AI! I'm curious to see how #GPT4V will transform industries.

2 Reactions

See more comments

To view or add a comment, sign in

Sign in

Stay updated on your professional world

By clicking Continue to join or sign in, you agree to LinkedInâ€™s User Agreement, Privacy Policy, and Cookie Policy.

New to LinkedIn? Join now

Directory

Multimodality is King - Bridging the Gap Between Language and Vision in AI

Wilko Wolters

Transforming Industries with Applied AI | Catalyzing AI-Driven Value Creation | Interim Executive | AI Strategy Leader | Bridging Business, Legal, Finance and Tech in Industrial and Service Sectors | Ex-IBM

The Versatility of GPT-4V

Limitations and Challenges

Multimodal Plugins and Multimodal Chains

Recommended by LinkedIn

The Future of LMMs

Sources:

More articles by Wilko Wolters

Sign in

Insights from the community

Others also viewed

How to get more out of LLMs

SLM and LLM... My Top 10 in July 2024

Quantitative Evaluation of LLM Responses with RAG-based Question-answering Chatbots

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

Customizing and optimizing methods for Large Language Models (LLMs)

The Future of Artificial Intelligence: Navigating Small and Large Language Models

Building vs. Utilizing Existing Large Language Models (LLMs): Considerations for Use Cases and Bias Mitigation

How Gemini Pro 1.5 Predicts Your Next Move

How to prompt like a pro: Why do different language models react differently?

Catastrophic Forgetting in LLMs

Explore topics

Directory

The Versatility of GPT-4V

Limitations and Challenges

Multimodal Plugins and Multimodal Chains

Recommended by LinkedIn

The Future of LMMs

Sources:

More articles by Wilko Wolters

The Challenge of Enterprise AI Adoption

Will AI wipe out the SaaS economy?

OpenAI Dev Day 2024: OpenAI's Four Pillars of Innovation

Will new model approaches render the AI Act's defining criteria obsolete?

Werden die im AI-Act definierten Kriterien durch neue ModellansÃ¤tze obsolet?

"Apple Intelligence" - a Stroke of Genius - even if only at second glance

Moderna and OpenAI Partnership: A Model for Corporate AI Adoption

Â»Schneller! Hallo, KI? Schneller!Â«

Navigating Innovation and Regulation: The Impact of the EU AI Act on Advanced AI Development

Trends for 2024: What year 2 of the generative AI craze will look like, according to 41 experts

Sign in

Insights from the community

Others also viewed

How to get more out of LLMs

SLM and LLM... My Top 10 in July 2024

Quantitative Evaluation of LLM Responses with RAG-based Question-answering Chatbots

Crafting Intelligence: The Art of Tailoring Large Language Models for Precision and Relevance

Customizing and optimizing methods for Large Language Models (LLMs)

The Future of Artificial Intelligence: Navigating Small and Large Language Models

Building vs. Utilizing Existing Large Language Models (LLMs): Considerations for Use Cases and Bias Mitigation

How Gemini Pro 1.5 Predicts Your Next Move

How to prompt like a pro: Why do different language models react differently?

Catastrophic Forgetting in LLMs

Explore topics