Getting started with Gemini/Gema & Multimodal

5 min readMay 24, 2024

Hello world, and welcome to another blog about LLMs 🗣️ specifically in this case the Google LLM “Gemini” and “Gema” (open source version), and also some important concepts and examples about Multimodal. I’ll try to go short, but this topic covers already a lot 😜, pss you’ll find at the end of this blog some resources to learn more about this topics.

Let’s dive in.

LLM: Large Language Models

Imagine an AI that can generate human-quality writing, translate languages flawlessly, and answer your questions in an informative way. That’s the power of LLMs! These AI models are trained on massive amounts of text data, allowing them to understand and process language with remarkable sophistication. (written by Gemini)

LLMs are expected to become even more versatile and impactful 🧐

Gemini is a multimodal LLM developed by Google AI.
It can handle text, images, and code together, making it an invaluable asset for applications like generating code from natural language descriptions, creating captions for images, and summarizing video content. 🤓

Gema: A family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models

Gemini vs Gema

Source: https://g.co/gemini/share/a3b41f95a0e4

Multimodal 🖼 📹 📝

LLMs are really good at working with text, however the real world is rich with other forms of data such as images, audio, and video.

Multimodal learning allows AI models to process and understand these different data types, leading to a more comprehensive understanding of the world. 🗺

Now, we are ready!!! with those definitions ,we are ready to test a small experiment along multimodality and Gemini, but we’ll test that in Google AI Studio.

Google AI Studio

Google AI Studio is a good option to get started! Its user-friendly interface makes it easy to build and deploy machine learning models, even for those without extensive coding experience.

https://aistudio.google.com/app/prompts/new_chat

You only need a gmail account to get started (and yes, I use a lot dark mode 🙊)

Alright, now we are ready for a quick experiment 🔬

So, this is the context, last April, I went into a short vacation to visit Macchu Picchu 🙌
I met lots of tourist who usually go to Cusco, Peru and surroundings and usually from there, they go to Bolivia 🇧🇴, to mostly visit the Salt Flats in Uyuni but some others are also interested in mountain hiking 🗻 activities! (we have lots of mountains nearby La Paz).
So, I exchanged information about my fav outdoor activities in Bolivia, and of course, I said I’ll get updated info for hiking along, prices, etc. I got useful information to climb some mountains, however most of this info is in Spanish, therefore 🤓 I found a great opportunity to ask help to the AI for doing something about this info.

I decided to try the multimodal of Gemini in Google AI Studio.

Screenshot capture in April 2024. Google AI Studio interface might differ a little bit today.

What you see in the image above is the prompt “System Instructions” and the result (kind of highlighted text), let’s see some details:

prompt:
According to the attached document, I need the following information: — Level of difficult for this activity
— How many days of this trip
— How much it costs
— The different altitudes reach while walking
— What it is included in this trip
‘my information’ file uploaded in pdf format (this document has information about the climbing activity, it has pictures and all text is written in Spanish).
and then RUN 🏃 🏃‍♀

And voilà, I’ve got a very nice answer! in English for all of those questions (see the prompt) from my pdf file (all info about the climbing activity).

So, along these multimodality options (image, video, audio, file…), you can ask specific information (prompt).
For example, I could have only the text prompt: “give me information about climbing mountains in Bolivia, along prices and activities” and sure I’ll get an answer, but in this specific experiment, I want answers from my pdf file, containing the information for a specific tour I found.

It looks simple but… I can have a coding template to take this “experiment” farther

Code temaple in Python ❤ ! how cool 😻 it’d be to have a ‘project’ based on this ‘idea’ and build a nicer interface, super friendly for the user, where you can select which tour info you want (and even if it is in Spanish) and ask questions about it in English or any other language and get good answers!

I’m thinking on using streamlit or huggingface gradio to build a nicer interface (work in progress 🐒).

No matter how, I’m so happy 😼 to have the chance to “prototype” this idea in Google AI Studio, using this LLM “Gemini” and getting more ideas to build a digital product 🚀

This is a simple experiment that helps understand multimodal, gemini, and prompting. Looking forward for more experiments.

what’s next? resources where to learn! 🤓

As I promised, here I’m sharing some resources to learn more 🚀

Updates on Gemini 1.5 https://developers.googleblog.com/es/gemini-15-our-next-generation-model-now-available-for-private-preview-in-google-ai-studio/
Vector Database (where to store data for multimodal?), check this short course: Vector Databases: from Embeddings to Applications
Building Multimodal Search and RAG
Google AI Blog
Prompt Gallery

Till next blogpost :)