---
license: gemma
library_name: keras-nlp
extra_gated_heading: Access PaliGemma on Hugging Face
extra_gated_prompt: >-
  To access PaliGemma on Hugging Face, you’re required to review and agree to
  Google’s usage license. To do this, please ensure you’re logged-in to Hugging
  Face and click below. Requests are processed immediately.
extra_gated_button_content: Acknowledge license
pipeline_tag: image-text-to-text
---

PaliGemma is a set of multi-modal large language models published by Google based on the Gemma model. Both a pre-trained and instruction tuned models are available. See the model card below for benchmarks, data sources, and intended use cases.


## Links

* [PaliGemma API Documentation](https://keras.io/api/keras_nlp/models/pali_gemma/)
* [KerasNLP Beginner Guide](https://keras.io/guides/keras_nlp/getting_started/)
* [KerasNLP Model Publishing Guide](https://keras.io/guides/keras_nlp/upload/)

## Installation

Keras and KerasNLP can be installed with:

```
pip install -U -q keras-nlp
pip install -U -q keras&gt;=3
```

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instruction on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.

## Presets

The following model checkpoints are provided by the Keras team. Full code examples for each are available below.

| Preset name           | Parameters | Description                                                 |
|-----------------------|------------|-------------------------------------------------------------|
| [paligemma-3b-224-mix-keras](https://huggingface.co/google/paligemma-3b-224-mix-keras) | 2.92B      | image size 224, mix fine tuned, text sequence length is 256 |
| [paligemma-3b-448-mix-keras](https://huggingface.co/google/paligemma-3b-448-mix-keras) | 2.92B      | image size 448, mix fine tuned, text sequence length is 512 |
| [paligemma-3b-224-keras](https://huggingface.co/google/paligemma-3b-224-keras)     | 2.92B      | image size 224, pre trained, text sequence length is 128    |
| [**paligemma-3b-448-keras**](https://huggingface.co/google/paligemma-3b-448-keras)     | 2.92B      | image size 448, pre trained, text sequence length is 512    |
| [paligemma-3b-896-keras](https://huggingface.co/google/paligemma-3b-896-keras)     | 2.93B      | image size 896, pre trained, text sequence length is 512    |

## Prompts

The PaliGemma `"mix"` models can handle a number of prompting structures out of the box. It is important to stick exactly to these prompts, including the newline. Lang can be a language code such as `"en"` or `"fr"`. Support for languages outside of English will vary depending on the prompt type.

* `"cap {lang}\n"`: very raw short caption (from WebLI-alt).
* `"caption {lang}\n"`: coco-like short captions.
* `"describe {lang}\n"`: somewhat longer more descriptive captions.
* `"ocr\n"`: optical character recognition.
* `"answer en {question}\n"`: question answering about the image contents.
* `"question {lang} {answer}\n"`: question generation for a given answer.
* `"detect {thing} ; {thing}\n"`: count objects in a scene.

Not `"mix"` presets should be fine-tuned for a specific task.

```
!pip install -U -q keras-nlp
```
Pick a backend of your choice
```
import os
os.environ["KERAS_BACKEND"] = "jax"
```
Now we can load the PaliGemma "causal language model" from the Kaggle Models hub. A causal language model is just a LLM that is ready for generation, by training with a causal mask, and running generation a token at a time in a recurrent loop.
```
keras.config.set_floatx("bfloat16")
pali_gemma_lm = keras_nlp.models.PaliGemmaCausalLM.from_preset(
    "hf://google/paligemma-3b-448-keras"
)
```
Function that reads an image from a given URL
```
def read_image(url):
    contents = io.BytesIO(requests.get(url).content)
    image = PIL.Image.open(contents)
    image = np.array(image)
    # Remove alpha channel if neccessary.
    if image.shape[2] == 4:
        image = image[:, :, :3]
    return image
```
```
image_url = 'https://storage.googleapis.com/keras-cv/models/paligemma/cow_beach_1.png'
image = read_image(image_url)
```
Use `generate()` call with a single image and prompt. The text prompt
has to end with `\n`.
```
prompt = 'answer en where is the cow standing?\n'
output = pali_gemma_lm.generate(
    inputs={
        "images": image,
        "prompts": prompt,
    }
)
print(output)
```
Use `generate()` call with a batched images and prompts.
```
prompts = [
    'answer en where is the cow standing?\n',
    'answer en what color is the cow?\n',
    'describe en\n',
    'detect cow\n',
    'segment cow\n',
]
images = [image, image, image, image, image]
outputs = pali_gemma_lm.generate(
    inputs={
        "images": images,
        "prompts": prompts,
    }
)
for output in outputs:
    print(output)
```

There's a few other style of prompts this model can handle out of the box...

`cap {lang}\n`: very raw short caption (from WebLI-alt).

`caption {lang}\n`: nice, coco-like short captions.

`describe {lang}\n`: somewhat longer more descriptive captions.

`ocr\n`: optical character recognition.

`answer en {question}\n`: question answering about the image contents.

`question {lang} {answer}\n`: question generation for a given answer.

`detect {thing} ; {thing}\n`: count objects in a scene.

Call `fit()` on a single batch
```
import numpy as np
image = np.random.uniform(-1, 1, size=(224, 224, 3))
x = {
    "images": [image, image],
    "prompts": ["answer en Where is the cow standing?\n", "caption en\n"],
}
y = {
    "responses": ["beach", "A brown cow standing on a beach next to the ocean."],
}
pali_gemma_lm = keras_nlp.models.PaliGemmaCausalLM.from_preset("hf://google/paligemma-3b-448-keras")
pali_gemma_lm.fit(x=x, y=y, batch_size=2)
```