Thomasboosinger nielsr HF staff commited on
Commit
aa885a0
0 Parent(s):

Duplicate from google/owlv2-large-patch14-ensemble

Browse files

Co-authored-by: Niels Rogge <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - vision
5
+ - zero-shot-object-detection
6
+ inference: false
7
+ ---
8
+
9
+ # Model Card: OWLv2
10
+
11
+ ## Model Details
12
+
13
+ The OWLv2 model (short for Open-World Localization) was proposed in [Scaling Open-Vocabulary Object Detection](https://arxiv.org/abs/2306.09683) by Matthias Minderer, Alexey Gritsenko, Neil Houlsby. OWLv2, like OWL-ViT, is a zero-shot text-conditioned object detection model that can be used to query an image with one or multiple text queries.
14
+
15
+ The model uses CLIP as its multi-modal backbone, with a ViT-like Transformer to get visual features and a causal language model to get the text features. To use CLIP for detection, OWL-ViT removes the final token pooling layer of the vision model and attaches a lightweight classification and box head to each transformer output token. Open-vocabulary classification is enabled by replacing the fixed classification layer weights with the class-name embeddings obtained from the text model. The authors first train CLIP from scratch and fine-tune it end-to-end with the classification and box heads on standard detection datasets using a bipartite matching loss. One or multiple text queries per image can be used to perform zero-shot text-conditioned object detection.
16
+
17
+
18
+ ### Model Date
19
+
20
+ June 2023
21
+
22
+ ### Model Type
23
+
24
+ The model uses a CLIP backbone with a ViT-L/14 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss. The CLIP backbone is trained from scratch and fine-tuned together with the box and class prediction heads with an object detection objective.
25
+
26
+
27
+ ### Documents
28
+
29
+ - [OWLv2 Paper](https://arxiv.org/abs/2306.09683)
30
+
31
+
32
+ ### Use with Transformers
33
+
34
+ ```python3
35
+ import requests
36
+ from PIL import Image
37
+ import torch
38
+
39
+ from transformers import Owlv2Processor, Owlv2ForObjectDetection
40
+
41
+ processor = Owlv2Processor.from_pretrained("google/owlv2-large-patch14-ensemble")
42
+ model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-large-patch14-ensemble")
43
+
44
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
45
+ image = Image.open(requests.get(url, stream=True).raw)
46
+ texts = [["a photo of a cat", "a photo of a dog"]]
47
+ inputs = processor(text=texts, images=image, return_tensors="pt")
48
+ outputs = model(**inputs)
49
+
50
+ # Target image sizes (height, width) to rescale box predictions [batch_size, 2]
51
+ target_sizes = torch.Tensor([image.size[::-1]])
52
+ # Convert outputs (bounding boxes and class logits) to COCO API
53
+ results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)
54
+
55
+ i = 0 # Retrieve predictions for the first image for the corresponding text queries
56
+ text = texts[i]
57
+ boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
58
+
59
+ # Print detected objects and rescaled box coordinates
60
+ for box, score, label in zip(boxes, scores, labels):
61
+ box = [round(i, 2) for i in box.tolist()]
62
+ print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
63
+ ```
64
+
65
+
66
+ ## Model Use
67
+
68
+ ### Intended Use
69
+
70
+ The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, text-conditioned object detection. We also hope it can be used for interdisciplinary studies of the potential impact of such models, especially in areas that commonly require identifying objects whose label is unavailable during training.
71
+
72
+ #### Primary intended uses
73
+
74
+ The primary intended users of these models are AI researchers.
75
+
76
+ We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
77
+
78
+ ## Data
79
+
80
+ The CLIP backbone of the model was trained on publicly available image-caption data. This was done through a combination of crawling a handful of websites and using commonly-used pre-existing image datasets such as [YFCC100M](http://projects.dfki.uni-kl.de/yfcc100m/). A large portion of the data comes from our crawling of the internet. This means that the data is more representative of people and societies most connected to the internet. The prediction heads of OWL-ViT, along with the CLIP backbone, are fine-tuned on publicly available object detection datasets such as [COCO](https://cocodataset.org/#home) and [OpenImages](https://storage.googleapis.com/openimages/web/index.html).
81
+
82
+ (to be updated for v2)
83
+
84
+ ### BibTeX entry and citation info
85
+
86
+ ```bibtex
87
+ @misc{minderer2023scaling,
88
+ title={Scaling Open-Vocabulary Object Detection},
89
+ author={Matthias Minderer and Alexey Gritsenko and Neil Houlsby},
90
+ year={2023},
91
+ eprint={2306.09683},
92
+ archivePrefix={arXiv},
93
+ primaryClass={cs.CV}
94
+ }
95
+ ```
added_tokens.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "!": 0,
3
+ "<|endoftext|>": 49407,
4
+ "<|startoftext|>": 49406
5
+ }
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Owlv2ForObjectDetection"
4
+ ],
5
+ "initializer_factor": 1.0,
6
+ "logit_scale_init_value": 2.6592,
7
+ "model_type": "owlv2",
8
+ "projection_dim": 768,
9
+ "text_config": {
10
+ "hidden_size": 768,
11
+ "intermediate_size": 3072,
12
+ "model_type": "owlv2_text_model",
13
+ "num_attention_heads": 12
14
+ },
15
+ "torch_dtype": "float32",
16
+ "transformers_version": "4.35.0.dev0",
17
+ "vision_config": {
18
+ "hidden_size": 1024,
19
+ "image_size": 1008,
20
+ "intermediate_size": 4096,
21
+ "model_type": "owlv2_vision_model",
22
+ "num_attention_heads": 16,
23
+ "num_hidden_layers": 24,
24
+ "patch_size": 14
25
+ }
26
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
preprocessor_config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_normalize": true,
3
+ "do_pad": true,
4
+ "do_rescale": true,
5
+ "do_resize": true,
6
+ "image_mean": [
7
+ 0.48145466,
8
+ 0.4578275,
9
+ 0.40821073
10
+ ],
11
+ "image_processor_type": "Owlv2ImageProcessor",
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "processor_class": "Owlv2Processor",
18
+ "resample": 2,
19
+ "rescale_factor": 0.00392156862745098,
20
+ "size": {
21
+ "height": 1008,
22
+ "width": 1008
23
+ }
24
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2934e1f32c68b49f62e9b7a415c22080a8bf197c50c6f4408f4a60e21e0be252
3
+ size 1750647637
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|startoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "pad_token": "!",
5
+ "unk_token": "<|endoftext|>"
6
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "!",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "49406": {
13
+ "content": "<|startoftext|>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": false
19
+ },
20
+ "49407": {
21
+ "content": "<|endoftext|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": false
27
+ }
28
+ },
29
+ "additional_special_tokens": [],
30
+ "bos_token": "<|startoftext|>",
31
+ "clean_up_tokenization_spaces": true,
32
+ "do_lower_case": true,
33
+ "eos_token": "<|endoftext|>",
34
+ "errors": "replace",
35
+ "model_max_length": 16,
36
+ "pad_token": "!",
37
+ "processor_class": "Owlv2Processor",
38
+ "tokenizer_class": "CLIPTokenizer",
39
+ "tokenizer_file": "/Users/nielsrogge/.cache/huggingface/hub/models--openai--clip-vit-base-patch32/snapshots/e6a30b603a447e251fdaca1c3056b2a16cdfebeb/tokenizer.json",
40
+ "unk_token": "<|endoftext|>"
41
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff