Rough estimates for text generation?
Hi there,
I'm new to transformers, torch, and basically any ML development from the last decade and I'm trying to get back into it.
I've setup a jupyter notebook with torch and cuda enabled, I have an RTX 2080 8GB. I'm not expecting blistering performance, but should that be sufficient to build a pipeline from a pretrained model and get it to give me answers in say less than 10 minutes?
This code runs without error in about 8 minutes or so:
tokenizer = AutoTokenizer.from_pretrained("databricks/dolly-v2-12b", padding_side="left")
model = AutoModelForCausalLM.from_pretrained("databricks/dolly-v2-12b", offload_folder="offload", torch_dtype=torch.bfloat16, device_map="auto", load_in_8bit=True)
generate_text = InstructionTextGenerationPipeline(model=model, tokenizer=tokenizer)
but
generate_text("tell a short story")
just seems to hang.
I thought the pipeline inference would be relatively quick compared to loading the model. Are my expectations wrong?
device_map='auto' is causing so much confusion. You don't have nearly enough GPU RAM to load so it loads most on the CPU and works but very slowly. Maybe we should just set the example to force CUDA 0 so it fails explicitly if it doesn't fit
For 16GB GPUs you can get it to load in 8 bit. For 8GB won't work. Use the 2.7B model?
To answer your question should be like 10-20 seconds on an A10.
Hi
@srowen
, sorry to follow up on a closed discussion, but I'm wondering how to specify the device_map
argument to force CUDA 0 and fail explicitly, as you suggested?
You just set device="cuda:0"
then, and you don't need accelerate
to figure out a device mapping in that case.
Thank you! That's clear and works like a charm.