help in finetuning

#5
by manojbalaji1 - opened

We tried finetuning the model and we are getting the following error:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.


After setting CUDA_LAUNCH_BLOCKING=1, we are getting the following:
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [505,0,0], thread: [62,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [505,0,0], thread: [63,0,0] Assertion srcIndex < srcSelectDimSize failed.


Any help appreciated.

P.S. We can do fairseq based finetuning but we are constrained by the fact that most of our data utility functions are already written for huggingface model. So thought of giving one final chance to see if we can try to figure out something, before we start putting efforts on moving to fairseq based model. Thanks in advance

Sign up or log in to comment