File size: 4,071 Bytes
103911f
 
692fc03
 
 
 
 
f0ffc2e
692fc03
 
 
 
 
 
 
 
 
 
103911f
f11347d
 
d00f47e
 
a22d5ff
 
 
 
 
f11347d
 
a22d5ff
 
 
 
 
 
 
 
f11347d
 
a22d5ff
f11347d
 
 
 
 
 
a22d5ff
f11347d
 
 
 
 
 
 
 
 
 
a22d5ff
f11347d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ea7f64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
692fc03
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: cdla-permissive-2.0
datasets:
- mythicinfinity/libritts_r
- mythicinfinity/libritts
- keithito/lj_speech
- ginger-turmeric/LibriLight
- corvj/daps
language:
- en
base_model:
- descript/dac_24khz
tags:
- speech
- autoencoder
- tokenizer
- speech coding
- vocoder
---

## Model Summary
[DAC auto-encoder models](https://github.com/descriptinc/descript-audio-codec) provide compact discrete tokenization of speech and audio signals that facilitate signal generation by cascaded generative AI models (e.g. multi-modal generative AI models) and high-quality reconstruction of the original signals. [The current finetuned models](https://www.isca-archive.org/interspeech_2024/shechtman24_interspeech.pdf) improve upon the [original DAC models](https://github.com/descriptinc/descript-audio-codec) by allowing a more compact representation for wide-band speech signals with high-quality signal reconstruction. The models achieve speech reconstruction, which is [nearly indistinguishable from PCM](https://ibm.biz/IS24SpeechRVQ) with a rate of 150-300 tokens per second
(1500-3000 bps). [The evaluation](https://www.isca-archive.org/interspeech_2024/shechtman24_interspeech.pdf) used comprehensive English speech data encompassing different recording conditions, including studio settings.

| Model     | Speech Sample Rate    | codebooks | Bit Rate  | Token Rate| version|
| :---:     | :---:                 | :---:     | :---:     | :---:     | :---: |
| weights_24khz_3.0kbps_v1.0.pth | 24kHz   | 4 | 3kHz   | 300Hz | 1.0 |
| weights_24khz_1.5kbps_v1.0.pth | 24kHz   | 2 | 1.5kHz   | 150Hz | 1.0 |

## Usage
* follow [DAC](https://github.com/descriptinc/descript-audio-codec) installation instructions

* clone the current repo
```
git clone https://huggingface.co/ibm/DAC.speech.v1.0
cd DAC.speech.v1.0
```

### Compress audio
```
python3 -m dac encode /path/to/input --output /path/to/output/codes --weights_path weights_24khz_3.0kbps_v1.0.pth
```

This command will create `.dac` files with the same name as the input files. It will also preserve the directory structure relative to input root and re-create it in the output directory. Please use `python -m dac encode --help` for more options.

### Reconstruct audio from compressed codes
```
python3 -m dac decode /path/to/output/codes --output /path/to/reconstructed_input --weights_path weights_24khz_3.0kbps_v1.0.pth
```

This command will create `.wav` files with the same name as the input files. It will also preserve the directory structure relative to input root and re-create it in the output directory. Please use `python -m dac decode --help` for more options.

### Programmatic Usage
```py
import dac
from audiotools import AudioSignal

# Download a model
model_path = 'weights_24khz_3.0kbps_v1.0.pth'
model = dac.DAC.load(model_path)

model.to('cuda')

# Load audio signal file
signal = AudioSignal('input.wav')

# Encode audio signal as one long file
# (may run out of GPU memory on long files)
signal.to(model.device)

x = model.preprocess(signal.audio_data, signal.sample_rate)
z, codes, latents, _, _ = model.encode(x)

# Decode audio signal
y = model.decode(z)

# Alternatively, use the `compress` and `decompress` functions
# to compress long files.

signal = signal.cpu()
x = model.compress(signal)

# Save and load to and from disk
x.save("compressed.dac")
x = dac.DACFile.load("compressed.dac")

# Decompress it back to an AudioSignal
y = model.decompress(x)

# Write to file
y.write('output.wav')
```

## Citing & Authors
        
If you find this model helpful, feel free to cite our publication [Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer](https://www.isca-archive.org/interspeech_2024/shechtman24_interspeech.pdf):
```bibtex 
@inproceedings{shechtman24_interspeech,
  title     = {Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer},
  author    = {Slava Shechtman and Avihu Dekel},
  year      = {2024},
  booktitle = {Interspeech 2024},
  pages     = {4174--4178},
  doi       = {10.21437/Interspeech.2024-2366},
  issn      = {2958-1796},
}
```