Abstract: Current Text-to-audio (TTA) models mainly use coarse text descriptions as inputs to generate audio, which hinders models from generating audio with fine-grained control of content and style.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results