Hector Garcia Rodriguez
Deep Learning has found great success in a variety of tasks relevant to humans, like real-time dialogue translation or navigation assistance for visually impaired individuals. Recently, a single set of techniques (LLMs) have shown to be extremely proficient in a wide variety of language-mediated problems. An equivalently powerful and versatile generalist approach for tasks combining several natural stimuli; like visual, auditory, tactile and kinesthetic feedback, or video; is yet to be developed.
Natural language possesses a key characteristic which can help explain the earlier success of these approaches at the current data and compute scale. Namely, the discreteness of text, which results both in a much smaller input and target space, renders its understanding and generation simpler and more efficient, even if less expressive. However, other modalities inherently bear higher information density, which renders them more descriptive, but limits the training size and requires more complex objectives. Therefore, efficiency techniques can be crucial to untap the potential of multimodal data.
Among the wide variety of efficiency techniques that are applicable in Deep Learning, few are used in an input dependent manner. However, reducing the cost of processing easy samples and extending the compute budget of solving harder tasks, would naturally provide more flexibility in finding a better cost-performance trade-off for a generalist agent that can tackle a wide variety of tasks using different types of stimuli. On the one hand, techniques for adaptively reducing the cost of inferring simple inputs have not been shown to generalize to diverse tasks or modalities, or to be applicable in current hardware accelerators. Specifically, methods to exploit the translational equivariance property of current popular architectures (transformers) can exploit efficiency gains from sparsifying the activation sequence, but have yet to be successfully applied to wider multimodal settings. On the other hand, increasing the decoding budget of language models by increasing the amount of output elements in the sequence before providing the final answer has proven a simple and effective way to improve model proficiency (CoT) at language tasks. However, the mechanisms to learn, control and understand such "reasoning" process are very limited. Moreover, such techniques have not been satisfactorily extended to other modalities beyond text.
During the PhD we will explore techniques that increase the proficiency of multimodal models in a wide range of human-relevant tasks, by maximising the available computational resources with the adaptive use of efficiency techniques.