By Heng Jiang | CMU 10-605 Machine Learning with Large Datasets
Introduction
Fine-tuning pre-train large models presents computational challenges in the field of Machine Learning with Large Datasets. In addition, scaling those processes and tuning them to adapt to specific domain challenges are also crucial and are being actively studied. To address the issue of high computational cost when fine-tuning large models with large numbers of parameters, Low-Rank Adaptation (LoRA) was introduced to reduce the number of trainable parameters down from full fine-tuning and greatly alleviated the computational overhead. However, conventional LoRA has various limitations when being tasked to address the second challenge — adapting models to domain-specific tasks. To solve this problem, research is being carried out to modify the conventional LoRA model, building on the conventional LoRA framework of Low-Rank adaptation while applying techniques to help it adapt to domain specific tasks. This blog explores the adaptation of LoRA for domain-specific challenges through three distinctive domains and a comparative analysis of three state-of-the-art corresponding techniques — all within the LoRA framework but each of them features a tailored solution to a unique domain challenge: Conv-LoRA in the domain of computer vision which enhances LoRA’s capability to capture local spatial features; LongLoRA in the domain of long text comprehension which extends the context length of language models; Mixture of LoRA Experts (MoLE) in the domain of knowledge composition, which enables a dynamic, efficient, and low-cost mechanism to adjust the influence of LoRAs applied to different layers, optimizing the composition of specialized LoRAs to meet diverse task-specific requirements within a single model. By examining and comparing those effective adaptations of LoRA, this blog provides a more comprehensive understanding between the interaction between conventional frameworks and domain specific learning tasks and offers a glimpse into the robustness of LoRA’s foundational framework. Lastly, this blog aims to provide some value-added insights on how those domain-specific techniques could potentially interplay and work in hybrid models, which could point directions for future research.
Visual Models: The Application of Conv-LoRA for the Captures of Local Spatial Features
Conv-LoRA is an ingenious adaptation of the LoRA framework, specially designed to tackle unique challenges in vision-related works. While LoRA has turned parameter-efficient fine-tuning (PEFT) by injecting trainable low-rank matrices into pre-trained models, its traditional design primarily targets fully connected layers, focusing on global representations. This approach works well in many domains but falls short of optimizing performance in computer vision tasks that rely heavily on local spatial features. Conv-LoRA solves this problem by incorporating lightweight convolutional operations, unlocking a new dimension of efficiency and accuracy for vision models.
Vision models, such as convolutional neural networks (CNNs) and Vision Transformers (ViTs) , have demonstrated exceptional capabilities in extracting both global and localized features for tasks like image classification, segmentation, and object detection. However, traditional LoRA falls short in leveraging these localized patterns because it primarily operates within the realm of fully connected layers, focusing on global representations. This limitation causes LoRA to overlook a fundamental property of visual data: local dependencies. For example, detecting a cat in an image may require the model to combine global context with the ability to recognize smaller, localized features such as fur texture or whiskers. Without explicitly incorporating spatial structure, standard LoRA cannot fully capitalize on the hierarchical nature of visual information captured by CNNs and ViTs
Conv-LoRA emerged from the recognition that convolutional operations, which are foundational to many vision models, can dramatically enhance LoRA’s utility in vision tasks. By embedding trainable convolutional layers into the LoRA framework, Conv-LoRA aligns fine-tuning capabilities with the inherent characteristics of visual data. These filters are integrated into the architecture alongside the low-rank matrices typically used in LoRA. The strength of original LoRA in efficiently adapting global representation is kept, added with the capability of learning localized spatial features introduced by convolutional filters.
The effectiveness of Conv-LoRA has been validated across multiple computer vision benchmarks, showcasing its superiority in segmentation tasks across diverse domains such as medical imaging, agriculture, and remote sensing. Conv-LoRA consistently outperformed other parameter-efficient fine-tuning (PEFT) methods, including BitFit, SAM-Adapter, and standard LoRA, achieving higher segmentation metrics such as Jaccard Index (Jac) and Dice Similarity Coefficient (Dice). For instance, on the ISIC 2017 medical dataset, Conv-LoRA delivered significant improvements, particularly in fine-grained segmentation accuracy. The integration of Mixture of Experts (MoE) further enhanced performance by dynamically injecting local priors into feature maps at optimal scales, leading to a 1.54x training speedup and reduced memory usage compared to static multi-scale strategies. These results highlight Conv-LoRA’s ability to adapt to dataset-specific needs while maintaining a minimal parameter overhead.
While Conv-LoRA offers substantial gains in performance for vision tasks, it comes with increased computational complexity and a more domain-specific usage. Despite these limitations, the reduced number of trainable parameters and the resulting efficiency make Conv-LoRA an attractive option for resource-constrained environments where accuracy is paramount.
In conclusion, Conv-LoRA is a powerful example of how domain-specific adaptations can expand the applicability of foundational techniques like LoRA. By infusing convolutional operations into the fine-tuning process, Conv-LoRA unlocks new potential for pre-trained vision models, enabling them to perform exceptionally well in tasks requiring a keen understanding of local spatial features. This adaptation paves the way for broader adoption of LoRA in the computer vision domain, making state-of-the-art models more accessible and efficient for a wide range of applications. Whether it’s detecting intricate patterns in medical images, enhancing autonomous vehicle perception, or enabling high-resolution satellite imagery analysis, Conv-LoRA stands out as a game-changing advancement in the field of machine learning.
Long Context Text Models: TheApplication of LongLoRA for Long-Sequence Tasks
Have you ever noticed that Chat-GPT and other language models find it difficult to track long contexts? For example, when it is tasked to summarize a long document or follow long and complex text instructions, it often fails to capture the entirety of its contexts or miss the instructions from early on in the text as it progresses into the latter parts. This is because those language models are trained with predefined context sizes — the maximum numbers of tokens the models can process in a single input, which essentially means the number of tokens or length of texts the model remembers at once. In simple words, when being given long texts or pages and pages of words, the model doesn’t get to look at or digest the entire document at once — it has to forget some parts before going into another. This is why users have the impression that GPT forgets previous contents when being occupied with new ones. Language models were given context size limits because training them with long context is computationally expensive and requires extended time and hardware resources. Fine-tuning those models are also considerably expensive. This makes extended context length unaffordable or financially undesirable for individual researchers, institutions, and businesses. The question is, are there any ways to extend the context window of those LLM models in cost-effective ways?
Of course, it is a good starting point to revisit LoRA, which greatly reduces the number of trainable parameters. However, there are a few problems with using conventional LoRA in this context. First, it is not effective. According to researchers, plain low-rank adaptation in a long-contect extension results in high perplexity even if rank is increased to higher values, which essentially means LoRA by itself is unable to handle the complexity of long-context models, even when increasing its precision. Additionally, when it comes to efficiency, given that computational overhead increases dramatically with context size under self-attention mechanism (read words in relation to other words), training hours under LoRA is still substantial. LongLoRA was introduced to address the inefficiency and ineffectiveness of conventional LoRA. It is essentially an effective fine-tuning approach (like conventional LoRA) that extends the context size while doing so without dramatically expanding computation cost. It features two additional capabilities — Shifted Sparse Attention (S2-Attn) and improved LoRA for long contexts. It decreases the accuracy gap between conventional LoRA and full fine-tuning while maintaining a relatively low memory cost.
Shifted Sparse Attention (S2-Attn) is a fine-tuning technique that retains the original attention architecture during inference, as a substitute to standard self-attention. It essentially achieves reduced computational overhead by dividing tokens into smaller groups and compute attention locally within those smaller and manageable groups. S2-Attn shifts tokens across groups for some attention heads to make sure information flows across groups. Such shifted attention enables communication between token groups and access to a coherent overall context. During inference, the original attention mechanism is reactivated to ensure compatibility with existing infrastructure. Shifting prevents models from overfitting to specific attention patterns. To implement Shifted Sparse Attention in practice, it only needs three steps — splitting the features along the head dimension into two chunks, shifting tokens in one of the chunks by half of the group size (so that one slightly overlaps with the other group), and splitting tokens into groups and reshape them into batch dimensions. Attention is computed locally within groups but information flows across groups via shifting to ensure overall coherence.
As mentioned, although conventional LoRA decreases the number of parameters to modify during fine-tuning, it doesn’t perform well under large context length and result in overly high perplexity even when increasing ranks. To address this issue, improved LoRA was introduced by researchers, which makes two additional layers trainable — embedding layers and normalization layers. Embedding layers map input tokens into vector representations the model can process and are usually not made trainable under conventional LoRA. Normalization layers are activation layers that stabilize training outputs and only account for a very small percentage (less than 0.1%) of total model parameters. However, making them trainable proved to have a significant effect on LoRA’s adaptation to long contexts. According to experiments, training normalization and embedding layers decreases the perplexity gap between LoRA and full fine-tuning and improves performances, even though the number of additional parameters were few. Experiments also demonstrated that when combined with the Shifted Sparse Attention mechanism, the application of those additional trainale layers achieves model performance close to full fine-tuning.
Cross-Domain Models: The Application of MoLE for Dynamic and Efficient LoRA Composition
Mixture of LoRA Experts (MoLE) solves the challenge of effectively combining multiple trained LoRAs by treating each layer of LoRAs as a separate expert. Each of these experts specializes in a specific type of knowledge, such as identifying textures or shapes in images or understanding grammatical structures in text. MoLE uses a mechanism called hierarchical weight control, enabled by learnable gating functions, to dynamically decide how much each expert contributes based on the specific task. This ensures that the unique strengths of each LoRA are preserved while working together to produce better results. At the same time, MoLE achieves this with minimal additional computational costs.
To understand knowledge composition, imagine you are building a team of specialists to solve a problem. Each LoRA layer represents one specialist: for example, one might excel in detecting colors in images, another in understanding textures, and another in identifying objects. MoLE acts as a smart project manager that dynamically adjusts how much each specialist contributes to the project. If the task shifts, such as focusing more on textures than objects, MoLE reallocates the weights assigned to each specialist, ensuring optimal results. Additionally, MoLE can remove certain specialists (LoRAs) if their expertise is not needed, redistributing the workload without having to retrain the entire system. This makes MoLE both flexible and efficient in handling diverse tasks.
Extensive experiments validate MoLE’s superiority over existing methods in both Natural Language Processing (NLP) and Vision & Language (V&L) domains. In the V&L domain, MoLE outperformed baseline methods such as normalized linear arithmetic composition (NLA) and SVDiff in multi-subject image generation tasks. MoLE achieved higher text- and image-alignment scores, meaning it better matched the provided descriptions with the generated images. Moreover, MoLE preserved fine details like textures and object shapes, as demonstrated in qualitative results where competing methods often failed to retain such details or introduced errors in the composition. For instance, SVDiff and NLA often mixed up features or omitted subjects, whereas MoLE consistently captured all elements accurately.
In the NLP domain, MoLE excelled in tasks such as translation, natural language inference, and structured text generation. On benchmarks like the Big-Bench Hard (BBH) dataset, MoLE achieved significant performance improvements over existing LoRA composition methods, such as LoRAHub and PEMs. It demonstrated robust generalization, scaling effectively to tasks that required combining knowledge from multiple LoRAs, and consistently outperformed other methods as the number of LoRAs increased.
One of MoLE’s key strengths is its scalability. While other methods often degrade in performance as the number of LoRAs increases, MoLE continues to perform well. For example, even when combining dozens of LoRAs, MoLE maintains superior performance, though challenges arise when the number becomes extremely large (e.g., 128), where all methods, including MoLE, show some decline. This limitation highlights the need for future research into handling very large-scale LoRA compositions.
MoLE also offers flexibility by allowing for different levels of control over the composition process. For instance, it can assign weights to individual layers (layer-wise gating) or larger groups of layers (block-wise gating), with intermediate granularities often yielding the best results. Additionally, the inclusion of a gating balancing loss ensures that MoLE does not overly rely on a few LoRAs, promoting balanced contributions from all experts. This balance enhances overall performance and prevents underutilization of certain LoRAs.
In summary, MoLE represents a significant advancement in LoRA composition by combining multiple experts dynamically and efficiently. Its ability to adapt to different tasks, preserve individual strengths, and outperform competing methods makes it a valuable tool for applications in NLP, V&L, and beyond. While there are challenges in scaling to extreme numbers of LoRAs, MoLE sets a strong foundation for future innovations in this area, demonstrating the power of knowledge composition to achieve superior results.
Comments and Conclusions
The exploration of Conv-LoRA, LongLoRA, and MoLE revealed the adaptivity of the LoRA framework. It layed a robust yet versatile foundation for future applications to align the models for domain-specific tasks. Whether the application enhanced performance in visual, language, or multi-task applications, each of those applications demonstrated common principles for Machine Learning with Large Datasets — improving efficiency, scalability, and alignment. Given the modern approach of training a model that is in general suitable for various tasks and fine-tuning and then aligning it for specific tasks and domains, the LoRA framework will continue to find its way into more application domains. The comparative analysis of the three techniques also reveals various clues about how they can potentially interplay — providing inspirations for one another and potentially work hybridly to adapt to their respective domains even better. This potential synergy between techniques paves the way for future research.
For example, Shifted Sparse Attention in LongLORA could potentially be applied to Conv-LoRA to reduce its computational overhead. In the application of Conv-LoRA, the addition of a convolutional layer increases computational overhead, particularly in large or high-resolution images used in, for example, medical settings. It is analogous to the problem of long context windows, where the size of input is too large to be digested at once. The principles behind S2-Attn could be applied to Conv-LoRA in settings such as medical imaging and divide the image (pixels analogous to text tokens) into smaller groups and apply convolutional operations within those individual groups. Communication could be introduced by introducing slight overlaps between the pixel groups to enable coherence and attention to the overall context. This could reduce memory and computational costs while enabling the detection of local spatial features such as textures and edges and have the convolutional layer prioritize regions of concern if there are any.
Another example of interplay could be the potential application of MoLE’s hierarchical weight control in Conv-LoRA and LongLoRA. MoLE’s hierarchical weight control offers a powerful opportunity to enhance both Conv-LoRA and LongLoRA by dynamically managing the contributions of specialized components within these models. In Conv-LoRA, MoLE could treat convolutional filters as “experts”, adjusting their influence based on task requirements, such as prioritizing fine-grained texture detection in medical imaging or broader spatial patterns for object detection in autonomous driving. For LongLoRA, MoLE can allocate weights to specific attention mechanisms or token groups, enabling context-aware adaptation in tasks like summarizing long documents or handling multi-domain inputs. Furthermore, MoLE could facilitate hybrid models that combine Conv-LoRA and LongLoRA for multi-model tasks, such as vision-language applications or hierarchical multi-task learning, where visual and textual inputs need to be integrated dynamically. By allowing these techniques to interplay, MoLE introduces a scalable and flexible framework for optimizing performance across diverse domains.
More research on the domain application of the LoRA framework is being carried out in the field of Machine Learning with Large Datasets. One potential focus of future work is on the hybrid application of techniques originally designed across distinctive domains. Those approaches could unlock new possibilities in efficiency, scalability, and alignment, and could further enhance accessibility of large pre-trained models into more domains.
References
“BigBench | Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data.” Accessed November 29, 2024. https://dl.acm.org/doi/abs/10.1145/2463676.2463712?casa_token=KcilHYvIAcIAAAAA:TtG6mU1yfdpQJHl8H-j40EHp9T3EcxCyTEGN-PvNA0YnaKeND7G_xFDOHhAcVBrSDZ3gEb_0GDZp.
Chen, Tianrun, Lanyun Zhu, Chaotao Ding, Runlong Cao, Yan Wang, Zejian Li, Lingyun Sun, Papa Mao, and Ying Zang. “SAM Fails to Segment Anything? — SAM-Adapter: Adapting SAM in Underperformed Scenes: Camouflage, Shadow, Medical Image Segmentation, and More.” arXiv, May 2, 2023. https://doi.org/10.48550/arXiv.2304.09148.
Chen, Zhe, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. “Vision Transformer Adapter for Dense Predictions.” arXiv, February 13, 2023. https://doi.org/10.48550/arXiv.2205.08534.
Chen, Yukang, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. “LONGLORA: EFFICIENT FINE-TUNING OF LONG- CONTEXT LARGE LANGUAGE MODELS,” 2024.
Dosovitskiy, Alexey, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, et al. “An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale.” arXiv, June 3, 2021. https://doi.org/10.48550/arXiv.2010.11929.
Han, Ligong, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas, and Feng Yang. “SVDiff: Compact Parameter Space for Diffusion Fine-Tuning,” 7323–34, 2023. https://openaccess.thecvf.com/content/ICCV2023/html/Han_SVDiff_Compact_Parameter_Space_for_Diffusion_Fine-Tuning_ICCV_2023_paper.html.
Huang, Chengsong, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. “LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition.” arXiv, August 19, 2024. https://doi.org/10.48550/arXiv.2307.13269.
O’Shea, Keiron, and Ryan Nash. “An Introduction to Convolutional Neural Networks.” arXiv, December 2, 2015. https://doi.org/10.48550/arXiv.1511.08458.
Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer.” arXiv, January 23, 2017. https://doi.org/10.48550/arXiv.1701.06538.
Wu, Xun, Shaohan Huang, and Furu Wei. “MIXTURE OF LORA EXPERTS,” 2024.
Zaken, Elad Ben, Shauli Ravfogel, and Yoav Goldberg. “BitFit: Simple Parameter-Efficient Fine-Tuning for Transformer-Based Masked Language-Models.” arXiv, September 5, 2022. https://doi.org/10.48550/arXiv.2106.10199.
