Liheng Chen


I am an undergraduate student researcher from the University of Hong Kong (HKU). My research interests include Parameter-Efficient Fine-Tuning (PEFT) and conditional non-autoregressive text generation. I am also interseted in 📷photography and ⛰️hiking.

Love is the one thing we're capable of perceiving that transcends dimensions of time and space

Dr. Amelia Brand in Interstellar

🎓 Education
  • University of Hong Kong
    University of Hong Kong
    Bachelor of Engineering (Computer Science)
    Sep. 2021 - Jul. 2025
  • University of California, Berkeley
    University of California, Berkeley
    Visiting Student
    Jan. 2024 - May. 2024
  • Fudan University
    Fudan University
    School of Economics (SOE) Winter School
    Dec. 2022 - Jan. 2023
🎊 Honors & Awards
  • Teaching Development and Language Enhancement Grant (TDLEG)
    2024
  • HKU Reaching Out Award (ROA) Exchange Scholarship
    2024
  • Dean's Honors List, Department of Computer Science, HKU
    2022
  • China Soong Ching Ling Foundation Zhiyuan Bursary Recipient
    2021
🔬 Experience
  • Univeristy of Hong Kong
    Univeristy of Hong Kong
    Research Assistant
    Jul. 2021 - Feb. 2024
News
2024
🔥🔥🔥 DoT is accepted by NeurIPS 2024 (Poster)!
Sep 24
🔥🔥🔥 ProLoRA is accepted by ACL 2024 (Main conference)!
May 15
🔥🔥🔥 HiddenKey is accepted by ACL 2024 (Findings)! See you in 🏝️🥥Thailand!
May 15
Selected Publications (view all )
MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards

Sheng Wang*, Liheng Chen*, Pengan Chen, Jingwei Dong, Boyang Xue, Jiyue Jiang, Lingpeng Kong, Chuan Wu (* equal contribution)

Preprint 2024 Arxiv

The rapid scaling of large language models necessitates more lightweight finetuning methods to reduce the explosive GPU memory overhead when numerous customized models are served simultaneously. Targeting more parameter-efficient low-rank adaptation (LoRA), parameter sharing presents a promising solution. Empirically, our research into high-level sharing principles highlights the indispensable role of differentiation in reversing the detrimental effects of pure sharing. Guided by this finding, we propose Mixture of Shards (MoS), incorporating both inter-layer and intra-layer sharing schemes, and integrating four nearly cost-free differentiation strategies, namely subset selection, pair dissociation, vector sharding, and shard privatization. Briefly, it selects a designated number of shards from global pools with a Mixture-of-Experts (MoE)-like routing mechanism before sequentially concatenating them to low-rank matrices. Hence, it retains all the advantages of LoRA while offering enhanced parameter efficiency, and effectively circumvents the drawbacks of peer parameter-sharing methods. Our empirical experiments demonstrate approximately 8x parameter savings in a standard LoRA setting. The ablation study confirms the significance of each component. Our insights into parameter sharing and MoS method may illuminate future developments of more parameter-efficient finetuning methods.

MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards
MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards

Sheng Wang*, Liheng Chen*, Pengan Chen, Jingwei Dong, Boyang Xue, Jiyue Jiang, Lingpeng Kong, Chuan Wu (* equal contribution)

Preprint 2024 Arxiv

The rapid scaling of large language models necessitates more lightweight finetuning methods to reduce the explosive GPU memory overhead when numerous customized models are served simultaneously. Targeting more parameter-efficient low-rank adaptation (LoRA), parameter sharing presents a promising solution. Empirically, our research into high-level sharing principles highlights the indispensable role of differentiation in reversing the detrimental effects of pure sharing. Guided by this finding, we propose Mixture of Shards (MoS), incorporating both inter-layer and intra-layer sharing schemes, and integrating four nearly cost-free differentiation strategies, namely subset selection, pair dissociation, vector sharding, and shard privatization. Briefly, it selects a designated number of shards from global pools with a Mixture-of-Experts (MoE)-like routing mechanism before sequentially concatenating them to low-rank matrices. Hence, it retains all the advantages of LoRA while offering enhanced parameter efficiency, and effectively circumvents the drawbacks of peer parameter-sharing methods. Our empirical experiments demonstrate approximately 8x parameter savings in a standard LoRA setting. The ablation study confirms the significance of each component. Our insights into parameter sharing and MoS method may illuminate future developments of more parameter-efficient finetuning methods.

Diffusion of Thought: Chain-of-Thought Reasoning in Diffusion Language Models

Jiacheng Ye*, Shansan Gong*, Liheng Chen*, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Xin Jiang, Zhenguo Li, Wei Bi, Lingpeng Kong (* equal contribution)

Annual Conference on Neural Information Processing System 2024 NeurIPS 2024

Recently, diffusion models have garnered significant interest in the field of text processing due to their many potential advantages compared to conventional autoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a novel approach that integrates diffusion models with Chain-of-Thought, a well-established technique for improving the reasoning ability of autoregressive language models. In contrast to autoregressive language models that make decisions in a left-to-right, token-by-token manner, DoT allows reasoning steps to diffuse over time through a diffusion language model and offers greater flexibility in trading-off computation for reasoning performance. Our experimental results demonstrate the effectiveness of DoT in multi-digit multiplication, boolean logic, and grade school math problems, with a small diffusion model outperforming a much larger autoregressive model in both efficiency and accuracy. In addition to that, DoT showcases promising self-correction abilities and benefits from existing reasoning-enhancing techniques like self-consistency decoding. Our findings contribute to the understanding and development of reasoning with diffusion language models.

Diffusion of Thought: Chain-of-Thought Reasoning in Diffusion Language Models
Diffusion of Thought: Chain-of-Thought Reasoning in Diffusion Language Models

Jiacheng Ye*, Shansan Gong*, Liheng Chen*, Lin Zheng, Jiahui Gao, Han Shi, Chuan Wu, Xin Jiang, Zhenguo Li, Wei Bi, Lingpeng Kong (* equal contribution)

Annual Conference on Neural Information Processing System 2024 NeurIPS 2024

Recently, diffusion models have garnered significant interest in the field of text processing due to their many potential advantages compared to conventional autoregressive models. In this work, we propose Diffusion-of-Thought (DoT), a novel approach that integrates diffusion models with Chain-of-Thought, a well-established technique for improving the reasoning ability of autoregressive language models. In contrast to autoregressive language models that make decisions in a left-to-right, token-by-token manner, DoT allows reasoning steps to diffuse over time through a diffusion language model and offers greater flexibility in trading-off computation for reasoning performance. Our experimental results demonstrate the effectiveness of DoT in multi-digit multiplication, boolean logic, and grade school math problems, with a small diffusion model outperforming a much larger autoregressive model in both efficiency and accuracy. In addition to that, DoT showcases promising self-correction abilities and benefits from existing reasoning-enhancing techniques like self-consistency decoding. Our findings contribute to the understanding and development of reasoning with diffusion language models.

How Far Can Cantonese NLP Go? Benchmarking Cantonese Capabilities of Large Language Models

Jiyue Jiang, Liheng Chen, Pengan Chen, Sheng Wang, Qinghang Bao, Lingpeng Kong, Yu Li, Chuan Wu

Preprint 2024 Arxiv

The rapid evolution of large language models (LLMs) has transformed the competitive landscape in natural language processing (NLP), particularly for English and other data-rich languages. However, underrepresented languages like Cantonese, spoken by over 85 million people, face significant development gaps, which is particularly concerning given the economic significance of the Guangdong-Hong Kong-Macau Greater Bay Area, and in substantial Cantonese-speaking populations in places like Singapore and North America. Despite its wide use, Cantonese has scant representation in NLP research, especially compared to other languages from similarly developed regions. To bridge these gaps, we outline current Cantonese NLP methods and introduce new benchmarks designed to evaluate LLM performance in factual generation, mathematical logic, complex reasoning, and general knowledge in Cantonese, which aim to advance open-source Cantonese LLM technology. We also propose future research directions and recommended models to enhance Cantonese LLM development.

How Far Can Cantonese NLP Go? Benchmarking Cantonese Capabilities of Large Language Models
How Far Can Cantonese NLP Go? Benchmarking Cantonese Capabilities of Large Language Models

Jiyue Jiang, Liheng Chen, Pengan Chen, Sheng Wang, Qinghang Bao, Lingpeng Kong, Yu Li, Chuan Wu

Preprint 2024 Arxiv

The rapid evolution of large language models (LLMs) has transformed the competitive landscape in natural language processing (NLP), particularly for English and other data-rich languages. However, underrepresented languages like Cantonese, spoken by over 85 million people, face significant development gaps, which is particularly concerning given the economic significance of the Guangdong-Hong Kong-Macau Greater Bay Area, and in substantial Cantonese-speaking populations in places like Singapore and North America. Despite its wide use, Cantonese has scant representation in NLP research, especially compared to other languages from similarly developed regions. To bridge these gaps, we outline current Cantonese NLP methods and introduce new benchmarks designed to evaluate LLM performance in factual generation, mathematical logic, complex reasoning, and general knowledge in Cantonese, which aim to advance open-source Cantonese LLM technology. We also propose future research directions and recommended models to enhance Cantonese LLM development.

PRoLoRA: Partial Rotation Empowers More Parameter-Efficient LoRA

Sheng Wang, Boyang Xue, Jiacheng Ye, Jiyue Jiang, Liheng Chen, Lingpeng Kong, Chuan Wu

Annual Meeting of the Association for Computational Linguistics 2024 ACL 2024

With the rapid scaling of large language models (LLMs), serving numerous LoRAs concurrently has become increasingly impractical, leading to unaffordable costs and necessitating more parameter-efficient finetuning methods. In this work, we introduce Partially Rotation enhanced Low-Rank Adaptation (PRoLoRA), an intra-layer sharing mechanism comprising four essential components, broadcast reduction, rotation enhancement, partially-sharing refinement, and rectified initialization strategy. As a superset of LoRA, PRoLoRA pertains its advantages, and effectively circumvent the drawbacks of peer parameter-sharing methods with superior model capacity, practical feasibility, and broad applicability. Empirical experiments demonstrate the remarkably higher parameter efficiency of PRoLoRA in both specific parameter budget and performance target scenarios, and its scalability to larger LLMs. Notably, with one time less trainable parameters, PRoLoRA still outperforms LoRA on multiple instruction tuning datasets. Subsequently, an ablation study is conducted to validate the necessity of individual components and highlight the superiority of PRoLoRA over three potential variants. Hopefully, the conspicuously higher parameter efficiency can establish PRoLoRA as a resource-friendly alternative to LoRA.

PRoLoRA: Partial Rotation Empowers More Parameter-Efficient LoRA
PRoLoRA: Partial Rotation Empowers More Parameter-Efficient LoRA

Sheng Wang, Boyang Xue, Jiacheng Ye, Jiyue Jiang, Liheng Chen, Lingpeng Kong, Chuan Wu

Annual Meeting of the Association for Computational Linguistics 2024 ACL 2024

With the rapid scaling of large language models (LLMs), serving numerous LoRAs concurrently has become increasingly impractical, leading to unaffordable costs and necessitating more parameter-efficient finetuning methods. In this work, we introduce Partially Rotation enhanced Low-Rank Adaptation (PRoLoRA), an intra-layer sharing mechanism comprising four essential components, broadcast reduction, rotation enhancement, partially-sharing refinement, and rectified initialization strategy. As a superset of LoRA, PRoLoRA pertains its advantages, and effectively circumvent the drawbacks of peer parameter-sharing methods with superior model capacity, practical feasibility, and broad applicability. Empirical experiments demonstrate the remarkably higher parameter efficiency of PRoLoRA in both specific parameter budget and performance target scenarios, and its scalability to larger LLMs. Notably, with one time less trainable parameters, PRoLoRA still outperforms LoRA on multiple instruction tuning datasets. Subsequently, an ablation study is conducted to validate the necessity of individual components and highlight the superiority of PRoLoRA over three potential variants. Hopefully, the conspicuously higher parameter efficiency can establish PRoLoRA as a resource-friendly alternative to LoRA.

LoRA Meets Dropout under a Unified Framework

Sheng Wang*, Liheng Chen*, Jiyue Jiang, Boyang Xue, Lingpeng Kong, Chuan Wu (* equal contribution)

Annual Meeting of the Association for Computational Linguistics 2024 ACL 2024

With the remarkable capabilities, large language models (LLMs) have emerged as essential elements in numerous NLP applications, while parameter-efficient finetuning, especially LoRA, has gained popularity as a lightweight approach for model customization. Meanwhile, various dropout methods, initially designed for full finetuning with all the parameters updated, alleviates overfitting associated with excessive parameter redundancy. Hence, a possible contradiction arises from negligible trainable parameters of LoRA and the effectiveness of previous dropout methods, which has been largely overlooked. To fill this gap, we first confirm that parameter-efficient LoRA is also overfitting-prone. We then revisit transformerspecific dropout methods, and establish their equivalence and distinctions mathematically and empirically. Building upon this comparative analysis, we introduce a unified framework for a comprehensive investigation, which instantiates these methods based on dropping position, structural pattern and compensation measure. Through this framework, we reveal the new preferences and performance comparisons of them when involved with limited trainable parameters. This framework also allows us to amalgamate the most favorable aspects into a novel dropout method named HiddenKey. Extensive experiments verify the remarkable superiority and sufficiency of HiddenKey across multiple models and tasks, which highlights it as the preferred approach for high-performance and parameter-efficient finetuning of LLMs.

LoRA Meets Dropout under a Unified Framework
LoRA Meets Dropout under a Unified Framework

Sheng Wang*, Liheng Chen*, Jiyue Jiang, Boyang Xue, Lingpeng Kong, Chuan Wu (* equal contribution)

Annual Meeting of the Association for Computational Linguistics 2024 ACL 2024

With the remarkable capabilities, large language models (LLMs) have emerged as essential elements in numerous NLP applications, while parameter-efficient finetuning, especially LoRA, has gained popularity as a lightweight approach for model customization. Meanwhile, various dropout methods, initially designed for full finetuning with all the parameters updated, alleviates overfitting associated with excessive parameter redundancy. Hence, a possible contradiction arises from negligible trainable parameters of LoRA and the effectiveness of previous dropout methods, which has been largely overlooked. To fill this gap, we first confirm that parameter-efficient LoRA is also overfitting-prone. We then revisit transformerspecific dropout methods, and establish their equivalence and distinctions mathematically and empirically. Building upon this comparative analysis, we introduce a unified framework for a comprehensive investigation, which instantiates these methods based on dropping position, structural pattern and compensation measure. Through this framework, we reveal the new preferences and performance comparisons of them when involved with limited trainable parameters. This framework also allows us to amalgamate the most favorable aspects into a novel dropout method named HiddenKey. Extensive experiments verify the remarkable superiority and sufficiency of HiddenKey across multiple models and tasks, which highlights it as the preferred approach for high-performance and parameter-efficient finetuning of LLMs.

Data Augmentation of Multi-turn Psychotherapy Dialogue via Knowledge-driven Progressive Thought Prompting

Jiyue Jiang, Liheng Chen, Sheng Wang, Lingpeng Kong, Yu Li, Chuan Wu

Preprint 2024 Arxiv

Existing dialogue data augmentation (DA) techniques predominantly focus on augmenting utterance-level dialogues, which makes it difficult to take dialogue contextual information into account. The advent of large language models (LLMs) has simplified the implementation of multi-turn dialogues. Due to absence of professional understanding and knowledge, it remains challenging to deliver satisfactory performance in low-resource domain, such as the psychotherapy dialogue. DA involves creating new training or prompting data based on the existing data, which help the model better understand and generate psychotherapy-related responses. In this paper, we aim to address the issue of multi-turn dialogue data augmentation for boosted performance in the psychotherapy domain. We propose a knowledge-driven progressive thought prompting method to guide LLM to generate multi-turn psychotherapy-related dialogue. This method integrates a progressive thought generator, a psychotherapy knowledge generator, and a multi-turn dialogue generator. The thought generated by the progressive thought generator serves as a prompt to prevent the generated dialogue from having significant semantic deviations, while the psychotherapy knowledge generator produces psychotherapy knowledge to serve as the dialogue history for the LLM, guiding the dialogue generator to create multi-turn psychotherapy-related dialogue. To ensure the precision of psychotherapy-related multi-turn dialogue generation by LLM, a meticulous professional evaluation is required. Extensive experiments conducted on three psychotherapy-related datasets verify the effectiveness of the proposed method.

Data Augmentation of Multi-turn Psychotherapy Dialogue via Knowledge-driven Progressive Thought Prompting
Data Augmentation of Multi-turn Psychotherapy Dialogue via Knowledge-driven Progressive Thought Prompting

Jiyue Jiang, Liheng Chen, Sheng Wang, Lingpeng Kong, Yu Li, Chuan Wu

Preprint 2024 Arxiv

Existing dialogue data augmentation (DA) techniques predominantly focus on augmenting utterance-level dialogues, which makes it difficult to take dialogue contextual information into account. The advent of large language models (LLMs) has simplified the implementation of multi-turn dialogues. Due to absence of professional understanding and knowledge, it remains challenging to deliver satisfactory performance in low-resource domain, such as the psychotherapy dialogue. DA involves creating new training or prompting data based on the existing data, which help the model better understand and generate psychotherapy-related responses. In this paper, we aim to address the issue of multi-turn dialogue data augmentation for boosted performance in the psychotherapy domain. We propose a knowledge-driven progressive thought prompting method to guide LLM to generate multi-turn psychotherapy-related dialogue. This method integrates a progressive thought generator, a psychotherapy knowledge generator, and a multi-turn dialogue generator. The thought generated by the progressive thought generator serves as a prompt to prevent the generated dialogue from having significant semantic deviations, while the psychotherapy knowledge generator produces psychotherapy knowledge to serve as the dialogue history for the LLM, guiding the dialogue generator to create multi-turn psychotherapy-related dialogue. To ensure the precision of psychotherapy-related multi-turn dialogue generation by LLM, a meticulous professional evaluation is required. Extensive experiments conducted on three psychotherapy-related datasets verify the effectiveness of the proposed method.

All publications
Pageviews