HYU Natural Language Processing (NLP) Laboratory

Welcome to Natural Language Processing (NLP) Lab. at Hanyang University.

We study various approaches and problems with regard to natural language, chiefly based on machine learning and AI technologies.

We are looking for MS/Ph.D. students (and interns) who are self-motivated and passionate about doing research in NLP.

Please submit your information on this page if you are interested in applying for our lab.

News!

(24/09/24) Two papers have been accepted at HCLT 2024. Congrats to Deokyeong, Changhyeon, and Seung Hee! Also honored to share that both papers have received the Best Paper Award (최우수논문상) from the conference!

(24/09/20) Three papers—one Main and two Findings—have been accepted at EMNLP 2024. Two of them are the result of collaboration with SNU and Naver. The other is the outcome of an internal project. Congratulations to Deokyeong and Ki Jung!

(24/08/22) Yejin and Jii have graduated with their Master's degrees. Wish them all the best! Yejin is going to begin her Ph.D. studies!

(24/06/14) Four (3 oral and 1 poster) papers have been accepted for presentation at KCC 2024. Congrats to Yejin, Jungyeon, Jisu, Youngwoo, Kang Min, Dong Geon, and Jungmin! Also excited to share 한국어 발화의 다중 의도 감지 연구 (Multi-Intent Detection for Korean Spoken Language) has received the Outstanding Paper Award (우수논문상) from the conference!

(24/05/16) Two papers, 1. Analysis of Multi-Source Language Training in Cross-Lingual Transfer (authors: Seong Hoon, Taejun, Jinhyeon) 2. Hyper-CL: Conditioning Sentence Representations with Hypernetworks (authors: Young Hyun, Jii, Changhyeon), have been accepted for presentation at ACL 2024. Big congrats to the authors!

Recent Publications

Abstract

Modular programming, which aims to construct the final program by integrating smaller, independent building blocks, has been regarded as a desirable practice in software development. However, with the rise of recent code generation agents built upon large language models (LLMs), a question emerges: is this traditional practice equally effective for these new tools? In this work, we assess the impact of modularity in code generation by introducing a novel metric for its quantitative measurement. Surprisingly, unlike conventional wisdom on the topic, we find that modularity is not a core factor for improving the performance of code generation models. We also explore potential explanations for why LLMs do not exhibit a preference for modular code compared to non-modular code. Our code is available at https://github.com/HYU-NLP/Revisiting-Modularity.

Abstract

While the introduction of contrastive learning frameworks in sentence representation learning has significantly contributed to advancements in the field, it still remains unclear whether state-of-the-art sentence embeddings can capture the fine-grained semantics of sentences, particularly when conditioned on specific perspectives. In this paper, we introduce Hyper-CL, an efficient methodology that integrates hypernetworks with contrastive learning to compute conditioned sentence representations. In our proposed approach, the hypernetwork is responsible for transforming pre-computed condition embeddings into corresponding projection layers. This enables the same sentence embeddings to be projected differently according to various conditions. Evaluation of two representative conditioning benchmarks, namely conditional semantic text similarity and knowledge graph completion, demonstrates that Hyper-CL is effective in flexibly conditioning sentence representations, showcasing its computational efficiency at the same time. We also provide a comprehensive analysis of the inner workings of our approach, leading to a better interpretation of its mechanisms. Our code is available at https://github.com/HYU-NLP/Hyper-CL.

Abstract

The successful adaptation of multilingual language models (LMs) to a specific language-task pair critically depends on the availability of data tailored for that condition. While cross-lingual transfer (XLT) methods have contributed to addressing this data scarcity problem, there still exists ongoing debate about the mechanisms behind their effectiveness. In this work, we focus on one of the promising assumptions about the inner workings of XLT, that it encourages multilingual LMs to place greater emphasis on language-agnostic or task-specific features. We test this hypothesis by examining how the patterns of XLT change with a varying number of source languages involved in the process. Our experimental findings show that the use of multiple source languages in XLT—a technique we term Multi-Source Language Training (MSLT)—leads to increased mingling of embedding spaces for different languages, supporting the claim that XLT benefits from making use of language-independent information. On the other hand, we discover that using an arbitrary combination of source languages does not always guarantee better performance. We suggest simple heuristics for identifying effective language combinations for MSLT and empirically prove its effectiveness.

Abstract

Task-oriented dialogue (TOD) systems are commonly designed with the presumption that each utterance represents a single intent. However, this assumption may not accurately reflect real-world situations, where users frequently express multiple intents within a single utterance. While there is an emerging interest in multi-intent detection (MID), existing in-domain datasets such as MixATIS and MixSNIPS have limitations in their formulation. To address these issues, we present BlendX, a suite of refined datasets featuring more diverse patterns than their predecessors, elevating both its complexity and diversity. For dataset construction, we utilize both rule-based heuristics as well as a generative tool-OpenAI's ChatGPT-which is augmented with a similarity-driven strategy for utterance selection.To ensure the quality of the proposed datasets, we also introduce three novel metrics that assess the statistical properties of an utterance related to word count, conjunction use, and pronoun usage. Extensive experiments on BlendX reveal that state-of-the-art MID models struggle with the challenges posed by the new datasets, highlighting the need to reexamine the current state of the MID field. The dataset is available at https://github.com/HYU-NLP/BlendX.

Abstract

Cross-lingual transfer (XLT) is an emergent ability of multilingual language models that preserves their performance on a task to a significant extent when evaluated in languages that were not included in the fine-tuning process. While English, due to its widespread usage, is typically regarded as the primary language for model adaption in various tasks, recent studies have revealed that the efficacy of XLT can be amplified by selecting the most appropriate source languages based on specific conditions. In this work, we propose the utilization of sub-network similarity between two languages as a proxy for predicting the compatibility of the languages in the context of XLT. Our approach is model-oriented, better reflecting the inner workings of foundation models.  In addition, it requires only a moderate amount of raw text from candidate languages, distinguishing it from the majority of previous methods that rely on external resources. In experiments, we demonstrate that our method is more effective than baselines across diverse tasks. Specifically, it shows proficiency in ranking candidates for zero-shot XLT, achieving an improvement of 4.6% on average in terms of NDCG@3. We also provide extensive analyses that confirm the utility of sub-networks for XLT prediction.