Tribhuvan University

NepaliGPT — Efficient Adaptation of LLMs for Nepali via Distributed Training

Adapting open-source LLMs to Nepali using parameter‑efficient techniques (LoRA/QLoRA) and carefully scoped distributed training on Tribhuvan University High Performance Computing. Focus: resource‑aware methods that preserve quality while enabling reproducibility and open access.

Read Abstract Methodology Outcomes

Author: Aatiz Ghimire

Advisor: Dr. Madhav Prasad Ghimire, Siman Giri

School of Mathematical Sciences, Tribhuvan University — Kirtipur, Kathmandu

Central Department of Physics, Tribhuvan University — Kirtipur, Kathmandu

Compute

2× GPUs

CPU Nodes

500 + Cores, 5 TB + Memory

Network

1 Gbps

Storage

≈ 10 TB

Abstract

Most state‑of‑the‑art LLMs are trained on high‑resource languages; Nepali remains under‑served despite tens of millions of speakers. This project adapts an open‑source LLM to the Nepali domain using parameter‑efficient fine‑tuning (LoRA/QLoRA) with quantization and memory‑aware optimizations, enabling single‑GPU training on Tribhuvan University High Performance Computing System while preserving quality. We emphasize reproducibility and open release of models and code as NepaliGPT.

Problem Statement

Resource constraints: 2 x GPU, 1 Gbps interconnect; memory and bandwidth limit naive fine‑tuning.
Efficiency vs quality: Do LoRA/QLoRA match full fine‑tuning on Nepali perplexity and downstream tasks?
Low‑resource adaptation: Tokenizer coverage for Devanagari and strategies for vocabulary extension or retention.
Reproducibility & open access: Transparent scripts, environment, and license‑compliant release of weights/adapters.

Objectives

Adapt a strong open LLM to Nepali, addressing Devanagari tokenization and embeddings.
Implement PEFT: LoRA and QLoRA to enable billion‑parameter fine‑tuning on a single GPU.
Compare against full fine‑tuning (7B) on performance, efficiency, and scalability.
Evaluate Nepali quality via perplexity and task benchmarks (QA, summarization), plus human judgments.
Release reproducible code and Nepali adapters as NepaliGPT artifacts.

Rationale

Bridge the low‑resource language gap with a generative Nepali model (beyond BERT‑style encoders).
Advance efficient LLM adaptation science under hard compute limits.
Promote open, reproducible AI for Nepal’s research and civic ecosystem.

Working Reviews

LLMs & Adaptation PEFT (LoRA) QLoRA Distributed Training Nepali Models

Transformer‑based LLMs (GPT, LLaMA, etc.) dominate high‑resource languages; domain/language adaptation is required for Nepali.

LoRA: injects low‑rank adapters into attention/MLP weights; updates a tiny fraction of parameters with minimal memory overhead; mergeable for inference.

QLoRA: 4‑bit NF4 quantization + LoRA; paged optimizers; enables fine‑tuning larger bases (e.g., ≥33B) on a single high‑VRAM GPU with quality parity.

FSDP/ZeRO and mixed precision are key enablers on HPC; constrained networks favor single‑node training and CPU offload where needed.

Prior Nepali efforts (e.g., NepaliBERT / NepBERTa) focus on understanding; this work targets generative Nepali via PEFT.

Methodology

Environment

GPU training; CPU nodes for preprocessing and parallel jobs.
PyTorch + Transformers + PEFT + bitsandbytes; optional FSDP/DeepSpeed‑ZeRO for offload.
Slurm scripts, reproducible env, experiment logging.

Data

Assemble Nepali corpora (e.g., Nepali Wikipedia, OSCAR Nepali, large‑scale Nepali text), with cleaning/dedup & held‑out splits.
Tokenizer study: baseline vs extended SentencePiece for Devanagari coverage; optional vocab growth + embedding resize.

Models

Start with 7B for full‑tune baseline; LoRA/QLoRA on 7B/13B; attempt ≥33B with QLoRA if feasible.

Fine‑Tuning Regimes

Full fine‑tuning (7B): FP16, gradient checkpointing, small effective batch; baseline upper bound.
LoRA: ranks 8–16 on attention/MLP; fewer trainable params, faster iterations.
QLoRA: 4‑bit NF4 weights + LoRA; paged optimizers; scale to larger bases.

Evaluation

Intrinsic: Perplexity on held‑out Nepali.
Extrinsic: QA/summarization prompts; human preference for fluency/faithfulness.
Systems: tokens/sec, peak GPU/CPU mem, GPU hours; cost/energy estimates.

Tokens/sec vs. GPUs

Illustrative placeholder. Replace with your TU‑HPC measurements.

Key Findings

LoRA/QLoRA enable billion‑param adaptation on single GPU with strong PPL gains over base.
Scaling efficiency limited by 1 Gbps; single‑node favored; offload viable for larger bases.
Tokenizer tweaks for Devanagari can reduce sequence fragmentation and improve fluency.

Replace with measured outcomes and ablations.

Experiment Registry

ID	Base Model	Tokenizer	Method	Seq Len	Batch	Epochs	PPL	Notes
E‑FT‑7B‑01	7B	SP‑base	Full FT	1024	1× GA8	1	—	Baseline
E‑LoRA‑13B‑02	13B	SP‑base	LoRA r=8	1024	2× GA16	2	—	Attn+MLP
E‑QLoRA‑33B‑03	33B	SP‑ext	QLoRA r=16	1024	2× GA32	1	—	NF4 + paged opt

Expected Outcomes

NepaliGPT release: fluent generative Nepali model (or LoRA adapters) with model card and usage guide.
Efficiency data: concrete comparisons of full FT vs LoRA vs QLoRA on time, memory, and quality.
HPC insights: practical notes on single‑GPU training with CPU offload and constrained networking.
Open science: scripts, configs, and evaluation benchmarks for community reuse.