NepaliGPT Project
Paper GitHub
Tribhuvan University

NepaliGPT — Efficient Adaptation of LLMs for Nepali via Distributed Training

Adapting open-source LLMs to Nepali using parameter‑efficient techniques (LoRA/QLoRA) and carefully scoped distributed training on Tribhuvan University High Performance Computing. Focus: resource‑aware methods that preserve quality while enabling reproducibility and open access.

Read Abstract Methodology Outcomes
Author: Aatiz Ghimire
Advisor: Dr. Madhav Prasad Ghimire, Siman Giri
School of Mathematical Sciences, Tribhuvan University — Kirtipur, Kathmandu
Central Department of Physics, Tribhuvan University — Kirtipur, Kathmandu
Compute
2× GPUs
CPU Nodes
500 + Cores, 5 TB + Memory
Network
1 Gbps
Storage
≈ 10 TB
Abstract

Most state‑of‑the‑art LLMs are trained on high‑resource languages; Nepali remains under‑served despite tens of millions of speakers. This project adapts an open‑source LLM to the Nepali domain using parameter‑efficient fine‑tuning (LoRA/QLoRA) with quantization and memory‑aware optimizations, enabling single‑GPU training on Tribhuvan University High Performance Computing System while preserving quality. We emphasize reproducibility and open release of models and code as NepaliGPT.

Problem Statement
  • Resource constraints: 2 x GPU, 1 Gbps interconnect; memory and bandwidth limit naive fine‑tuning.
  • Efficiency vs quality: Do LoRA/QLoRA match full fine‑tuning on Nepali perplexity and downstream tasks?
  • Low‑resource adaptation: Tokenizer coverage for Devanagari and strategies for vocabulary extension or retention.
  • Reproducibility & open access: Transparent scripts, environment, and license‑compliant release of weights/adapters.
Objectives
  1. Adapt a strong open LLM to Nepali, addressing Devanagari tokenization and embeddings.
  2. Implement PEFT: LoRA and QLoRA to enable billion‑parameter fine‑tuning on a single GPU.
  3. Compare against full fine‑tuning (7B) on performance, efficiency, and scalability.
  4. Evaluate Nepali quality via perplexity and task benchmarks (QA, summarization), plus human judgments.
  5. Release reproducible code and Nepali adapters as NepaliGPT artifacts.
Rationale
  • Bridge the low‑resource language gap with a generative Nepali model (beyond BERT‑style encoders).
  • Advance efficient LLM adaptation science under hard compute limits.
  • Promote open, reproducible AI for Nepal’s research and civic ecosystem.
Working Reviews
Transformer‑based LLMs (GPT, LLaMA, etc.) dominate high‑resource languages; domain/language adaptation is required for Nepali.
LoRA: injects low‑rank adapters into attention/MLP weights; updates a tiny fraction of parameters with minimal memory overhead; mergeable for inference.
QLoRA: 4‑bit NF4 quantization + LoRA; paged optimizers; enables fine‑tuning larger bases (e.g., ≥33B) on a single high‑VRAM GPU with quality parity.
FSDP/ZeRO and mixed precision are key enablers on HPC; constrained networks favor single‑node training and CPU offload where needed.
Prior Nepali efforts (e.g., NepaliBERT / NepBERTa) focus on understanding; this work targets generative Nepali via PEFT.
Methodology

Environment

  • GPU training; CPU nodes for preprocessing and parallel jobs.
  • PyTorch + Transformers + PEFT + bitsandbytes; optional FSDP/DeepSpeed‑ZeRO for offload.
  • Slurm scripts, reproducible env, experiment logging.

Data

  • Assemble Nepali corpora (e.g., Nepali Wikipedia, OSCAR Nepali, large‑scale Nepali text), with cleaning/dedup & held‑out splits.
  • Tokenizer study: baseline vs extended SentencePiece for Devanagari coverage; optional vocab growth + embedding resize.

Models

  • Start with 7B for full‑tune baseline; LoRA/QLoRA on 7B/13B; attempt ≥33B with QLoRA if feasible.

Fine‑Tuning Regimes

  1. Full fine‑tuning (7B): FP16, gradient checkpointing, small effective batch; baseline upper bound.
  2. LoRA: ranks 8–16 on attention/MLP; fewer trainable params, faster iterations.
  3. QLoRA: 4‑bit NF4 weights + LoRA; paged optimizers; scale to larger bases.

Evaluation

  • Intrinsic: Perplexity on held‑out Nepali.
  • Extrinsic: QA/summarization prompts; human preference for fluency/faithfulness.
  • Systems: tokens/sec, peak GPU/CPU mem, GPU hours; cost/energy estimates.
Tokens/sec vs. GPUs
0 5k 10k 15k 1 2 4 8 Data Parallel (DP)Hybrid (DP+TP+ZeRO)

Illustrative placeholder. Replace with your TU‑HPC measurements.

Key Findings
  • LoRA/QLoRA enable billion‑param adaptation on single GPU with strong PPL gains over base.
  • Scaling efficiency limited by 1 Gbps; single‑node favored; offload viable for larger bases.
  • Tokenizer tweaks for Devanagari can reduce sequence fragmentation and improve fluency.

Replace with measured outcomes and ablations.

Experiment Registry
IDBase ModelTokenizerMethodSeq LenBatchEpochsPPLNotes
E‑FT‑7B‑017BSP‑baseFull FT10241× GA81Baseline
E‑LoRA‑13B‑0213BSP‑baseLoRA r=810242× GA162Attn+MLP
E‑QLoRA‑33B‑0333BSP‑extQLoRA r=1610242× GA321NF4 + paged opt
Expected Outcomes
  • NepaliGPT release: fluent generative Nepali model (or LoRA adapters) with model card and usage guide.
  • Efficiency data: concrete comparisons of full FT vs LoRA vs QLoRA on time, memory, and quality.
  • HPC insights: practical notes on single‑GPU training with CPU offload and constrained networking.
  • Open science: scripts, configs, and evaluation benchmarks for community reuse.
Working Schedule
Jan-Feb — Review
Survey LLM adaptation, PEFT, QLoRA, and distributed methods; define baselines and metrics.
Feb-Apr — Data Works & Setup
Assemble/clean corpora; tokenizer analysis/extension; environment & Slurm scripts.
Apr-Jul — Fine‑Tuning
Full FT (7B) baseline; LoRA (7B/13B); QLoRA (≥13B/33B) with ablations.
Jul-Nov — Analysis, Refinement & Evaluation
Compute PPL & task metrics; human eval; efficiency and cost studies; iterate.
Nov - December — Writing & Documentation
Finalize Paper; prepare model card; release code and adapters; update website.
Release
  • NepaliGPT release: December 2025.
Team & Acknowledgments

Author: Aatiz Ghimire (MSc Data Science, Tribhuvan University)

Advisor: Dr. Madhav Prasad Ghimire (Associate Professor, Central Department of Physics),

Siman Giri(Center of AI, Herald College Kathmandu)

Infrastructure: Tribhuvan University High Performance Computing

Contact
© Aatiz Ghimire • This site hosts the NepaliGPT paper, code, data cards, and reproducibility artifacts.