Skip to main navigation Skip to search Skip to main content

GreenTune: Energy-Efficient Low-Rank Tuning of LLMs with ThreeE Evaluation under 4-/8-bit Quantization

Research output: Book Chapters | Papers in Conference ProceedingsConference paper (refereed)Researchpeer-review

Abstract

As web-scale Generative AI (GenAI) services become core Internet infrastructure, the energy use and carbon emissions of LLM inference are becoming a significant sustainability concern. Yet deployment decisions for conversational endpoints remain almost entirely driven by accuracy and latency, implicitly treating energy as a free resource and offering little visibility into the energy and CO2e cost of individual requests or into how 4-bit and 8-bit precision choices affect the trade-offs between energy, performance, and task quality. We address this gap with GreenTune, a request-centric, length-aware energy and carbon accounting protocol for web-scale GenAI services and an energy-aware quantized tuning pipeline built on top of it. GreenTune samples GPU power, subtracts an idle baseline, and integrates traces to obtain per-request energy. From these measurements it derives length-dependent Joules-per-token profiles, throughput, and tail latency and converts device-side energy into per-token and per-session CO2e estimates under standard carbon accounting assumptions. On this accounting layer, we train low-rank adapters on 4-bit and 8-bit quantized chat backbones and evaluate the resulting configurations under shared workloads to obtain fair comparisons. To support deployment decisions, GreenTune introduces the Energy--Efficiency--Effectiveness (ThreeE) composite metric, which combines energy efficiency, inference efficiency, and task effectiveness into a transparent, tunable deployment score while still exposing the underlying axes. Instantiated on ~1.1B- and 7B-parameter chat models on consumer-grade GPUs, GreenTune shows that carefully configured 4-bit variants reduce energy consumption and CO2e per generated token by about 60% relative to 8-bit baselines, while increasing throughput by up to 3× and reducing p95 latency by more than half at comparable task accuracy. These results turn precision choices such as 4-bit vs. 8-bit from a narrow performance tweak into a system-level lever for meeting explicit carbon budgets and provide a model-agnostic path from power traces to actionable per-request and fleet-wide CO2e budgets for web-scale GenAI endpoints.

Original languageEnglish
Title of host publicationWWW 2026 - Proceedings of the ACM Web Conference 2026
EditorsHakim HACID, Yoelle MAAREK
PublisherAssociation for Computing Machinery, Inc
Pages9397-9408
Number of pages12
ISBN (Electronic)9798400723070
ISBN (Print)9798400723070
DOIs
Publication statusPublished - 12 Apr 2026
EventACM Web Conference 2026 - Dubai , United Arab Emirates
Duration: 13 Apr 202617 Apr 2026

Conference

ConferenceACM Web Conference 2026
Abbreviated titleWWW '26
Country/TerritoryUnited Arab Emirates
CityDubai
Period13/04/2617/04/26

Bibliographical note

Publisher Copyright:
© 2026 Owner/Author.

Funding

This work has been supported by Lingnan University through Faculty Research Grants (SDS24A2 and SDS24A12) and the Lam Woo Research Fund (No. LWP20040). This work is partly supported by the Major Key Project of PCL (Grant No. PCL2024A05) and the National Natural Science Foundation of China under grant No. 62403433.

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 7 - Affordable and Clean Energy
    SDG 7 Affordable and Clean Energy

Keywords

  • Large Language Model
  • Low-Rank Adaptation
  • Environmental Awareness
  • Environmental, social, and governance (ESG)
  • environmental
  • social
  • low-rank adaptation
  • and governance (esg)
  • environmental awareness
  • large language model

Fingerprint

Dive into the research topics of 'GreenTune: Energy-Efficient Low-Rank Tuning of LLMs with ThreeE Evaluation under 4-/8-bit Quantization'. Together they form a unique fingerprint.

Cite this