Abstract
As web-scale Generative AI (GenAI) services become core Internet infrastructure, the energy use and carbon emissions of LLM inference are becoming a significant sustainability concern. Yet deployment decisions for conversational endpoints remain almost entirely driven by accuracy and latency, implicitly treating energy as a free resource and offering little visibility into the energy and CO2e cost of individual requests or into how 4-bit and 8-bit precision choices affect the trade-offs between energy, performance, and task quality. We address this gap with GreenTune, a request-centric, length-aware energy and carbon accounting protocol for web-scale GenAI services and an energy-aware quantized tuning pipeline built on top of it. GreenTune samples GPU power, subtracts an idle baseline, and integrates traces to obtain per-request energy. From these measurements it derives length-dependent Joules-per-token profiles, throughput, and tail latency and converts device-side energy into per-token and per-session CO2e estimates under standard carbon accounting assumptions. On this accounting layer, we train low-rank adapters on 4-bit and 8-bit quantized chat backbones and evaluate the resulting configurations under shared workloads to obtain fair comparisons. To support deployment decisions, GreenTune introduces the Energy--Efficiency--Effectiveness (ThreeE) composite metric, which combines energy efficiency, inference efficiency, and task effectiveness into a transparent, tunable deployment score while still exposing the underlying axes. Instantiated on ~1.1B- and 7B-parameter chat models on consumer-grade GPUs, GreenTune shows that carefully configured 4-bit variants reduce energy consumption and CO2e per generated token by about 60% relative to 8-bit baselines, while increasing throughput by up to 3× and reducing p95 latency by more than half at comparable task accuracy. These results turn precision choices such as 4-bit vs. 8-bit from a narrow performance tweak into a system-level lever for meeting explicit carbon budgets and provide a model-agnostic path from power traces to actionable per-request and fleet-wide CO2e budgets for web-scale GenAI endpoints.
| Original language | English |
|---|---|
| Title of host publication | WWW 2026 - Proceedings of the ACM Web Conference 2026 |
| Editors | Hakim HACID, Yoelle MAAREK |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 9397-9408 |
| Number of pages | 12 |
| ISBN (Electronic) | 9798400723070 |
| ISBN (Print) | 9798400723070 |
| DOIs | |
| Publication status | Published - 12 Apr 2026 |
| Event | ACM Web Conference 2026 - Dubai , United Arab Emirates Duration: 13 Apr 2026 → 17 Apr 2026 |
Conference
| Conference | ACM Web Conference 2026 |
|---|---|
| Abbreviated title | WWW '26 |
| Country/Territory | United Arab Emirates |
| City | Dubai |
| Period | 13/04/26 → 17/04/26 |
Bibliographical note
Publisher Copyright:© 2026 Owner/Author.
Funding
This work has been supported by Lingnan University through Faculty Research Grants (SDS24A2 and SDS24A12) and the Lam Woo Research Fund (No. LWP20040). This work is partly supported by the Major Key Project of PCL (Grant No. PCL2024A05) and the National Natural Science Foundation of China under grant No. 62403433.
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 7 Affordable and Clean Energy
Keywords
- Large Language Model
- Low-Rank Adaptation
- Environmental Awareness
- Environmental, social, and governance (ESG)
- environmental
- social
- low-rank adaptation
- and governance (esg)
- environmental awareness
- large language model
Fingerprint
Dive into the research topics of 'GreenTune: Energy-Efficient Low-Rank Tuning of LLMs with ThreeE Evaluation under 4-/8-bit Quantization'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver