BEGIN:VCALENDAR
VERSION:2.0
PRODID:-//Silicon Valley Engineering Council - ECPv6.15.20//NONSGML v1.0//EN
CALSCALE:GREGORIAN
METHOD:PUBLISH
X-WR-CALNAME:Silicon Valley Engineering Council
X-ORIGINAL-URL:https://svec.org
X-WR-CALDESC:Events for Silicon Valley Engineering Council
REFRESH-INTERVAL;VALUE=DURATION:PT1H
X-Robots-Tag:noindex
X-PUBLISHED-TTL:PT1H
BEGIN:VTIMEZONE
TZID:America/Los_Angeles
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20240310T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20241103T090000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20250309T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20251102T090000
END:STANDARD
BEGIN:DAYLIGHT
TZOFFSETFROM:-0800
TZOFFSETTO:-0700
TZNAME:PDT
DTSTART:20260308T100000
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0700
TZOFFSETTO:-0800
TZNAME:PST
DTSTART:20261101T090000
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTART;TZID=America/Los_Angeles:20251101T090000
DTEND;TZID=America/Los_Angeles:20251101T103000
DTSTAMP:20260424T042538
CREATED:20251021T123428Z
LAST-MODIFIED:20251021T123428Z
UID:77301-1761987600-1761993000@svec.org
SUMMARY:High Performance Inferencing for LLMs
DESCRIPTION:Inferencing has become ubiquitous across cloud\, regional\, edge\, and device environments\, powering a wide spectrum of AI use cases spanning vision\, language\, and traditional machine learning applications. In recent years\, Large Language Models (LLMs)\, initially developed for natural language tasks\, have expanded to multimodal applications including vision speech\, reasoning and planning each demanding distinct service-level objectives (SLOs). Achieving high-performance inferencing for such diverse workloads requires both model-level and system-level optimizations.\nThis talk focuses on system-level optimization techniques that maximize token throughput and minimize cost per token\, maintaining achieve meeting user experience metrics and inference service-provider efficiency. We review several recent innovations including KV caching\, Paged/Flash/Radix Attention\, Speculative Decoding\, and KV Routing\, and explain how these mechanisms enhance performance by reducing latency\, memory footprint\, and compute overhead. These techniques are implemented in leading open-source inference frameworks such as vLLM\, SGLang\, Hugging Face TGI\, and NVIDIA NIM\, which form the backbone of large-scale public and private LLM serving platforms.\nThe use of GPU Training\, Inference and Analysis clusters with Multi-Instance-GPU's (MIG)\, and Federated Models with QML applications is now become practical.\nAttendees will gain a practical understanding of the challenges in delivering scalable\, low-latency LLM inference\, and of the architectural and algorithmic innovations driving next-generation high-performance inference systems.\nCo-sponsored by: eMerging Open Tech Foundation\nSpeaker(s): Ravishankar\,\nAgenda:\n– Introduction to INGR with AIML & QIT working groups (Baw Chng + Prakash Ramchandran) – 10 mts\n– High Performance Inferencing for LLMs – By Dr. Ravishankar Ravindran ( Tech. Director eOTF – Advisory) -60 mts\n– Q&A – 20 mts\nVirtual: https://events.vtools.ieee.org/m/508671
URL:https://svec.org/event/high-performance-inferencing-for-llms/
LOCATION:Virtual: https://events.vtools.ieee.org/m/508671
END:VEVENT
END:VCALENDAR