David Patterson's visit at UW
Table of Contents
- Student Meetings
- Talk
- TPU
- Other companies
- 10 lessons learned
- DNN Model Growth
- DNN Worklaods evolve with ML breakthroughs
- Can optimize DNN as well as compiler and hardware
- Inference SLO limit is P99 latency rather than batch size
- Production inference needs multi-tenancy
- Energy limits modern chips
- DSA optimize for domain while being flexible
- Unequal changes in semiconductor technology
- Maintain Compiler optimizations and ML compatibility
- Optimize Perf/TCO vs. Perf/CapEx
- Carbon emissions
- Recommendations for ML research and practice
Student Meetings
Why shifting from RISC-V to current project(Anna)
industry: build hardware, deploy it and then publish paper
Not too much responsibility.
More like an individual contributor, enjoy the time, especially during the pandamic.
Meaning of RISC-V (Katie)
NVIDIA inference chip is open sourced.
Alibaba also open-sourced.
University can benefit from open-source instruction set.
Any new accelerator domains? (Tapan)
Slowing of Moore's law.
Accelerator is a way to compensate that.
General purpose processing units scales with chiplets.
Machine Learning is exciting, but there are more domains for innovation.
Software stack changes (Ani)
Libraries forces software people change the way of programming.
What's the appropriate APIs hardware people should expose to programmers?
Metric of "good" hardware (Luis)
efficiency energy efficiency cost-performance security
Security (speculation) (Anna)
Intel is good at turing transistors into performance.
TPU do not have caches.
Software-Hardware Separation (Zihao)
DSA paper need to understand the stack, the models and applications
hardware and model people work together
danger: build the wrong thing.
Talk
TPU
started in 2013, fast development, used in data center after 15 months.
announced in 2016.
Other companies
Intel acquire DSA companies over years, Alibaba and Amazon develop their own inference chips.
10 lessons learned
DNN Model Growth
1.5x per year.
1 year design, 1 year deployment, 3 year serving.
DNN Worklaods evolve with ML breakthroughs
BERT
Can optimize DNN as well as compiler and hardware
Platform-aware AutoML.
Inference SLO limit is P99 latency rather than batch size
Production inference needs multi-tenancy
Energy limits modern chips
memory matters not FLOPs.
Systolic array is energy efficient.
~100,000 ALUs amortize memory access energy.
DSA optimize for domain while being flexible
TPUv1 -> TPUv2
TPUv2 was made more flexible.
Unequal changes in semiconductor technology
TPUv2 -> TPUv3
2x MXUs
+30% freq
+30% b/w
2x capacity
Maintain Compiler optimizations and ML compatibility
XLA -> HLO ops (machine independent) and LLO ops (machine dependent)
XLA manages all memory transfers.
backware compatibility (floating associetivity)
Optimize Perf/TCO vs. Perf/CapEx
TCO: Total cost of ownership
CapEx: purchase
OpEx: operation
Carbon emissions
kWh = hours to train * #processors * average power per processor * PUE(power uses effectiveness)
Reduce energy 100x, CO2e 1000x
4Ms for ML Energy Efficiency:
- Model (Transformer to Primer) 4x
- Machine (P100 to TPUv4) 14x
- Mechanization (datacenter efficiency) 1.4x
- Maps (geographical location, energy source) 9x
4Ms for NLP: The GLaM paper, usage of MoE.
ML at Gooble <= 15% overall energy
Statistics from last three years.
3/5 for inference, 2/5 for training per year.
DNNs were 70-80% FLOPs yet 10-15% energy.
Dire ML estimates were faulty
Some papers regarding NAS cost are faulty.
How to correct published papers?
Recommendations for ML research and practice
4Ms
- model
- better model in terms of cost.
- machine
- better hardware (e.g. sparse systolic array)
- mechanization
- datacenter efficiency
- map
- use greenest data centers