David Patterson's visit at UW

Student Meetings
Talk

Student Meetings

Why shifting from RISC-V to current project(Anna)

industry: build hardware, deploy it and then publish paper

Not too much responsibility.

More like an individual contributor, enjoy the time, especially during the pandamic.

Meaning of RISC-V (Katie)

NVIDIA inference chip is open sourced.

Alibaba also open-sourced.

University can benefit from open-source instruction set.

Any new accelerator domains? (Tapan)

Slowing of Moore's law.

Accelerator is a way to compensate that.

General purpose processing units scales with chiplets.

Machine Learning is exciting, but there are more domains for innovation.

Software stack changes (Ani)

Libraries forces software people change the way of programming.

What's the appropriate APIs hardware people should expose to programmers?

Metric of "good" hardware (Luis)

efficiency energy efficiency cost-performance security

Security (speculation) (Anna)

Intel is good at turing transistors into performance.

TPU do not have caches.

Software-Hardware Separation (Zihao)

DSA paper need to understand the stack, the models and applications

hardware and model people work together

danger: build the wrong thing.

Talk

TPU

started in 2013, fast development, used in data center after 15 months.

announced in 2016.

Other companies

Intel acquire DSA companies over years, Alibaba and Amazon develop their own inference chips.

10 lessons learned

DNN Model Growth

1.5x per year.

1 year design, 1 year deployment, 3 year serving.

DNN Worklaods evolve with ML breakthroughs

BERT

Can optimize DNN as well as compiler and hardware

Platform-aware AutoML.

Inference SLO limit is P99 latency rather than batch size

Production inference needs multi-tenancy

Energy limits modern chips

memory matters not FLOPs.

Systolic array is energy efficient.

~100,000 ALUs amortize memory access energy.

DSA optimize for domain while being flexible

TPUv1 -> TPUv2

TPUv2 was made more flexible.

Unequal changes in semiconductor technology

TPUv2 -> TPUv3

2x MXUs

+30% freq

+30% b/w

2x capacity

Maintain Compiler optimizations and ML compatibility

XLA -> HLO ops (machine independent) and LLO ops (machine dependent)

XLA manages all memory transfers.

backware compatibility (floating associetivity)

Optimize Perf/TCO vs. Perf/CapEx

TCO: Total cost of ownership

CapEx: purchase

OpEx: operation

Carbon emissions

kWh = hours to train * #processors * average power per processor * PUE(power uses effectiveness)

Reduce energy 100x, CO2e 1000x

4Ms for ML Energy Efficiency:

Model (Transformer to Primer) 4x
Machine (P100 to TPUv4) 14x
Mechanization (datacenter efficiency) 1.4x
Maps (geographical location, energy source) 9x

4Ms for NLP: The GLaM paper, usage of MoE.

ML at Gooble <= 15% overall energy

Statistics from last three years.

3/5 for inference, 2/5 for training per year.

DNNs were 70-80% FLOPs yet 10-15% energy.

Luis's question

Presumably the non-ML part generates data for the ML part, and is affected by the behavior of the ML models?

Answer: not large fraction.

Dire ML estimates were faulty

Some papers regarding NAS cost are faulty.

How to correct published papers?

Recommendations for ML research and practice

4Ms

model: better model in terms of cost.
machine: better hardware (e.g. sparse systolic array)
mechanization: datacenter efficiency
map: use greenest data centers