Who am I
- About 20 years of graphics programming experience
- 10 years professionally
- Former driver-developer at Falanx / ARM's Mali team
- Involved in the development of OpenGL ES 1.1 and 2.0
- Active open source contributor
- Lots of Git patches
- Linux, Android, Angle, Mesa, ...
- Demo scener
What is Grate
An effort to reverse engineer the Tegra GPU
...and eventually to create open source drivers for it.
Probably the furthest behind of the ARM SoC reverse engineered driver efforts
Disclaimer
Everything in this presentation is based on reverse-engineering
Most information presented might be wrong
The history so far
- In the summer/fall of 2012, I got envyous of the Lima-guys, so I decided to start looking into Tegra
- Around FOSDEM 2013, I had:
- Command list capturing and parsing
- Envytools-style RNNDB descriptions of most OpenGL ES 2.0 non-shader state
- A very rough fragment shader disassembler
- Reverse enginered the rough interface to the shader-compiler
- Got bored with it
Enter Thierry
- A month later, Luc told me to get my ass on IRC
- Turns out, while I was procrastinating Thierry Reding had picked up the ball:
- Linux DRI/KMS
- LibDRM
- Got command-stream replay working
- Started on a DDX-driver
- Even did the initial work on a Gallium driver!
- Then Thierry got hired by NVIDIA to maintain the DRI driver
- I'm slowly trying to follow in his footsteps
- However, my biggest interest is reverse-engineering
Awesome work, Thierry!
Rob Clark also helped out a bit.
Tegra 2
- This is the core I've focused on
- Command stream dumping
- Basic rendering through command stream replay
- Can modify a lot of state by tampering with the command stream
- Upstream Linux DRM driver
- Downstream libDRM support
- Very, very unfinished downstream Mesa/Gallium driver
- Can only do glClear and glReadPixels with GR2D
Tegra 3
Replay seems to just work. Identical 3D core?
Tegra 4
- Some additional registers discovered
- Not strictly compatible?
- But modified Tegra 2 command-streams have been replayed
Tegra K1
- Kepler based
- Only the 3D core, lacks most other components of GeForce
- No work done
- Maybe something for Nouveau instead?
- Won't be covered further in this talk
Tegra 2 GPU overview
- Code named AR20
- Immediate-mode renderer
- Consists of (at least) three components:
-
Clients are programmed through Host1x
- DMA engine for writing registers
- Proprietary OpenGL ES drivers
GR2D
- Documented in the publically available TRM
- Requires signing up and agreeing to an EULA
- Example source code available
- Blits / fills / patterns
- Tiling / linear source and destination
- Stretching
- Rotation / flipping
- Blending
- CSAA resolve
- ROP3
- Lots more, see TRM
GR3D
- Non-unified shader
- Performs blending in the fragment shader
- 16 bit depth buffer
- Tegra 4 also supports 24 bit depth
- 16 render targets (including depth/stencil)
- Occlusion queries
- Texturing:
- Floating-point textures
- Texture arrays
- Anisotropic filtering
- ETC1, S3TC, DXT1, LATC
- Non-pow2-ish textures
- GL_OES_standard_derivatives
- GL_NV_draw_path
Video
- No work so far
- I'm not a video-expert
- Up for grabs!
Vertex shader ISA
- NV30 subset
- 4 component vector ALU
- scalar SFU
- No control flow
- Straight forward to generate code for
- Share code with Nouveau?
Fragment shader ISA
- Registers are 1 x 20 bit floating-point or 2 x 10-bit fixed-point
- At least 3 separate instruction streams:
- ALU - Arithmetic/Logic Unit
- MFU - Multi-Function Unit
- Varying interpolation
- Complex function evaluation
- Not executed in the same clock?
- TEX - Texturing Unit
- EXPORT?
- Others? (import for spilling?)
- No control flow
Fragment shader ISA: ALU
- Pretty much understood
- Instructions comes in packets
- Can perform 4 scalar ops per instruction packet
- Or 3 scalar ops with 2 x 20 bit / 4 x 10 bit embedded constants
- Glorified MAD
- 1 destination, 3 source operands
- D = A * B + C
- D = A * B + C * C
- D = (A + C) * B
- D += ...
- MIN/MAX/CSEL
- Predicate instructions
- Saturate result
- Absolute / negate source operands
- Scale source operands by 2, result by 0.5, 2 or 4
Fragment shader ISA: MFU
- Probably based on "A High-Performance Area-Efficient Multifuction Interpolator", Oberman et. al, 2005
- Complex function evaluation pretty much understood
- NOP, RCP, RSQ, LG2, EX2, SQRT, SIN, COS, FRC
- PREEX2, PRESIN, PRECOS
- Not unlike NVIDIA with two-step trig
- Varying write is still a mystery :(
- This is a major blocker
- Help, please!
Fragment shader ISA: TEX
- Somewhat understood, but...
- Not clear where texture coordinates come from
- Progressing on this feels pointless without varying writes understood
- Seems simple
- 2D textures and cube maps lookups compile bitwise identical
- No need to normalize cubemap inputs
Fragment shader ISA: EXPORT
- Render-target index found
- The rest is pretty much a mystery :(
TODO
- finish/upstream libDRM patches
- X.org DDX driver
- GR2D is completely documented
- Helps hardening the libDRM interface
- Reverse engineering
- Varying writes!!!
- Fragment shader exporting
- Register spilling
- ...
- Mesa / Gallium driver