Motivation
ByteT5 is very expensive (because you have to have a residual on every damn token)
MrT5
MrT5 uses a soft attention masking gate at pretraining time to delete unused tokens; at inference time we use a hard cut.
Cool: MrT5 learns language independent compression rate (different languages have different rates).