“A Byte Level transformer, with some compression”
Key insight: use a [CLS] token in front of every word to train a small “tokenizer”, and then do a normal transformer on the [CLS] tokens, and then autoregressive decode out the single bytes.
Method
Hierarchical Autoregressive Transformers
We put a [cls] in front of every word. So the input looks like
[CLS] M y _ [CLS] n a m e _ [CLS] i s
We then run a small encoder over each sequence. And then you take the encoded [CLS], and run
Dynamically Allocating Compute
You can dynamically allocate [CLS] tokens.