Talks
Downsides of Subword Tokenization
- not learned end to end: vocab is fixed, can’t adapt to difficulty
- non-smoothness: similar inputs get mapped to very different token sequences
- [token][ization]
- typo: [token][zi][ation] <- suddenly bad despite small typo
- huge vocabs: yes
- non-adaptive compression ratio: you can’t decide how much to compress (affects FLOPs/document)