text normalization
two main parts:
tokenization
lemmatization