Honest evaluation on small datasets
BLEU, F1, AUC, and confusion matrices say different things on a few thousand examples than on a million. A note on which numbers to trust, which to caveat, and which to ignore when you're working at student-project scale.