We are likely introducing bias with some of our tests because they are commonly found data sets. We should have more tests that use datasets that are synthesized by us or not found on the internet.
Two at least using sample documents and testing that the model correctly uses the sample documents.
At least 5 tests.