
SpamAssassin Corpus Policy
--------------------------

SpamAssassin relies on corpus data to generate good scores.  Here's the policy
we use to judge if a corpus is "good" or not.  It should be:

  - hand-verified as "spam" and "nonspam" piles -- *not* just classified using
    existing spam-classification algorithms (such as SpamAssassin itself)

  - containing a representative mix of non-spam mail -- that includes
    commercial-sounding-but-non-spam messages, legitimate business discussion
    (which may include talk of "sales", "marketing", "offers" etc), or verified
    opt-in mail newsletters. This is a *very* important point!

  - cleaned of virii, and forwarded spam messages.  These will skew the
    results.

  - and finally, cleaned of discussion of spam or virus messages or signatures
    (such as SpamAssassin-talk or bugtraq mailing list messages).  Even though
    they are non-spam, these often contain snippets of code that incorrectly
    trigger tests, and again will skew the results.  (Rewriting the tests to
    avoid triggering on SpamAssassin-talk messages is not realistic!)

Once you run "mass-check" on a corpus, see the instructions in "CORPUS_SUBMIT"
for details of how to verify that the top scorers are not accidental spam that
got through.


lastmod: Aug 12 2002 jm 
