What Is Spam?
You have probably seen an increase in the amount of junk mail which shows up in your email box, or on your favorite newsgroup. The activities of a small number of people are becoming a bigger problem for the Internet.
Chain letters that ask for money, whether for reports or just straight up, are illegal in the US whether they are in postal mail or e-mail. Report these frauds to your local US Postmaster. You may see e-mail coming from Nigeria or another African country, sent by someone who wants to use your bank account to transfer 20 million dollars. This is called a ‘419’ scam and people have been killed over it.
Spam is flooding the Internet with many copies of the same message, in an attempt to force the message on people who would not otherwise choose to receive it. Most spam is commercial advertising, often for dubious products, get-rich-quick schemes, or quasi-legal services. Spam costs the sender very little to send — most of the costs are paid for by the recipient or the carriers rather than by the sender. To the recipient, spam is easily recognizable. If you hired someone to read your mail and discard the spam, they would have little trouble doing it. How much do we have to do, short of AI, to automate this process? I think we will be able to solve the problem with fairly simple algorithms. In fact, I’ve found that you can filter present-day spam acceptably well using nothing more than a Bayesian combination of the spam probabilities of individual words. Using a slightly tweaked (as described below) Bayesian filter, we now miss less than 5 per 1000 spams, with 0 false positives.
One particularly nasty variant of email spam is sending spam to mailing lists (public or private email discussion forums.) Because many mailing lists limit activity to their subscribers, spammers will use automated tools to subscribe to as many mailing lists as possible, so that they can grab the lists of addresses, or use the mailing list as a direct target for their attacks.