A little history behind the origins of this might be useful here. We created our own database of these "spamvertised sites" in order to pursue a related anti-spam effort. The data SURBL it's based on was originally created to provide a data service for Eric Kolve's SpamCopURI project. SpamCopURI is a SpamAssassin plugin which uses the SpamCop URI data to tag messages whose message bodies contain spam-referenced URIs or domains.
Earlier, Eric's SpamCopURI plugin used our database in the form of the web directories under "domains", but then he changed it to query and cache the SpamCop data directly. But now he has added a method for SpamCopURI to use the SURBL data, so in a sense we've come full circle, meeting somewhere in the middle with SURBL. By getting the data in the form of our RBL, SpamCopURI leverages the data distribution efficiencies of the DNS system which all RBLs similarly enjoy. URIDNSBL is a SpamAssassin 3 plugin which has the command urirhsbl added to compare message body URI domains with name based RBLs such as SURBL. Therefore SA 3 installations can now also use SURBL + urirhsbl to score messages based on SURBL's spam URI domain data. Other programs are being adapted or developed to use SURBL.
Using an RBL for serving up this data leverages existing networking and mail processing technologies and may therefore be a promising way to accomplish this blocking of spam messages based on their contained web sites. In particular, the DNS distribution and caching mechanism used by RBLs is highly efficient, and the code to make use of RBLs is widely available and understood. Even if our use of RBLs here is different from the norm, using RBLs for checking body domains may be at least partially facilitated in the many existing mail agents and filters that are able to use RBLs more conventionally. In other words not too many modifications to existing mail handling code should be necessary to use SURBL to block spam based on message body domains.
Scripts which power the database and SURBL creation grab data from SpamCop's "Spamvertised Web Sites" web page every couple minutes or so, then merge new entries and expire the data so that it's never more than 4 days old. Due to the way the merge is performed, multiple reports of the same URIs within the same minute are collapsed into a single report. However when reported URIs occur during different minutes, each of those minutes is recorded as a separate entry. These minute-unique entries are then counted, and only URIs that generate more than 10 such reports [1] (lowered from the original threshold of 20 [2]) are included in the SURBL data. Especially because of this thresholding of the minimum number of reports needed, false positives are greatly reduced. Some of the top reported sites and counts can be seen at the current top sites page. (Note that the top sites page is not quite the same cutoff of data as in SURBL, and "www." and related prefixes are folded in with the base domain in SURBL. Also numeric "subdomains" like "10.20.30" of numeric URIs like "http://10.20.30.40/" are not included in SURBL. Only complete numeric addresses containing all four octets are included in SURBL, whereas they are additionally broken out per octet in the spamcheck data.)
Data is grabbed often enough that none of the top 100 spamvertised sites mentioned at any given time on the SpamCop page should be missed, unless there happen to be an unusually large number of reports in a short period of time or the original spam timestamps are very old. Also data are handled in such a way that every unique URI and domain found is recorded. The "compression" of duplicate reports that happen within the same minute is lossless in the sense that no URIs or domains are lost, even if some of the counting of the number of duplicates per minute is lost. In other words, there will always be at least one report of every unique URI and domain that occurs during any given minute of reporting that we can see.
Part of the reason we can comfortably argue that most of the domains listed in SURBL are genuine spam domains is the thresholding of the time-compressed data, meaning that many independent spam reports by SpamCop users are required in order to get a domain onto the list. Another feature to improve the quality of the data is a simple exclusion of some whitelisted domains. (We have a separate list of ccTLDs to filter out some problematic "short" TLDs like co.uk, com.pl, com.br. Source: http://www.bestregistrar.com/help/ccTLD.htm.) Looking at the resulting list of domains (this is live data) [3], few if any legitimate sites make it through the reporting threshold and simple, short whitelist [4]. A live log of newly whitelisted domains is also visible now. We also gain the advantage of positive feedback: the more spammers promote their sites, the more likely they will get reported and thus get onto this list. This is a democratic effect, improved by manual de-selection of legitimate domains by SpamCop users when they submit their reports. More reports means more votes that a given site is indeed spam. The quality of data is reinforced by the conscientious efforts of good people in reporting the spam. In this sense it is democracy in action.
More detailed information about how the data is handled and a tarball of the scripts and source code used can be found at the home page of that project at: spamcheck.freeapp.net
We are now working with SpamCop
to get the spam URI data directly from them and compose SURBL from it.
SpamCop or any other source could also compile their own RBL of spam
domains, with similar thresholding and whitelisting principles.
Until that happens, SURBL may be a useful way to make use of the data.
Thus far, I am not aware of any spam message body domain data being
made available as an automated RBL.
This is somewhat surprising
considering how useful this approach can be,
as given in some of the reasons above.
Notes
Name: test.sc.surbl.org.sc.surbl.org Address: 127.0.0.2Similarly at Eric's behest, the three-level domain name, test.surbl.org, represented by the SURBL entry test.surbl.org.sc.surbl.org will always resolve:
Name: test.surbl.org.sc.surbl.org Address: 127.0.0.2Note, however, that the three and four-level domains above won't work with most programs which reduce URIs to their base domains. So use the following two-level domain for testing instead.
At Justin Mason's suggestion we've changed the example.com test point to use a more obscure domain, surbl-org-permanent-test-point.com. Use this two-level domain in a message body with something like "http://" to test if a message is getting correctly scored.
Name: surbl-org-permanent-test-point.com.sc.surbl.org Address: 127.0.0.2We have added permanent numeric testpoints:
Name: 2.0.0.127.sc.surbl.org Address: 127.0.0.2