sc.surbl.org Data

Source data for the sc.surbl.org spam URI list comes from reports sent to SpamCop.

A little history behind the origins of this might be useful here. We created our own database of these "spamvertised sites" in order to pursue a related anti-spam effort. The data SURBL it's based on was originally created to provide a data service for Eric Kolve's SpamCopURI project. SpamCopURI is a SpamAssassin plugin which uses the SpamCop URI data to tag messages whose message bodies contain spam-referenced URIs or domains.

Earlier, Eric's SpamCopURI plugin used our database in the form of the web directories under "domains", but then he changed it to query and cache the SpamCop data directly. But now he has added a method for SpamCopURI to use the SURBL data, so in a sense we've come full circle, meeting somewhere in the middle with SURBL. By getting the data in the form of our RBL, SpamCopURI leverages the data distribution efficiencies of the DNS system which all RBLs similarly enjoy. URIDNSBL is a SpamAssassin 3 plugin which has the command urirhsbl added to compare message body URI domains with name based RBLs such as SURBL. Therefore SA 3 installations can now also use SURBL + urirhsbl to score messages based on SURBL's spam URI domain data. Other programs are being adapted or developed to use SURBL.


Using an RBL for serving up this data leverages existing networking and mail processing technologies and may therefore be a promising way to accomplish this blocking of spam messages based on their contained web sites. In particular, the DNS distribution and caching mechanism used by RBLs is highly efficient, and the code to make use of RBLs is widely available and understood. Even if our use of RBLs here is different from the norm, using RBLs for checking body domains may be at least partially facilitated in the many existing mail agents and filters that are able to use RBLs more conventionally. In other words not too many modifications to existing mail handling code should be necessary to use SURBL to block spam based on message body domains.

Scripts which power the database and SURBL creation grab data from SpamCop's "Spamvertised Web Sites" web page every couple minutes or so, then merge new entries and expire the data so that it's never more than 4 days old. Due to the way the merge is performed, multiple reports of the same URIs within the same minute are collapsed into a single report. However when reported URIs occur during different minutes, each of those minutes is recorded as a separate entry. These minute-unique entries are then counted, and only URIs that generate more than 10 such reports [1] (lowered from the original threshold of 20 [2]) are included in the SURBL data. Especially because of this thresholding of the minimum number of reports needed, false positives are greatly reduced. Some of the top reported sites and counts can be seen at the current top sites page. (Note that the top sites page is not quite the same cutoff of data as in SURBL, and "www." and related prefixes are folded in with the base domain in SURBL. Also numeric "subdomains" like "10.20.30" of numeric URIs like "http://10.20.30.40/" are not included in SURBL. Only complete numeric addresses containing all four octets are included in SURBL, whereas they are additionally broken out per octet in the spamcheck data.)

Data is grabbed often enough that none of the top 100 spamvertised sites mentioned at any given time on the SpamCop page should be missed, unless there happen to be an unusually large number of reports in a short period of time or the original spam timestamps are very old. Also data are handled in such a way that every unique URI and domain found is recorded. The "compression" of duplicate reports that happen within the same minute is lossless in the sense that no URIs or domains are lost, even if some of the counting of the number of duplicates per minute is lost. In other words, there will always be at least one report of every unique URI and domain that occurs during any given minute of reporting that we can see.

Part of the reason we can comfortably argue that most of the domains listed in SURBL are genuine spam domains is the thresholding of the time-compressed data, meaning that many independent spam reports by SpamCop users are required in order to get a domain onto the list. Another feature to improve the quality of the data is a simple exclusion of some whitelisted domains. (We have a separate list of ccTLDs to filter out some problematic "short" TLDs like co.uk, com.pl, com.br. Source: http://www.bestregistrar.com/help/ccTLD.htm.) Looking at the resulting list of domains (this is live data) [3], few if any legitimate sites make it through the reporting threshold and simple, short whitelist [4]. A live log of newly whitelisted domains is also visible now. We also gain the advantage of positive feedback: the more spammers promote their sites, the more likely they will get reported and thus get onto this list. This is a democratic effect, improved by manual de-selection of legitimate domains by SpamCop users when they submit their reports. More reports means more votes that a given site is indeed spam. The quality of data is reinforced by the conscientious efforts of good people in reporting the spam. In this sense it is democracy in action.

More detailed information about how the data is handled and a tarball of the scripts and source code used can be found at the home page of that project at: spamcheck.freeapp.net

We are now working with SpamCop to get the spam URI data directly from them and compose SURBL from it. SpamCop or any other source could also compile their own RBL of spam domains, with similar thresholding and whitelisting principles. Until that happens, SURBL may be a useful way to make use of the data. Thus far, I am not aware of any spam message body domain data being made available as an automated RBL. This is somewhat surprising considering how useful this approach can be, as given in some of the reasons above.

Notes

  1. The minimum report count threshold of 10 was pulled out of thin air. Some statistical consulting about more meaningful values for the threshold would be welcomed. The full, dynamic dataset with counts can be found at http://spamcheck.freeapp.net/top-sites.txt. This file runs around 10,000 lines long and is sorted by decreasing count. The entries towards the top of the list tend to occur in the most spams and are therefore of the greatest value to block.

  2. We have changed the threshold for inclusion in SURBL from 20 to 10, as of 30 March 2004. The main reason is that I (! ;-) got a spam for a domain med6547.biz that at the time had 18 hits in our data: a little under the count of 20 previously needed. Right after I (and presumably other spam victims) reported it to SC, the count went up to 38 or so. (Interestingly it's up to 129 about half a day later, so more people got spammed and reported it after I did.) A threshold of 10 or 12 would probably have caught the spam before it got to me. Future versions of SURBL will use more scientific methods of setting inclusion thresholds and expiration times. This should result in more hits on spam.

  3. A log of new domains as they get added to SURBL is available. Note that numeric addresses are in reverse order, as customary in RBLs. Note that legitimate looking domains that appear on this log eventually get added to the whitelist and therefore won't be blocked on.

  4. SURBL has a whitelist which is used only internally to prevent certain domains from getting onto SURBL in the first place. Since the whitelisted domains do not appear in SURBL, any matches tried against them will correctly fail. Our whitelist is not necessarily intended to be used externally to prevent the testing of domains in the wild. Using it on the client side could reduce some SURBL queries, but the number of domains on the whitelist is so small that the overhead from using it may be greater than that from any queries successfully prevented.

  5. At the good suggestion of Julian Haight of SpamCop, we have added a permanent testpoint which will always resolve in all SURBLs:
      Name:     test.sc.surbl.org.sc.surbl.org
      Address:  127.0.0.2
    
    Similarly at Eric's behest, the three-level domain name, test.surbl.org, represented by the SURBL entry test.surbl.org.sc.surbl.org will always resolve:
      Name:     test.surbl.org.sc.surbl.org
      Address:  127.0.0.2
    
    Note, however, that the three and four-level domains above won't work with most programs which reduce URIs to their base domains. So use the following two-level domain for testing instead.

    At Justin Mason's suggestion we've changed the example.com test point to use a more obscure domain, surbl-org-permanent-test-point.com. Use this two-level domain in a message body with something like "http://" to test if a message is getting correctly scored.

      Name:     surbl-org-permanent-test-point.com.sc.surbl.org
      Address:  127.0.0.2
    
    We have added permanent numeric testpoints:
      Name:     2.0.0.127.sc.surbl.org
      Address:  127.0.0.2
    
  6. Added a manual blacklist, mainly to add domains I happen to get in spams that are not already on the list. Often these domains have some hits in the data, but not enough to overcome the threshold yet. Since I am a spam victim and can check out the spam manually, it's safe to add these. However the whitelist and blacklist should normally be small if things are working well. Like the whitelist, this blacklist is internal to the list. Your SA plugins or MTA or other mail processing programs may have their own white and blacklists for you to adjust. Hopefully those local lists will not be needed for the SURBL data much either.

Data Sources of Other SURBL Lists

Please see the next Lists section for more information about the data sources of SURBL lists other than SC.

<< Usage Previous Section, Next Section Lists >>

data.html version 1.78 on 7/15/07