Introduction

There is a growing nuisance for users and administrators of sites that run web servers, and, particularly, blogs like this one: Comment, trackback and referer spam. I figured I'd take this lazy Saturday afternoon as an opportunity to try to lay out the various issues and attempts to combat this increasingly costly problem.

First, a Note on Spelling

Referrer or referer? "Referer" is, in fact, a misspelling. A common misspelling, however -- so common that it made it into the HTTP/1.1 spec. This is funny, but annoying. For the purposes of this article and others, we will resist being pedantic and simply refer (ha!) to it as "referer" spam, simply because it refers to the "Referer" header. We'll use the proper spelling when referring to the actual referrer (the HTTP client in question).

What is HTTP Referer?

The HTTP 1.1 RFC defines the "Referer" header as:

The Referer[sic] request-header field allows the client to specify,
for the server's benefit, the address (URI) of the resource from
which the Request-URI was obtained (the "referrer", although the
header field is misspelled.)

Essentially, it's a way for an HTTP client to send in the HTTP headers the URI of the page that sent them there. For example, when I search for "centreblog" in google and click on the link to our blog, my web browser sends the following header:


Referer: http://www.google.com/search?num=100&hl=en&lr=&q=centreblog&btnG=Search

This is handy, because it gives the site administrator some insight into where the traffic on his webserver is coming from. Further, many of the most popular webserver log analyzers also depend on this info to provide statistics on the most common referrers. Not every web browser and HTTP client sends the Referer: header, however, so it should not be depended upon in any sort of web programming or analysis. Some web browsers, for example Opera, give you the ability to turn off the sending of the Referer. While this improves your privacy as a web browser, it can impair your web-browsing because many websites check the Referer: header on requests for images, movies, and other forms of media likely to be "stolen" -- i.e. accessed directly rather than through the website hosting it.

OK, What is Referer Spam?

While the HTTP Referer: header is very useful, it's also completely arbitrary. There's nothing to stop a web browser or HTTP client from sending a forged Referer: header with any request to a web server. You can even do it by hand:

$ telnet blog.centresource.com 80
Trying 70.84.100.10...
Connected to picasso.centresource.com.
Escape character is '^]'.
GET / HTTP/1.0
Host: blog.centresource.com
Referer: http://whitehouse.gov/

HTTP/1.1 200 OK

It's that easy. And, much like spammers have taken advantage of the fact that there is no provision for authentication in SMTP, they have also clued in on this openness, using specially crafted requests with their websites in the Referer: header.

Why Referer Spam?

You're probably thinking to yourself "Okay, I understand how Referer: spam works, but why would someone bother spamming something only the site administrator will see in the logs?"

That's a good question, and others have speculated at length on the various motivations. But, briefly, the reasons are generally:

  1. an attempt at boosting search engine rankings
  2. simply to show up in any stats published by the site. That is, if the site being spammed runs Webalizer, AWStats or some other webserver log-analyzing software, the spammer can get their URL in the "top referers" section.

There are probably other reasons, but we won't waste our time psychoanalyzing the degenerate mind of the spammer.

What is Comment/Trackback Spam?

Comment spam and trackback spam is a much more direct method that spammers have discovered. As the popularity of blogs has grown exponentially, so has their popularity grown with spammers. Initially, most blogging software (Movable Type, Wordpress, etc.) and most blogging sites (blogger.com, livejournal.com, etc) had very little restrictions against who could post a comment. There is simply an HTML form that POSTs to a CGI script which accepts and displays the comment.

Of course it takes very little technical knowledge to see that this is easily exploited by spammers who want to get their goods in front of people's eyes. These spammers have automated tools that are constantly searching for blogs that allow comments/trackback, and spam them with POST requests containing generic comments with their URLs. Sometimes the URL is in the body of the comment, the "URL" field of the comment, or both.

Consequences

There are many consequences for those of us trying to use the Internet and the WWW in a reasonably productive way as a result of these two new methods of spamming.

Tragedy of the Commons

The first and most obvious is that comments on blogs have been overwhelmed by hundreds of comments advertising "c1al1s", "v1agra" and other wonderful things. Many blog operators have simply given up and disabled comments and trackback, delivering a serious blow to one of the most powerful aspects of blogging -- the network of communication and relationships built between people and blogging communities.

Bandwidth, Bot-nets and DoS

A more serious consequence is that the process of comment, trackback and Referer: spam is often performed via an HTTP "GET" or "POST request, which retrieves the entire body of the document being spammed. For example, if a spammer is sending "GET /index.html" to send his Referer: header, and index.html is a 30k document, all 30k is transferred across your Internet pipe. The ever-optimizing engineer in me feels compelled to point out to the spammers that they could simply issue a "HEAD" request and accomplish the same thing without wasting my bandwidth, but I don't want to encourage them.

Anyways, this all results in quite a bit of traffic on your webserver, and bandwidth is not cheap.

Further complicating the situation is the increasing prevalence of botnets on the Internet. These massive networks of compromised computers are being used more and more to distribute the process of comment/trackback spam. This means that the potential for bandwidth usage increased exponentially. It also means that the comment spamming attacks can actually result in an effective Denial of Service attack.

While CentreBlog is relatively new and hasn't had much problem with comment/Referer spam, my personal blog at chris.quietlife.net has been in operation for around 4 years now. I'd estimate that roughly 70-80% of the traffic on my site is Comment/Referer: spam. My Wordpress blacklist plugin blocks, on average, around 500-600 comment spams a day (more on prevention techniques later). On at least 4 or 5 different occasions my website was entirely shut down by comment/referer spam attacks. In the past I had been able to simply firewall off the offending IP addresses, but these were attacks by bot-nets, meaning they were massively distributed and impossible to block -- each request comes from a different IP address.

I simply had to shut down apache (my HTTP server) and wait it out. It was quite infuriating, because there was nothing I could do.

Solutions (and Non-Solutions)

So, what's a webserver administrator to do? Here are a list of some of the prevention methods and their effectiveness:

Firewalling

Simply noticing an attack and simply firewalling it off is effective against the occasional limited attack, but in general this is a losing battle, and with the advent of bot-net attacks, has been rendered impotent.

.htaccess

By and large, this is also an unwinnable battle, but you can blacklist certain referers in .htaccess. For example, this one.

The ease with which spammers register thousands of domains and rotate them as quickly as they are blacklisted has rendered this ineffective as well.

Comment/Trackback Blacklisting

This is a largely effective technique to prevent comment spammers from wreaking too much havoc on your blog/website. It will do nothing to prevent a large-scale bot-net attack from bringing down your webserver (or eating up your bandwidth), however.

For Wordpress, I have had great luck with Fahim Farook's WPBlacklist plugin.

For Movable Type, there's the ubiquitous MT-Blacklist written by Jay Allen.

These plugins simply check comments/trackbacks against certain blacklisted URLs, IP addresses, etc. and denies the request if there's a match. This prevents the site from being overwhelmed but it still uses bandwidth and system resources.

As mentioned before, my personal blog gets around 500 per day, and the WPBlacklist plugin catches most of them.

DNSBL checks

A newer but more difficult technique to stop the bot-net attacks is by using DNS Blacklists to check requests to a webserver.

I've outlined the process by which you can use mod_access_rbl in apache to accomplish this in this article. The technique involves checking each web request against many of the popular DNS blacklists typically used to fight E-mail spam. This can help mitigate the success of attacks by bot-nets, since these compromised computers are often in blacklisted IP space, or IP space at least flagged as dynamic/dial-up/broadband IP space. You have to be careful with this, though, because networks that will be blacklisted from sending e-mail are not always suitable for blocking web access -- that is, you may inadvertantly cut off legitimate users of your site.

Conclusion

This sort of spamming is a growing problem. As the popularity of standards for blogging interaction increase, so do the opportunities for spammers to use it to pollute the web. Further, the rise of bot-nets does not seem to be abating, contributing to the difficulty in reining in this problem. Like the problem of E-mail spam, this is a tricky problem to deal with (but not impossible *cough*cough*Swirbo*cough*). We will continue to keep this blog updated with the latest developments and tactics in fighting this growing problem.

Links

Comments

Referrer SPAM is an example

Referrer SPAM is an example of humanity using technology to attack humanity– shoot ourselves in the foot, if you will .Can there be legitimate commercial track backs? Unlike commercial email where there are subscriptions, track backs are open for everyone by design. What if you were posting about “spamming blogs link ” and you got a single track back advertising “spamming blogs link ” for sale. Would that be spam? What if they actually linked to the page so that it was a 100% legitimate track back?
I delete hundreds of track back spam every couple of days.i also got some useful help a valuable site on Refer-spam, Comment Spam, Anti-Spam It is a big problem, but I think the question of acceptable commercial track backs is a valid one.

[...] able botnet which is

[...] able botnet which is then sold to the highest bidder, for wonderful things from hacking to referer and comment spam. Almost all of the comment/referer spam I get comes from [...]

So here is my question. Can

So here is my question. Can there be legitimate commercial trackbacks? Unlike commercial email where there are subscriptions, trackbacks are open for everyone by design. What if you were posting about “the purple pill” and you got a single trackback advertising “the purple pill” for sale. Would that be spam? What if they actually linked to the page so that it was a 100% legitimate trackback?

This is a tricky one.. I think this is more just an issue of etiquette. There's a disagreement among a lot of bloggers as to when and when not to use trackback -- i.e., is it appropriate to send a trackback ping without actually linking to the URL you pinged. I think unless the URL being pinged is explicitly designed to aggregate content, you should actually link to the URL you are pinging -- it facilitates the "you scratch my back, I'll scratch yours" feedback mechanism in trackback pings.

So, to use your "purple pill" example, I don't think it would be appropriate if the ping was from a URL that was nothing more than an advertisement for the purple pill, because it adds nothing to the URL being pinged and doesn't involve a thread of conversation. That said, I think it's really just very context sensitive. For example, if someone were to come up with the One True Solution to referer spam, I wouldn't see any problem with them pinging this URL, even if they didn't link to it directly.

True spam of course is not really relevant to these nuanced discussions of etiquette, though, because there's no content analysis going on. They are just spamming their URLs to as many places at once.

I don't think there's any real problem with commercial-oriented trackbacking as long as it a) doesn't overwhelm or clutter up the post being pinged unecessarily, and b) there's an indication that there was some forethought behind the ping -- i.e. there's an obvious contextual link between the URL being pinged and the URL offered.

Another reason for referer

Another reason for referer spam is that several blog packages used to enable a "top referrers" sidebar plugin by default. The idea was to know who was linking to you, but it was too easily abused. I jst realized a little bit ago that the blog I set up for my wife had it enabled and was giving her good page rank to the spammers.

So here is my question. Can there be legitimate commercial trackbacks? Unlike commercial email where there are subscriptions, trackbacks are open for everyone by design. What if you were posting about "the purple pill" and you got a single trackback advertising "the purple pill" for sale. Would that be spam? What if they actually linked to the page so that it was a 100% legitimate trackback?

I delete hundreds of trackback spam every couple of days. It is a big problem, but I think the question of acceptable commercial trackbacks is a valid one.

I was railing against

I was railing against Referrer SPAM over a year ago and no one seemed to understand what I was talking about, I'm glad to finally find someone who does.

I think the biggest danger posed by Referrer SPAM is the loss of commercial value it brings to not just blogs but all websites. After all, until the 'bots start shopping, advertisers are only going to be interested in human eye balls. In other words: when an advertiser asks me, "How many page views do you get each month?" that advertiser isn't wanting to know how many 'bots read my site, but how many people read my site. Referrer SPAM is an example of humanity using technology to attack humanity-- shoot ourselves in the foot, if you will. Interestingly enough, those who use Referrer SPAM will also see a devaluation of their own websites as more and more idiots take up its use.

Such morons we've become as we bow to the technology we've created.

[...] eferer/comment spam

[...] eferer/comment spam Filed under: geek — Chris @ 2:16 pm I've put together a summary post on the growing problem of comment/referer spam over at the CentreBlog. [...]

discuss your project

{ latest blog posts }