Comment Spam — A Case Study and a Proposed Deeper Solution that Google Should Provide

I have a friend who is a poet, mother and journalist in Norway. Her name is Aina Gerner-Mathisen. She recently emailed me trying to figure out who was posting in her name on various sites. I looked into it and discovered that she is a victim of comment-spam identity-theft. Comment-spam is a new type of spam that is spreading across weblogs. It is one of the problems that we will see in the emerging Metaweb — and it comes about because there is still no standard, widely used authentication protocol on the Internet. No responsible content provider should allow non-authenticated postings on their pages.

Comment-spam is a new type of spam that is spreading across weblogs. It is one of the problems that we will see in the emerging Metaweb — and it comes about because there is still no standard, widely used authentication protocol on the Internet. No responsible content provider should allow non-authenticated postings on their pages.

Click here to view this case in Google’s search results. Almost every one of the results of this search is a fake posting generated in Aina’s name. Open some of them and notice the auto-generated quote. Then click on her name and see where it goes. Yikes! It is quite extensive and all seems to have happened recently — possibly via a bot. This seems to be a new use of comment-spam — hijacking someone’s actual identity in order to use it to post spam ads.

Why didn’t the spammer just use fake names? Why did they use real people’s names? This was not necessary — they could have easly used made-up or even randomly generated names. That is one argument for this being a deliberate attack against certain people. Or maybe the spammer was just too lazy to make up names so they crawled for them instead? Could this be the output of a hidden worm that is stealing address entires from Outlook and using them for this purpose? Aina’s name appears to be in address-book format. The question is: did advertisers pay for this, or was it perpetrated by the site-owners of the linked-to sites, or did someone do this to her just to be mean?

This article proposes several solutions to the growing comment-spam problem — as well as a new proposal for a service that Google should provide to solve this problem.

But before you read further, first do me one favor: Help me rescue my friend Aina from the hell of identity-theft and comment spam. I have made a brief webpage about Aina on my blog at this official page about Aina Gerner-Mathisen. Can you link to it from your Blog please (with a nice comment about our friend Aina!) so that this legitimate page about her gets higher ranking in Google than all the comment-spam pages? That would help her out of this mess. Thanks. Then read on to find out about my new proposal for Google. I think it’s interesting. Maybe you will think so too. Please comment!

Ok now for how to solve the problem of comment-spam…

The immediate and most basic solution is very simple: all Websites that enable comments to be posted should require an email confirmation of the comment from whatever email address is provided with the posting. The email confirmation should link to a web page where the party can confirm their identity and approve the posting — and this page should also contain the standard “type in this word” challenge designed to foil bots. This would eliminate 90% of comment-spam immediately.

All Weblog providers should immediately implement this system; it is the responsible thing to do. People’s reputations are being harmed and Weblogs comments are filling up with spam — both of these problems can be solved easily with this simple rule of thumb — never allow unconfirmed comments to be posted.

The next step beyond this is to require authentication of posters’ names — for example, if someone posts a comment on your blog claiming to be “Bill Gates” that comment should not appear in your blog unless the poster has signed their posting with Bill Gates’ authentic digital signature. All the technology for this exists and is available — why aren’t Weblog providers (and other content providers that accept comment postings) implementing this today? Mainly because most people don’t have digital signatures yet. ISPs should provide every user with a digital signature as part of their paid service. That would solve a lot of problems.

A final measure that is needed is a way to help rescue people who have been unfortunate victims of comment-spam to an extent that may hurt their reputation by messing up their Google rankings (for example, my friend Aina. If you do a search in Google for “Aina Genrner-Mathisen” or “GernerMathisen Aina” you will get only pages that contain comment spams that were generated using her name. Aina is actually a leading writer and journalist in Norway — yet due to the comment spam problem if you search her name on Google you get links that do not accurately represent her views. This is a real problem for her — and she is not technical enough to solve it herself. Her professional reputation, and personal reputation, could be harmed by this. So for people like Aina, there is a need for a system to “dig out” from under comment spam in their name.

I think that since most comment-spam is used to increase Google standings of the spammers’ sites, Google is the logical place to combat comment-spam. By making it harder for comment-spammers to manipulate Google effectively, we can make comment-spam inefficient and that will cause the spammers to stop doing it.

Here’s how it works. To use the proposed Google “Eliminate Spam” feature you have to sign up and pay with your credit card. It costs a minimal amount — $3 per year. The reason that a credit card (or Paypal) payment is required is that this is the easiest way to authenticate people and screen out spammers. Once you have signed up and paid, you then get an account in Google’s “Eliminator” service. When you are logged into this account on Google, you may use a new “Report Spam” feature (this option should appear in the header of the cached version of every page that they index).

By selecting the “Report Spam” option on Google’s ached version of a particular page the user is given a new page that enables them to “edit” the text of the cached page. Users may not delete anything, but they can select any string and designate it as “spam.” (Perhaps it should cost $.05 per string designated — billed to the user’s account — to report spam so that people don’t do it spuriously.)

Each string that a user cites as “spam” will be removed from Google’s index. Thus that page will no longer appear in Google search results for searches on that string. Alternatively the page could still be indexed but might appear in a different (and by default, collapsed) section of Google search results called “Omitted Results: Cited as Spam”.

As an added safeguard however (to prevent users from falsely designating content on pages as “spam”), before any page is changed in Google, the webmaster of that page should get a confirmation email from Google saying, “Please approve change to your Google indexing” — which includes the identity of the party asking for the change, the text that is cited as spam, and optional comments from the party asking for the change. If no response is provided by the webmaster then that should default to “permission granted” for Google to change the indexing of the page. Webmasters should also be given accounts in Google (for free?), where they can manage their indexing and view and approve/decline indexing change requests in an aggregate manner for greater efficiency. Perhaps via their accounts they could also at any time decline an indexing change-request, for example one that they got in the past but didn’t have time to decline before.

The above proposal would help to make Google more accurate and spam-free, and would also help victims of comment-spam regain their reputations. Comment-spam is really an abuse of Google, and it is only effective because services like Google aggregate information centrally. Major portals such as Google have a social responsibility to improve their systems so that they cannot be co-opted to damage people’s reputations, hijack identities, etc. They have this responsibility because they of their position and role in society as the central arbiters of requests for information.

Search engine companies might attempt to avoid this responsibility with an excuse such as “but it’s not search engines that are the problem, it’s the content providers lack of authentication in their posting mechanisms. The content providers should bear the cost of this, not the search providers.” But in fact such a position is really no different than gun manufacturers claiming that “Guns don’t kill people, bullets do” or tobacco companies claiming, “Cigarettes don’t harm people, it’s smoking them that causes harm.” My point is that if a company provides a product or service that is potentially used to harm people, intentionally or unintentionally, then it is their responsibility to provide adaquate safety measures to protect people from the potential harm of their product or service, to the full extent that is possible to implement on their part.

Even if other links in the chain are part of the cause of harm, every link in the chain must do it’s part. Search engine companies must provide protections for victims of identity-theft, libel, comment-spamming, etc. — even though the content containing the offending material is actually created somewhere else and is caused by weaknesses in external services. Search engines, by indexing content, are aggregators of a sort — and if they provide that content to users they are publishers. This means they have a liability in fact. In fact, Google and other search engines have in the past submitted to legal pressure — for example from the Scientologists — to remove links to material that were claimed to be “illegitimate” for some reason. This is a tacit acknowledgement of this legal liability.

But is there a danger that by opening the door to removing “unwanted” index-associations from Google that this could enable censorship and misuse (such as deliberate removal of index-associations as a means to decrement the standing of some party with respect to a given term in Google)? I think that the method I have proposed above may mitigate most of the risks of both these concerns. First of all, no content is actually removed — rather it is simply indexed differently in Google, in my proposal. So nothing has been “censored” — rather the search results have simply been “tuned.” The content is still there — for example, if one goes to some page X they will still see the content on that page, but they won’t see page X as a search result under “Aina Gerner-Mathisen” for example — at least not if Aina cited such an index-association (from the string representing her name to page X) as spam.

Regarding the second potential concern — that the system might be abused to deliberately decrement a party’s standing in Google — the fact that all uses of this system are authenticated and nominally charged for will effectively eliminate that problem — but the final protection is that Webmasters are given the opportunity to view and approve (and possibly to at any future date, decline) change requests for how they are indexed.

In conclusion, the above proposal represents one very effective means for eliminating comment-spam. It also provides a way for Google to do enable authenticated users to help improve the quality of its index, which has many other benefits. What do you think?

Social tagging:

8 Responses to Comment Spam — A Case Study and a Proposed Deeper Solution that Google Should Provide

  1. Nova, if you changed you link from “this link” to something like “page about Aina Gerner-Mathisen” and also changed the page title of that page from “Damsel in Distress” to something with Aina Gerner-Mathisen in the title, both would significently help the search results in Google.
    When Google’s pagerank algorithm has to pick between pages for a given search, it weights pages with links containing the search phrase in the link description and pages with the search phrase(s) in the title higher than those without.

  2. Robin hoodand his merry men.

    Well…here’s an interesting paradox. Here is someone claiming that their friend, a person named Aina Gerner-Mathiesen has been the subject of blog spam Google identity theft. What the hell is this you ask? Well, basically it’s the idea that by post…

  3. Nova Spivack says:

    Good, I’ll do that now!

  4. anon says:

    restricting comments in that way would effectively kill them. so maybe you should simply disable comments at all.

  5. Jeremy Ellison-Gladstone says:

    I am also an Oberlin alum and I have had the same thing happen to me. Someone has linked my name, along with the names of countless other Oberlin students and alums to all kinds of spam sites through “nonsense” postings on weblogs. It seems that someone has targeted the Oberlin Alumni database and it’s really creepy. Just wanted to let you know, you’re not alone.

  6. John says:

    Your plan worked. The comment spam isn’t coming up first in the listings anymore when searching her name.
    That was similar to blog bombing.

  7. Bianca June says:

    I enjoy reading through this informal place. I will surely visit you again to see if anything new appears on it.
    Good luck for the future.

  8. e-sign act says:

    It is great to see that digital signature is coming very quickly in every system. It is used for security purpose. In fact, it also maximizes the individual privacy and compatibility.