Summary

Web Analytics Tutorial

 

Lesson 12 – Investigating Troublemakers

IN THIS LESSON
* Unusual Access Patterns
   Drilling Down
* Bad Robots
   Denial of Service Attacks
   Worms
   Content Mirroring
* Digging Deeper
* What to Do About Troublemakers
   Responsible Parties
   Validity of Information
   Counter Measures
   Limiting Robots
   Limiting Mirroring Tools

What to Do About Troublemakers

Once you have determined that someone is causing trouble on your site, you will want to act on it. We have already covered a few ways you can make changes to your web site or server to reduce the ability of users to cause problems. If the problems persist or are extremely debilitating, you may want to find the responsible parties and bring them to task for it.

Responsible Parties

The only record you really have of the source of requests to your web site is the host or IP number where the request originated. You may have referrer information for some request, but most problem traffic is generated directly, not referred from other sites. If your Summary is configured to do DNS lookups (it is by default) then the domain information will tell you who owns the hosts that were causing the problems. If you have created a filter to show only the offending traffic, you can use the Domains Report to find out which domain or domains are responsible. You can then use Network Solutions’s Whois tool to lookup contact information on the domains in question.

If the IP numbers of the hosts did not resolve, or you have not enabled DNS lookups in Summary, you can use ARIN’s Whois tool to lookup ownership information for the hosts that are causing trouble and get contact information for the parties responsible for those hosts.

However, contacting the coordinator for a domain registration or an IP number netblock may not be very effective. If the problem involved an access attempt or a Denial of Service attack, the source is likely to be an Internet service provider, and a major one at that. While these companies do not condone this kind of activity, they also handle such large volumes of traffic that they may not easily be able to determine which client was assigned a given IP number for the period of time covered. In fact, due to privacy regulations and concerns, some major Internet service providers refuse to correlate access or use information (such as IP numbers) with customer account data.

Validity of Information

Unfortunately, all the information you collect in your web site logs is self-reported by the agent making the request. This means that you cannot guarantee the validity of any of the data you might want to use when trying to contact responsible parties. This is covered in greater details in Appendix B - Technical Details of Metric Accuracy. As we previously mentioned when discussing content mirroring, it is possible (and not very hard) for a visitor to misrepresent her user agent or browser. The wget mirroring tool lets you do this quite simply and the Opera browser allows you to select one of several other browsers that it can masquerade as. Robots are supposed to include a contact email address and / or web site in the user agent string so you can ask about their behavior. Poorly behaved robots are unlikely to include either.

Even more important is that it is possible for a user to fake the IP address of the host from which she is connecting. This is not easily done and is generally only something that experienced system crackers accomplish. However, you should be aware before contacting a “responsible party” for a given host that the host address may have been forged in the communication with your server.

Counter Measures

Even though you may not be able to find out who was responsible for causing problems on your site, you may still be able to take steps to reduce that kind of traffic or the effect of it on your server in the future. As already mentioned, you can install a security algorithm to make sure that users cannot try a brute force password attack on a secure portion of your web site. If your algorithm completely blocks traffic from the offending host for a half hour or so, then the attacker will have to keep switching hosts to even get a request to your server. This will greatly reduce the load caused by these attacks. Similarly, if you have experienced Denial of Service attacks on particular pages or files on your site, you could make the page or file more efficient for the server to deliver or also limit access to it from a given host for a period of time. A simpler (and more immediate) solution is to rename the page that is getting the high volume of hits and replace it with a small, non-CGI file. This will reduce the load on the server for a time while the attacking tool continues to load the old page (until the hacker notices and resets the tool.)

Limiting Robots

For well-behaved robots you can limit their access by using creating a file called robots.txt in the root of your web site. You will need to put some commands in this file as described in the Standard for Robot Exclusion to tell each robot which files to ignore. If you want to allow robots to index your site, but want to stop getting Failed Requests for the robots.txt file, you can create a file that contains just this, allowing all robots to access all paths on your site:

  User-Agent: *
  Disallow: 

If you do not have control of the robots.txt file or only want to control specific files, you can include ‘meta-tags’ in the page source that may control the activities of some robots. These meta-tags are described in detail in their proposal. You should note, however, that these are proposed control mechanisms and many robots will not recognize some of the tags. The ‘ROBOTS’ meta-tag, however, is supported by many robots. To prevent a page from being indexed or parsed by a robot, add this HTML tag to the HEAD section of the page source:

  <meta name="ROBOTS" content="NOFOLLOW,NOINDEX">

If you are trying to block poorly-behaved robots, that do not adhere to the Standard, then you will need to make changes to your web site or server. As most robots work from a single host, you could configure your server to block all traffic from the hosts where offending robots reside. If you only want to block robot access to a given section of your site, you will need to secure it. Usually you can do this simply at the web server level. The robots will not have the access information, so they will not be able to get into the secured section of the site. This is really the only way you can prevent robots from accessing any of the information you have on your server. If you do not want your web site to be public, then you must configure it to require a login for access.

Limiting Mirroring Tools

You may be able to limit the use of mirroring tools by configuring you server to block traffic from particular user agents, such as wget. Unfortunately, it is simple for users of these tools to change the user agent to look like, for example, Microsoft Internet Explorer 5, which you really would not want to block. Unfortunately, beyond that it is quite complicated to stop mirroring. As far as the web server is concerned, it looks very much like regular web site traffic. Some web servers will limit the number of simultaneous connections from a given host, which may help reduce excessive load caused by greedy mirroring software, but a “friendly” mirror is indistinguishable from visitor traffic and virtually impossible to detect and block.

MORE ON
Validity of Data


Table of Contents | 1: What is Web Analytics? | 2: Where are My Visitors Coming From? | 3: Search Engines | 4: Advertising | 5: Revenue Modeling | 6: Design Considerations | 7: Determining Visitor Behavior Patterns | 8: Examining Subsets of Traffic  | 9: Incorporating Business Goals | 10: Bandwidth Management | 11: Site and Server Diagnostics | 12: Investigating Troublemakers | Appendix A: Making Reports More Usable | Appendix B: Technical Details of Metric Accuracy

Copyright 2002 by Summary.Net - Updated 16.Apr.2002