|
Web Analytics Tutorial |
Lesson 12 – Investigating Troublemakers | ||||
|
What to Do About TroublemakersOnce you have determined that someone is causing trouble on your site, you will want to act on it. We have already covered a few ways you can make changes to your web site or server to reduce the ability of users to cause problems. If the problems persist or are extremely debilitating, you may want to find the responsible parties and bring them to task for it. Responsible PartiesThe only record you really have of the source of requests to your web
site is the host or IP number where the request originated. You may have
referrer information for some request, but most problem traffic is generated
directly, not referred from other sites. If your Summary is configured to do DNS
lookups (it is by default) then the domain information will tell you who owns
the hosts that were causing the problems. If you have created a filter to
show only the offending traffic, you can use the If the IP numbers of the hosts did not resolve, or you have not enabled DNS lookups in Summary, you can use ARIN’s Whois tool to lookup ownership information for the hosts that are causing trouble and get contact information for the parties responsible for those hosts. However, contacting the coordinator for a domain registration or an IP number netblock may not be very effective. If the problem involved an access attempt or a Denial of Service attack, the source is likely to be an Internet service provider, and a major one at that. While these companies do not condone this kind of activity, they also handle such large volumes of traffic that they may not easily be able to determine which client was assigned a given IP number for the period of time covered. In fact, due to privacy regulations and concerns, some major Internet service providers refuse to correlate access or use information (such as IP numbers) with customer account data. Validity of Information | |||
|
Unfortunately, all the information you collect in your web site logs is self-reported by the agent making the request. This means that you cannot guarantee the validity of any of the data you might want to use when trying to contact responsible parties. This is covered in greater details in Appendix B - Technical Details of Metric Accuracy. As we previously mentioned when discussing content mirroring, it is possible (and not very hard) for a visitor to misrepresent her user agent or browser. The wget mirroring tool lets you do this quite simply and the Opera browser allows you to select one of several other browsers that it can masquerade as. Robots are supposed to include a contact email address and / or web site in the user agent string so you can ask about their behavior. Poorly behaved robots are unlikely to include either. Even more important is that it is possible for a user to fake the IP address of the host from which she is connecting. This is not easily done and is generally only something that experienced system crackers accomplish. However, you should be aware before contacting a “responsible party” for a given host that the host address may have been forged in the communication with your server. Counter MeasuresEven though you may not be able to find out who was responsible for causing problems on your site, you may still be able to take steps to reduce that kind of traffic or the effect of it on your server in the future. As already mentioned, you can install a security algorithm to make sure that users cannot try a brute force password attack on a secure portion of your web site. If your algorithm completely blocks traffic from the offending host for a half hour or so, then the attacker will have to keep switching hosts to even get a request to your server. This will greatly reduce the load caused by these attacks. Similarly, if you have experienced Denial of Service attacks on particular pages or files on your site, you could make the page or file more efficient for the server to deliver or also limit access to it from a given host for a period of time. A simpler (and more immediate) solution is to rename the page that is getting the high volume of hits and replace it with a small, non-CGI file. This will reduce the load on the server for a time while the attacking tool continues to load the old page (until the hacker notices and resets the tool.) Limiting RobotsFor well-behaved robots you can limit their access by using creating a file
called User-Agent: * Disallow: If you do not have control of the <meta name="ROBOTS" content="NOFOLLOW,NOINDEX"> If you are trying to block poorly-behaved robots, that do not adhere to the Standard, then you will need to make changes to your web site or server. As most robots work from a single host, you could configure your server to block all traffic from the hosts where offending robots reside. If you only want to block robot access to a given section of your site, you will need to secure it. Usually you can do this simply at the web server level. The robots will not have the access information, so they will not be able to get into the secured section of the site. This is really the only way you can prevent robots from accessing any of the information you have on your server. If you do not want your web site to be public, then you must configure it to require a login for access. Limiting Mirroring ToolsYou may be able to limit the use of mirroring tools by configuring you server to block traffic from particular user agents, such as wget. Unfortunately, it is simple for users of these tools to change the user agent to look like, for example, Microsoft Internet Explorer 5, which you really would not want to block. Unfortunately, beyond that it is quite complicated to stop mirroring. As far as the web server is concerned, it looks very much like regular web site traffic. Some web servers will limit the number of simultaneous connections from a given host, which may help reduce excessive load caused by greedy mirroring software, but a “friendly” mirror is indistinguishable from visitor traffic and virtually impossible to detect and block. |
|
|||
| ||||
|
Table of Contents |
1: What is Web Analytics? |
2: Where are My Visitors Coming From? |
3: Search Engines |
4: Advertising |
5: Revenue Modeling |
6: Design Considerations |
7: Determining Visitor Behavior Patterns |
8: Examining Subsets of Traffic |
9: Incorporating Business Goals |
10: Bandwidth Management |
11: Site and Server Diagnostics |
12: Investigating Troublemakers |
Appendix A: Making Reports More Usable |
Appendix B: Technical Details of Metric Accuracy Copyright 2002 by Summary.Net - Updated 16.Apr.2002 |