Summary

Web Analytics Tutorial

 

Lesson 12 – Investigating Troublemakers

IN THIS LESSON
* Unusual Access Patterns
   Drilling Down
* Bad Robots
   Denial of Service Attacks
   Worms
   Content Mirroring
* Digging Deeper
* What to Do About Troublemakers
   Responsible Parties
   Validity of Information
   Counter Measures
   Limiting Robots
   Limiting Mirroring Tools

Bad Robots

Troublemakers are not always individuals. They may be automated tools that are either poorly behaved or run by users who do not understand the proper conduct for using them. Properly behaved robots abide by some self-enforced rules to make sure that they do not adversely affect your server’s performance. Robots should not make large numbers of requests in a short period of time. Most robot etiquette suggests at least a second between requests. Some robots round-robin their requests across many servers so that they may have several minutes or hours between subsequent requests to your site. Well-behaved robots also adhere to the Standard for Robot Exclusion, which states that they must read the file /robots.txt in the root of each web site and only index content that they are allowed, through that file, to index. All the robots listed in Summary’s Known Robots report abide by the Standard for Robot Exclusion. Summary also include the Possible Robots report. This gives statistics for hosts that made a request to /robots.txt.

Unfortunately, there are some robots or spiders that do not behave according to these rules. They may not be malicious, just poorly designed or implemented. For example, some robots will crawl your site and download large files, like trial software or documents that you make available, even when the indexing tool that they are connected to cannot process those file types. If this becomes common, it can lead to excessive bandwidth usage and put a drain on your server resources. The same can be true for dynamically generated pages. If you have a database-driven section of your site that creates dynamic pages that link to more database content, robots can quickly become mired in loading the entire volume of data available for indexing. If the links that your dynamic site makes are all unique, the robot can get lost in an endless loop of loading new pages, even when they may be the same as those already loaded.

Figure 5. Host Report
Figure 5. High traffic volume in
the Host report could be a robot.
Robots generally make subsequent requests from the same host. You can use the Host Report, as in Figure 5, to see if any hosts are making a large number of requests. You can then filter traffic from just this host (in a subreport with Summary SP) and examine the request report. If there are about the same number of requests to each page on your site, it is likely that the action is caused by a robot. If the requests are in a high volume over a very short period of time (i.e. several requests per second) then you can almost guarantee that this is the behavior of a robot. You can also change the sort order of the Hosts Report (by clicking on the column names) to look for hosts that request a large number of pages (robots do not usually load graphics), or that have a large number of visits (some robots spread out their crawls over a long time such that they appear to ‘visit’ the site for each page they index.)

Denial of Service Attacks

Figure 6. Sample All Requests Report
Figure 6. Highly requested files in the All
Requests report can signify a DoS attack.
Some robots are malicious. The most common of these are ones designed to implement a Denial-of-Service (DoS) attack. DoS attacks are accomplished by flooding a server with a very large volume of requests such that the server becomes too busy to handle all traffic and the web site (or other service) is no longer available for normal users. Usually, these DoS attacks will request dynamic pages or large objects from the server to consume bandwidth and processing power as effectively as possible. If your server becomes slow and unresponsive, you can use Summary’s Host Report to see which host originated the attack. Using the All Requests report you can often tell, as in Figure 6, which pages or files on your site are being requested and make adjustments to reduce the load those put on your server (where possible.)

Unfortunately, there is a more complicated attack, called Distributed Denial of Service (DDos), where the flood of requests comes from a range of hosts. This makes it harder to determine who is responsible for the action (in fact most DDoS attacks are carried out by worms that have infected the systems that are hitting your server.) In addition, a Distributed Denial of Service attack is harder to detect, because you do not see the obvious traffic flood from a single host. In fact, it can often look like a large flux of regular traffic to the site. You may still notice the traffic spikes in the All Requests report, telling you that certain pages on your site are receiving much more traffic than expected. If these sites are either high-load (e.g. they require complex database queries) or are large files and not regular requests of your normal visitors, this is you clue that your site is under attack, and not just flooded with a large volume of visitor traffic.

Worms

Worms are malicious robots that look for security vulnerabilities in web (or other) servers that allow them to break in and infect the server, thus propagating themselves across the Internet. On the infected server they may implement a Distributed Denial of Service attack, as described previously, they may cripple the server, making it fail or even destroy the data stored on it, or they may open a ‘back door’ allowing intruders to access the system and read, change or destroy any information on it. Some of the more well known worms are Code Red and Nimda. These two achieved notoriety in 2001 by infecting most Microsoft IIS web servers on the Internet.

Figure 7. Sample All Requests Report
Figure 7. The All Requests report can show the footprint of the Nimda, or other worms.

Web server worms can generally be easily detected by looking for access patterns in the All Requests report because they propagate by exploiting vulnerabilities in the web server request processing. Figure 7 shows the footprint of Nimda worm attack as it might appear in the All Request report of an infected server. Note that on a server that is not vulnerable to such an exploit, these request would be in the Failed Requests report, indicating that the worm did not gain access through any of these methods. CERT maintains a list of known exploits in Internet servers that you can monitor to find information about worms and other attacks. Usually, each advisory contains information about the footprint or other detection method for determining if your system has been attacked or infected. If your system has been infected or is vulnerable, you should immediately install the patch available from your server vendor to secure your software.

Content Mirroring

Figure 8. Sample Browser Report
Figure 8. The Browser Report lists
Wget and other mirroring tools.
Just as search engine robots will read every page of your site in order to add it to their indexes, there are robots that people can use to read every page of your site and save it on their computer. This ‘content mirroring’ may be something that you do not want visitors doing. Two common tools that allow this functionality are “wget” and Microsoft Internet Explorer’s offline content feature. In normal usage, both of these tools properly identify themselves, as you can see in Figure 8, a Browser Report. Wget, however, allows the user to masquerade as any user agent he chooses. So it is possible that a mirroring user will not show up obviously in the Browser Report. If you suspect users are mirroring your site masquerading as standard browsers, you can look at access patterns of particular hosts (if they request each file on your site once, then that is mirroring activity), much like you would for poorly behaved robots (which is really what these are.)



Table of Contents | 1: What is Web Analytics? | 2: Where are My Visitors Coming From? | 3: Search Engines | 4: Advertising | 5: Revenue Modeling | 6: Design Considerations | 7: Determining Visitor Behavior Patterns | 8: Examining Subsets of Traffic  | 9: Incorporating Business Goals | 10: Bandwidth Management | 11: Site and Server Diagnostics | 12: Investigating Troublemakers | Appendix A: Making Reports More Usable | Appendix B: Technical Details of Metric Accuracy

Copyright 2002 by Summary.Net - Updated 16.Apr.2002