|
|
Bad Robots
Troublemakers are not always individuals. They may be automated tools that
are either poorly behaved or run by users who do not understand the proper
conduct for using them. Properly behaved robots abide by some self-enforced
rules to make sure that they do not adversely affect your server’s performance.
Robots should not make large numbers of requests in a short period of time. Most
robot etiquette suggests at least a second between requests. Some robots
round-robin their requests across many servers so that they may have several
minutes or hours between subsequent requests to your site. Well-behaved robots
also adhere to the Standard for Robot Exclusion, which states that they must
read the file /robots.txt in the root of each web site and only
index content that they are allowed, through that file, to index. All the robots
listed in Summary’s Known Robots report abide
by the Standard for Robot Exclusion. Summary also include the Possible Robots report. This gives statistics for hosts
that made a request to /robots.txt.
Unfortunately, there are some robots or spiders that do not behave according
to these rules. They may not be malicious, just poorly designed or implemented.
For example, some robots will crawl your site and download large files, like
trial software or documents that you make available, even when the indexing
tool that they are connected to cannot process those file types. If this becomes
common, it can lead to excessive bandwidth usage and put a drain on your server
resources. The same can be true for dynamically generated pages. If you have a
database-driven section of your site that creates dynamic pages that link to
more database content, robots can quickly become mired in loading the entire
volume of data available for indexing. If the links that your dynamic site makes
are all unique, the robot can get lost in an endless loop of loading new pages,
even when they may be the same as those already loaded.
|
Figure 5. High traffic volume in
the Host report could be a robot. |
Robots generally make subsequent requests from the same host. You can use the
Host Report, as in Figure 5, to see if any
hosts are making a large number of requests. You can then filter traffic from
just this host (in a subreport with Summary SP) and examine the request report.
If there are about the same number of requests to each page on your site, it is
likely that the action is caused by a robot. If the requests are in a high
volume over a very short period of time (i.e. several requests per second) then
you can almost guarantee that this is the behavior of a robot. You can also
change the sort order of the Hosts Report (by clicking on the column names) to
look for hosts that request a large number of pages (robots do not usually load
graphics), or that have a large number of visits (some robots spread out their
crawls over a long time such that they appear to ‘visit’ the site
for each page they index.)
Denial of Service Attacks
|
Figure 6. Highly requested files in the All
Requests report can signify a DoS attack. |
Some robots are malicious. The most common of these are ones designed to
implement a Denial-of-Service (DoS) attack. DoS attacks are accomplished by
flooding a server with a very large volume of requests such that the server
becomes too busy to handle all traffic and the web site (or other service) is no
longer available for normal users. Usually, these DoS attacks will request
dynamic pages or large objects from the server to consume bandwidth and
processing power as effectively as possible. If your server becomes slow and
unresponsive, you can use Summary’s Host Report
to see which host originated the attack. Using the All
Requests report you can often tell, as in Figure 6, which pages or
files on your site are being requested and make adjustments to reduce the load
those put on your server (where possible.)
Unfortunately, there is a more complicated attack, called Distributed Denial
of Service (DDos), where the flood of requests comes from a range of hosts. This
makes it harder to determine who is responsible for the action (in fact most
DDoS attacks are carried out by worms that have infected the systems that are
hitting your server.) In addition, a Distributed Denial of Service attack is
harder to detect, because you do not see the obvious traffic flood from a single
host. In fact, it can often look like a large flux of regular traffic to the
site. You may still notice the traffic spikes in the All
Requests report, telling you that certain pages on your site are
receiving much more traffic than expected. If these sites are either high-load
(e.g. they require complex database queries) or are large files and not regular
requests of your normal visitors, this is you clue that your site is under
attack, and not just flooded with a large volume of visitor traffic.
Worms
Worms are malicious robots that look for security vulnerabilities in web (or
other) servers that allow them to break in and infect the server, thus
propagating themselves across the Internet. On the infected server they may
implement a Distributed Denial of Service attack, as described previously, they
may cripple the server, making it fail or even destroy the data stored on it, or
they may open a ‘back door’ allowing intruders to access the system
and read, change or destroy any information on it. Some of the more well known
worms are Code Red and Nimda.
These two achieved notoriety in 2001 by infecting most Microsoft IIS web servers
on the Internet.
|
| Figure 7. The All Requests report can show
the footprint of the Nimda, or other worms. |
Web server worms can generally be easily detected by looking for access
patterns in the All Requests report because
they propagate by exploiting vulnerabilities in the web server request
processing. Figure 7 shows the footprint of Nimda worm attack as it might appear
in the All Request report of an infected server. Note that on a server that is
not vulnerable to such an exploit, these request would be in the Failed Requests report, indicating that the worm did not
gain access through any of these methods. CERT maintains a list of known exploits in
Internet servers that you can monitor to find information about worms and
other attacks. Usually, each advisory contains information about the footprint
or other detection method for determining if your system has been attacked or
infected. If your system has been infected or is vulnerable, you should
immediately install the patch available from your server vendor to secure your
software.
Content Mirroring
|
Figure 8. The Browser Report lists
Wget and other mirroring tools. |
Just as search engine robots will read every page of your site in order to
add it to their indexes, there are robots that people can use to read every page
of your site and save it on their computer. This ‘content mirroring’
may be something that you do not want visitors doing. Two common tools that allow
this functionality are “wget” and Microsoft Internet Explorer’s
offline content feature. In normal usage, both of these tools properly identify
themselves, as you can see in Figure 8, a Browser
Report. Wget, however, allows the user to masquerade as any user
agent he chooses. So it is possible that a mirroring user will not show up obviously
in the Browser Report. If you suspect users are mirroring your site masquerading
as standard browsers, you can look at access patterns of particular hosts (if
they request each file on your site once, then that is mirroring activity), much
like you would for poorly behaved robots (which is really what these are.)
|