|
Web Analytics Tutorial |
Appendix B – Technical Details of Metrics Accuracy | ||
Proxies, Caches and FirewallsEveryone who connects to the Internet goes through an access provider at some point. For individuals and home users, this is usually an Internet Service Provider like AOL or EarthLink. These access providers also want to reduce the amount of traffic they carry and want to increase the efficiency and security of their service. In the same manner, businesses need to share Internet access and maintain the security of their internal network. To do this, these companies usually employ special servers called Web Caches, Proxies and Firewalls. Each of these servers affects the way requests to your web site are seen by your server. CachesTo reduce the network load, major Internet Service Providers implement web caching servers across their service. A web cache is similar to a browser cache except that it works for all users on the system. A browser cache recognizes files that have already been requested and gets them from the cache instead of the Internet for an single user. A web cache does this for multiple users. Imagine that a user connecting through a major provider browses your site. The same day another user (who may be in a different city or state) also navigates to your site. Every page that the first user saw is pulled from the provider’s cache by the web caching server rather than from your web server. These requests are never registered on your server. You may have no record that the second visitor even saw your site. There are different kinds of web caches depending on what the provider has determined is most important to cache. Some caching servers will only store graphics; others only pages; some store anything. Generally the caches store items for a limited period of time, perhaps a day or a week. When the time had elapsed, the caching server will send your web server a query to see of the file has changed the next time it is requested. This query is much smaller than the original file so it uses less bandwidth. Summary counts these queries as hits even though the actual file was never sent. The file was requested and that is what Summary is tracking. (For byte counts Summary uses the size of the response, not the file size, so byte counts are accurate too.) Controlling CachingCaches are an important part of the user experience, so while it reduces the request information that you log, it increases the performance of the site in the user’s experience. However, there may be some pages that you do not want cached: live data, news pages where new material appears often, or other dynamic content. There are a couple ways you can tell caching servers not to cache your page or tell them when the content expires. Not all servers support it, but many of the more common ones do. The first method involves adding HTTP 1.1 header fields to the response when a page is sent out. This can sometimes be done with embedded scripting languages such as ASP, PHP or JSP, but is usually managed at the server level. Section 14.9 of the HTTP 1.1 specification provides for the ‘Cache-Control’ header field that you can use to tell compliant caches not to cache a particular page, or to only cache it for a given period of time. You can also use the ‘Expires’ header to tell how long the document will be valid for. (The Expires header is an HTTP 1.0 field so it may work on more systems.) If you do not have the ability to influence the headers on a particular page, you can insert meta-tags that will take the place of them. Meta-tags are rarely read by caches (because they do not read the page, just the headers) but do influence browser caching. The meta-tag uses an ‘http-equiv’ attribute to indicate that it should be treated as if it were an HTTP header. For example, the following would tell supporting browsers not to cache a page: <meta http-equiv="Cache-Control" content="no-cache"> ProxiesWhile caches will alter the amount of data you can collect for content metrics, proxy servers will significantly affect the value of data based on host counting, including visit tracking metrics. A proxy server acts as an intermediary between a group of users and the outside world. Often this takes the form of a firewall to secure a company’s internal network or a NAT box to allow one Internet connection to work for many users (many web caches also act as proxies). When a proxy exists between a group of users and your website, you see all requests from those users as if they came from the one host (the proxy server). Even if three different people at a company are browsing your site, it looks like they are all one host to your web server. As visit tracking depends in part on the host making the request, this can make the visit statistics much different from actual visits. Fortunately, Summary has some techniques to detect multiple visitors behind a proxy and distinguish their visits, but it will not always work. Many proxies contain web caches and most web caches act as a proxy so the effects of both of these are seen at once. For those that support it, you can use the above techniques to communicate with the cache part of the server. In the final section of the appendix we discuss some advanced techniques you can use to attempt to maintain visit distinctions through a proxy. FirewallsFirewalls are computers that usually reside between an company’s internal network and the Internet to secure the internal network from malicious public access. Many firewalls, because of their function and location, work like proxies. Some implement web caches. In addition to proxying visitor traffic, firewalls limit what can be sent in either direction. Often the limitation is by protocol. If all of your traffic is HTTP (i.e. a web site), then that traffic should pass through. However, if your site uses Java applets that connect to your server, that would be a different protocol and probably blocked by the firewall (because it is unknown.) Some firewalls also limit or strip content as well, such as Java applets, plugins and their content (like Shockwave Flash!). Striping firewalls will even remove cookies or referrer information. Because of this it may be very hard to distinguish individual visits on the other side of the firewall and reduce the amount of referrer information you have to analyze. (For security, it is in the interest of the firewall to make requests as anonymous as possible.) The hosts will likely be the same (proxy problem), and sometimes the request will be modified so as to make them all appear to be from a single computer. Proxy ClusteringOn top of all of this, some service providers (AOL being the most prevalent) have implemented a new technique called ‘proxy clustering.’ Proxy clustering works by routing subsequent request from a user to different proxy servers. If a given user pulls up your home page, the browser will request the page and then all the graphics on that page. Each graphics requests will be passed through a different proxy so that they can all be processed simultaneously. This means that each request looks like it came from a different host (the proxy server). Trying to detect which requests are part of each visitor’s session is very hard when the hosts are not the same, as this is the basis for visit detection. If your web site gets a lot of traffic from AOL customers, standard visit detection will count requests from different hosts as different visits, so your visit detection will be greatly inflated. In addition, you unique host count will not represent the number of connected visitors’ host, but will be lower as all AOL visitors will be using the same few proxy servers. Fortunately, Summary Plus and SP are aware of this and makes special case adjustments to accurately recover visit information across these proxy servers. |
||
| ||
|
Table of Contents |
1: What is Web Analytics? |
2: Where are My Visitors Coming From? |
3: Search Engines |
4: Advertising |
5: Revenue Modeling |
6: Design Considerations |
7: Determining Visitor Behavior Patterns |
8: Examining Subsets of Traffic |
9: Incorporating Business Goals |
10: Bandwidth Management |
11: Site and Server Diagnostics |
12: Investigating Troublemakers |
Appendix A: Making Reports More Usable |
Appendix B: Technical Details of Metric Accuracy Copyright 2002 by Summary.Net - Updated 16.Apr.2002 |