Summary

Web Analytics Tutorial

 

Appendix B – Technical Details of Metrics Accuracy

IN THIS APPENDIX
* Limitations of Metrics Accuracy
   Visit Detection
* Proxies, Caches and Firewalls
   Proxies
   Caches
   Controlling Caching
   Firewalls
   Proxy Sharing
* Validity of Agent Data
   User Agents
   Referrers
   Hosts
   Validity of Reports
* Visit Time Issues
   View Time
   Visit Duration
* Advanced Solutions
   Cookies
   Session Keys in URLs
   Client-side applets

Validity of Agent Data

In Lesson 12 - Investigating Troublemakers, when we talked about finding responsible parties, we mentioned that most of the data gathered in web server logs is agent-provided and therefore not entirely reliable. The data that your server provides (and that is therefore reliable) includes date and time of requests, file requested (including CGI arguments), authorized username (if access was granted), status code and bytes sent. Additional fields, like request method and protocol, are a contract between the agent and server, so are likely to be accurate. The most interesting information, host (including domain and TLD or country), referrer (and search terms), and browser details, are agent-provided and can be forged, although host information is much more likely to be accurate than the rest.

User Agents

Some users agents (or browsers) allow the user to change what they identify themselves as to your server. The mirroring tool, Wget, allows the user to define any user agent to be sent with each request. The Opera web browser allows the user to choose one of several more common browser that it can masquerade as. Therefore, all reports based on user agent data, are only as good as the information collected. This includes the Browsers, Browser Brands, Platforms, Known Robots and Agent reports. Of course, Summary accurately calculates these reports based on the reported agent data, so the reports are correct. It is possible, though, that some of the reported user agents were not what was really being used.

Referrers

MORE ON
Finding Troublemakers

Referrer information is also easily forged. Opera (again) allows the user to configure it to not provide any referrer information. Some firewalls will strip this data from HTTP requests. Neither of these is forging, but it does affect your results and counts. There have been reports of some unscrupulous “marketers” who have used robots to fill logs with referrer information pointing back to their web site. The only reason for this would be to attract the attention of those reading the reports. This kind of traffic is easy to spot and remove with a filter. (See Lesson 8 - Examining Subsets of Traffic for details on filters.)

There is not much other value in faking a referrer, so it is not very commonly done. However, you should be aware that all the referrer data and reports based them are based on reported referrers, whatever the real page was before the request to your site. The reports that depend on referrer data include Referring Domains and related reports, Referrers and related reports, Search Engines, Search Words and Search Phrases and reports related to those, and the Refers To and Referred From reports. In addition, all the Path Analysis reports are dependent on referrers, as are the Bad Links and Failed Referrers reports.

Hosts

It is not easy to fake the host IP number that your server receives with a request. Unlike user agent and referrer information, which is included in the HTTP header (a set of text lines sent before the content), the host is included the TCP wrapper around each packet sent over the network. In order for someone to fake her IP address, she must change her TCP stack (or TCP driver) on her computer to submit fake information. Tools to do such are certainly available and there are people in the world who have the skill to do this, but there is no commonly available software that allows this activity (as Wget and Opera do for the other data.) On the other hand, it is simple to fake the host and domain name returned for a given IP address when Summary does DNS look-ups, but this is rarely done. You can assume that the bulk of your host-based data is accurate, especially when aggregated. However, when using the host information to identify an individual visit, especially one of malicious intent, you should be skeptical of its validity.

Validity of Reports

At this point you may have some concerns about the validity of web analytics in general. Several factors contribute to making web analytics a valuable tool, despite the possibility for false data. First, the vast majority of users, especially those who you are most interested in, will not bother faking or omitting this information from their requests. More than 99% of web users use a major browser that accurately reports all information in requests. Robots may fake information, but how many do is hard to detect, perhaps one or two percent. Second, Summary and other web analytics tools accurately report on the information in your logs. As there is no way of knowing what the “correct” information is, it is usually valid enough to analyze the information that your users wished you to get from them. Third, because of the low quantity of falsified data, any summaries that cover data in aggregate will have a very low margin of error and can be accepted as near accurate. Finally, whether the metrics you choose to track represent any real-world quantities or not, they do represent measurable quantities. In analyzing trends you can compare values from distinct time periods and produce valuable growth information from that.

MORE ON
Filters


Table of Contents | 1: What is Web Analytics? | 2: Where are My Visitors Coming From? | 3: Search Engines | 4: Advertising | 5: Revenue Modeling | 6: Design Considerations | 7: Determining Visitor Behavior Patterns | 8: Examining Subsets of Traffic  | 9: Incorporating Business Goals | 10: Bandwidth Management | 11: Site and Server Diagnostics | 12: Investigating Troublemakers | Appendix A: Making Reports More Usable | Appendix B: Technical Details of Metric Accuracy

Copyright 2002 by Summary.Net - Updated 16.Apr.2002