|
Web Analytics Tutorial |
Appendix B – Technical Details of Metrics Accuracy | ||||
Validity of Agent Data | ||||
|
In Lesson 12 - Investigating Troublemakers, when we talked about finding responsible parties, we mentioned that most of the data gathered in web server logs is agent-provided and therefore not entirely reliable. The data that your server provides (and that is therefore reliable) includes date and time of requests, file requested (including CGI arguments), authorized username (if access was granted), status code and bytes sent. Additional fields, like request method and protocol, are a contract between the agent and server, so are likely to be accurate. The most interesting information, host (including domain and TLD or country), referrer (and search terms), and browser details, are agent-provided and can be forged, although host information is much more likely to be accurate than the rest. User AgentsSome users agents (or browsers) allow the user to change what they identify
themselves as to your server. The mirroring tool, Wget, allows the user to
define any user agent to be sent with each request. The Opera web browser allows
the user to choose one of several more common browser that it can masquerade as.
Therefore, all reports based on user agent data, are only as good as the
information collected. This includes the Referrers |
|
|||
|
Referrer information is also easily forged. Opera (again) allows the user to configure it to not provide any referrer information. Some firewalls will strip this data from HTTP requests. Neither of these is forging, but it does affect your results and counts. There have been reports of some unscrupulous “marketers” who have used robots to fill logs with referrer information pointing back to their web site. The only reason for this would be to attract the attention of those reading the reports. This kind of traffic is easy to spot and remove with a filter. (See Lesson 8 - Examining Subsets of Traffic for details on filters.) There is not much other value in faking a referrer, so it is not very
commonly done. However, you should be aware that all the referrer data and
reports based them are based on reported referrers, whatever the real
page was before the request to your site. The reports that depend on referrer
data include HostsIt is not easy to fake the host IP number that your server receives with a request. Unlike user agent and referrer information, which is included in the HTTP header (a set of text lines sent before the content), the host is included the TCP wrapper around each packet sent over the network. In order for someone to fake her IP address, she must change her TCP stack (or TCP driver) on her computer to submit fake information. Tools to do such are certainly available and there are people in the world who have the skill to do this, but there is no commonly available software that allows this activity (as Wget and Opera do for the other data.) On the other hand, it is simple to fake the host and domain name returned for a given IP address when Summary does DNS look-ups, but this is rarely done. You can assume that the bulk of your host-based data is accurate, especially when aggregated. However, when using the host information to identify an individual visit, especially one of malicious intent, you should be skeptical of its validity. Validity of ReportsAt this point you may have some concerns about the validity of web analytics in general. Several factors contribute to making web analytics a valuable tool, despite the possibility for false data. First, the vast majority of users, especially those who you are most interested in, will not bother faking or omitting this information from their requests. More than 99% of web users use a major browser that accurately reports all information in requests. Robots may fake information, but how many do is hard to detect, perhaps one or two percent. Second, Summary and other web analytics tools accurately report on the information in your logs. As there is no way of knowing what the “correct” information is, it is usually valid enough to analyze the information that your users wished you to get from them. Third, because of the low quantity of falsified data, any summaries that cover data in aggregate will have a very low margin of error and can be accepted as near accurate. Finally, whether the metrics you choose to track represent any real-world quantities or not, they do represent measurable quantities. In analyzing trends you can compare values from distinct time periods and produce valuable growth information from that. |
|
|||
| ||||
|
Table of Contents |
1: What is Web Analytics? |
2: Where are My Visitors Coming From? |
3: Search Engines |
4: Advertising |
5: Revenue Modeling |
6: Design Considerations |
7: Determining Visitor Behavior Patterns |
8: Examining Subsets of Traffic |
9: Incorporating Business Goals |
10: Bandwidth Management |
11: Site and Server Diagnostics |
12: Investigating Troublemakers |
Appendix A: Making Reports More Usable |
Appendix B: Technical Details of Metric Accuracy Copyright 2002 by Summary.Net - Updated 16.Apr.2002 |