A Primer on Internet Privacy

By Karen Coyle 1

Western Regional Director

Computer Professionals for Social Responsibility

Many people are convinced that computers, by their very nature, invade our privacy. While that isn't the case, it is true that computers have facilitated the gathering and processing of data about us practically from their very conception.

One of the earliest uses of computers was with the US Bureau of the Census. The census is just a huge file of data on who we are and where we live plus a smattering of details on our homes, our families and our earnings. What computers do best is compute, which means that they provide an ability to perform those calculations that aggregate and compare large amounts of data. This is exactly what the census needs since without this advancement in technology the entire decade between each census would not be long enough to produce the many important tables of data that the census makes possible.

The census was the first database to take advantage of computing technology, but others soon followed suit. By the 1970's many other government agencies had computerized their data and had developed new capabilities in making use of their databases. It was this very ability to combine and analyze data, even from different databases, that led to the Privacy Act of 1974. This act forbids government agencies that have gathered personally identifying data for their own purposes from combining these data across agencies. This stemmed from a fear that an all-encompassing mega-database could be created that would carry everything about us. The Privacy Act did not address the issue of databases owned by non-governmental agencies since at the time only the government had large databases of information about individuals.

The Internet

Today we are in a new era in computing and communications. Computers and databases are no longer just the realm of government agencies or large corporations. Over the last two decades we have seen a generalization of the computing ability along with a personalization of digital communications. Many of us have on our desktops computing power that would have formerly been the pride of respectable research organization.

Millions of people are using the Internet as a personal and institutional communications system. Initially developed and used by only a small group of people in the academic and research community, it opened to larger public after 1990 when first "user-friendly" finding tools were developed: first Gopher and then the World Wide Web. The Internet user base has doubled every 12-18 months since its inception and is now large enough to be considered truly a part of public life in this country.

This same system that gives us the capability to communicate instantly on a global scale also is becoming a tool for the gathering of information about us and our communications. There has been a great deal of publicity about the potential of the Internet to act as a system for invading our privacy. While there certainly are legitimate privacy concerns, some fears are based on a lack of understanding of the actual functioning of the Internet. The thesis of this essay is not that you should not be concerned about the Internet and privacy, but that it is preferable that your fears be based on the facts.

Internet Protocols

The Internet works on a complex set of rules called "protocols." These protocols define how each computer will communicate with other computers on the network for each function (e-mail, file transfer, World Wide Web, etc.). Each interaction is between computers, not between people. As a matter of fact, it isn't relevant to the working of the Internet even if there is a human being associated with a transaction. Essentially, any computer on the Internet only knows that it has engaged in a transaction with another machine.

For the Internet to work, each computer has an address. The address is a set of numbers always written as four groups of numbers separated by periods: 128.48.104.15. Because we humans don't remember numbers well, we can create word-like names that are equivalent to numerical addresses. So an address like www.company.com is just another way to say 198.18.14.4. A request for information to a computer includes the return address of the computer requesting the information. The requested information is delivered across the to Internet using this address. Essentially, this is all two computers need know about each other's identities in order to communicate.

Transaction Logging

One of the potentials for invasion of privacy is in Internet system transaction logging. Most computers on the Internet keep some logs of transactions they have recently performed. Basic system logs contain the addresses of requesting computers and the request command along with the date and time. These logs can be used to provide statistics on the use of that computer's services, to track peak usage times, and to show which materials on the site are the most popular. This is information that allows system administrators make decisions that will improve their service. Some information can be gleaned from the actual addresses of the visiting computers such as counting the visits by broad network division (.com, .mil, etc.) or by country of origin. Each system on the Internet makes its own decisions about how long it stores these logs and how they are used. Some systems keep logs for only a few days or a week, long enough to gather some statistics and to use the logs in case of an unexplained system failure (it sometimes helps to know what was happening when the system went south). Other systems may store logs for a longer time and do longer term analysis on their system's activity.

Figure 1. Some Web Server Log Entries

ppp156210.asahi-net.or.jp - - [29/Mar/1998:05:45:00 -0800] "GET /~kec/best.html HTTP/1.0" 200
DIAL8-ASYNC38.DIAL.NET.NYU.EDU - - [29/Mar/1998:12:23:11 -0800] "GET /~kec HTTP/1.0" 302
gwc-mel.afgwc.af.mil - - [29/Mar/1998:14:24:09 -0800] "GET /~kec/ HTTP/1.0" 304 -

Note that these logs do not include information about people, only machines, and the Internet address of a particular computer does not necessarily identify an individual person. With most online services today, users are temporarily assigned an address each time they log on and the address is recycled to a new user when it is freed up. So it may be possible for an Internet site to know that it had an information request from an AOL or a Netcom computer, but there's no immediate way to know who the human user was behind that address. With varying degrees of effort, depending on the particulars of the interaction, it is possible to use these logs in combination with other system information to identify the person behind a transaction, but this is generally reserved for those occasions when it is necessary to trace criminal activity such as computer intrusion.

Clickstream and Privacy

You may have read that the gathering of "clickstream" data is a privacy issue. Clickstream refers to the information that I have described in Internet logs. Logging programs can trace what page you visited immediate prior to your request to their site and where you go next; that is what link you click on to exit their site. In this way they can log used paths and roads not taken. They can also know whenever a banner ad is clicked on and from what page, something that advertisers are very interested in for their own planning. While clickstream information may be useful to site administrators, it rarely can be linked back to personally identifying data. The objection most users have to the gathering of clickstream data is the general unease that we feel at the idea that our every move is being watched or recorded.

Cookies

Cookies are another topic that comes up frequently in the discussion of the Internet and privacy. The origin of cookies had really nothing to do with gathering personal data. Cookies were a fix to the "statelessness" of the World Wide Web's underlying functioning.

What do we mean by stateless? Well, part of the efficiency of the Web's basic protocols is that each interaction between two computers is a single, separate activity. For example, when you visit a web site, you may start at the home page. Your computer sends a request to the computer where that home page resides and asks to have it sent. The other computer sends that page, and then the interaction between those two computers ends: the transaction is completed. When you move on to another page on that same machine by clicking on a link on that home page, this is an entirely separate transaction, unrelated to the first, at least as far as the two computers are concerned. The efficiency of this approach is that the two computers are freed up to do other transactions at computer speeds while humans read pages and decide what they want to do next.

Statelessness works fine until you need to carry some piece of information along from one interaction to another. Let's say that the home page you visit has a menu with five choices and the webmaster wants to change the banner you see depending on which choices you've already selected. This requires some way to carry the information about your selections through a series of "stateless" transactions. You first choose number one and from that page you select number three. But the transaction that got you number three is separate from the transaction that got you number one; the sending computer does not know that these two requests are experienced as a single session by the human sitting at the machine. This can be solved with a cookie.

A cookie is a small file that is stored on your computer. That file is written by the remote computer that is sending you information across the World Wide Web. In the case of our Web site with 5 choices, the webmaster has arranged that a cookie will be written that keeps track of what selections you make during that visit to the site. As you request each new page, a program on the remote computer reads the cookie to determine what pages you've already visited, and writes information to the cookie about the new page you've requested.

Figure 2. A Cookie File

.microsoft.com	TRUE	/	FALSE	937422000	MC1	GUID=CA2E2FD0807B11D19D3F0000F84121EB
.msn.com	TRUE	/	FALSE	937422000	MC1	GUID=a059c54db43211d18b2208002bb74f3f
.imgis.com	TRUE	/	FALSE	1046789230	JEB2	-1240408683|home.netscape.com	FALSE	/	FALSE	942189160	NGUserID	cdda9c49-22688-892745369-1
.netscape.com	TRUE	/	FALSE	946684799	NETSCAPE_ID	10010408,127cd357
.doubleclick.net	TRUE	/	FALSE	1920499140	id	26312e6b

Because the cookie is a file on your computer, it will be there the next time you visit that Web site. In this way, selections that you have made or forms that you have filled in can be carried over from one visit to another. Note again that the cookie is attached to your computer, not to you. If you move between computers, such as between work and home, your cookies don't follow you from one to the other. And if someone else uses your computer at work or if you share one, the systems creating and reading the cookies are unaware and unconcerned about this change in human actors and will blithely present this new user with the preferences most recently stored on the machine's hard drive. Cookies work best on a computer that is used consistently by one person.

There is, however, a "dark side" to cookies. Some people generally feel uncomfortable that a remote machine is able to write files on their personal computer. Their discomfort might be eased if they could themselves read those files and know what is in them, but unfortunately cookies are generally quite cryptic and not interpretable by humans.

You can set your Web browser to notify you each time a site attempts to write a cookie and you can either allow the cookie to be written or not. But you cannot know the content of the cookie that is being written. 2

Most people's biggest fear relating to cookies is that if they purchase something over the Web their credit card information will be stored as a cookie and will therefore be available to any site they visit on the Net. While this is unlikely to happen, for various reasons that I will soon explain, it is not entirely impossible.

You should have only one cookie file on your computer and it contains the cookies from each site that has written one. Cookies are a protocol, or set of rules, in the same way that all other functions on the Web are. The rules for cookies say that each WWW site can only read cookies that it has created and this is enforced based on the Internet name of the machine writing and reading the cookie. So even if a site should store sensitive information about you, only that site would be able to read it back at another time.

There is nothing to keep Internet sites from sharing their cookie information after they have read it from your machine. There is at least one service that will set cookies for Web sites so that multiple Web sites can benefit from the cookie information that is gathered and analyzed across those sites. In this scenario, information that you provide, such as a credit card number, could theoretically be passed on to other companies although legitimate companies would have no use for that information. They are more interested in learning about the surfing habits of what they see as their customer base. Cookies have become a way to overcome the paucity of marketing information that the standard Internet logs provide. If you are an advertiser trying to analyze which Internet sites would reach your target audience, a simple list of Internet addresses is not useful. Cookies can be used by an Internet site to keep track of return visits, to build limited profiles of users and their activities on the host system (i.e. users who read the sports news are more likely to also visit the automobile classifieds). These profiles, crude as they are, begin to look like the demographic information that magazines and newspapers use when wooing their advertising accounts.

Cookies can also be used to target advertising, or to "personalize" it. A site can use its cookie information to make sure that visitors don't see the same advertisements over and over, and can correlate ads to information that it can glean about reader tastes and interests. Rather than providing a service to the person browsing, these cookies are providing a service to the advertisers and site owners, although in the literature of marketing there is much made of the benefit to the customer.

Free Registration, or the Information Barter

Between logs and cookies, site owners can know everything that was viewed on their site, exactly when, down to the tenths of a second, and can even know if this is a return visit by someone or ones using the same machine. It seems like a lot of information, but it doesn't answer the kinds of questions that need to be answered for the many sites that depend on advertising for their revenue; mainly who are the readers and what are their demographic characteristics. To properly target advertising, marketers need to know more about the actual people they are in contact with.

In the end, Web sites resort to the same techniques of data gathering as do magazines and appliance companies: they ask you to fill out forms of information about yourself. The incentive to do so can be in the guise of entry in a contest, or it can be a benefit such as some added access to services on their Web site. Although the latter is usually presented as a "free registration," it is essentially a payment of information for access, and personal information is fast becoming the coin of the realm on the Internet.

The information that is gathered in this way is much more revealing in most cases than anything that can be gleaned just from the Internet interaction. Some sites ask for only a name and an e-mail address, but many ask for much more: a street address, demographic information like sex, age and level of education. Others require household data like ages of household members and the family income. Each time you visit the site your actions there can be correlated with this demographic information.

Many Internet users are quite happy to participate in this system of information barter. It's not unlike our agreement to barter our attention to advertisements for the television and radio programming that we receive "for free." It can be seen as problematic, however, when the site being accessed is one of ideas rather than mere consumerism, such as in the case of an online newspaper or magazine. By logging on with an ID that is linked to personally identifiable information, it is possible that a profile can be built based on one's readings. This is analogous to the patron circulation files that libraries protect as confidential. A list of which newspaper articles one person has read over the last year can reveal more than just their potential as consumers; it can reveal social, political and other personal concerns.

The real question is whether those with privacy concerns can opt out of this particular barter economy and still have access to the full value of the Internet. Currently this seems to be the case, but that may change in the future.

Privacy as a Protocol

Many of today's Internet users are unaware that commercial activity (advertising, sales, etc.) was only allowed on the Internet for the first time in 1994. Prior to that time a portion of the basic Internet structure was still funded by the National Science foundation and this public funding meant that the system could not be used for commercial gain. The basic technology of the Internet was developed in a time when online commercial transactions were not allowed and certain functions, like those that support online purchasing, were not developed.

Today, however, there is a strong interest in facilitating online purchasing and other interactions where an exchange of information about the consumer is required. Citing the inefficiency for Web users of having to key in this information separately at each site that requires it (i.e. name, address and credit card number for online purchases), some Internet companies designed a new protocol that would allow users to key this information only once into their Web browser and to exchange it securely with requesting sites as needed. They name this protocol the Open Profiling Standard, or OPS.

OPS was incorporated into work being done at the World Wide Web Consortium (W3C). W3C is a membership organization that is the standards body for the World Wide Web. The resulting standard, which has not yet been implemented, is called the "Platform for Privacy Preferences" and is given the acronym P3P since the acronym "PPP" was already in wide use for an unrelated protocol. In the P3P scenario, there is a set of rules for the exchange of information that includes privacy provisions. For example, a user can state that she is willing to give full identifying and demographic information to a Web site that will not exchange or sell that information to any third parties but that she is only willing to give demographic information (with no personally identifying information attached) to a site that will be passing that information along to others. In this way, her browser and the Web site negotiate in the background for the level of information exchange without taking the user's time. P3P can also facilitate purchases, since the user can fill in full payment information and authorize its use at the time of purchase.

There are a number of potential problems with a protocol like P3P, not the least of which is that many users may be unable to comprehend its options and therefore may not be good managers of the functions that it offers. Studies show that a huge number of computer users never change the defaults on their programs or their Web browsers; in the case of a function like P3P, how the defaults are set could have a great deal of influence on the degree of privacy that the average user has on the Web. 3

The other possible problem is that P3P facilitates the exchange of personal information for access, therefore it feeds the trend of using data about ourselves as payment for online resources. In marketing literature this is presented as a very positive scenario and there are some compelling arguments in its favor. Marketers cite the ability to provide more personalized service to customers as well as promising an environment where advertising will be so intelligently targeted that you will never see ads for products in which you have no interest. Not seeing advertising at all, of course, is not one of the options.

For librarians, with their policy of patron confidentiality, this exchange of personal data for access to information violates the principle of freedom to read. As users of online information sources become aware that their selections are being monitored and recorded, they will feel less free to explore those areas that have some element of danger or social stigma attached. This might keep people from visiting sites that are sexually explicit or that promote the use of illegal drugs, but it also might discourage access to information on safe sex or addiction treatment. Few of us would feel free visiting alternative health treatment sites if we thought our medical insurers could obtain our Web surfing records, 4 and job hunters would not want their current employer to be monitoring their visits to other companies' sites. 5

Privacy: A Choice, Not a Technology

It is not the case that computers invade our privacy. As the above examples show, gathering information about computer users is a conscious decision of system and software developers and takes effort. There is nothing inherent nor natural about the use of this technology to facilitate an exchange of personal data for access to online resources. We could just as easily decide to use the same systems and the same computers to protect the privacy of readers and information seekers. In some sense, this is also being done on the Internet today, where some companies make clear their respect for the privacy of their customers and gather data on visits to their site without seeking to identify the individuals who visit.

Education is the key to the future of our freedom to read. As the technologies we use to access information grow more complex it is harder for people to understand and take responsibility for their own privacy, even when they value that privacy greatly. Those of us who are in a position to teach friends, colleagues and members of the public about information access should take every opportunity to inform them of the privacy implications of this technology and help them make informed decisions about the circumstances in which they reveal information about themselves.


Footnotes

1 The author is on the staff of the California Digital Library at the University of California, although she speaks and writes on the issue of privacy mainly in her capacity as a concerned volunteer with CPSR. CPSR can be reached at http://www.cpsr.org. The author maintains a Web site of her writings at http://www.kcoyle.net.

2 There are a number of ways to eliminate or thwart cookies, such as programs that will automatically delete cookies from your hard drive. Each WWW browser has an option that you can set to refuse all cookies, or to notify you for each cookie and ask you want to accept it. Most users find that interruption of"Do you accept this cookie?" annoying since cookies are used abundantly on the Web. And as I've stated above, there's no way to know exactly what information the cookie is trying to write to your hard drive so the question itself is nearly meaningless. An easier solution for PC users is to find the file called cookies.txt on their hard drive, empty it and resave it, then set the file's attributes to "read only." Cookies will be accepted by the browser and used during a single browser session, but as the browser program is closed the cookie simply will not be written to the hard drive.

3 Note that although users can opt to disallow the writing of cookies through a browser option, the default in both Internet Explorer and Netscape is that cookies are allowed and the user is not notified when cookies are being written. This means that many users are totally unaware of the existence of cookies even though they are being written to their hard drive.

4 On a health-related Usenet site that I frequent, a woman posted a message criticizing her doctor and her health plan. (Such statements are very common.) It happens that she named her health plan in her message, and a short time later she received an email from the plan's customer relations department inviting her to contact them if she wished. There are companies whose business it is to scan the Internet for mentions of company names and to bring this to the attention of the companies - not unlike newspaper clipping services that do the same thing. There is a difference, however: many people use the Internet not for formal publication but for personal communication, and it is easy to forget that posting to a Usenet group is a very public act.

5 The issue of privacy in the workplace is complex, but suffice it to say that employees using the Internet at work should not expect to have privacy in their online communications. There is a growing market in software that employers can use that will log all sites that employees visit from their desktop computer and that can also block access to non-work related sites that employers feel will be too tempting (essentially the 3 S's: sex, sports and shopping).


©Copyright Karen Coyle, 1998
Back to Karen Coyle's Home Page