Posted by Danny Dover
Update: You can now download the complete list of Google User Data by clicking here.
Google Inc. is first and foremost a data company. In the past, it competed on a level playing field by manipulating publicly available data better than its competition. By doing this, it had unprecedented success.
Enter Web 2.0. Hard drives, processors, bandwidth, and even workers are now all relatively inexpensive. This has caused the barriers to entry in the search field to drastically lower. As Google’s competition has started to catch up (MSN Image Search) and new competitors are arising, (Cuill) the search engine is looking for some kind of advantage. Since everyone has reasonably equal access to the internet’s content, leaders have been striving to gain access to private data. The most cost effective way of doing this for the engines is by collecting data from the users that already use their services. Google has been increasingly serving its users by using their personal data to manipulate public data in individualized ways. These methods are impossible to copy without the necessary personal data.
The Methods Google Uses to Get Data
Click Tracking - Google logs
all the navigational clicks (ads, actions, feature clicks, etc) of
all of its users on
all of its services.
Forms - Along with the data the user enters directly into the forms (username, password, etc), Google logs the time and date and location of submission.
Code From Google Account Sign Up
1. Input type is hidden so user doesn"t see or enter data into given field
2. Location to send user after submitting (hidden)
3. Input type is hidden so user doesn"t see or enter data into given field
4. User"s referrer data is used and sent via the form so Google knows where user clicked "Sign Up" (hidden)
Cookies - Google uses cookies on all of its web properties. Additionally, it leaves advertising (Doubleclick) cookies to track users" movement around the web. By doing this, Google can track individual users on any page that has either Doubleclick or Adsense ads. This means millions of pages that are not on Google’s web properties.
Unique cookies stored on user"s computer from multiple Google web properties
Server Requests Stored in Log Files - Every request made to any of Google"s server (ex. GET http://www.google.com) is stored in log files. The content stored is dependent on the type of request. (See ‘normal search’ below for more details.)
Example of a log file
URL - "http://www.google.com/search?hl=en&q=seomoz&ie=UTF-8"
1. IP Address from user making request. This can be used to geo-locate the user
2. Date, time, and time zone offset of user
3. Language of requested result (in this case, English)
4. Search query
5. Operating system of user
6. Browser of user
The additional information is less important but details the server type of request, the server response, and rendering engine.
Javascript - Google has small amounts of javascript embedded in websites all over the internet. When a user’s browser executes the script in the background, Google is able to tell a lot of important information on a person’s browsing habits (location, operating system, browser type and version, etc).
Web Beacons - Google embeds small (1 pixel by 1 pixel) transparent .gifs into many of its checkout screens. Just like the javascript, a user downloads the invisible image and sends information about their computer to Google.
Example of a Web Beacon (What you can"t see it? That is the point.)
Understanding What Google Does with the Data
Store - Google uses an internal database called BigTable spread over approximately one million servers.
Google Data In 2006
|
Data |
Size (TB) |
| Crawl Index |
800
|
| Google Analytics |
200 |
| Google Base |
2 |
| Google Earth |
70 |
| Orkut |
9 |
| Personalized Search |
4 |
(Source: Bigtable: A Distributed Storage System for Structured Data)
This is the size of the
compressed data in terabytes (1,024 GB). That puts Google"s disclosed data size at over 1 petabyte (1,048,576 GB).
GREAT GOOGLEY MOOGLEY! This doesn"t even consider AdSense, Gmail, Google Maps, Street View, Google Images, or other private databases. This is considered to be a lot of data now and these are stats from over two years ago before the Web 2.0 Data Rush.
Massive Data Analysis - This is a little like Charlie and the Chocolate Factory. We know that a lot of data goes into Google, and we know a lot of useful manipulated data comes out. We just don"t know what happens in between.
Ompa Loompas working hard at Google writing pretty primary colored code.
We know that Google has many algorithms to sort and organize its data. Page Rank is the most well known. It also known that Google has many complicated spam filters, duplicate content filters, pattern detection algorithms, natural language interpreters, image recognition software, and loads of other complicated software.
Permanent Backup - The final resting place for data at Google is likely in permanent storage. Google"s privacy policies hint that some user data can never be completely deleted because of permanent backups.
Understanding What Specific User Data Google Collects
Below is a list of every
self-declared piece of datum that Google collects when a user interacts with its many web services. This means there is even more user data that is gathered by Google that is unknown to the public. Be forewarned, ignorance is bliss. After you read this you may feel inclined to wear a tinfoil hat.
The Comprehensive List of All the Data Google Admits to Collecting from Users
Download as:
PDF Doc Pages
Do you like this post? Yes No
>>
Source Link>>Blog:
SEOmoz Daily SEO Blog>>Publish Date: 6/27/2008 7:03:00 AM
>>Keywords: google data
Related Posts>>Google-internal Data Restrictions # There"s two sides to protecting your personal data stored at Google: defending abuse from the outside, and defending abuse from the inside. Google"s Douglas Merrill recently gave some remarks on how G
>>Google Analytics Data Sharing # Site stats program Google Analytics offers a new opt-in data sharing setting. Log-in and you"ll see a message dialog, and somewhere below it a link reading "Edit Account and Data Sharing Settings". Th
>>Google Data Center Locations (and a Sidenote on Supercomputers) # Google hosts what might be the world"s biggest supercomputer owned by a single company*; rather than a single machine, it"s a dispersed network made of smaller machines, though. Now Pingdom (a neat pa
>>Google is a Data Company # Google is a data company. No, it"s not a search engine company, or an advertising company, it"s a data company. I"m talking about their core expertise here. My thinking on this emerges from a few d
>>Google Defends Data-Retention Practices # In response to an E.U. Article 29 Working Party investigation, Google has changed its data retention policies again. Instead of the 18-24 months that it announced in March as the cut-off for keeping s
>>Google Adds Transit Data to Maps # Google has been showing locations of train, bus or subway stops on its maps, but now those locations will link directly to more detailed information about a specific station, route, or schedule, accor
>>SEW Experts: Can Google Analytics Be Evil? # In today"s Search Ads column, "Can Google Analytics Be Evil?," Tony Wright is looking for feedback on Google Analytics. Like many search marketers, he has recommended that clients stay away from Googl
>>Google Documents API Released # Google released a brand-new API titled "Google Documents List Data API." Sounds confusing, but the "Google Documents List" is just the Windows Explorer-style file browser available at docs.google.com,