Our goal is a free, fair, and open web!
What is the open web?
The open web is based on the concept that websites participate fairly with one another and allow data to be published and accessed without being blocked by a walled garden of technical or legal obstacles.
Without the open web, a start-up like Google, which gathers content across the whole web, indexes it, and supports search results, would never have been able to get off the ground.
Much of this user-generated data is published within the public domain, available under fair use, or Creative Commons license.
Problems arise when publicly viewable data is then selectively made accessible to some while other are denied access on a discriminatory basis.
We believe that when data is published in the public, for free, there is an implied license that permits crawlers to access this content in a fair manner.
Why should I care about the open web?
An open web is critical to the future of the Internet and innovation. Data is the air that web startups breath and without it innovation on the web will be stifled.
Society depends on this innovation. The entire social media analytics industry, a billion dollar industry, completely relies on open and fair access to web data.
Users should have rights around their own data. They have the right to export this data, the right to share it with other companies, and the right to benefit when new companies build innovative products around their data.
There are direct and indirect benefits to society that come from open access to web data such as wider and more egalitarian distribution of information, data-driven decision tools, and a more thriving economy. All these rely on open and fair access to web data.
What are some common uses and examples of open data
Fetching bulk content and building derivative applications.
This involves a 3rd party fetching large amounts of content from the Internet and building new and derivative applications. Example could include Blekko, Topsy, and Summize (which was acquired by Twitter in 2008).
Clients that can post new data to blogs and social networks.
This would include Twitter clients that allow you to easily post new tweets or images but also weblog clients that allow you to post as well as photos applications like Instagram.
Synchronizing private data between social networks and exporting data for backup purposes.
This would involve applications which cross post between sites, export data from one site and upload it to another site, or simply store the data for offline backup.
In the past, these type of applications have been aggressiely blocked by hosting providers.
What is non-discriminatory access?
A major component to a fair access policy is non-discriminatory access. It's paramount that a hosting provider treat each party in an identical manner. The access rules must remain consistent if access is to be judged fair and reasonable.
We have seen web publishers create APIs to build an ecosystem of developers around their platform of public data, but such access being arbitrarily terminated to particular parties that become too successful in adjacent marketplaces that a publishers wants to protect. Manipulating such access to public data to thwarts innovation and competition, and is inherently detrimental to both users and society as a whole.
This is a chilling effect because it prevents companies from investing significantly in an API because they may be cut off at any time.
This is very similar to the way patents are licensed within open standards using FRAND (Fair Reasonable and Non-discriminatory Licensing)
Non-discriminatory relates to both the terms and the rates included in licensing agreements. As the name suggests this commitment requires that licensors treat each individual licensee in a similar manner. This does not mean that the rates and payment terms can't change dependent on the volume and creditworthiness of the licensee. However it does mean that the underlying licensing condition included in a licensing agreement must be the same regardless of the licensee. This obligation is included in order to maintain a level playing field with respect to existing competitors and to ensure that potential new entrants are free to enter the market on the same basis.
This often happens with web crawlers where data is given to Google and Microsoft but not to anyone else.
Without non-discriminatory access, the web is not open.
Don't these websites OWN the data?
Yes and no. Very few websites own the data submitted by the user. They often request permissions to use the data within their application but do not attempt to assert exclusive copyright on their user generated content.
The copyright on the data is owned by the user and often under the public domain, a Creative Commons license, or available for access under doctrines of fair use or implied license.
Furthermore, some particularly unfair (but legal) TOUs are written in such voluminous and dense legalize that no one can understand just what is and is not permissible regarding access to data or interoperation on a site. The TOUs can then be arbitrarily interpreted in any way the publisher so deems -- ensuring that equally arbitrary termination, blocking, or suing are all chilling threats held over the heads of third parties that wish to work with the data of a particular site.
Recently, Craigslist decided to (shortly) assert exclusive copyright on user generated content but then did the right thing and backed down from that position:
In a welcome course correction, craigslist has removed its short-lived provision that required users to grant it an exclusive license to--in other words granting them ownership of--every post. We were unhappily surprised to see this click-through demand, but are glad to see that craigslist has promptly removed it.
For many years, craigslist has been a good digital citizen. Its opposition to SOPA/PIPA was critically important, and it has been at the forefront of challenges to Section 230 and freedom of expression online. We understand that craigslist faces real challenges in trying to preserve its character and does not want third parties to simply reuse its content in ways that are out of line with its user community's expectations and could be harmful to its users.
Nevertheless, it was important for craigslist to remove the provision because claiming an exclusive license to the user's posts--to the exclusion of everyone, including the original poster--would have harmed both innovation and users' rights, and would have set a terrible precedent. We met with craigslist to discuss this recently and are pleased about their prompt action.
Don't crawlers use excessive resources when indexing a site?
No. Not fair crawlers.
We want to make it clear that we do not support or defend excessive resource utilization used by misbehaving crawlers and robots. These systems cause havoc and are very costly to many websites.
However, for large sites, fetching the home page or an RSS feed of public content imposes a very insignificant impact on the site's performance. For a website receiving 1M hits per day, fetching the home page once per hour increases the load on the website by only 0.01%.
They also went further and provided the feed to anyone who wanted it - immediately. They also removed private posts from the stream to minimize user concerns about their private data being indexed by search engines.
Often, the resource utilization issue is a red herring used as an excuse to continue unfair access policies.
Six Apart still provides access to the same feed but have migrated to a more modern protocol - pubsub.
Don't hosting providers have the right to publish content under any terms they desire?
Yes. A hosting provider is well within their rights to publish data under any license they choose. This could range from being completely unfair or very fair.
Users have the right to know in plain terms about the access policies of sites they use, similar to the way that responsible websites publish clear cut privacy policies.
Further, the OAC and the user community are free to criticize these sites and encourage users to take their business elsewhere.
We believe it's in the users best interest to insist on using sites that have fair access. Additionally, we feel it's vital to the growth of the industry to have a free, fair, and open Internet.
What are some common examples that violate fair access?
Blocking all crawlers via robots.txt - except Google:
This happens far too often. This is a problem for new search engines because they have to make a decision about whether to yield to the robots.txt block, contact the website owner (which could involve thousands of websites) or providing a sub-standard experience to their users. Far too often, a decision to compete in the same way that Google behaves, becomes a contractual, legal, and even criminal violation.
Differential pricing based on size, or hidden partnerships:
There have been situations where API licenses are given based on hidden partnerships, private investments, or size of company which have prevented open use of the data.
One example is Twitter's firehose license to Google and Microsoft.
Twitter licensed their firehose to Google (for millions of dollars) and Google eventually didn't like the terms and cancelled their license.
The issue seems to be that Twitter was attempting to charge Google an excessive price for a full license for which Google refused.
There have also been situations where large social networks have harassed both individuals and small companies with large lawsuits or threat of lawsuit simply for accessing data in new and compelling situations - even when perfectly lawful.
These lawsuits can be extremely frighting for small independent developers:
In 2010, Pete Warden was threatened with a lawsuit from Facebook for crawling content which they posted publicly:
I scratched my head a bit and thought "well, how hard can it be to build my own search engine?". As it turned out, it was very easy. Checking Facebook's robot.txt, they welcome the web crawlers that search engines use to gather their data, so I wrote my own in PHP (very similar to this Google Profile crawler I open-sourced) and left it running for about 6 months. Initially all I wanted to gather was people's names and locations so I could search on those to find public profiles. Talking to a few other startups they also needed the same sort of service so I started looking into either exposing a search API or sharing that sort of 'phone book for the internet' information with them.
I noticed Facebook were offering some other interesting information too, like which pages people were fans of and links to a few of their friends. I was curious what sort of patterns would emerge if I analyzed these relationships, so as a side project I set up fanpageanalytics.com to allow people to explore the data. I was getting more people asking about the data I was using, so before that went live I emailed Dave Morin at Facebook to give him a heads-up and check it was all kosher. We'd chatted a little previously, but I didn't get a reply, and he left the company a month later so my email probably got lost in the chaos.
On Sunday around 25,000 people read the article, via YCombinator and Reddit. After that a whole bunch of mainstream news sites picked it up, and over 150,000 people visited it on Monday. On Tuesday I was hanging out with my friends at Gnip trying to make sense of it all when my cell phone rang. It was Facebook's attorney.
He was with the head of their security team, who I knew slightly because I'd reported several security holes to Facebook over the years. The attorney said that they were just about to sue me into oblivion, but in light of my previous good relationship with their security team, they'd give me one chance to stop the process. They asked and received a verbal assurance from me that I wouldn't publish the data, and sent me on a letter to sign confirming that. Their contention was robots.txt had no legal force and they could sue anyone for accessing their site even if they scrupulously obeyed the instructions it contained. The only legal way to access any web site with a crawler was to obtain prior written permission.
Explicitly blocking user agents:
HTTP requests include a User-Agent header which is used to specify the agent requesting a URL.
For example, the Googlebot user agent is:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
It's difficult to block Googlebot because a site's web traffic would fall off significantly (as well as revenue). However, smaller websites and startups often have their crawlers blocked by websites based on User-Agent.
What is rate limiting? What are IP throttles?
Rate limiting is often used by websites to ostensibly avoid excessive and expensive resource utilization.
It may in fact be used by this in a number of circumstances but by definition, once you allow anyone to bypass the throttle you have discriminatory access.
Rate limiting is generally done on an API is required to be included with each request.
An IP throttle is similar but designed to only allow a given IP address a few requests per hour (or some other arbitrarly duration like per minute).
IP throttles are especially troubling as they often hurt legitimate companies but don't hurt companies willing to bypass the IP blocks through controversial measures.