Archive for the 'Intellectual Property' Category

Open Data Dead on Arrival

In 1984 Karl Popper wrote a private letter to an inquirer he didn’t know, responding to enclosed interview questions. The response was subsequently published and in it he wrote, among other things, that:

“Every intellectual has a very special responsibility. He has the privilege and opportunity of studying. In return, he owes it to his fellow men (or ‘to society’) to represent the results of his study as simply, clearly and modestly as he can. The worst thing that intellectuals can do — the cardinal sin — is to try to set themselves up as great prophets vis-a-vis their fellow men and to impress them with puzzling philosophies. Anyone who cannot speak simply and clearly should say nothing and continue to work until he can do so.”

Aside from the offensive sexism in referring to intellectuals as males, there is another way this imperative should be updated for intellectualism today. The movement to make data available online is picking up momentum — as it should — and open code is following suit (see for example). But data should not be confused with facts, and applying the simple communication that Popper refers to beyond the written or spoken word is the only way open data will produce dividends. It isn’t enough to post raw data, or undocumented code. Data and code should be considered part of intellectual communication, and made as simple as possible for “fellow men” to understand. Just as knowledge of adequate English vocabulary is assumed in the nonquantitative communication Popper refers to, certain basic coding and data knowledge can be assumed as well. This means the same thing as it does in the literary case; the elimination of extraneous information and obfuscating terminology. No need to bury interested parties in an Enron-like shower of bits. It also means using a format for digital communication that is conducive to reuse, such as a flat text file or another non-proprietary format, for example pdf files cannot be considered acceptable to either data or code. Facilitating reproducibility must be the gold standard for data and code release.

And who are these “fellow men”?

Well, fellow men and women that is, but back to the issue. Much of the history of scientific communication has dealt with the question of demarcation of the appropriate group to whom the reasoning behind the findings would be communicated, the definition of the scientific community. Clearly, communication of very technical and specialized results to a layman would take intellectuals’ time away from doing what they do best, being intellectual. On the other hand some investment in explanation is essential for establishing a finding as an accepted fact — assuring others that sufficient error has been controlled for and eliminated in the process of scientific discovery. These others ought to be able to verify results, find mistakes, and hopefully build on the results (or the gaps in the theory) and thereby further our understanding. So there is a tradeoff. Hence the establishment of the Royal Society for example as a body with the primary purpose of discussing scientific experiments and results. Couple this with Newton’s surprise, or even irritation, at having to explain results he put forth to the Society in his one and only journal publication in their journal Philosophical Transactions (he called the various clarifications tedious, and sought to withdraw from the Royal Society and subsequently never published another journal paper. See the last chapter of The Access Principle). There is a mini-revolution underfoot that has escaped the spotlight of attention on open data, open code, and open scientific literature. That is, the fact that the intent is to open to the public. Not open to peers, or appropriately vetted scientists, or selected ivory tower mates, but to anyone. Never before has the standard for communication been “everyone,” in fact quite the opposite. Efforts had traditionally been expended narrowing and selecting the community privileged enough to participate in scientific discourse.

So what does public openness mean for science?

Recall the leaked files from the University of East Anglia’s Climatic Research Unit last November. Much of the information revealed concerned scientifically suspect (and ethically dubious) attempts not to reveal data and methods underlying published results. Although that tack seems to have softened now some initial responses defended the climate scientists’ right to be closed with regard to their methods due to the possibility of “denial of service attacks” – the ripping apart of methodology (recall all science is wrong, an asymptotic progression toward to truth at best) not with the intent of finding meaningful errors that halt the acceptance of findings as facts, but merely to tie up the climate scientists so they cannot attend to real research. This is the same tradeoff as described above. An interpretation of this situation cannot be made without the complicating realization that peer review — the review process that vets articles for publication — doesn’t check computational results but largely operates as if the papers are expounding results from the pre-computational scientific age. The outcome, if computational methodologies are able to remain closed from view, is that they are directly vetted nowhere. Hardly an acceptable basis for establishing facts. My own view is that data and code must be communicated publicly with attention paid to Popper’s admonition: as simply and clearly as possible, such that the results can be replicated. Not participating in dialog with those insufficiently knowledgable to engage will become part of our scientific norms, in fact this is enshrined in the structure of our scientific societies of old. Others can take up those ends of the discussion, on blogs, in digital forums. But public openness is important not just because taxpayers have a right to what they paid for (perhaps they do, but this quickly falls apart since not all the public are technically taxpayers and that seems a wholly unjust way of deciding who shall have access to scientific knowledge and who not, clearly we mean society), but because of the increasing inclusiveness of the scientific endeavor. How do we determine who is qualified to find errors in our scientific work? We don’t. Real problems will get noticed regardless of with whom they originate, many eyes making all bugs shallow. And I expect peer review for journal publishing to incorporate computational evaluation as well.

Where does this leave all the open data?

Unused, unless efforts are expended to communicate the meaning of the data, and to maximize the usability of the code. Data is not synonymous with facts – methods for understanding data, and turning its contents into facts, are embedded within the documentation and code. Take for granted that users understand the coding language or basic scientific computing functions, but clearly and modestly explain the novel contributions. Facilitate reproducibility. Without this data may be open, but will remain de facto in the ivory tower.

Post 3: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the OSTP’s call as posted here: The first wave, comments posted here, asked for feedback on implementation issues. The second wave requested input on Features and Technology (our post is here). For the third and final wave on Management, Chris Wiggins, Matt Knepley, and I posted the following comments:

Q1: Compliance. What features does a public access policy need to ensure compliance? Should this vary across agencies?

One size does not fit all research problems across all research communities, and a heavy-handed general release requirement across agencies could result in de jure compliance – release of data and code as per the letter of the law – without the extra effort necessary to create usable data and code facilitating reproducibility (and extension) of the results. One solution to this barrier would be to require grant applicants to formulate plans for release of the code and data generated through their research proposal, if funded. This creates a natural mechanism by which grantees (and peer reviewers), who best know their own research environments and community norms, contribute complete strategies for release. This would allow federal funding agencies to gather data on needs for release (repositories, further support, etc.); understand which research problem characteristics engender which particular solutions, which solutions are most appropriate in which settings, and uncover as-yet unrecognized problems particular researchers may encounter. These data would permit federal funding agencies to craft release requirements that are more sensitive to barriers researchers face and the demands of their particular research problems, and implement strategies for enforcement of these requirements. This approach also permits researchers to address confidentiality and privacy issues associated with their research.


One exemplary precedent by a UK funding agency is the January 2007 “Policy on data management and sharing”
adopted by The Wellcome Trust ( according to which “the Trust will require that the applicants provide a data management and sharing plan as part of their application; and review these data management and sharing plans, including any costs involved in delivering them, as an integral part of the funding decision.” A comparable policy statement by US agencies would be quite useful in clarifying OSTP’s intent regarding the relationship between publicly-supported research and public access to the research products generated by this support.

Continue reading ‘Post 3: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

Post 2: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the second wave of the OSTP’s call as posted here: The first wave, comments posted here and on the OSTP site here (scroll to the second last comment), asked for feedback on implementation issues. The second wave requests input on Features and Technology and Chris Wiggins and I posted the following comments:

We address each of the questions for phase two of OSTP’s forum on public access in turn. The answers generally depend on the community involved and (particularly question 7, asking for a cost estimate) on the scale of implementation. Inter-agency coordination is crucial however in (i) providing a centralized repository to access agency-funded research output and (ii) encouraging and/or providing a standardized tagging vocabulary and structure (as discussed further below).

Continue reading ‘Post 2: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

Nathan Myhrvold advocates for Reproducible Research on CNN

On yesterday’s edition of Fareed Zakaria’s GPS on CNN former Microsoft CTO and current CEO of Intellectual Ventures Nathan Myhrvold said reproducible research is an important response for climate science in the wake of Climategate, the recent file leak from a major climate modeling center in England (I blogged my response to the leak here). The video is here, see especially 16:27, and the transcript is here.

The OSTP's call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the OSTP’s call as posted here:

Open access to our body of federally funded research, including not only published papers but also any supporting data and code, is imperative, not just for scientific progress but for the integrity of the research itself. We list below nine focus areas and recommendations for action.

Continue reading ‘The OSTP's call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

My Interview with ITConversations on Reproducible Research

On September 30, I was interviewed by Jon Udell from in his Interviews with Innovators series, on Reproducibility of Computational Science.

Here’s the blurb: “If you’re a writer, a musician, or an artist, you can use Creative Commons licenses to share your digital works. But how can scientists license their work for sharing? In this conversation, Victoria Stodden — a fellow with Science Commons — explains to host Jon Udell why scientific output is different and how Science Commons aims to help scientists share it freely.”

Stuart Shieber and the Future of Open Access Publishing

Back in February Harvard adopted a mandate requiring its faculty member to make their research papers available within a year of publication. Stuart Shieber is a computer science professor at Harvard and responsible for proposing the policy. He has since been named director of Harvard’s new Office for Scholarly Comminication.

On November 12 Shieber gave a talk entitled “The Future of Open Access — and How to Stop It” to give an update on where things stand after the adoption of the open access mandate. Open access isn’t just something that makes sense from an ethical standpoint, as Shieber points out that (for-profit) journal subscription costs have risen out of proportion with inflation costs and out of proportion with the costs of nonprofit journals. He notes that the cost per published page in a commercial journal is six times that of the nonprofits. With the current library budget cuts, open access — meaning both access to articles directly on the web and shifting subscriptions away from for-profit journals — is something that appears financially unavoidable.

Here’s the business model for an Open Access (OA) journal: authors pay a fee upfront in order for their paper to be published. Then the issue of the journal appears on the web (possibly also in print) without an access fee. Conversely, traditional for-profit publishing doesn’t charge the author to publish, but keeps the journal closed and charges subscription fees for access.

Shieber recaps Harvard’s policy:

1. The faculty member grants permission to the University to make the article available through an OA repository.

2. There is a waiver for articles: a faculty member can opt out of the OA mandate at his or her sole discretion. For example, if you have a prior agreement with a publisher you can abide by it.

3. The author themselves deposits the article in the repository.

Shieber notes that the policy is also because it allows Harvard to make a collective statement of principle, systematically provide metadata about articles, it clarifies the rights accruing to the article, it allows the university to facilitate the article deposit process, it allows the university to negotiate collectively, and having the mandate be opt out rather than opt in might increase rights retention at the author level.

So the concern Shieber set up in his talk is whether standards for research quality and peer review will be weakened. Here’s how the dystopian argument runs:

1. all universities enact OA policies
2. all articles become OA
3. libraries cancel subscriptions
4. prices go up on remaining journals
5. these remaining journals can’t recoup their costs
6. publishers can’t adapt their business model
7. so the journals and the logistics of peer review they provide, disappear

Shieber counters this argument: 1 through 5 are good because journals will start to feel some competitive pressure. What would be bad is if publishers cannot change their way of doing business. Shieber thinks that even if this is so it will have the effect of pushing us towards OA journals, which provide the same services, including peer review, as the traditional commercial journals.

But does the process of getting there cause a race to the bottom? The argument goes like this: since OA journals are paid by the number of articles published they will just publish everything, thereby destroying standards. Shieber argues this won’t happen because there is price discrimination among journals – authors will pay more to publish in the more prestigious journals. For example, PLOS costs about $3k, Biomed Central about $1000, and Scientific Publishers International is $96 for an article. Shieber also makes an argument that Harvard should have a fund to support faculty who wish to publish in an OA journal and have no other way to pay the fee.

This seems to imply that researchers with sufficient grant funding or falling under his proposed Harvard publication fee subsidy, would then be immune to the fee pressure and simply submit to the most prestigious journal and work their way down the chain until their paper is accepted. This also means that editors/reviewers decide what constitutes the best scientific articles by determining acceptance.

But is democratic representation in science a goal of OA? Missing from Shieber’s described market for scientific publications is any kind of feedback from the readers. The content of these journals, and the determination of prestige, is defined solely by the editors and reviewers. Maybe this is a good thing. But maybe there’s an opportunity to open this by allowing readers a voice in the market. This could done through ads or a very tiny fee on articles – both would give OA publishers an incentive to respond to the preferences of the readers. Perhaps OA journals should be commercial in the sense of profit-maximizing: they might have a reason to listen to readers and might be more effective at maximizing their prestige level.

This vision of OA publishing still effectively excludes researchers who are unable to secure grants or are not affiliated with a university that offers a publication subsidy. The dream behind OA publishing is that everyone can read the articles, but to fully engage in the intellectual debate quality research must still find its way into print, and at the appropriate level of prestige, regardless of the affiliation of the researcher. This is the other side of OA that is very important for researchers from the developing world or thinkers whose research is not mainstream (see, for example, Garrett Lisi a high impact researcher who is unaffiliated with an institution).

The OA publishing model Shieber describes is a clear step forward from the current model where journals are only accessible by affiliates of universities who have paid the subscription fees. It might be worth continuing to move toward an OA system where, not only can anyone access publications, but any quality research is capable of being published, regardless of the author’s affiliation and wealth. To get around the financial constraints one approach might be to allow journals to fund themselves through ads, or provide subsidies to certain researchers. This also opens up the idea of who decides what is quality research.