Archive for the 'Law' Category

Post 3: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the OSTP’s call as posted here: The first wave, comments posted here, asked for feedback on implementation issues. The second wave requested input on Features and Technology (our post is here). For the third and final wave on Management, Chris Wiggins, Matt Knepley, and I posted the following comments:

Q1: Compliance. What features does a public access policy need to ensure compliance? Should this vary across agencies?

One size does not fit all research problems across all research communities, and a heavy-handed general release requirement across agencies could result in de jure compliance – release of data and code as per the letter of the law – without the extra effort necessary to create usable data and code facilitating reproducibility (and extension) of the results. One solution to this barrier would be to require grant applicants to formulate plans for release of the code and data generated through their research proposal, if funded. This creates a natural mechanism by which grantees (and peer reviewers), who best know their own research environments and community norms, contribute complete strategies for release. This would allow federal funding agencies to gather data on needs for release (repositories, further support, etc.); understand which research problem characteristics engender which particular solutions, which solutions are most appropriate in which settings, and uncover as-yet unrecognized problems particular researchers may encounter. These data would permit federal funding agencies to craft release requirements that are more sensitive to barriers researchers face and the demands of their particular research problems, and implement strategies for enforcement of these requirements. This approach also permits researchers to address confidentiality and privacy issues associated with their research.


One exemplary precedent by a UK funding agency is the January 2007 “Policy on data management and sharing”
adopted by The Wellcome Trust ( according to which “the Trust will require that the applicants provide a data management and sharing plan as part of their application; and review these data management and sharing plans, including any costs involved in delivering them, as an integral part of the funding decision.” A comparable policy statement by US agencies would be quite useful in clarifying OSTP’s intent regarding the relationship between publicly-supported research and public access to the research products generated by this support.

Continue reading ‘Post 3: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

Post 2: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the second wave of the OSTP’s call as posted here: The first wave, comments posted here and on the OSTP site here (scroll to the second last comment), asked for feedback on implementation issues. The second wave requests input on Features and Technology and Chris Wiggins and I posted the following comments:

We address each of the questions for phase two of OSTP’s forum on public access in turn. The answers generally depend on the community involved and (particularly question 7, asking for a cost estimate) on the scale of implementation. Inter-agency coordination is crucial however in (i) providing a centralized repository to access agency-funded research output and (ii) encouraging and/or providing a standardized tagging vocabulary and structure (as discussed further below).

Continue reading ‘Post 2: The OSTP’s call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

The OSTP's call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government

The following comments were posted in response to the OSTP’s call as posted here:

Open access to our body of federally funded research, including not only published papers but also any supporting data and code, is imperative, not just for scientific progress but for the integrity of the research itself. We list below nine focus areas and recommendations for action.

Continue reading ‘The OSTP's call for comments regarding Public Access Policies for Science and Technology Funding Agencies Across the Federal Government’

The Climate Modeling Leak: Code and Data Generating Published Results Must be Open and Facilitate Reproducibility

On November 20 documents including email and code spanning more than a decade were leaked from the Computing Climatic Research Unit (CRU) at East Anglia University in the UK.

The Leak Reveals a Failure of Reproducibility of Computational Results

It appears as though the leak came about through a long battle to get the CRU scientists to reveal the code and data associated with published results, and highlights a crack in the scientific method as practiced in computational science. Publishing standards have not yet adapted to the relatively new computational methods used pervasively across scientific research today.

Other branches of science have long-established methods to bring reproducibility into their practice. Deductive or mathematical results are published only with proofs, and there are long established standards for an acceptable proof. Empirical science contains clear mechanisms for communication of methods with the goal of facilitation of replication. Computational methods are a relatively new addition to a scientist’s toolkit, and the scientific community is only just establishing similar standards for verification and reproducibility in this new context. Peer review and journal publishing have generally not yet adapted to the use of computational methods and still operate as suitable for the deductive or empirical branches, creating a growing credibility gap in computational science.

The key point emerging from the leak of the CRU docs is that without the code and data it is all but impossible to tell whether the research is right or wrong, and this community’s lack of awareness of reproducibility and blustery demeanor does not inspire confidence in their production of reliable knowledge. This leak and the ensuing embarrassment would not have happened if code and data that permit reproducibility had been released alongside the published results. When mature, computational science will produce routinely verifiable results.

Verifying Computational Results without Clear Communication of the Steps Taken is Near-Impossible

The frequent near-impossibility of verification of computational results when reproducibility is not considered a research goal is shown by the miserable travails of “Harry,” a CRU employee with access to their system who was trying to reproduce the temperature results. The leaked documents contain logs of his unsuccessful attempts. It seems reasonable to conclude that CRU’s published results aren’t reproducible if Harry, an insider, was unable to do so after four years.

This example also illustrates why a decision to leave reproducibility to others, beyond a cursory description of methods in the published text, is wholly inadequate for computational science. Harry seems to have had access to the data and code used and he couldn’t replicate the results. The merging and preprocessing of data in preparation for modeling and estimation encompasses a potentially very large number of steps, and a change in any one could produce different results. Just as when fitting models or running simulations, parameter settings and function invocation sequences must be communicated, again because the final results are a culmination of many decisions and without this information each small step must match the original work – a Herculean task. Responding with raw data when questioned about computational results is merely a canard, not intended to seriously facilitate reproducibility.

The story of Penn State professor of meteorology Michael Mann‘s famous hockey stick temperature time series estimates is an example where lack of verifiability had important consequences. In February 2005 two panels examined the integrity of his work and debunked the results, largely from work done by Peter Bloomfield, a statistics professor at North Carolina State University, and Ed Wegman, statistics professor at George Mason University. (See also this site for further explanation of statistical errors.) Release of the code and data used to generate the results in the hockey stick paper likely would have caught the errors earlier, avoided the convening of the panels to assess the papers, and prevented the widespread promulgation of incorrect science. The hockey stick is a dramatic illustration of global warming and became something of a logo for the U.N.’s Intergovernmental Panel of Climate Change (IPCC). Mann was an author of the 2001 IPCC Assessment report, and was a lead author on the “Copenhagen Diagnosis,” a report released Nov 24 and intended to synthesize the hundreds of research papers about human-induced climate change that have been published since the last assessment by the IPCC two years ago. The report was prepared in advance of the Copenhagen climate summit scheduled for Dec 7-18. Emails between CRU researchers and Mann are included in the leak, which happened right before the release of the Copenhagen Diagnosis (a quick search of the leaked emails for “Mann” provided 489 matches).

These reports are important in part because of their impact on policy, as CBS news reports, “In global warming circles, the CRU wields outsize influence: it claims the world’s largest temperature data set, and its work and mathematical models were incorporated into the United Nations Intergovernmental Panel on Climate Change’s 2007 report. That report, in turn, is what the Environmental Protection Agency acknowledged it “relies on most heavily” when concluding that carbon dioxide emissions endanger public health and should be regulated.”

Discussions of Appropriate Level of Code and Data Disclosure on, Before and After the CRU Leak

For years researchers had requested the data and programs used to produce Mann’s Hockey Stick result, and were resisted. The repeated requests for code and data culminated in Freedom of Information (FOI) requests, in particular those made by Willis Eschenbach, who tells his story of requests he made for underlying code and data up until the time of the leak. It appears that a file,, was placed on CRU’s FTP server and then comments alerting people to its existence were posted on several key blogs.

The thinking regarding disclosure of code and data in one part of the climate change community is illustrated in this fascinating discussion on the blog in February. (Thank you to Michael Nielsen for the pointer.) has 5 primary authors, one of whom is Michael Mann, and its primary author is Gavin Schmidt who was described earlier this year as a “computer jockeys for Nasa’s James Hansen, the world’s loudest climate alarmist.” In this RealClimate blog post from November 27, Where’s the Data, the position seems to be now very much all in favor of data release, but the first comment asks for the steps taken in reconstructing the results as well. This is right – reproducibility of results should be the concern but does not yet appear to be taken seriously (as also argued here).

Policy and Public Relations

The Hill‘s Blog Briefing Room reported that Senator Inhofe (R-Okla.) will investigate whether the IPCC “cooked the science to make this thing look as if the science was settled, when all the time of course we knew it was not.” With the current emphasis on evidence-based policy making, Inhofe’s review should recommend code and data release and require reliance on verified scientific results in policy making. The Federal Research Public Access Act should be modified to include reproducibility in publicly funded research.

A dangerous ramification from the leak could be an undermining of public confidence in science and the conduct of scientists. My sense is that had this climate modeling community made its code and data readily available in a way that facilitated reproducibility of results, not only would they have avoided this embarrassment but the discourse would have been about scientific methods and results rather than potential evasions of FOIA requests, whether or not data were fudged, or scientists acted improperly in squelching dissent or manipulating journal editorial boards. Perhaps data release is becoming an accepted norm, but code release for reproducibility must follow. The issue here is verification and reproducibility, without which it is all but impossible to tell whether the core science done at CRU was correct or not, even for peer reviewing scientists.

My Interview with ITConversations on Reproducible Research

On September 30, I was interviewed by Jon Udell from in his Interviews with Innovators series, on Reproducibility of Computational Science.

Here’s the blurb: “If you’re a writer, a musician, or an artist, you can use Creative Commons licenses to share your digital works. But how can scientists license their work for sharing? In this conversation, Victoria Stodden — a fellow with Science Commons — explains to host Jon Udell why scientific output is different and how Science Commons aims to help scientists share it freely.”

Optimal Information Disclosure Levels: and "Taleb's Criticism"

I was listening to the audio recording of last Friday’s “Scientific Data for Evidence Based Policy and Decision Making” symposium at the National Academies, and was struck by the earnest effort on the part of members of the Whitehouse to release governmental data to the public. Beth Noveck, Obama’s Deputy Chief Technology Officer for Open Government, frames the effort with a slogan, “Transparency, Participation, and Collaboration.” A plan is being developed by the Whitehouse in collaboration with the OMB to implement these three principles via a “massive release of data in open, downloadable, accessible for machine readable formats, across all agencies, not only in the Whitehouse,” says Beth. “At the heart of this commitment to transparency is a commitment to open data and open information..”

Vivek Kundra, Chief Information Officer in the Whitehouse’s Open Government Initiative, was even more explicit – saying that “the dream here is that you have a grad student, sifting through these datasets at 3 in the morning, who finds, at the intersection of multiple datasets, insight that we may not have seen, or developed a solution that we may not have thought of.”

This is an extraordinary vision. This discussion comes hot on the heels of a debate in Congress regarding the level of information they are willing to release to the public in advance of voting on a bill. Last Wednesday CBS reports, with regard to the health care bill, that “[t]he Senate Finance Committee considered for two hours today a Republican amendment — which was ultimately rejected — that would have required the “legislative” language of the committee’s final bill, along with a cost estimate for the bill, to be posted online for 72 hours before the committee voted on it. Instead, the committee passed a similar amendment, offered by Committee Chair Max Baucus (D-Mont.), to put online the “conceptual” or “plain” language of the bill, along with the cost estimate.” What is remarkable is the sense this gives that somehow the public won’t understand the raw text of the bill (I noticed no compromise position offered that would make both versions available, which seems an obvious solution).

The Whitehouse’s efforts have the potential to test this hypothesis: if given more information will people pull things out of context and promulgate misinformation? The Whitehouse is betting that they won’t, and Kundra does state the Whitehouse is accompanying dataset release with efforts to provide contextual meta-data for each dataset while safeguarding national security and individual privacy rights.

This sense of limits in openness isn’t unique to governmental issues and in my research on data and code sharing among scientists I’ve termed the concern “Taleb’s crticism.” In a 2008 essay on The Edge website, Taleb worries about the dangers that can result from people using statistical methodology without having a clear understanding of the techniques. An example of concern about Taleb’s Criticism appeared on UCSF’s EVA website, a repository of programs for automatic protein structure prediction. The UCSF researchers won’t release their code publicly because, as stated on their website, “We are seriously concerned about the ‘negative’ aspect of the freedom of the Web being that any newcomer can spend a day and hack out a program that predicts 3D structure, put it on the web, and it will be used.” Like the congressmen seemed to fear, for these folks openness is scary because people may misuse the information.

It could be argued, and for scientific research should be argued, that an open dialog of an idea’s merits is preferable to no dialog at all, and misinformation can be countered and exposed. Justice Brandeis famously elucidated this point in Whitney v. California (1927), writing that “If there be time to expose through discussion the falsehood and fallacies, to avert the evil by the processes of education, the remedy to be applied is more speech, not enforced silence.” is an experiment in context and may bolster trust in the public release of complex information. Speaking of the project, Noveck explained that “the notion of making complex information more accessible to people and to make greater sense of that complex information was really at the heart.” This is a very bold move and it will be fascinating to see the outcome.

Crossposted on Yale Law School’s Information Society Project blog.

Sunstein speaks on extremism

Cass Sunstein, Professor at Harvard Law School, is speaking today on Extremism: Politics and Law. Related to this topic, he is the author of Nudge, 2.0, and Infotopia. He discussed Republic 2.0 with Henry Farrell on this diavlog, which touches on the theme of extremism in discourse and the web’s role is facilitating polarization of political views (notably, Farrell gives a good counterfactual to Sunstein’s claims, and Sunstein ends up agreeing with him).

Sunstein is in the midst of writing a new book on extremism and this talk is a teaser. He gives us a quote from Churchill: “Fanatics are people who can’t change their minds and will not change the subject.” Political scientist Hardin says he agrees with the first clause epistemologically but the second clause is wrong because they *cannot* change the subject. Sunstein says extremism in multiple domains (The Whitehouse, company boards, unions) results from group polarization.

He thinks the concept of group polization should replace the notion of group think in all fields. Group Polarization involves both information exchange and reputation. His thesis is that like-minded people talking with other like-minded people tend to move to more extreme positions upon disucssion – partly because of the new information and partly because of the pressure from peer viewpoints.

His empirical work on this began with his Colorado study. He and his coauthors recorded the private views on 3 issues (climate change, same sex marriage and race conscious affirmative action) for citizens in Boulder and for citizens in Colorado Springs. Boulder is liberal so they screened people to ensure liberalness: if they liked Cheney they were excused from the test. They asked the same Cheney question in Colorado Springs and if they didn’t like him they were excused. Then he interviewed them to determine their private view after deliberation, and well as having come to a group consensus.
Continue reading ‘Sunstein speaks on extremism’