On September 13, 2018 I gave a lightning talk called "Reproducibility and Dataverse" at the Whole Tale Workshop on Tools and Approaches for Publishing Reproducible Research. You can watch the talk at https://www.youtube.com/watch?v=SuyQTsOGugc look at the slides at https://www.slideshare.net/philipdurbin/reproducibility-and-dataverse and read the transcript below:
Hi. My name is Philip Durbin and I've been a developer for Dataverse for almost six years and since this is a computational crowd I'll mention that the prior six years to that I worked in research computing environments so I helped researchers with scheduling systems like Condor, Slurm, and LSF. I'm from Columbus, Ohio, home of the Ohio State Buckeyes, so I'm excited to be here in the Big Ten Center. Thanks for inviting me.
Dataverse is open source research data repository software. Dataverse is also a community and I wanted to thank Victoria Stodden for giving the keynote address last year at our third annual community meeting. She introduced Whole Tale to our community so you're on our radar already, you Whole Tale people. There are 33 installations of Dataverse around the world. It's an active and growing community.
We were asked, "What cool things is your project doing to address reproducibility?" Fundamentally, one of the primary things Dataverse does is provide an incentive for data sharing in the first place. We give researchers credit for their data. We have features like our guestbook where researchers can learn how their data is being used. We provide metrics. For example, this datasets has been downloaded nine thousand times. In the future, we're going to be using additional metrics systems such as Make Data Count. More on that in a bit.
We have a concept of "replication datasets." We host six thousand of these in Harvard Dataverse. I made the text bigger from one of these datasets to show that the Odum Institute at UNC has professional curators that replicate tables and figures in the primary article. This is a requirement for being published in The American Journal of Political Science. I was hoping Jon Crabtree from Odum would be here today but there's some kind of hurricane.
We do have a computation story in Dataverse. These are the existing tools we have. Primarily these work at the file level. For tabular files we will present an "explore" button and you can open up a tool called Data Explorer where you can do cross tabulation. You can open up a tool called TwoRavens where you can do statistical analysis. That's at the file level, these tools. We have a more experimental "compute" button and you have to have a special setup of Dataverse where you're storing your files on Swift, which is an OpenStack thing. Then you click the "compute" button and it launches the compute side of OpenStack and you can play around with your data and run arbitrary computation there. That's existing, what we have today.
This is the perfect venue to announce that we were recently awarded a grant from the Alfred P. Sloan Foundation. There's a blog post coming on this any day but I'm just going to read the title of the grant: Increasing Scientific Dataset Quality Through Reproducibility and Curation Tools and Targeted Services in Dataverse Repositories. To unpack that a little bit, what we're really doing is working on the tools side and the human side, the curation services side. I've put the four tools here that we are planning to integrate with: Code Ocean, Encapsulator, CoRe2, and Make Data Count.
On the curation side, part of the grant is trying to come up with a sustainable model for offering curation services from Harvard Dataverse. Right now it's free data hosting. We do some amount of curation, but we want to be able to offer some paid tiers. We're not promising this is going to be sustainable. We want to come up with a model that we think will work through a pilot program. I wanted to mention that the Dataverse community is really into this concept of data quality and data reuse and many installations are pursuing this CoreTrustSeal certification. Tilburg was the first to announce to the Dataverse community that they've already achieved this certification.
The Sloan grant covers stuff we're definitely on the hook for but meanwhile our community has their own ideas of what they want to do for computation and reproducibility so I thought I'd just mention three right here. We have a researcher at Harvard Medical School who's about to launch his installation of Dataverse. He uses this "local data access path" config option where he can mount all the files from his Dataverse to right in his cluster. So he can just tell his researchers to "cd" to this NFS mount and go crazy with the data. There's a ton of excitement about Jupyter notebooks and there is a group a UC Berkeley who we've been talking to who want to add a "launch in Binder" button, sort of similar to what we heard earlier about the "launch in Whole Tale" where you enter a DOI and then would be able to play around with the data in a Jupyter notebook. And then, this last one, there were some students at BU this last semester who were playing around with integrating Dataverse with Spark, which was really exciting.
That's it. Thank you. You can find me on Twitter as @philipdurbin. Our website is dataverse.org.