The Data Archivist – The archivist’s role in data managementand preservation



The Data Archivist – The archivist’s role in data managementand preservation

0 0


tcdl-data-archivist

Presentation for TCDL 2016 - The Data Archivist: the archivist’s role in data management and preservation

On Github sallain / tcdl-data-archivist

The Data Archivist

The archivist’s role in data managementand preservation

Sara Allain & Sarah Romkey | Artefactual Systems, Inc. May 26, 2016 | TCDL 2016, Austin, Texas

Today's talking points

The role of the archivist in research data management

Basic intro to Archivematica

Three case studies:

Ontario Council of University Libraries + Dataverse University of York & University of Hull + Hydra Compute Canada + Globus

Archivists+RDM

RDM Isn't New

We've been thinking about the role of the library in research data management for several years.

The Digital Preservation Gap

Digital management platforms must adequately preserve data.

Domain-specific tools and proprietary formats make this difficult.

Assertions

Research data management is a digital preservation problem.

Archivists are pretty good at digital preservation.

Why Archivematica?

Definition

Web- and standards-based open-source application which allows your institution to preserve long-term access to trustworthy, authentic and reliable digital content.

So... Why Archivematica?

Based on standards and best practices

Format and repository agnostic

Small enough to run on a laptop

Robust enough to handle petabytes of data

Modular

Free and open source

Familiar

Archivematica is for Archivists

It was built around archival standards, using archival terminology, and it's meant to anticipate archival digital preservation workflows. (Of course, everyone's welcome to use it!)

Luckily, since RDM is a digital preservation problem, it's well suited to RDM workflows as well.

York/Hull+Hydra

Case Study 1

Research Data Spring

Jisc-funded projects aimed at encouraging tool and workflow development to tackle various aspects of research data management.

Available project funding was anywhere from £250k to £1m.

Research Data Spring

York and Hull were successful at obtaining funding for all three phases of the project.

Goal was to take advantage of Archivematica's modularity to integrate Archivematica into a research data management architecture that would include other applications for deposit, management, etc.

York & Hull at the Outset

Established Hydra-based institutional repository, but no digital preservation capacity.

Wanted to be able to offer assured long-term preservation to faculty members.

Archivematica Falls Short!

After Phase 1 (testing), the archivists at York and Hull identified several areas where Archivematica was not sufficient to meet their RDM needs.

They applied for Phase 2 funding to begin developing solutions for the identified problems.

Winter of Our Discontent Development

Five deliverables:

  • On demand automated DIP generation
  • METS parsing
  • Generic search REST API
  • Multiple checksum algorithms
  • Handle unidentified files
Disclaimer: York and Hull are lovely to work with! But who can resist a Shakespeare joke?

Deploy! Deploy!

York and Hull successfully applied for Phase 3 funding to build a proof-of-concept platform, making use of the deliverables to integrate Archivematica with Hydra.

Meanwhile, Artefactual is currently bundling the new features into the 1.5 and 1.6 releases of Archivematica.

OCUL+Dataverse

Case Study 2

Dataverse at OCUL

Open source repository platform developed at Harvard.

Ontario Council of University Libraries' tech branch, Scholars Portal, hosts a Dataverse instance that is available to academics at Ontario's 21 universities.

Deposit and Access Reign

Dataverse excels as a deposit and access system, but has limited digital preservation functionality.

Goal of the project was to let users deposit content through Dataverse, running Archivematica preservation tasks in the background.

Important: users can deposit content over time, rather than all at once!

Automate It!

The integration makes use of Automation Tools, an Archivematica library that facilitates requests for updated information from Dataverse's API. An ingest script was also developed to manage ingest tasks.

Orange: Automation ToolsGreen: DataverseBlue: Archivematica

An Experiment

The Dataverse integration project resulted in a proof of concept workflow that isn't currently scheduled for release. However, it's available as a separate public branch of the project on Github.

At some point in the future, we would love to generalize the code and make it available in a public release.

Compute Canada+Globus

Case Study 3

Compute Canada

A national, non-profit organization that provides high performance research computing resources for 70 institutions and 10,000+ researchers.

Compute Canada uses Globus' Transfer Service and Publication Service tools to store and provide access to research data.

Canadian Polar Data Network Pilot

Scholars Portal holds terabytes of climate data from the CPDN. This corpus was used to pilot an integration where Archivematica acts as a bridge between the Globus Transfer and Publication Services and Compute Canada datastores.

Another Experiment

This proof of concept is also not scheduled for release. We're working on getting it into a separate public branch of the Archivematica project on Github.

Archivists+RDM

Get In Touch!

Twitter: @archivalistic | @archivematica

Email: sallain@artefactual.com or info@artefactual.com

This presentation: bit.do/data-archivist