Grab a bucket! It's raining data!

Click here to load reader

download Grab a bucket! It's raining data!

of 70

  • date post

  • Category


  • view

  • download


Embed Size (px)


Slides plus talk notes.

Transcript of Grab a bucket! It's raining data!


Its raining data!

Grab a bucket!

Dorothea SaloUniversity of Wisconsin

Access 2009

Hi there. Thanks very much to Mark Leggott for inviting me here, and to all of you for lending me your ears for a time.

Youll have noticed that the title of this talk in the program notes is very formal and buttoned-down. Representing and managing the data deluge. Well, okay. I am not a formal and buttoned-down person, but when Mark approached me to speak here, I was actually scared to death to accept, and so I wrote this really terribly boring title -- so like Peter, I just up and changed it. The REAL title is Grab a bucket -- its raining data!

To hear some folks tell it, its a golden age to be a digital librarian. Here we have an entire new form of scholarly publication -- digital research data -- and its ours for the asking! In times when were all worried about the future of libraries (and, lets face it, librarians), this feels heaven-sent. Grab a bucket, its raining data, hallelujah!


of Open AccessPainting: Cassandra, Evelyn de MorganPhoto:

In some quarters, I am now styled the Cassandra of Open Access. Cassandra, for those not up on their Greek myth, was a Trojan prophetess who was cursed such that nobody believed what she said until it was too late. Being from Troy, which was of course completely doomed, most of her prophecies were fairly dire, too. Hey, the Greeks are about to wheel a big wooden horse into your city so they can burn it down and kill everybody! Not happy-making stuff were talking about.

Ive got nothing against

but the reality was...Photo:

Some people have mistaken my Cassandra-nature for an onus against open access generally and institutional repositories in particular. Ive never had it in for open access! Who doesnt like open access! Its similar to what Cory said yesterday, its hard to be against an unambiguous good like open access without sounding like a total jerk... which hasnt stopped some publishers, of course. (*CLICK*) But Ive been running institutional repositories for close to five years now and the on-the-ground reality has been quite a bit...

... blurrier.




something for nothing?

fit between content and container?

fit between user needs and system?

and so now, I may be becoming... blurrier. Conflicting, contradictory, and in some cases flatly impossible goals. Minimal means, because of people who seem to have been reading the mythical Frommers Institutional Repositories on $5 A Day. Asking for time, effort, and data from faculty without giving them any real service or any return on their time investment that made sense to them. We crammed things into IRs that just didnt fit with the very limited IR view of the digital universe, just because we hadnt anywhere else to put them: our content didnt fit in the container we had. And we completely ignored faculty needs and desires.

Im seeing some of the same thought and design processes happening now with regard to e-science, e-research, cyberinfrastructure, data curation, whatever you want to call it. And this troubles me. So I cant help but wonder if Im becoming...


of Data Curation?But, optimistically, its early days yet. Theres no reason we have to make the same mistakes with data that we made with IRs. So, I dont want anyone to think that Im raising the problems Im going to raise in this talk because Im somehow AGAINST research data curation, or I think libraries shouldnt get involved with it.

I am all for research-data curation, and I believe very strongly that libraries need to get involved. I just think we should know what were getting ourselves into, and if that means Im a little Cassandraic, okay, so be it.



something for nothing?

fit between content and container?

fit between user needs and system?

I could spend hours talking about all these things, but I guarantee that nobody here wants to listen to me for hours. So Im going to focus this talk on the fit between content and container, though I may touch on other things.

Im going to examine some of the qualities of typical research data, then talk about digital libraries and IRs, looking hard at some of the impedance mismatches were liable to run into, and maybe strategize a little bit about how to make ourselves and our systems better now, before we run headlong into another mess.

And the lens Im going to be looking through is a human lens, not so much a technological lens. THIS IS NOT JUST A TECHNOLOGY PROBLEM, I cant say that loudly enough.

What do we know about data?


So what do we know about research data, speaking very broadly and generally?

Theres a lot of data.


Richard talked about this yesterday, but Ill just reiterate: Even if we admit that the Large Hadron Collider types are probably going to take care of themselves -- and this isnt something I necessarily admit; I know huge, well-funded projects that are making huge messes with their data -- even if we admit that, were still looking at an incredible flood of stuff.

Have we got big enough buckets? I dunno. At this juncture I feel it incumbent upon me to say the word cloud. Cloud. There. I have said it. I now feel no need at all to say it again.

Look, I understand that storage and networking are problems that have to be solved before we can do anything else. I get that. Just -- to me, its necessary but not sufficient, even though it seems to be getting all the attention right now. So Im going to move on to characteristics of research data that Im more interested in.

Data are there to be interacted with.Photo:

One thing I think we need to keep in mind about data is that they are not an end in themselves. We dont keep data just to keep data; we do it because researchers can pick up shovels and dig around in the sands and build knowledge like sand castles!

Data are there to do things with. To be examined, cleaned up, verified, refuted, corrected, number-crunched, mashed up with other data, graphed, charted, visualized... and if we treat them as though they were unchangeable museum objects -- look but dont touch, like books chained to a medieval lectern -- we are actually getting in the way of making new knowledge. If nobody can do things with data, there is no point in keeping them! Thats what CC0 is about, as Richard mentioned in his Q&A session yesterday: removing legal barriers to messing about with data. We, we librarians, need to remove TECHNICAL barriers to messing about with data.

Whats more, different kinds of data have different affordances. You dont use a plastic sand-shovel to dig a rock quarry, just the way you dont use a backhoe to build a sand castle. The way a sociologist interacts with census data is just wildly different from the way a medical researcher interacts with MRI data. The data buckets we build will have to internalize and respect those affordances, or at the VERY least allow RESEARCHERS to build tools on top that respect those affordances.

Data are wildly diverse in nature...

... as are their technical environments.Photo:

In other words, data are diverse, so the buckets we put them in will need to be different shapes and colors in order to respect that diversity.

Now, differences in data can sometimes be skin-deep. The difference between a digital image of a sculpture and a digital image of a physics field station in Antarctica is in some ways not much for our purposes, however different our researchers may think they are. But sometimes the differences really do matter. You cant treat a book in TEI markup the same as a book of page-scanned images; you will be doing violence to readers of one or the other. A microscopy researcher on my campus does cell sections digitally; you can train a microscope to focus from the top of the cell all the way through and down, and then you can create a 3D cell image to play with. Its really cool! But a system that treats each section image as a wholly separate and unrelated thing (*cough*DSpace*cough*), is making it impossible to get any knowledge out of those data.

Think for a moment about a single bucket that works for the TEI book, the book of page scans, the images of the Antarctic field station, and the microscopy data, and youre starting to realize the scope of the data-diversity problem.

Again, we dont control the technical environments our researchers are using to generate data. Some of those environments are proprietary, and Mike Rylander talked yesterday about why thats a dangerous, dangerous problem. But even leaving that aside, if were really, really lucky, we might have a chance to make recommendations to researchers about their data. For the most part, though, WE are the ones who will have to adapt to whatever theyre doing.

Data are already out there.

Photo: NASA (via, Multiwavelength M81

Why is that? Were not creating all the digital research data out there; the researchers are. And theyve created it in huge volumes already. So Im really interested when Dan Chudnov says that the Library of Congress is working to capture data at world-scale and web-scale, because I want them to teach ME how to do that.

So, researchers. Theyre not thinki