ELAR and Digital Archiving for Documentation of Endangered Languages

82
1 David Nathan Endangered Languages Archive SOAS University of London LingDy Feb 15, 2013 ELAR and Digital Archiving for Documentation of Endangered Languages

description

ELAR and Digital Archiving for Documentation of Endangered Languages. David Nathan Endangered Languages Archive SOAS University of London LingDy Feb 15, 2013. What is a digital language archive?. - PowerPoint PPT Presentation

Transcript of ELAR and Digital Archiving for Documentation of Endangered Languages

Page 1: ELAR  and Digital Archiving  for Documentation of Endangered Languages

1

David NathanEndangered Languages Archive

SOAS University of London

LingDyFeb 15, 2013

ELAR and Digital Archiving for Documentation of

Endangered Languages

Page 2: ELAR  and Digital Archiving  for Documentation of Endangered Languages

2

What is a digital language archive?

a trusted repository created and maintained by an institution with a commitment to the long-term preservation of archived material

has policies and processes for acquiring, cataloguing, preserving, disseminating, and migrating (updating formats)

a platform for building and supporting relationships between data providers and data users

Page 3: ELAR  and Digital Archiving  for Documentation of Endangered Languages

3

General archiving functions

advise acquire preserve add value provide access develop trust

Page 4: ELAR  and Digital Archiving  for Documentation of Endangered Languages

4

Why is language archiving different?

what is a language? unlike business data, it is not

conventionalised (like $, age, year of publication etc) – what and how to code?

varying and competing expectations

Page 5: ELAR  and Digital Archiving  for Documentation of Endangered Languages

5

And endangered languages archiving?

extremely diverse context – languages, cultures, communities, individuals, projects

typical source - fieldworkers typical materials - documentation difficult for archive staff to manage sensitivities and restrictions

Page 6: ELAR  and Digital Archiving  for Documentation of Endangered Languages

6

What can a language archive offer?

Security - keep your electronic materials safe Preservation - store your materials for the

long term Discovery - help others to find out about your

materials, and you to find out about users Protocols - respect and implement

sensitivities, restrictions Sharing - share results of your work, if

appropriate Acknowledgement - create citable

acknowledgement Mobilisation - create usable language

materials Quality and standards - advice for assuring

your materials are of the highest quality and robust standards

Page 7: ELAR  and Digital Archiving  for Documentation of Endangered Languages

7

There are different kinds of language archives

from local to global - different coverage, contexts, methods, collection policies

consider placing your materials in more than one …

there are also sites for aggregating different archives’ holdings, eg Virtual Language Observatory, OLAC

Page 8: ELAR  and Digital Archiving  for Documentation of Endangered Languages

8

Why digital?

preservation: digitisation is the only way that audio and video (non-symbolic material) can be preserved for the future … because it can be copied and transmitted with zero loss

also good for cataloguing, sharing, dissemination, repurposing

Page 9: ELAR  and Digital Archiving  for Documentation of Endangered Languages

9

Digital disadvantages

digital data is fragile and ephemeral cost (human, equipment, maintenance) requires strategy and luck to get right preservation depends on file and data

formats depend on tools and software some formats require particular software

(can we archive the software?) formats: prefer standard, stable, open,

explicit, long-lasting some materials may have to be

‘migrated’

Page 10: ELAR  and Digital Archiving  for Documentation of Endangered Languages

10

What do depositors have to do?

select and contact an archive prepare materials

select structure suitable encodings and formats complete metadata,

metadocumentation, agreements send materials to archive(s) work with archive during curation etc ongoing management, updating,

dissemination

Page 11: ELAR  and Digital Archiving  for Documentation of Endangered Languages

11

OAIS model

OAIS archives define three types of ‘packages’ingestion, archive, dissemination:

Archive Dissemination

afd_34

dfa dfadf

fds fdafds

afd_34

dfa dfadf

fds fdafds

afd_34

dfa dfadf

fds fdafds

afd_34

dfa dfadf

fds fdafds

afd_34

dfa dfadf

fds fdafds

IngestionProducers Designated communities

Page 12: ELAR  and Digital Archiving  for Documentation of Endangered Languages

12

ELAR - architecture

reduced boundaries between depositors, users and archive: users add, update content;

negotiate accessArchive

afd_34

dfa dfadf

fds fdafds

afd_34

dfa dfadf

fds fdafds

afd_34

dfa dfadf

fds fdafds

afd_34

dfa dfadf

fds fdafds

afd_34

dfa dfadf

fds fdafds

&

Users Producers

request

give access

contribute

edit

Page 13: ELAR  and Digital Archiving  for Documentation of Endangered Languages

13

Redefining the digital EL archive

a platform for developing and conducting relationships between knowledge producers and knowledge users – a social networking archive

level the playing field between researchers and community members/other stakeholders

encourage, recognise and cater for diversity

Page 14: ELAR  and Digital Archiving  for Documentation of Endangered Languages

14

Data management and archiving

use good data management practices whether or not you plan to archive materials document decisions, steps, conventions,

structures, encodings appropriate and conventional data

encoding methods (e.g. Unicode) be explicit and consistent plan for flowing data, working with

others, across different systems (cf Bird and Simons, ‘Seven Dimensions of Portability’)

good data management practices will make a future archiving process easier and better

Page 15: ELAR  and Digital Archiving  for Documentation of Endangered Languages

15

Users and potential users

depositors – deposit, access or update materials

speakers and their descendants other researchers -

comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc

other “stakeholders”, eg educationalists, funders

journalists and the wider public

Page 16: ELAR  and Digital Archiving  for Documentation of Endangered Languages

16

ELAR facts and figures

archived collections: ~200 online (published) collections: 150 average collection size about 80 GB online data bundles: ~25,000 online bundles access: unrestricted

10,000, restricted 15,000 total number of files held: around 200,000 total volume of files held: around 10 TB registered users: ~800 annual number of website "hits": 230,000

Page 17: ELAR  and Digital Archiving  for Documentation of Endangered Languages

17

ELAR facts and figures – users

increasing number of community members, including Aleut (Canada), Tai-Ahom, Wadar (India), Burushaski (Pakistan), Serrano, Cahuilla, Arapaho (USA), Iraqi Jewish (Iraq), Saami (Finland), Wabena (Tanzania), Torwali (Pakistan), Hani, Bai (China), Irish

comments: “I found your site while looking up my grandmother, and i found her on your site speaking our language. and i would love for my children her great grandchildren to hear our language coming from her".

many interdisciplinary researchers, particularly archivists and anthropologists

Page 18: ELAR  and Digital Archiving  for Documentation of Endangered Languages

18

Our task

… to preserve and disseminate documentation of endangered languages

Page 19: ELAR  and Digital Archiving  for Documentation of Endangered Languages

19

Why is this important?

over 50% of the world’s 7000 languages: are endangered likely to cease to be spoken this

century little or nothing known about the

majority of them language documentations and the

archives that support, preserve, and disseminate them, will become the means of transmission of many languages

Page 20: ELAR  and Digital Archiving  for Documentation of Endangered Languages

20

A perfect storm?

documentation methods exposesensitivities & vulnerabilities

documentation performed by and for linguists and “others”

“big data” – resources channeledto analysis, broader audiences

“open data” – push for unmoderated access

Page 21: ELAR  and Digital Archiving  for Documentation of Endangered Languages

21

Protocol

the sensitivities and access restrictions associated with EL resources

need to be discussed, collected and recorded in the field

Page 22: ELAR  and Digital Archiving  for Documentation of Endangered Languages

22

Protocol and access control

principles: granularity – file, bundle or collection access is a relation between object and

user protocol values can be changed over

time ELAR’s URCS system

User Researcher Community member Subscriber

Page 23: ELAR  and Digital Archiving  for Documentation of Endangered Languages

23

ELAR’s protocol values

U – resource available to all registered users

R – resource available to users registered as researchers

C – resource available to users endorsed as members of relevant language community

S – resource available to users who have been given individual access rights for that resource

Page 24: ELAR  and Digital Archiving  for Documentation of Endangered Languages

24

Page 25: ELAR  and Digital Archiving  for Documentation of Endangered Languages

25

Page 26: ELAR  and Digital Archiving  for Documentation of Endangered Languages

26

Page 27: ELAR  and Digital Archiving  for Documentation of Endangered Languages

27

User xx has just applied for access to restricted material in the deposit solega-107128. The following message was attached to the application:

"Hi [depositor],

Please delegate me for access to the material on Solegas."

Subscription application: formal

Page 28: ELAR  and Digital Archiving  for Documentation of Endangered Languages

28

This email is to inform you that user xx's application for access to restricted material in the deposit musgrave2007tulehu has justbeen approved. The depositor included the following note to the user:

"The researcher is known to me personally and I know that his interest is legitimate."

Subscription response: formal

Page 29: ELAR  and Digital Archiving  for Documentation of Endangered Languages

29

User xx has just applied for access to restricted material in the deposit budd2008beirebo. The following message was attached to the application:

"I'm xx. I like to learn Bislama language, but never heard what it sounds like. Am very curious "

Subscription application: “curious”

Page 30: ELAR  and Digital Archiving  for Documentation of Endangered Languages

30

User xx has just applied for access to restricted material in the deposit verstraete2010paman. The following message was attached to the application:

"I am currently doing my masters in Linguistics and I'm researching on an endangered language in Malaysia. I would like to see a sample of the data from the fieldwork since I'm not use to this yet. I hope that I can gain more understanding in carrying out the fieldwork."

Subscription application: establish credentials and reason

Page 31: ELAR  and Digital Archiving  for Documentation of Endangered Languages

31

This email is to inform you that user xx's application for access to restricted material in the deposit verstraete2010paman has just been rejected. The depositor included the following note:

"Dear xx,I am sorry we cannot give you access to this deposit. The Lamalama community has asked us to restrict access to community members.

With best wishes,

[depositor]"

Subscription response: rejected, with reason

Page 32: ELAR  and Digital Archiving  for Documentation of Endangered Languages

32

This email is to inform you that user xx’s application for access to restricted material in the deposit caballero2009raramuri has just been approved. The depositor included the following note to the user:

"Please let me know if you're looking for any specific materials or if you have any questions."

Subscription response: offering further help

Page 33: ELAR  and Digital Archiving  for Documentation of Endangered Languages

33

This email is to inform you that user xx's application for access to restricted material in the deposit kunbarlang-389 has just been approved. The depositor included the following note to the user:

"Hi xxI've approved your access to this collection, but you should know that there is an update in the material I've just deposited, with much more information on both music and texts. I'd be happy to give you access to that when it is processed.

Next time I come to London (October or November this year) I'd be happy to meet up if you would like to discuss."

Response: further info and offer to meet

Page 34: ELAR  and Digital Archiving  for Documentation of Endangered Languages

34

What can you archive (at ELAR)?

media - audio, video graphics - images, scans texts - fieldnotes, grammars,

description, analysis structured data - aligned and

annotated transcriptions, databases, lexica

metadata, metadocumentation - contextual information about the materials, both structured and unstructured

Page 35: ELAR  and Digital Archiving  for Documentation of Endangered Languages

35

Archive objects

an “object” could be a file, a set of files, a directory, or a set of files with their relationships explicitly defined

like other archives, ELAR uses a set principle, we call “bundles” (like DoBeS’ sessions)

See bundles at ELAR

Page 36: ELAR  and Digital Archiving  for Documentation of Endangered Languages

36

Archive objects

ELAR

Collection Collection Collection Collection

BundleBundle Bundle Bundle

File File File File File

Page 37: ELAR  and Digital Archiving  for Documentation of Endangered Languages

37

resource(s) for an endangered language it could be just one file

catalogue / metadata deposit form view

existing deposits can also be updated, added to, and metadata added/modified

What is required to make a deposit?

Page 38: ELAR  and Digital Archiving  for Documentation of Endangered Languages

38

Archive material should be selected

example: Depositor’s question: How much video can I archive?

answer: ...

Page 39: ELAR  and Digital Archiving  for Documentation of Endangered Languages

39

How can I deliver data?

hard disks we return them we also send them out

flash cards and USB sticks email

good for samples for evaluation OK for most text materials

Dropbox etc a web upload facility may be provided

one day we can download from your server

Page 40: ELAR  and Digital Archiving  for Documentation of Endangered Languages

40

What about CDs and DVDs?

we have found CDs, andespecially DVDs, to bevery unreliable DVD fail rate > 10%

cause confusion as filesare allocated to fit on disks, not according to corpus structure

create a lot of work for depositors and for ELAR

Page 41: ELAR  and Digital Archiving  for Documentation of Endangered Languages

41

Express yourself - Metadata

metadata is data about data containers data about data

its functions• for identification, management,

retrieval of data• provides the context and

understanding of that data carries those understandings into

the future, and to others

Page 42: ELAR  and Digital Archiving  for Documentation of Endangered Languages

42

Express yourself - Metadata

metadata reflects the knowledge and practices of data providers

… and therefore defines and constrains audiences and usages for the data

all value-adding to recordings of events (annotations transcriptions, translations, glosses, comments, interpretations, part of speech tagging etc) can be considered metadata

data and metadata lie on a spectrum and depend on how they are used rather than being absolutely different things

Page 43: ELAR  and Digital Archiving  for Documentation of Endangered Languages

43

Express yourself - Metadata

distinguish between metadata scheme (eg set of

categories) and the way that scheme is expressed

Page 44: ELAR  and Digital Archiving  for Documentation of Endangered Languages

ID audio transcription

1 TRS00065.wav bjt_02.txt

2 TRS00066.wav krs_43.txt

<sessions><session id=”1”>

<audio>TRS00065.wav </audio><transcription>bjt_02.txt</transcription>

</session><session id=”2”>

<audio>TRS00066.wav</audio><transcription>krs_43.txt</transcription>

</session></sessions>

tagged

relationalfilename: sessions.xls

filename: sessions.xml

Page 45: ELAR  and Digital Archiving  for Documentation of Endangered Languages

45

Express yourself - Metadata

example you could choose categories from

OLAC, IMDI etc schemes or formulate your own

this would be a scheme of logical categories (speaker, location, date etc)

you could express these in different language(s)

you could structure the categories and values in different ways, eg as spreadsheet, database, XML

Page 46: ELAR  and Digital Archiving  for Documentation of Endangered Languages

46

Express yourself - Metadata

you need to choose a set of metadata categories applying

across whole collection

+ metadata categories that apply to

particular types of objects (eg transcriptions, video), or to individual objects

+ ways of expressing and encoding all

that metadata

Page 47: ELAR  and Digital Archiving  for Documentation of Endangered Languages

47

Page 48: ELAR  and Digital Archiving  for Documentation of Endangered Languages

48

Page 49: ELAR  and Digital Archiving  for Documentation of Endangered Languages

49

Example

Ju|’hoan (Biesele)

Page 50: ELAR  and Digital Archiving  for Documentation of Endangered Languages

50

Potential sources of metadata

deposit form spreadsheets MS Word tables, CSV etc IMDI and OLAC XML files custom XML notes, correspondence and reports filenames direct input to ELAR interface audio files images (/captions) meta-metadata files

Page 51: ELAR  and Digital Archiving  for Documentation of Endangered Languages

51

A survey

we collected information from about 50 ELAR deposits

Page 52: ELAR  and Digital Archiving  for Documentation of Endangered Languages

About 80% of most frequently occurring categories can be mapped to OLAC

20 languageSubject.language17 date Date17 descriptionDescription16 id Identifier16 speaker Contributor16 title Title15 format Format13 type Type12 creator Creator12 file name Identifier12 notes11 rights Rights10 duration Coverage9 content Description9 contributorContributor9 name Contributor9 relation Relation

8 age8 comment8 genre Type.linguistic8 subject.languageSubject.language7 date recorded Date7 document 17 gender7 place Coverage6 directory Identifier5 location Coverage5 rec_date Date5 recorder Contributor

term OLAC term OLAC

Page 53: ELAR  and Digital Archiving  for Documentation of Endangered Languages

53

Depositors also add categories such as:

detailed locations metadata in Spanish indigenous genres and titles (eg of songs) parents’ and spouse’s mother tongues,

birthplaces number of children, their language

competence L2, L3 and competencies languages heard clan/moiety occupation education level

Page 54: ELAR  and Digital Archiving  for Documentation of Endangered Languages

54

… more metadata:

date left home country photos (/captions) of consultants, field

sessions etc equipment microphone workflow status naming and organisational codes and

principles recorder/linguist experience level biography and project description

(“meta-documentation”)

Page 55: ELAR  and Digital Archiving  for Documentation of Endangered Languages

55

What is the distribution?

Page 56: ELAR  and Digital Archiving  for Documentation of Endangered Languages

56

Term frequencyNumber of terms20 117 216 315 113 112 311 110 19 48 47 46 15 34 53 172 511 613

Page 57: ELAR  and Digital Archiving  for Documentation of Endangered Languages

57

0

5

10

15

20

25

langua

ge

spea

ker

crea

tor

dura

tion

relatio

n

subje

ct.lan

guage

place

reco

rder

rec_

locat

ion elan

med

ia

occu

patio

n

subje

ct

abstr

act

code

com

municativ

e_ev

ent:

file_b

undle:

vide

o_file

cont

ribut

orau

thor

diale

ct

equip

men

t

file_b

undle

: aud

io_file

indig

enou

s title

item

date

med

ia file

read

me

sess

ion_n

ame

toolb

ox id

imag

e_file

name

acto

r.dea

fnes

s.stat

us

acto

r.fam

ily.de

af.pr

imaryc

omm

unica

tion fn

filepa

th

spee

ch so

und

name of

the i

tem

(in

spanis

h/engli

sh)

Page 58: ELAR  and Digital Archiving  for Documentation of Endangered Languages

58

A visualisation

Page 59: ELAR  and Digital Archiving  for Documentation of Endangered Languages

59

Page 60: ELAR  and Digital Archiving  for Documentation of Endangered Languages

60

Page 61: ELAR  and Digital Archiving  for Documentation of Endangered Languages

61

Page 62: ELAR  and Digital Archiving  for Documentation of Endangered Languages

62

Discussion and conclusions

for endangered language documentation, the metadata framework is to be discovered, not predefined (cf Jeff Wallman, TBRC)

Page 63: ELAR  and Digital Archiving  for Documentation of Endangered Languages

63

MD and resource discovery

“discovery” is not neutral: what is emphasized/distilled? who gains? who does the work?

MD is also about the distribution of labor and resources

Page 64: ELAR  and Digital Archiving  for Documentation of Endangered Languages

64

MD and users

MD is more responsible for the form, presentation, and usage of documentation than generally acknowledged

MD should be equally accessible to and relevant for community members – it may even be more relevant to them than any “linguistic” data

Page 65: ELAR  and Digital Archiving  for Documentation of Endangered Languages

65

OLAC: Open Language Archives Community:

IMDI: ISLE Metadata Initiative more categories, software specific

ELAR: for endangered language documentation, metadata framework is to be discovered, not predefined

Common metadata standards

TitleIdentifierCreatorContributorLanguageSubject.language

DateDescriptionFormatTypeRightsCoverageRelation

Page 66: ELAR  and Digital Archiving  for Documentation of Endangered Languages

66

Types of metadata

people metadata – creator’s / participants’ details

descriptive metadata – content of data administrative metadata – eg. who did

what when, relationships between objects, IPR and permissions

structural metadata – how collection and its objects are organised, associated, formatted

preservation metadata – character encoding, file format

access and usage protocols

Page 67: ELAR  and Digital Archiving  for Documentation of Endangered Languages

67

Examples

example - XLS example - XML example – key example – key XML example – summary and requests example - notes

Page 68: ELAR  and Digital Archiving  for Documentation of Endangered Languages

68

Meta-documentation

Nathan (2010): “think of metadata as meta-documentation, the documentation of your data itself, and the conditions (linguistic, social, physical, technical, historical, biographical) under which it was produced. Such meta-documentation should be as rich and appropriate as the documentary materials themselves.”

Page 69: ELAR  and Digital Archiving  for Documentation of Endangered Languages

69

Meta-documentation

identity of stakeholders involved, and their roles attitudes of language consultants, towards their

languages and towards the documenter and documentation project

relationships with consultants and community (Good 2010 mentions what he called ‘the 4 Cs’: ‘contact, consent, compensation, culture’);

goals and methodology of researcher, including research methods and tools, corpus theorisation (Woodbury 2011), theoretical assumptions behind annotation, potential for revitalisation

Page 70: ELAR  and Digital Archiving  for Documentation of Endangered Languages

70

Meta-documentation

project and researcher biography: knowledge and experience of the researcher and consultants (eg. researcher’s knowledge at beginning of project, what training researcher and consultants received)

for funded projects: grant application, reports, email communications

agreements entered into – formal or informal (eg. Memorandum of Understanding, compensation arrangements), and promises made to stakeholders

relationships between this and other projects

Page 71: ELAR  and Digital Archiving  for Documentation of Endangered Languages

71

Formats/encoding

format choices at these levels: representation of information representation of characters how characters are assembled into

files (file formats)

Page 72: ELAR  and Digital Archiving  for Documentation of Endangered Languages

72

Characters

use UTF-8 (aka Unicode ISO 10646) be aware of using characters outside ASCII

(common US keyboard characters) – these can break if UTF-8 is not used

distinguish character encoding and fonts (a font is simply a set of images for a “character set”) something may be coded perfectly in

UTF-8 but there is no suitable font applied

some fonts may display special characters correctly but this does not mean that encoding is correct

Page 73: ELAR  and Digital Archiving  for Documentation of Endangered Languages

73

File formats

audio WAV (what if original is not WAV??) resolution: 16 bit, 44.1KHz, stereo or

better video

changing frequently MPEG4 or MTS/H264/AVCH aspect, resolution: depends on project get advice from achive before

depositing

Page 74: ELAR  and Digital Archiving  for Documentation of Endangered Languages

74

File formats

images TIFF **OR** original from device resolution: archive quality is 300dpi

or better

Page 75: ELAR  and Digital Archiving  for Documentation of Endangered Languages

75

File formats

text best is plain text PDF/A often acceptable, may pose

problem if MS-Word or ODF, check with archive

structured data (spreadsheets, databases original format should be supplied provide a preservable derivative as well

(eg csv, PDF) common linguistic software (ELAN,

Transcriber, Toolbox, Praat etc) their file formats are generally

preservable

Page 76: ELAR  and Digital Archiving  for Documentation of Endangered Languages

76

Can I still use MS Word?

ELAR no longer accepts MS Word files but Word is still useful

quicker to type up useful tables, functions, macros etc

solutions think “text only” tables as spreadsheets (are they bad

too?) (advanced) complex materials formatted

as styles, then export as marked up PDF/A – but not a perfect solution

Page 77: ELAR  and Digital Archiving  for Documentation of Endangered Languages

77

My cells have multiple values!

example: keywords this is probably OK, as keywords are

atomic just consistently use a suitable

delimiter e.g. use comma - if data values

cannot have commas ELAR recommends double pipe “||”

Page 78: ELAR  and Digital Archiving  for Documentation of Endangered Languages

78

My cells have multiple values!

example: speakers in a recording speakers are probably not atomic –

they have other attributes create a separate “speakers” sheet give each speaker an ID (number or

initials) use the IDs in the original sheet, with

delimiter (implements one to many) (advanced) or make another sheet to

associate recordings with speakers (implements many to many)

Page 79: ELAR  and Digital Archiving  for Documentation of Endangered Languages

79

Standards

we have already mentioned some standards – UTF-8, WAV etc

there are other relevant standards, eg ISO 639-3 (language/dialect names) metadata systems

you can also establish project-local standards, eg to handle special characters (eg \e =

schwa) data field names document them! – for your usage and

for correspondence to wider standards

Page 80: ELAR  and Digital Archiving  for Documentation of Endangered Languages

80

Page 81: ELAR  and Digital Archiving  for Documentation of Endangered Languages

81

Page 82: ELAR  and Digital Archiving  for Documentation of Endangered Languages

82

THANK YOU!

www.elar-archive.org

David [email protected]