Building Greenstone Collections from the Command Line

27
Building Building Greenstone Greenstone Collections from Collections from the Command Line the Command Line

description

Building Greenstone Collections from the Command Line. Basic commands. Type “setup.bat” (for Windows users) or “setup.sh” for (Unix/Linux users) when you’re in the Greenstone installation directory To create a collection, type “perl –S mkcol.pl –creator [email protected] collection_name” - PowerPoint PPT Presentation

Transcript of Building Greenstone Collections from the Command Line

Page 1: Building Greenstone Collections from the Command Line

Building Building Greenstone Greenstone

Collections from Collections from the Command Linethe Command Line

Page 2: Building Greenstone Collections from the Command Line

Basic commandsBasic commands Type “setup.bat” (for Windows users) or Type “setup.bat” (for Windows users) or

“setup.sh” for (Unix/Linux users) when “setup.sh” for (Unix/Linux users) when you’re in the Greenstone installation you’re in the Greenstone installation directorydirectory

To create a collection, type “perl –S mkcol.pl To create a collection, type “perl –S mkcol.pl –creator [email protected] –creator [email protected] collection_name”collection_name”

To import documents into a collection, type To import documents into a collection, type “perl –S import.pl collection_name”“perl –S import.pl collection_name”

To build a collection, type “perl –S To build a collection, type “perl –S buildcol.pl collection_name”buildcol.pl collection_name”

For further details, read page 9 – 19 of the For further details, read page 9 – 19 of the developer’s guidedeveloper’s guide

Page 3: Building Greenstone Collections from the Command Line

Building A Collection In Building A Collection In GreenstoneGreenstone

Documents

Import

import.pl

(plugins)

Archives Index

build.pl

(classifiers)

Web

Documents

Documents

XML documents

Browsing and full text

Page 4: Building Greenstone Collections from the Command Line

Importing documents Importing documents Plugins are used to process source Plugins are used to process source

documents in different formats and associate documents in different formats and associate the corresponding metadata to themthe corresponding metadata to them

The output of this process is XML documents The output of this process is XML documents encoded in the Greenstone Archive format encoded in the Greenstone Archive format specified by the following DTDspecified by the following DTD<!DOCTYPE GreenstoneArchive [

<!ELEMENT Section (Description,Content,Section*)><!ELEMENT Description (Metadata*)><!ELEMENT Content (#PCDATA)><!ELEMENT Metadata (#PCDATA)><ATTLIST Metadata name CDATA #REQUIRED>

]>]>

Page 5: Building Greenstone Collections from the Command Line

Automating collection Automating collection building tasksbuilding tasks

Batch files can automate many of the Batch files can automate many of the taskstasks

You can create a batch file to import and You can create a batch file to import and rebuild a collectionrebuild a collection

Try copy and paste the following lines into Try copy and paste the following lines into a batch file named “rebuild.bat”:a batch file named “rebuild.bat”:Perl –S import.pl –removeold %1Perl –S import.pl –removeold %1Perl –S buildcol.pl %1Perl –S buildcol.pl %1

Execute the batch file by typing Execute the batch file by typing “rebuild.bat collection_name”“rebuild.bat collection_name”

There are many commands that you can There are many commands that you can combined in a batch filecombined in a batch file

Page 6: Building Greenstone Collections from the Command Line

Importing documents Importing documents (cont.)(cont.)

An example:An example:<Section>

<Description><Metadata

name="gsdlsourcefilename">ec158e.txt</Metadata><Metadata name="Title">Freshwater Resources in Arid

Lands</Metadata><Metadata

name="Identifier">HASH0158f56086efffe592636058</Metadata>

<Metadata name="gsdlassocfile">cover.jpg:image/jpeg:</Metadata>

<Metadata name="gsdlassocfile">p07a.png:image/png:</Metadata>

</Description><Section>

Note: gsdlsourcefile is the original file from which the Greenstone archive file was generated, and gsdlassocfile is File associated with the document (e.g. an image file)

Page 7: Building Greenstone Collections from the Command Line

Document MetadataDocument Metadata

Greenstone Plugins recognize only a Greenstone Plugins recognize only a small set of metadata tagssmall set of metadata tags

There are three ways to assign There are three ways to assign metadata to documents in a collection: metadata to documents in a collection: 1) index.txt, 2) metadata.xml and 3) 1) index.txt, 2) metadata.xml and 3) modify an existing Greenstone pluginmodify an existing Greenstone plugin

An index.txt file is a space separated An index.txt file is a space separated file that assigns a list of metadata to file that assigns a list of metadata to documents in a collection. It should be documents in a collection. It should be placed in the collection import placed in the collection import directorydirectory

Page 8: Building Greenstone Collections from the Command Line

Document Metadata Document Metadata (cont.)(cont.)

To inform Greenstone about the existence of this file, To inform Greenstone about the existence of this file, include the IndexPlug plugin in your collect.cfg file or include the IndexPlug plugin in your collect.cfg file or add this plugin to your plugin list in GLIadd this plugin to your plugin list in GLI

An example of the index.txt file is as follows:An example of the index.txt file is as follows:key: Title Date Cast Directorkey: Title Date Cast Director"analyze.html" "Analyze That" "2002" "Robert De Niro, Billy Crystal, Lisa Kudrow" "Harold "analyze.html" "Analyze That" "2002" "Robert De Niro, Billy Crystal, Lisa Kudrow" "Harold

Ramis“Ramis“"majestic.html" "Majestic, The" "2001" "Jim Carrey, Bob Balaban, Jeffrey DeMunn" "Frank "majestic.html" "Majestic, The" "2001" "Jim Carrey, Bob Balaban, Jeffrey DeMunn" "Frank

Darabont“Darabont“

Each of the fields in this file are seperated by a space Each of the fields in this file are seperated by a space and enclosed in double quotes. Their offsets are and enclosed in double quotes. Their offsets are matched with the listing of fields shown in the first matched with the listing of fields shown in the first lien of the filelien of the file

Note that the first field of this listing must be the Note that the first field of this listing must be the filename of a documentfilename of a document

The trailers collection uses this approach to assign The trailers collection uses this approach to assign metadata to documents in a collectionmetadata to documents in a collection

Page 9: Building Greenstone Collections from the Command Line

Document Metadata Document Metadata (cont.)(cont.)

The second approach uses an XML file to The second approach uses an XML file to assign metadata to documents in a assign metadata to documents in a collectioncollection

To inform Greenstone that you would like To inform Greenstone that you would like to use the metadata.xml file, include the to use the metadata.xml file, include the string “plugin RecPlug -string “plugin RecPlug -use_metadata_files” in your collect.cfg file use_metadata_files” in your collect.cfg file or check the use_metadata_files flag after or check the use_metadata_files flag after clicking on the configure plugin button in clicking on the configure plugin button in the GLIthe GLI

The benefits of using an XML file over the The benefits of using an XML file over the previous approach is that the browser can previous approach is that the browser can perform tag checking for youperform tag checking for you

Page 10: Building Greenstone Collections from the Command Line

Document Metadata Document Metadata (cont.)(cont.)<?xml version="1.0" ?><?xml version="1.0" ?>

<DirectoryMetadata><DirectoryMetadata> <FileSet><FileSet> <FileName>MARTYN_DR_02002066.html</FileName><FileName>MARTYN_DR_02002066.html</FileName> <Description><Description>

<Metadata name="PlayerID">MARTYN_DR_02002066</Metadata><Metadata name="PlayerID">MARTYN_DR_02002066</Metadata><Metadata name="PlayerProfile"></Metadata><Metadata name="PlayerProfile"></Metadata><Metadata name="PlayerName">Damien Richard Martyn</Metadata><Metadata name="PlayerName">Damien Richard Martyn</Metadata><Metadata name="FullSizeImage">http://www-usa.cricket.org//perl/picture.cgi/030730</Metadata><Metadata name="FullSizeImage">http://www-usa.cricket.org//perl/picture.cgi/030730</Metadata><Metadata name="ThumbnailImage">http://www-usa.cricket.org//perl/picture.cgi/030730/inline?<Metadata name="ThumbnailImage">http://www-usa.cricket.org//perl/picture.cgi/030730/inline?alt=1</Metadata>alt=1</Metadata><Metadata name="CoverImage">MARTYN_DR_02002066.jpg</Metadata><Metadata name="CoverImage">MARTYN_DR_02002066.jpg</Metadata><Metadata name="Country">Australia</Metadata><Metadata name="Country">Australia</Metadata><Metadata name="BattingStyle">Right Hand Bat</Metadata><Metadata name="BattingStyle">Right Hand Bat</Metadata><Metadata name="BowlingStyle">Right Arm Medium</Metadata><Metadata name="BowlingStyle">Right Arm Medium</Metadata>

</Description></Description> </FileSet></FileSet> <FileSet><FileSet> <FileName>POTHECARY_JE_03001137.html</FileName><FileName>POTHECARY_JE_03001137.html</FileName> <Description><Description>

<Metadata name="PlayerID">POTHECARY_JE_03001137</Metadata><Metadata name="PlayerID">POTHECARY_JE_03001137</Metadata><Metadata name="PlayerProfile"></Metadata><Metadata name="PlayerProfile"></Metadata><Metadata name="PlayerName">James Edward Pothecary</Metadata><Metadata name="PlayerName">James Edward Pothecary</Metadata><Metadata name="Country">South Africa</Metadata><Metadata name="Country">South Africa</Metadata><Metadata name="BattingStyle">Right Hand Bat</Metadata><Metadata name="BattingStyle">Right Hand Bat</Metadata><Metadata name="BowlingStyle">Right Arm Medium</Metadata><Metadata name="BowlingStyle">Right Arm Medium</Metadata>

</Description></Description> </FileSet></FileSet>

Can you recognize the XML structure Can you recognize the XML structure this uses?this uses?

Page 11: Building Greenstone Collections from the Command Line

Document Metadata Document Metadata (cont.)(cont.)

Here’s the answer:Here’s the answer:<DirectoryMetadata><DirectoryMetadata> <FileSet><FileSet> <FileName><FileName>text text </FileName></FileName> <Description><Description>

<Metadata name=“<Metadata name=“name1name1">">some textsome text</Metadata></Metadata><Metadata name=" <Metadata name=" namename 2"> 2"> some textsome text

</Metadata></Metadata>other Metadata tags…other Metadata tags…

</Description></Description> </FileSet></FileSet> other FileSet tags …other FileSet tags …<DirectoryMetadata><DirectoryMetadata> Note that XML is case sensativeNote that XML is case sensative The cricket collection uses the metadata.xml to assign The cricket collection uses the metadata.xml to assign

metadata to the documentsmetadata to the documents

Page 12: Building Greenstone Collections from the Command Line

Document Metadata Document Metadata (cont.)(cont.)

We can also customize a plugin to We can also customize a plugin to extract metadata from a documentextract metadata from a document

We will look at modifying the We will look at modifying the TextPlug to extract Ratings, Genre TextPlug to extract Ratings, Genre and Subject from a few documents and Subject from a few documents in the trailers collectionin the trailers collection

Page 13: Building Greenstone Collections from the Command Line

Structuring Documents into Structuring Documents into SectionsSections Sometimes source documents have to be Sometimes source documents have to be

structured into sections and subsectionsstructured into sections and subsections This can be done easily by incorporating the This can be done easily by incorporating the

following HTML tags into your documents:following HTML tags into your documents:<!--

<Section><Description>

<Metadata name="Title"> Realizing human rights for poor

people: Strategies for achieving the international

development targets </Metadata></Description>

-->(text of section goes here)<!--

</Section>-->

You can also embed subsections within another section by embedding another level of <Section> before the </Section> tag

Look at one of the HTML files in the demo collection for an example

Page 14: Building Greenstone Collections from the Command Line

Browsing IndexesBrowsing Indexes

Page 15: Building Greenstone Collections from the Command Line

Types of Browsing Types of Browsing IndexesIndexes

SectionListSectionList AZListAZList AZSectionListAZSectionList DateListDateList HierarchyHierarchy

Page 16: Building Greenstone Collections from the Command Line

Creating Browsing Creating Browsing IndexesIndexes

Certain classifiers generate browsing Certain classifiers generate browsing structures that are hierarchicalstructures that are hierarchical

They are useful for subject classifications They are useful for subject classifications and organization hierarchiesand organization hierarchies

Therefore specific hierarchies will have to Therefore specific hierarchies will have to be provided using the flag –hfile be provided using the flag –hfile <filename> when the classifier is defined <filename> when the classifier is defined in the collect.cfg filein the collect.cfg file

For example:For example:classify Hierarchy –hfile sub.txt –metadata Subject

–sort Title

Page 17: Building Greenstone Collections from the Command Line

Creating Browsing Indexes Creating Browsing Indexes (cont.)(cont.)

Note that sub.txt has to reside in the Note that sub.txt has to reside in the /etc directory/etc directory

Certain classifiers don’t require Certain classifiers don’t require explicit hierarchies to be defined. explicit hierarchies to be defined. For instance, the AZList, DateList For instance, the AZList, DateList and List classifiers that generates a and List classifiers that generates a selection list of the corresponding selection list of the corresponding metadatametadataclassify List –metadata Howtoclassify AZList –metadata Title

Page 18: Building Greenstone Collections from the Command Line

Creating Browsing Indexes Creating Browsing Indexes (cont.)(cont.)

Explicit hierarchies have to be define Explicit hierarchies have to be define according to the following format:according to the following format:<identifier> <position in hierarchy> <name><identifier> <position in hierarchy> <name>

For example:For example:1 1 11 “General reference”“General reference”

1.21.2 1.21.2 “Something else”“Something else”

22 22 “….”“….” What this means is that the metadata type What this means is that the metadata type

associated to the current classifier will be associated to the current classifier will be assigned to the first classification if it has assigned to the first classification if it has the value 1 within the documentthe value 1 within the document

Look at the demo collections for examplesLook at the demo collections for examples

Page 19: Building Greenstone Collections from the Command Line

Creating Browsing Indexes Creating Browsing Indexes (cont.)(cont.)

Documents are treated internally as Documents are treated internally as tree nodes by Greenstone tree nodes by Greenstone

There are three types of nodes: Vlist, There are three types of nodes: Vlist, Hist and DatelistHist and Datelist

For example, an AZList consists of a For example, an AZList consists of a collection of Vlist nodes that collection of Vlist nodes that represent documentsrepresent documents

Arguments accepted by various Arguments accepted by various classifiers are in page 48 of the classifiers are in page 48 of the developer’s guidedeveloper’s guide

Page 20: Building Greenstone Collections from the Command Line

Formatting Browsing Formatting Browsing IndexesIndexes

Each classifier has an implicit name Each classifier has an implicit name from its position in the collect.cfg file. from its position in the collect.cfg file. For example, the third classifier For example, the third classifier specified in the file is called CL3specified in the file is called CL3

Tags in the formatting strings:Tags in the formatting strings: [Text] – document text[Text] – document text [link] … [/link] – link to the document [link] … [/link] – link to the document

itselfitself [icon] – icon representing the resource[icon] – icon representing the resource [metadata-name] – value of the metadata [metadata-name] – value of the metadata

associated to this documentassociated to this document

Page 21: Building Greenstone Collections from the Command Line

Formatting Browsing Formatting Browsing Indexes (cont.)Indexes (cont.)

For example:For example:format CL4Vlist “<br>[link][Howto][/link]”format CL4Vlist “<br>[link][Howto][/link]”

Conditional statements are supported Conditional statements are supported in the formatting string. They are in the formatting string. They are enclosed by the ‘{’ and ‘}’ characters enclosed by the ‘{’ and ‘}’ characters in these formats:in these formats:

{If}{[metadata], then clause, else clause}{If}{[metadata], then clause, else clause} {Or}{action, another-action, another-action, etc}{Or}{action, another-action, another-action, etc}

The {If} statement is the same as most The {If} statement is the same as most program languages program languages

The {Or} statement evaluates the The {Or} statement evaluates the items in the list and stops when one of items in the list and stops when one of them is non-null. Its value is sent to the them is non-null. Its value is sent to the output and evaluation is terminated.output and evaluation is terminated.

Page 22: Building Greenstone Collections from the Command Line

Formatting Browsing Formatting Browsing Indexes (cont.)Indexes (cont.)

For example:For example:format VList "<td valign=top>[link]<img format VList "<td valign=top>[link]<img

src=_httpprefix_/collect/cricket/images/[Playesrc=_httpprefix_/collect/cricket/images/[PlayerID].jpg border=0></link></td><td>[link]rID].jpg border=0></link></td><td>[link][Title][/link]</td><td>{If} {[HasAudio],<a [Title][/link]</td><td>{If} {[HasAudio],<a href=[audioURL]><img href=[audioURL]><img src=_httpprefix_/collect/cricket/images/wav.jsrc=_httpprefix_/collect/cricket/images/wav.jpg border=0></a>}</td>"pg border=0></a>}</td>"

Page 23: Building Greenstone Collections from the Command Line

Customizing the look and Customizing the look and feel of Greenstonefeel of Greenstone

Page 24: Building Greenstone Collections from the Command Line

Customizing the look and Customizing the look and feel of Greenstonefeel of Greenstone

Involved files are in gsdl/macros Involved files are in gsdl/macros directory:directory: Base.dm – global macros, such as custom Base.dm – global macros, such as custom

buttonsbuttons English.dm – text for the corresponding English.dm – text for the corresponding

languagelanguage Home.dm – The main GSDL pageHome.dm – The main GSDL page Gsdl.dm – About Greenstone pageGsdl.dm – About Greenstone page Style.dm – Page layoutStyle.dm – Page layout Query.dm – Query form layoutQuery.dm – Query form layout

Page 25: Building Greenstone Collections from the Command Line

Customizing the look and Customizing the look and feel of Greenstone (cont.)feel of Greenstone (cont.)

Background image (chalk.gif)Background image (chalk.gif)Base.dm:Base.dm:_httpiconchalk_ {_httpimg_/chalk.gif}_httpiconchalk_ {_httpimg_/chalk.gif}

_widthchalk_ {2000}_widthchalk_ {2000}

_heightchalk_ {10}_heightchalk_ {10}

Custom ButtonCustom ButtonBase.dm:Base.dm:

_Genrewidth_ {_widthtGenrex_}_Genrewidth_ {_widthtGenrex_}

_imageGenre_ _imageGenre_ {_gsimage_(_httpbrowseGenre_,_httpicontGenreof_,_httpicontGenreon_,G{_gsimage_(_httpbrowseGenre_,_httpicontGenreof_,_httpicontGenreon_,Genre,_textimageGenre_)}enre,_textimageGenre_)}

_icontabGenregreen_ {<img _icontabGenregreen_ {<img

src="_httpicontGenregr_" width=_widthtGenrex_ border=0>}src="_httpicontGenregr_" width=_widthtGenrex_ border=0>}

_icontabGenregreen_[v=1] {_texticontabGenregreen_}_icontabGenregreen_[v=1] {_texticontabGenregreen_}

Page 26: Building Greenstone Collections from the Command Line

Customizing the look and Customizing the look and feel of Greenstone (cont.)feel of Greenstone (cont.)

Document.dmDocument.dm_textGenrepage_ {_texticonhGenre_}_textGenrepage_ {_texticonhGenre_}

_iconGenrepage_ {<img _iconGenrepage_ {<img src="_httpiconhGenre_" src="_httpiconhGenre_" width="_widthhGenre_"width="_widthhGenre_"

height="_heighthGenre_">}height="_heighthGenre_">}

_iconGenrepage_ [v=1] _iconGenrepage_ [v=1] {<h2>_texticonhGenre_</h2>}{<h2>_texticonhGenre_</h2>}

Page 27: Building Greenstone Collections from the Command Line

Customizing the look and Customizing the look and feel of Greenstone (cont.)feel of Greenstone (cont.)

English.dmEnglish.dm_textimageGenre_ {Browse by Genre}_textimageGenre_ {Browse by Genre}

_texticontabGenregreen_{Genre}_texticontabGenregreen_{Genre}

_httpicontGenregr_{_httpimg_/tGenregr.gif}_httpicontGenregr_{_httpimg_/tGenregr.gif}_httpicontGenreon_{_httpimg_/tGenreon.gif}_httpicontGenreon_{_httpimg_/tGenreon.gif}_httpicontGenreof_{_httpimg_/tGenreof.gif}_httpicontGenreof_{_httpimg_/tGenreof.gif}_widthtGenrex_ {114}_widthtGenrex_ {114}

_texticonhGenre_ {Genre}_texticonhGenre_ {Genre}

_httpiconhGenre_ {_httpimg_/h\_Genre.gif}_httpiconhGenre_ {_httpimg_/h\_Genre.gif}_widthhGenre_ {250}_widthhGenre_ {250}_heighthGenre_ {57}_heighthGenre_ {57}

_textGenreshort_ {access publications by Genre}_textGenreshort_ {access publications by Genre}

_textGenrelong_ { <p>You can <i>access my documents by_textGenrelong_ { <p>You can <i>access my documents bywhatever I have defined</i> by pressing the <i>Genre</i> button. This whatever I have defined</i> by pressing the <i>Genre</i> button. This

bringsbringsup a list of documents. }up a list of documents. }