Building Greenstone Collections from the Command Line
description
Transcript of Building Greenstone Collections from the Command Line
Building Building Greenstone Greenstone
Collections from Collections from the Command Linethe Command Line
Basic commandsBasic commands Type “setup.bat” (for Windows users) or Type “setup.bat” (for Windows users) or
“setup.sh” for (Unix/Linux users) when “setup.sh” for (Unix/Linux users) when you’re in the Greenstone installation you’re in the Greenstone installation directorydirectory
To create a collection, type “perl –S mkcol.pl To create a collection, type “perl –S mkcol.pl –creator [email protected] –creator [email protected] collection_name”collection_name”
To import documents into a collection, type To import documents into a collection, type “perl –S import.pl collection_name”“perl –S import.pl collection_name”
To build a collection, type “perl –S To build a collection, type “perl –S buildcol.pl collection_name”buildcol.pl collection_name”
For further details, read page 9 – 19 of the For further details, read page 9 – 19 of the developer’s guidedeveloper’s guide
Building A Collection In Building A Collection In GreenstoneGreenstone
Documents
Import
import.pl
(plugins)
Archives Index
build.pl
(classifiers)
Web
Documents
Documents
XML documents
Browsing and full text
Importing documents Importing documents Plugins are used to process source Plugins are used to process source
documents in different formats and associate documents in different formats and associate the corresponding metadata to themthe corresponding metadata to them
The output of this process is XML documents The output of this process is XML documents encoded in the Greenstone Archive format encoded in the Greenstone Archive format specified by the following DTDspecified by the following DTD<!DOCTYPE GreenstoneArchive [
<!ELEMENT Section (Description,Content,Section*)><!ELEMENT Description (Metadata*)><!ELEMENT Content (#PCDATA)><!ELEMENT Metadata (#PCDATA)><ATTLIST Metadata name CDATA #REQUIRED>
]>]>
Automating collection Automating collection building tasksbuilding tasks
Batch files can automate many of the Batch files can automate many of the taskstasks
You can create a batch file to import and You can create a batch file to import and rebuild a collectionrebuild a collection
Try copy and paste the following lines into Try copy and paste the following lines into a batch file named “rebuild.bat”:a batch file named “rebuild.bat”:Perl –S import.pl –removeold %1Perl –S import.pl –removeold %1Perl –S buildcol.pl %1Perl –S buildcol.pl %1
Execute the batch file by typing Execute the batch file by typing “rebuild.bat collection_name”“rebuild.bat collection_name”
There are many commands that you can There are many commands that you can combined in a batch filecombined in a batch file
Importing documents Importing documents (cont.)(cont.)
An example:An example:<Section>
<Description><Metadata
name="gsdlsourcefilename">ec158e.txt</Metadata><Metadata name="Title">Freshwater Resources in Arid
Lands</Metadata><Metadata
name="Identifier">HASH0158f56086efffe592636058</Metadata>
<Metadata name="gsdlassocfile">cover.jpg:image/jpeg:</Metadata>
<Metadata name="gsdlassocfile">p07a.png:image/png:</Metadata>
</Description><Section>
Note: gsdlsourcefile is the original file from which the Greenstone archive file was generated, and gsdlassocfile is File associated with the document (e.g. an image file)
Document MetadataDocument Metadata
Greenstone Plugins recognize only a Greenstone Plugins recognize only a small set of metadata tagssmall set of metadata tags
There are three ways to assign There are three ways to assign metadata to documents in a collection: metadata to documents in a collection: 1) index.txt, 2) metadata.xml and 3) 1) index.txt, 2) metadata.xml and 3) modify an existing Greenstone pluginmodify an existing Greenstone plugin
An index.txt file is a space separated An index.txt file is a space separated file that assigns a list of metadata to file that assigns a list of metadata to documents in a collection. It should be documents in a collection. It should be placed in the collection import placed in the collection import directorydirectory
Document Metadata Document Metadata (cont.)(cont.)
To inform Greenstone about the existence of this file, To inform Greenstone about the existence of this file, include the IndexPlug plugin in your collect.cfg file or include the IndexPlug plugin in your collect.cfg file or add this plugin to your plugin list in GLIadd this plugin to your plugin list in GLI
An example of the index.txt file is as follows:An example of the index.txt file is as follows:key: Title Date Cast Directorkey: Title Date Cast Director"analyze.html" "Analyze That" "2002" "Robert De Niro, Billy Crystal, Lisa Kudrow" "Harold "analyze.html" "Analyze That" "2002" "Robert De Niro, Billy Crystal, Lisa Kudrow" "Harold
Ramis“Ramis“"majestic.html" "Majestic, The" "2001" "Jim Carrey, Bob Balaban, Jeffrey DeMunn" "Frank "majestic.html" "Majestic, The" "2001" "Jim Carrey, Bob Balaban, Jeffrey DeMunn" "Frank
Darabont“Darabont“
Each of the fields in this file are seperated by a space Each of the fields in this file are seperated by a space and enclosed in double quotes. Their offsets are and enclosed in double quotes. Their offsets are matched with the listing of fields shown in the first matched with the listing of fields shown in the first lien of the filelien of the file
Note that the first field of this listing must be the Note that the first field of this listing must be the filename of a documentfilename of a document
The trailers collection uses this approach to assign The trailers collection uses this approach to assign metadata to documents in a collectionmetadata to documents in a collection
Document Metadata Document Metadata (cont.)(cont.)
The second approach uses an XML file to The second approach uses an XML file to assign metadata to documents in a assign metadata to documents in a collectioncollection
To inform Greenstone that you would like To inform Greenstone that you would like to use the metadata.xml file, include the to use the metadata.xml file, include the string “plugin RecPlug -string “plugin RecPlug -use_metadata_files” in your collect.cfg file use_metadata_files” in your collect.cfg file or check the use_metadata_files flag after or check the use_metadata_files flag after clicking on the configure plugin button in clicking on the configure plugin button in the GLIthe GLI
The benefits of using an XML file over the The benefits of using an XML file over the previous approach is that the browser can previous approach is that the browser can perform tag checking for youperform tag checking for you
Document Metadata Document Metadata (cont.)(cont.)<?xml version="1.0" ?><?xml version="1.0" ?>
<DirectoryMetadata><DirectoryMetadata> <FileSet><FileSet> <FileName>MARTYN_DR_02002066.html</FileName><FileName>MARTYN_DR_02002066.html</FileName> <Description><Description>
<Metadata name="PlayerID">MARTYN_DR_02002066</Metadata><Metadata name="PlayerID">MARTYN_DR_02002066</Metadata><Metadata name="PlayerProfile"></Metadata><Metadata name="PlayerProfile"></Metadata><Metadata name="PlayerName">Damien Richard Martyn</Metadata><Metadata name="PlayerName">Damien Richard Martyn</Metadata><Metadata name="FullSizeImage">http://www-usa.cricket.org//perl/picture.cgi/030730</Metadata><Metadata name="FullSizeImage">http://www-usa.cricket.org//perl/picture.cgi/030730</Metadata><Metadata name="ThumbnailImage">http://www-usa.cricket.org//perl/picture.cgi/030730/inline?<Metadata name="ThumbnailImage">http://www-usa.cricket.org//perl/picture.cgi/030730/inline?alt=1</Metadata>alt=1</Metadata><Metadata name="CoverImage">MARTYN_DR_02002066.jpg</Metadata><Metadata name="CoverImage">MARTYN_DR_02002066.jpg</Metadata><Metadata name="Country">Australia</Metadata><Metadata name="Country">Australia</Metadata><Metadata name="BattingStyle">Right Hand Bat</Metadata><Metadata name="BattingStyle">Right Hand Bat</Metadata><Metadata name="BowlingStyle">Right Arm Medium</Metadata><Metadata name="BowlingStyle">Right Arm Medium</Metadata>
</Description></Description> </FileSet></FileSet> <FileSet><FileSet> <FileName>POTHECARY_JE_03001137.html</FileName><FileName>POTHECARY_JE_03001137.html</FileName> <Description><Description>
<Metadata name="PlayerID">POTHECARY_JE_03001137</Metadata><Metadata name="PlayerID">POTHECARY_JE_03001137</Metadata><Metadata name="PlayerProfile"></Metadata><Metadata name="PlayerProfile"></Metadata><Metadata name="PlayerName">James Edward Pothecary</Metadata><Metadata name="PlayerName">James Edward Pothecary</Metadata><Metadata name="Country">South Africa</Metadata><Metadata name="Country">South Africa</Metadata><Metadata name="BattingStyle">Right Hand Bat</Metadata><Metadata name="BattingStyle">Right Hand Bat</Metadata><Metadata name="BowlingStyle">Right Arm Medium</Metadata><Metadata name="BowlingStyle">Right Arm Medium</Metadata>
</Description></Description> </FileSet></FileSet>
Can you recognize the XML structure Can you recognize the XML structure this uses?this uses?
Document Metadata Document Metadata (cont.)(cont.)
Here’s the answer:Here’s the answer:<DirectoryMetadata><DirectoryMetadata> <FileSet><FileSet> <FileName><FileName>text text </FileName></FileName> <Description><Description>
<Metadata name=“<Metadata name=“name1name1">">some textsome text</Metadata></Metadata><Metadata name=" <Metadata name=" namename 2"> 2"> some textsome text
</Metadata></Metadata>other Metadata tags…other Metadata tags…
</Description></Description> </FileSet></FileSet> other FileSet tags …other FileSet tags …<DirectoryMetadata><DirectoryMetadata> Note that XML is case sensativeNote that XML is case sensative The cricket collection uses the metadata.xml to assign The cricket collection uses the metadata.xml to assign
metadata to the documentsmetadata to the documents
Document Metadata Document Metadata (cont.)(cont.)
We can also customize a plugin to We can also customize a plugin to extract metadata from a documentextract metadata from a document
We will look at modifying the We will look at modifying the TextPlug to extract Ratings, Genre TextPlug to extract Ratings, Genre and Subject from a few documents and Subject from a few documents in the trailers collectionin the trailers collection
Structuring Documents into Structuring Documents into SectionsSections Sometimes source documents have to be Sometimes source documents have to be
structured into sections and subsectionsstructured into sections and subsections This can be done easily by incorporating the This can be done easily by incorporating the
following HTML tags into your documents:following HTML tags into your documents:<!--
<Section><Description>
<Metadata name="Title"> Realizing human rights for poor
people: Strategies for achieving the international
development targets </Metadata></Description>
-->(text of section goes here)<!--
</Section>-->
You can also embed subsections within another section by embedding another level of <Section> before the </Section> tag
Look at one of the HTML files in the demo collection for an example
Browsing IndexesBrowsing Indexes
Types of Browsing Types of Browsing IndexesIndexes
SectionListSectionList AZListAZList AZSectionListAZSectionList DateListDateList HierarchyHierarchy
Creating Browsing Creating Browsing IndexesIndexes
Certain classifiers generate browsing Certain classifiers generate browsing structures that are hierarchicalstructures that are hierarchical
They are useful for subject classifications They are useful for subject classifications and organization hierarchiesand organization hierarchies
Therefore specific hierarchies will have to Therefore specific hierarchies will have to be provided using the flag –hfile be provided using the flag –hfile <filename> when the classifier is defined <filename> when the classifier is defined in the collect.cfg filein the collect.cfg file
For example:For example:classify Hierarchy –hfile sub.txt –metadata Subject
–sort Title
Creating Browsing Indexes Creating Browsing Indexes (cont.)(cont.)
Note that sub.txt has to reside in the Note that sub.txt has to reside in the /etc directory/etc directory
Certain classifiers don’t require Certain classifiers don’t require explicit hierarchies to be defined. explicit hierarchies to be defined. For instance, the AZList, DateList For instance, the AZList, DateList and List classifiers that generates a and List classifiers that generates a selection list of the corresponding selection list of the corresponding metadatametadataclassify List –metadata Howtoclassify AZList –metadata Title
Creating Browsing Indexes Creating Browsing Indexes (cont.)(cont.)
Explicit hierarchies have to be define Explicit hierarchies have to be define according to the following format:according to the following format:<identifier> <position in hierarchy> <name><identifier> <position in hierarchy> <name>
For example:For example:1 1 11 “General reference”“General reference”
1.21.2 1.21.2 “Something else”“Something else”
22 22 “….”“….” What this means is that the metadata type What this means is that the metadata type
associated to the current classifier will be associated to the current classifier will be assigned to the first classification if it has assigned to the first classification if it has the value 1 within the documentthe value 1 within the document
Look at the demo collections for examplesLook at the demo collections for examples
Creating Browsing Indexes Creating Browsing Indexes (cont.)(cont.)
Documents are treated internally as Documents are treated internally as tree nodes by Greenstone tree nodes by Greenstone
There are three types of nodes: Vlist, There are three types of nodes: Vlist, Hist and DatelistHist and Datelist
For example, an AZList consists of a For example, an AZList consists of a collection of Vlist nodes that collection of Vlist nodes that represent documentsrepresent documents
Arguments accepted by various Arguments accepted by various classifiers are in page 48 of the classifiers are in page 48 of the developer’s guidedeveloper’s guide
Formatting Browsing Formatting Browsing IndexesIndexes
Each classifier has an implicit name Each classifier has an implicit name from its position in the collect.cfg file. from its position in the collect.cfg file. For example, the third classifier For example, the third classifier specified in the file is called CL3specified in the file is called CL3
Tags in the formatting strings:Tags in the formatting strings: [Text] – document text[Text] – document text [link] … [/link] – link to the document [link] … [/link] – link to the document
itselfitself [icon] – icon representing the resource[icon] – icon representing the resource [metadata-name] – value of the metadata [metadata-name] – value of the metadata
associated to this documentassociated to this document
Formatting Browsing Formatting Browsing Indexes (cont.)Indexes (cont.)
For example:For example:format CL4Vlist “<br>[link][Howto][/link]”format CL4Vlist “<br>[link][Howto][/link]”
Conditional statements are supported Conditional statements are supported in the formatting string. They are in the formatting string. They are enclosed by the ‘{’ and ‘}’ characters enclosed by the ‘{’ and ‘}’ characters in these formats:in these formats:
{If}{[metadata], then clause, else clause}{If}{[metadata], then clause, else clause} {Or}{action, another-action, another-action, etc}{Or}{action, another-action, another-action, etc}
The {If} statement is the same as most The {If} statement is the same as most program languages program languages
The {Or} statement evaluates the The {Or} statement evaluates the items in the list and stops when one of items in the list and stops when one of them is non-null. Its value is sent to the them is non-null. Its value is sent to the output and evaluation is terminated.output and evaluation is terminated.
Formatting Browsing Formatting Browsing Indexes (cont.)Indexes (cont.)
For example:For example:format VList "<td valign=top>[link]<img format VList "<td valign=top>[link]<img
src=_httpprefix_/collect/cricket/images/[Playesrc=_httpprefix_/collect/cricket/images/[PlayerID].jpg border=0></link></td><td>[link]rID].jpg border=0></link></td><td>[link][Title][/link]</td><td>{If} {[HasAudio],<a [Title][/link]</td><td>{If} {[HasAudio],<a href=[audioURL]><img href=[audioURL]><img src=_httpprefix_/collect/cricket/images/wav.jsrc=_httpprefix_/collect/cricket/images/wav.jpg border=0></a>}</td>"pg border=0></a>}</td>"
Customizing the look and Customizing the look and feel of Greenstonefeel of Greenstone
Customizing the look and Customizing the look and feel of Greenstonefeel of Greenstone
Involved files are in gsdl/macros Involved files are in gsdl/macros directory:directory: Base.dm – global macros, such as custom Base.dm – global macros, such as custom
buttonsbuttons English.dm – text for the corresponding English.dm – text for the corresponding
languagelanguage Home.dm – The main GSDL pageHome.dm – The main GSDL page Gsdl.dm – About Greenstone pageGsdl.dm – About Greenstone page Style.dm – Page layoutStyle.dm – Page layout Query.dm – Query form layoutQuery.dm – Query form layout
Customizing the look and Customizing the look and feel of Greenstone (cont.)feel of Greenstone (cont.)
Background image (chalk.gif)Background image (chalk.gif)Base.dm:Base.dm:_httpiconchalk_ {_httpimg_/chalk.gif}_httpiconchalk_ {_httpimg_/chalk.gif}
_widthchalk_ {2000}_widthchalk_ {2000}
_heightchalk_ {10}_heightchalk_ {10}
Custom ButtonCustom ButtonBase.dm:Base.dm:
_Genrewidth_ {_widthtGenrex_}_Genrewidth_ {_widthtGenrex_}
_imageGenre_ _imageGenre_ {_gsimage_(_httpbrowseGenre_,_httpicontGenreof_,_httpicontGenreon_,G{_gsimage_(_httpbrowseGenre_,_httpicontGenreof_,_httpicontGenreon_,Genre,_textimageGenre_)}enre,_textimageGenre_)}
_icontabGenregreen_ {<img _icontabGenregreen_ {<img
src="_httpicontGenregr_" width=_widthtGenrex_ border=0>}src="_httpicontGenregr_" width=_widthtGenrex_ border=0>}
_icontabGenregreen_[v=1] {_texticontabGenregreen_}_icontabGenregreen_[v=1] {_texticontabGenregreen_}
Customizing the look and Customizing the look and feel of Greenstone (cont.)feel of Greenstone (cont.)
Document.dmDocument.dm_textGenrepage_ {_texticonhGenre_}_textGenrepage_ {_texticonhGenre_}
_iconGenrepage_ {<img _iconGenrepage_ {<img src="_httpiconhGenre_" src="_httpiconhGenre_" width="_widthhGenre_"width="_widthhGenre_"
height="_heighthGenre_">}height="_heighthGenre_">}
_iconGenrepage_ [v=1] _iconGenrepage_ [v=1] {<h2>_texticonhGenre_</h2>}{<h2>_texticonhGenre_</h2>}
Customizing the look and Customizing the look and feel of Greenstone (cont.)feel of Greenstone (cont.)
English.dmEnglish.dm_textimageGenre_ {Browse by Genre}_textimageGenre_ {Browse by Genre}
_texticontabGenregreen_{Genre}_texticontabGenregreen_{Genre}
_httpicontGenregr_{_httpimg_/tGenregr.gif}_httpicontGenregr_{_httpimg_/tGenregr.gif}_httpicontGenreon_{_httpimg_/tGenreon.gif}_httpicontGenreon_{_httpimg_/tGenreon.gif}_httpicontGenreof_{_httpimg_/tGenreof.gif}_httpicontGenreof_{_httpimg_/tGenreof.gif}_widthtGenrex_ {114}_widthtGenrex_ {114}
_texticonhGenre_ {Genre}_texticonhGenre_ {Genre}
_httpiconhGenre_ {_httpimg_/h\_Genre.gif}_httpiconhGenre_ {_httpimg_/h\_Genre.gif}_widthhGenre_ {250}_widthhGenre_ {250}_heighthGenre_ {57}_heighthGenre_ {57}
_textGenreshort_ {access publications by Genre}_textGenreshort_ {access publications by Genre}
_textGenrelong_ { <p>You can <i>access my documents by_textGenrelong_ { <p>You can <i>access my documents bywhatever I have defined</i> by pressing the <i>Genre</i> button. This whatever I have defined</i> by pressing the <i>Genre</i> button. This
bringsbringsup a list of documents. }up a list of documents. }