Building and Running ManifoldCF

17
Apache > ManifoldCF > Release Documentation > release > release2.1 > en_US Search the site with Solr Search Last Published: 05/05/2015 08:23:01 Building ManifoldCF Building ManifoldCF Building overview Building the framework and the connectors using Apache Ant Building and testing the legacy Alfresco connector Building and testing the Alfresco Webscript connector Building and running the Documentum connector Building and running the FileNet connector Building and running the JDBC connector, including Oracle, MSSQL, MySQL, SQLServer, and Sybase JDBC drivers Building and running the jCIFS/Windows Shares connector Building and running the LiveLink connector Building the Meridio connector Building and running the SharePoint connector Running the Apache Solr output connector Running the ElasticSearch output connector Building the framework and the connectors using Apache Maven Preparation How to build Building ManifoldCF's Apache2 plugin Running ManifoldCF Overview Binary organization Example deployments Quickstart single process model Singleprocess deployable war Simplified multiprocess model using filebased synchronization Simplified multiprocess model using ZooKeeperbased synchronization Commanddriven multiprocess model The connectors.xml configuration file Running connectorspecific processes Database selection Configuring a PostgreSQL database Configuring a MySQL database Configuring an HSQLDB database The ManifoldCF configuration files properties.xml file properties Logging configuration file properties Running the ManifoldCF Apache2 plug in Configuring the ManifoldCF Apache2 plug in Running ManifoldCF with Apache Maven Integrating ManifoldCF into another application Integrating the Quick Start example Integrating a multiprocess setup Integrating ManifoldCF with a search engine Building ManifoldCF ManifoldCF consists of a framework, a set of connectors, and an optional Apache2 plugin module. These can be built as follows. Building overview There are two ways to build ManifoldCF. The primary means of building (and the most supported) is via Apache Ant. The ant build is used to create ManifoldCF releases and to run tests, load tests, and UI tests. Maven is also supported for develop building only. Maven ManifoldCF builds have many restrictions and challenges and are of secondary priority for the development team. Powered by LucidWorks

description

mcf

Transcript of Building and Running ManifoldCF

Page 1: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 1/17

Apache > ManifoldCF > Release Documentation > release > release­2.1 > en_US

Search the site with Solr Search

Last Published: 05/05/2015 08:23:01

Building ManifoldCFBuilding ManifoldCF

Building overviewBuilding the framework and the connectors using Apache Ant

Building and testing the legacy Alfresco connectorBuilding and testing the Alfresco Webscript connectorBuilding and running the Documentum connectorBuilding and running the FileNet connectorBuilding and running the JDBC connector, including Oracle, MSSQL, MySQL, SQLServer, and Sybase JDBC driversBuilding and running the jCIFS/Windows Shares connectorBuilding and running the LiveLink connectorBuilding the Meridio connectorBuilding and running the SharePoint connectorRunning the Apache Solr output connectorRunning the ElasticSearch output connector

Building the framework and the connectors using Apache Maven

PreparationHow to build

Building ManifoldCF's Apache2 plugin

Running ManifoldCF

OverviewBinary organizationExample deployments

Quick­start single process modelSingle­process deployable warSimplified multi­process model using file­based synchronizationSimplified multi­process model using ZooKeeper­based synchronizationCommand­driven multi­process model

The connectors.xml configuration fileRunning connector­specific processesDatabase selection

Configuring a PostgreSQL databaseConfiguring a MySQL databaseConfiguring an HSQLDB database

The ManifoldCF configuration files

properties.xml file propertiesLogging configuration file properties

Running the ManifoldCF Apache2 plug in

Configuring the ManifoldCF Apache2 plug in

Running ManifoldCF with Apache Maven

Integrating ManifoldCF into another application

Integrating the Quick Start exampleIntegrating a multi­process setupIntegrating ManifoldCF with a search engine

Building ManifoldCF

ManifoldCF consists of a framework, a set of connectors, and an optional Apache2 plug­in module. These can be built as follows.

Building overview

There are two ways to build ManifoldCF. The primary means of building (and the most supported) is via Apache Ant. The ant build isused to create ManifoldCF releases and to run tests, load tests, and UI tests. Maven is also supported for develop building only. MavenManifoldCF builds have many restrictions and challenges and are of secondary priority for the development team.

Powered by LucidWorks

Page 2: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 2/17

The ManifoldCF framework is built without any dependencies on connector code. It consists of a set of jars, a family of web applications,and a number of java command classes. Connectors are then built that have well­defined dependencies on the framework modules. Aproperly built connector typically consists of:

One or more jar files meant to be included in the library area meant for connector jars and their dependencies.

Possibly some java commands, which are meant to support or configure the connector in some way.

Possibly a connector­specific process or two, each requiring a distinct classpath, which usually serves to isolate the crawler­uiservlet, authority­service servlet, agents process, and any commands from problematic aspects of the client environment

A recommended set of java "define" variables, which should be used consistently with all involved processes, e.g. the agentsprocess, the application server running the authority­service and crawler­ui, and any commands. (This is historical, and noconnectors as of this writing have any of these any longer).

An individual connector package will typically supply an output connector, or a transformation connector, or a mapping connector, or arepository connector, or sometimes both a repository connector and an authority connector. The main ant build script automatically formseach individual connector's contribution to the overall system into the overall package.

Building the framework and the connectors using Apache Ant

To build the ManifoldCF framework code, and the particular connectors you are interested in, you currently need to do the following:

1. Check out the desired release from https://svn.apache.org/repos/asf/manifoldcf/tags, or unpack the desired source distribution.

2. cd to the top­level directory.

3. EITHER: overlay the lib directory from the corresponding lib distribution (preferred, where possible), OR run "ant make­core­deps" to build the code dependencies. The latter is the only possibility if you are building from trunk, but it is not guaranteed to workfor older releases.

4. Run "ant make­deps", to download LGPL and other open source but non­Apache compatible libraries.

5. Install proprietary build dependencies. See below for details.

6. Run "ant build".

7. Install desired dependent proprietary libraries. See below for details.

If you do not run the ant "make­deps" target, and you supply NO LGPL or proprietary libraries, not all capabilities of ManifoldCF will beavailable. The framework itself and the following repository connectors will be built:

Alfresco Webscript connector

CMIS connector

EMC Documentum connector, built against a Documentum API stub

DropBox connector

Email connector

FileNet connector, built against a FileNet API stub

WGET­compatible filesystem connector

Generic XML repository connector

Google Drive connector

GridFS connector (mongoDB)

HDFS connector

JDBC connector, with the just the POSTGRESQL jdbc driver

Atlassian Jira connector

OpenText LiveLink connector, built against a LiveLink API stub

Meridio connector, built against modified Meridio API WSDLs and XSDs

RSS connector

Microsoft SharePoint connector, built against SharePoint API WSDLs

Webcrawler connector

Wiki connector

The following authority connectors will be built:

Active Directory authority

Alfresco Webscript authority

CMIS authority

EMC Documentum authority

Atlassian Jira authority

LDAP authority

OpenText LiveLink authority

Meridio authority, built against modified Meridio API WSDLs and XSDs

Null authority

Microsoft SharePoint/AD authority

Page 3: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 3/17

Microsoft SharePoint/Native authority, built against SharePoint API WSDLs

The following output connectors will be built:

WGET­compatible filesystem output connector

MetaCarta GTS output connector

Apache Solr output connector

OpenSearchServer output connector

ElasticSearch output connector

WGET­compatible filesystem output connector

HDFS output connector

Null output connector

The following transformation connectors will be built:

Field mapping transformation connector

Document filter transformation connector

Null transformation connector

Tika extractor transformation connector

The following mapping connectors will be built:

Regular­expression mapping connector

The dependencies and build limitations of each individual LGPL and proprietary connector is described in separate sections below.

Building and testing the legacy Alfresco connector

The legacy Alfresco connector requires the Alfresco Web Services Client provided by Alfresco in order to be built. Place this jar into thedirectory connectors/alfresco/lib­proprietary before you build. This will occur automatically if you execute the ant target "make­deps"from the ManifoldCF root directory.

To run integration tests for the connector you have to copy the alfresco.war including H2 support created by the Maven module test­materials/alfresco­4­war (using "mvn package" inside the folder) into the connectors/alfresco/test­materials­proprietary folder. Thenuse the "ant test" or "mvn integration­test" for the standard build to execute integration tests.

Building and testing the Alfresco Webscript connector

The Alfresco Webscript connector is built against an open­source Alfresco Indexer client, which requires a corresponding Alfresco Indexerplugin to be installed on your Alfresco instance. This Alfresco Indexer plugin is included with ManifoldCF distributions. Installation of theplugin should follow the standard Alfresco installation steps, as described here. See this page for configuration details, and for the pluginitself.

Building and running the Documentum connector

The Documentum connector requires EMC's DFC product in order to be run, but is built against a set of stub classes. The stubs mimic theclass structure of DFC 6.0. If your DFC is newer, it is possible that the class structure of the DFC classes might have changed, and youmay need to build the connector yourself.

If you need to supply DFC classes during build time, copy the DFC and dependent jars to the source directoryconnectors/documentum/lib­proprietary, and build using "ant build". The jars will be copied into the right place in your dist directoryautomatically.

For a binary distribution, just copy the DFC jars to processes/documentum­server/lib­proprietary instead.

If you have done everything right, you should be able to start the Documentum connector's registry and server processes, as per theinstructions.

Building and running the FileNet connector

The FileNet connector requires IBM's FileNet P8 API jar in order to be run, but is usually built against a set of stub classes. The stubsmimic the class structure of FileNet P8 API 4.5. If your FileNet is newer, it is possible that the class structure of the API might havechanged, and you may need to build the connector yourself.

If you need to supply your own Jace.jar at build time, copy it to the source directory connectors/filenet/lib­proprietary, and build using"ant build". The Jace.jar will be copied into the right place in your dist directory automatically.

If you do not wish to build, simply copy your Jace.jar and the other dependent jars from that installation into the distribution directoryprocesses/filenet­server/lib­proprietary.

If correctly done, you will be able to start the FileNet connector's registry and server processes, as per the instructions.

Building and running the JDBC connector, including Oracle, MSSQL, MySQL, SQLServer, and Sybase JDBC drivers

The JDBC connector also knows how to work with Oracle, SQLServer, and Sybase JDBC drivers. In order to support these databases, startby placing the mysql­connector­java.jar and the jtds.jar in the lib­proprietary directory. The ant target "make­deps" will do this for youautomatically. For Oracle, download the appropriate Oracle JDBC jar from the Oracle site, and copy it into the same directory before you

Page 4: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 4/17

build ManifoldCF.

Building and running the jCIFS/Windows Shares connector

To build this connector, you need to download jcifs.jar from http://jcifs.samba.org, and copy it into the connectors/jcifs/lib­proprietarydirectory before building. You can also just type "ant make­deps" from the root ManifoldCF directory and this step will be done for you.

If you have downloaded a binary distribution, place the jcifs.jar into the connector­lib­proprietary directory, and uncomment theWindows Shares line in the connectors.xml file.

Building and running the LiveLink connector

This connector needs OpenText's LAPI package in order to be run. It is usually built against a set of stubs. The stubs, however, mimic theclass structure of LAPI 9.7.1. Later versions (such as 10.x) have a different class structure. Therefore, you may need to rebuild ManifoldCFagainst your lapi.jar, in order for the connector to work properly.

If you need to supply your own lapi.jar and llssl.jar at build time, copy it to the source directory connectors/livelink/lib­proprietary, andbuild using "ant build". The lapi.jar will be copied into the right place in your dist directory automatically.

If you do not wish to build, simply copy your lapi.jar and llssl.jar into the binary distribution's connector­lib­proprietary directory, anduncomment the LiveLink­related connector lines in connectors.xml file.

Building the Meridio connector

The Meridio connector generates interface classes using checked­in wsdls and xsds originally obtained from an installed Meridio instanceusing disco.exe, and subsequently modified to work around limitations in Apache Axis. The disco.exe utility is installed as part ofMicrosoft Visual Studio, and is typically found under "c:\Program Files\Microsoft SDKs\Windows\V6.x\bin". If desired, you can obtainunmodified wsdls and xsds by interrogating the following Meridio web services:

http[s]://<meridio_server>/DMWS/MeridioDMWS.asmx

http[s]://<meridio_server>/RMWS/MeridioRMWS.asmx

Building and running the SharePoint connector

The SharePoint connector generates interface classes using checked­in wsdls originally obtained from an installed Microsoft SharePointinstance using disco.exe. The disco.exe utility is installed as part of Microsoft Visual Studio, and is typically found under "c:\ProgramFiles\Microsoft SDKs\Windows\V6.x\bin". If desired, you can obtain unmodified wsdls by interrogating the following SharePoint webservices:

http[s]://<server_name>/_vti_bin/Permissions.asmx

http[s]://<server_name>/_vti_bin/Lists.asmx

http[s]://<server_name>/_vti_bin/Dspsts.asmx

http[s]://<server_name>/_vti_bin/usergroup.asmx

http[s]://<server_name>/_vti_bin/versions.asmx

http[s]://<server_name>/_vti_bin/webs.asmx

Important: For SharePoint instances version 3.0 (2007) or higher, in order to run the connector, you also must deploy a customSharePoint web service on the SharePoint instance you intend to connect to. This is required because Microsoft overlooked support forweb­service­based access to file and folder security information when SharePoint 2007 was released. For SharePoint version 4.0 (2010),the service is even more critical, because backwards compatibility was not maintained and without this service no crawling can occur.SharePoint version 5.0 (2013) also requires a plugin; although its functionality is the same as for SharePoint 4.0, the binary you install isbuilt against SharePoint 2013 resources rather than SharePoint 2010 resources, so there is a different distribution.

The versions of this service can be found in the distribution directory plugins/sharepoint. Pick the version appropriate for yourSharePoint installation, and install it following the instructions in the file Installation Readme.txt, found in the correspondingdirectory.

Running the Apache Solr output connector

The Apache Solr output connector requires no special attention to build or run within ManifoldCF. However, in order for Apache Solr to beable to enforce document security, you must install and properly configure a plugin for Solr. This plugin is available for both Solr 3.x andfor Solr 4.x, and can be used either as a query parser plugin, or as a search component. Additional index fields are also required to containthe necessary security information. Much more information can be found in the README.txt file in the plugins themselves.

The correct versions of the plugins are included in the plugins/solr directory of the main ManifoldCF distribution. You can also downloadupdated versions of the plugins from the ManifoldCF download page. The compatibility matrix is as follows:

Apache ManifoldCF Solr 3.x and Solr 4.x plugin compatibilityManifoldCF versions Plugin version

0.1.x­1.4.x 0.x1.5.x 1.x>=1.6.x 2.x

If the proper version of the plugin is not deployed on Solr, documents will not be properly secured. Thus, it is essential toverify that the proper plugin version has been deployed for the version of ManifoldCF you are using.

Running the ElasticSearch output connector

Page 5: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 5/17

The ElasticSearch output connector requires no special attention to build or run within ManifoldCF. However, in order for ElasticSearch tobe able to enforce document security, you must install, properly configure, and code against a toolkit plugin for ElasticSearch. Additionalindex fields are also required to contain the necessary security information. Much more information can be found in the README.txt filein the plugin itself.

The correct versions of the plugin is included in the plugins/elasticsearch directory of the main ManifoldCF distribution. You can alsodownload updated versions of the plugin from the ManifoldCF download page. The compatibility matrix is as follows:

Apache ManifoldCF ElasticSearch plugin compatibilityManifoldCF versions Plugin version

0.1.x­1.4.x 0.x1.5.x 1.x>=1.6.x 2.x

If the proper version of the plugin is not deployed and properly integrated, documents will not be properly secured.Thus, it is essential to verify that the proper plugin version has been deployed for the version of ManifoldCF you are using.

To work with ManifoldCF, your ElasticSearch instance must also include the appropriate indexes created as well. Here are some simplesteps for creating an ElasticSearch index, using the CURL utility:

% curl ‐XPUT 'http://localhost:9200/manifoldcf'% curl ‐XPUT 'http://localhost:9200/manifoldcf/attachment/_mapping' ‐d ' "attachment" : "_source" : "excludes" : [ "file" ] , "properties": "allow_token_document" : "type" : "string" , "allow_token_parent" : "type" : "string" , "allow_token_share" : "type" : "string" , "attributes" : "type" : "string" , "createdOn" : "type" : "string" , "deny_token_document" : "type" : "string" , "deny_token_parent" : "type" : "string" , "deny_token_share" : "type" : "string" , "lastModified" : "type" : "string" , "shareName" : "type" : "string" , "file" : "type" : "attachment", "path" : "full", "fields" : "file" : "store" : true, "term_vector" : "with_positions_offsets", "type" : "string" '

This command creates an index called manifoldcf with a mapping named attachment which has some generic fields for accesstokens and a field file which makes use of the ElasticSearch attachment mapper plugin. It is configured for highlighting("term_vector" : "with_positions_offsets").

Page 6: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 6/17

The following part is useful for not saving the source json on the index which reduces the index size significantly. Be aware that youshouldn't do this if you will need to re­index data on the ElasticSearch side or you need access to the whole document:

"_source" : "excludes" : [ "file" ],

Building the framework and the connectors using Apache Maven

ManifoldCF includes some support for building jars under Maven. Apache Ant is considered to be ManifoldCF's primary build system, soyour mileage with Maven may vary.

The biggest limitation of the current Maven build is that it does not support any of the proprietary connectors or the multi­process modelof execution. The build includes only the Apache­licensed and LGPL­licensed connectors, thus avoiding conditional compilation, andexecutes under Maven using only the Quick Start example.

A canned version of all configuration files are included as resources. If you want to change the configuration in any way, you will need torebuild with Maven accordingly.

Preparation

No special preparation is required, other than to have access to the Apache Maven repository.

How to build

Building is straightforward. In the ManifoldCF root, type:

mvn clean install

This should generate all the necessary artifacts to run with, and also run the Hsqldb­based tests.

To build and skip only the integration tests, type:

mvn clean install ‐DskipITs

When you have the default package installed locally in your Maven repository, to only build ManifoldCF artifacts, type:

mvn clean package

NOTE: Due to current limitations in the ManifoldCF Maven poms, you MUST run a complete "mvn clean install" as the first step. Youcannot skip steps, or the build will fail.

Building ManifoldCF's Apache2 plugin

To build the mod­authz­annotate plugin, you need to start with a Unix system that has the apache2 development tools installed on it, plusthe curl development package (from http://curl.haxx.se or elsewhere). Then, cd to mod­authz­annotate, and type "make". The build willproduce a file called mod­authz­annotate.so, which should be copied to the appropriate Apache2 directory so it can be used as a plugin.

Running ManifoldCF

Overview

ManifoldCF consists of several components. These are enumerated below:

A database, which is where ManifoldCF keeps all of its configuration and state information, usually PostgreSQL

A synchronization directory, which how ManifoldCF coordinates activity among its various processes

An agents process, which is the process that actually crawls documents and ingests themA crawler­ui servlet, which presents the UI users interact with to configure and control the crawler

An authority­service servlet, which responds to requests for authorization tokens, given a user nameAn api­service servlet, which responds to REST API requests

These underlying components can be packaged in many ways. For example, the three servlets can be deployed in separate war fields asseparate web applications. One may also deploy all three servlets in one combined web application, and also include the agents process.

Binary organization

Whether you build ManifoldCF yourself, or download a binary distribution, you will need to know what is what in the build result. If youbuild ManifoldCF yourself, the binary build result can be found in the subdirectory dist. In a binary distribution, the contents of thedistribution are the contents of the dist directory. These contents are described below.

Distribution directories and filesdist file/directory Meaning

connectors.xml an xml file describing the connectors that should be registeredconnector­lib jars for all the connectors, referred to by properties.xmlconnector­lib­proprietary

proprietary jars for all the connectors, referred to by properties.xml; not included in binary release

Page 7: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 7/17

obfuscation­utility a utility to obfuscate passwords, for inclusion in properties.xml fieldslib jars for all of the examples, referenced by the example scriptslib­proprietary proprietary jars for all of the examples, referenced by the proprietary example scriptsprocesses scripts, classpath jars, and ­D switch values needed for the required connector­specific processesscript­engine jars and scripts for running the ManifoldCF script interpreterexample a jetty­based example that runs in a single process (except for any connector­specific processes), excluding all

proprietary librariesexample­proprietary a jetty­based example that runs in a single process (except for any connector­specific processes), including

proprietary libraries; not included in binary releasemultiprocess­file­example

scripts and jars for an example that uses the multiple process model using file­based synchronization, excludingall proprietary libraries

multiprocess­file­example­proprietary

scripts and jars for an example that uses the multiple process model using file­based synchronization, includingproprietary libraries; not included in binary release

multiprocess­zk­example scripts and jars for an example that uses the multiple process model using ZooKeeper­based synchronization,excluding all proprietary libraries

multiprocess­zk­example­proprietary

scripts and jars for an example that uses the multiple process model using ZooKeeper­based synchronization,including proprietary libraries; not included in binary release

web app­server deployable web applications (wars), excluding all proprietary librariesweb­proprietary app­server deployable web applications (wars), including proprietary libraries; not included in binary releasedoc javadocs for framework and all included connectorsplugins pre­built integration components to deploy on target systems, e.g. for Solr

If you downloaded the binary distribution, you may notice that the connector­lib­proprietary directory contains only a number of<connector>­README.txt files. This is because under Apache licensing rules, incompatibly­licensed jars may not be redistributed. Eachsuch <connector>­README.txt describes the jars that you need to add to the connector­lib­proprietary directory in order to get thecorresponding connector working. You will also then need to uncomment the appropriate entries in the connectors.xml file accordingly toenable the connector for use.

NOTE: The prebuilt binary distribution cannot, at this time, include support for MySQL. Nor can the JDBC Connector access MySQL,MSSQL, SyBase, or Oracle databases in that distribution. In order to use these JDBC drivers, you must build ManifoldCF yourself. Startby downloading the drivers and placing them in the lib­proprietary directory. The command ant download­dependencies will do most ofthis for you, with the exception of the Oracle JDBC driver.

The directory titled processes include separate processes which must be started in order for the associated connector to function. Thenumber of produced processes subdirectories may vary, because optional individual connectors may or may not supply processes thatmust be run to support the connector. For each of the processes subdirectories above, any scripts that pertain to that connector­suppliedprocess will be placed in the root level of the subdirectory. The supplied scripts for a process generally take care of building an appropriateclasspath and setting necessary ­D switches. (Note: none of the current connectors require special ­D switches at this time.) If you need toconstruct a classpath by hand, it is important to remember that "more" is not necessarily "better". The process deployment strategy impliedby the build structure has been carefully thought out to avoid jar conflicts. Indeed, several connectors are structured using multipleprocesses precisely for that reason.

The proprietary libraries required by the secondary process processes subdirectories should be in the directory processes/xxx/lib­proprietary. These jars are not included in the binary distribution, and you will need to supply them in order to make the process work. AREADME.txt file is placed in each lib­proprietary directory describing what needs to be provided there.

The plugins directory contains components you may need to deploy on the target system to make the associated connector functioncorrectly. For example, the Solr connector includes plug­in classes for enforcing ManifoldCF security on Solr 3.x and 4.x. See the READMEfile in each directory for detailed instructions on how to deploy the components.

Inside the example directory, you will find everything you need to fire up ManifoldCF in a single­process model under Jetty. Everything isincluded so that all you need to do is change to that directory, and start it using the command <java> ­jar start.jar. This is described inmore detail later, and is the recommended way for beginners to try out ManifoldCF. The directory example­proprietary contains anequivalent example that includes proprietary connectors and jars. This is the standard place to start if you build ManifoldCF yourself.

Example deployments

There are many different ways to run ManifoldCF out­of­the­box. These are enumerated below:

Quick­start single process model

Single­process deployable war

Simplified multi­process model

Command­driven multi­process model

Each way has advantages and disadvantages. For example, single­process models limit the flexibility of deploying ManifoldCFcomponents. Multi­process models require that inter­process synchronization be properly configured. If you are just starting out withManifoldCF, we suggest you try the quick­start single process model first, since that is the easiest.

Quick‐start single process model

You can run most of ManifoldCF in a single process, for evaluation and convenience. This single­process version uses Jetty to handle itsweb applications, and Hsqldb as an embedded database. All you need to do to run this version of ManifoldCF is to follow the Ant­basedbuild instructions above, and then:

cd examplestart[.bat|.sh]

Page 8: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 8/17

In the quick­start model, all database initialization and connector registration takes place automatically whenever ManifoldCF is started(at the cost of some startup delay). The crawler UI can be found at http://<host>:8345/mcf­crawler­ui. The authority service can befound at http://<host>:8345/mcf­authority­service/UserACLs. The programmatic API is at http://<host>:8345/mcf­api­service.

You can stop the quick­start ManifoldCF at any time using ^C, or by using the script stop[.bat|.sh]

Bear in mind that Hsqldb is not as full­featured a database as is PostgreSQL. This means that any performance testing you may doagainst the quick start example may not be applicable to a full installation. Furthermore, embedded Hsqldb only permits one process at atime to be connected to its databases, so you cannot use any of the ManifoldCF commands (as described below) while the quick­startManifoldCF is running.

Another caveat that you will need to be aware of with the quick­start version of ManifoldCF is that it in no way removes the need for you torun any separate processes that individual connectors require. Specifically, the Documentum and FileNet connectors require processes tobe independently started in order to function. You will need to read about these connector­specific processes below in order to use thecorresponding connectors. Scripts for running these processes can be found in the directories named processes/xxx.

Single‐process deployable war

Under the distribution directory web/war, there is a war file called mcf­combined­service.war. This web application contains the exactsame functionality as the quick­start example, but bundled up as a single war instead. An example script is provided to run this webapplication under Jetty. You can execute the script as follows:

cd example start‐combined[.sh|.bat]

The combined web service presents the crawler UI at the root path for the web application, which is http://<host>:8345/mcf/. Theauthority service functionality can be found at http://<host>:8345/mcf/UserACLs, similar to the quick­start example. However, theprogrammatic API service has a path other than the root: http://<host>:8345/mcf/api/.

The script that starts the combined­service web application uses the same database instance (Hsqldb by default) as does the quick­start,and the same properties.xml file. The same caveats about required individual connector processes also apply as they do for the quick­startexample.

Running single‐process combined war example using Tomcat

In order to run the ManifoldCF single­process combined war example under Tomcat, you will need to take the following steps:

1. Modify the Tomcat startup script, or use the Tomcat service administration client, to set a Java "­Dorg.apache.manifoldcf.configfile"switch to point to the example's properties.xml file.

2. Start Tomcat.

3. Deploy and start the mcf­combined­service web application, preferably using the Tomcat administration client.

Simplified multi‐process model using file‐based synchronization

ManifoldCF can also be deployed in a simplified multi­process model which uses files to synchronize processes. Inside the multiprocess­file­example directory, you will find everything you need to do this. (The multiprocess­file­example­proprietary directory is similar butincludes proprietary material and is available only if you build ManifoldCF yourself.) Below is a list of what you will find in this directory.

File­based multiprocess example files and directoriesmultiprocess­file­

example file/directoryMeaning

web Web applications that should be deployed on tomcat or the equivalent, plus recommended application server­D switch names and values

processes classpath jars that should be included in the class path for all non­connector­specific processes, along with ­D switches, using the same convention as described for tomcat, above

properties.xml an example ManifoldCF configuration file, in the right place for the multiprocess script to find itlogging.ini an example ManifoldCF logging configuration file, in the right place for the properties.xml to find itsyncharea an example ManifoldCF synchronization directory, which must be writable in order for multiprocess

ManifoldCF to worklogs where the ManifoldCF logs get written tostart­database[.sh|.bat] script to start the HSQLDB databaseinitialize[.sh|.bat] script to create the database instance, create all database tables, and register connectorsstart­webapps[.sh|.bat] script to start Jetty with the ManifoldCF web applications deployedstart­agents[.sh|.bat] script to start the (first) agents processstart­agents­2[.sh|.bat] script to start a second agents processstop­agents[.sh|.bat] script to stop all running agents processes cleanlylock­clean[.sh|.bat] script to clean up dirty locks (run only when all webapps and processes are stopped)

Initializing the database and running

If you run the file­based multiprocess model, after you first start the database (using start­database[.sh|.bat]), you will need to initializethe database before you start the agents process or use the crawler UI. To do this, all you need to do is run the initialize[.sh|.bat] script.Then, you will need to start the web applications (using start­webapps[.sh|.bat]) and the agents process (using start­agents[.sh|.bat]),and optionally the second agents process (using start­agents­2[.sh|.bat]).

Running multiprocess file‐based example using Tomcat

In order to run the ManifoldCF multiprocess file­based example under Tomcat, you will need to take the following steps:

Page 9: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 9/17

1. Start the database (using start­database[.sh|.bat])

2. Initialize the database (using initialize[.sh|.bat])3. Start the agents process (using start­agents[.sh|.bat], and optionally start­agents­2[.sh|.bat])

4. Modify the Tomcat startup script, or use the Tomcat service administration client, to set a Java "­Dorg.apache.manifoldcf.configfile"switch to point to the example's properties.xml file.

5. Start Tomcat.

6. Deploy and start the mcf­crawler­ui, mcf­authority­service, and mcf­api­service web applications, preferably using the Tomcatadministration client.

Simplified multi‐process model using ZooKeeper‐based synchronization

ManifoldCF can be deployed in a simplified multi­process model which uses Apache ZooKeeper to synchronize processes. Inside themultiprocess­kz­example directory, you will find everything you need to do this. (The multiprocess­zk­example­proprietary directory issimilar but includes proprietary material and is available only if you build ManifoldCF yourself.) Below is a list of what you will find in thisdirectory.

ZooKeeper­based multiprocess example files and directoriesmultiprocess­zk­

example file/directoryMeaning

web Web applications that should be deployed on tomcat or the equivalent, plus recommended application server­D switch names and values

processes classpath jars that should be included in the class path for all non­connector­specific processes, along with ­D switches, using the same convention as described for tomcat, above

properties.xml an example ManifoldCF configuration file, in the right place for the multiprocess script to find itproperties­global.xml an example ManifoldCF shared configuration file, in the right place for the setglobalproperties script to find

itlogging.ini an example ManifoldCF logging configuration file, in the right place for the properties.xml to find itzookeeper the example ZooKeeper storage directory, which must be writable in order for ZooKeeper to worklogs where the ManifoldCF logs get written torunzookeeper[.sh|.bat] script to run a ZooKeeper server instancesetglobalproperties[.sh|.bat]script to initialize ZooKeeper with properties from properties­global.xmlstart­database[.sh|.bat] script to start the HSQLDB databaseinitialize[.sh|.bat] script to create the database instance, create all database tables, and register connectorsstart­webapps[.sh|.bat] script to start Jetty with the ManifoldCF web applications deployedstart­agents[.sh|.bat] script to start (the first) agents processstart­agents­2[.sh|.bat] script to start a second agents processstop­agents[.sh|.bat] script to stop all running agents processes cleanly

Initializing the database and running

If you run the ZooKeeper­based multiprocess example, then you must follow the following steps:

1. Start ZooKeeper (using the runzookeeper[.sh|.bat] script)

2. Initialize the ManifoldCF shared configuration data (using setglobalproperties[.sh|.bat])3. Start the database (using start­database[.sh|.bat])

4. Initialize the database (using initialize[.sh|.bat])5. Start the agents process (using start­agents[.sh|.bat], and optionally start­agents­2[.sh|.bat])

6. Start the web applications (using start­webapps[.sh|.bat])

Running multiprocess ZooKeeper example using Tomcat

In order to run the ManifoldCF ZooKeeper example under Tomcat, you will need to take the following steps:

1. Start ZooKeeper (using the runzookeeper[.sh|.bat] script)

2. Initialize the ManifoldCF shared configuration data (using setglobalproperties[.sh|.bat])3. Start the database (using start­database[.sh|.bat])

4. Initialize the database (using initialize[.sh|.bat])5. Start the agents process (using start­agents[.sh|.bat], and optionally start­agents­2[.sh|.bat])

6. Modify the Tomcat startup script, or use the Tomcat service administration client, to set a Java "­Dorg.apache.manifoldcf.configfile"switch to point to the example's properties.xml file.

7. Start Tomcat.

8. Deploy and start the mcf­crawler­ui, mcf­authority­service, and mcf­api­service web applications, preferably using the Tomcatadministration client.

Command‐driven multi‐process model

The most generic way of deploying ManifoldCF involves calling ManifoldCF operations using scripts. There are a number of java classesamong the ManifoldCF classes that are intended to be called directly, to perform specific actions in the environment or in the database.These classes are usually invoked from the command line, with appropriate arguments supplied, and are thus considered to beManifoldCF commands. Basic functionality supplied by these command classes is as follows:

Create/Destroy the ManifoldCF database instance

Start/Stop the agents process

Page 10: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 10/17

Register/Unregister an agent class (there's currently only one included)

Register/Unregister an output connector

Register/Unregister a transformation connector

Register/Unregister a repository connector

Register/Unregister an authority connector

Register/Unregister a mapping connector

Clean up synchronization directory garbage resulting from an ungraceful interruption of an ManifoldCF process

Query for certain kinds of job­related information

Individual connectors may contribute additional command classes and processes to this picture.

The multiprocess command execution scripts are delivered in the processes subdirectory. The script for executing commands isprocesses/executecommand[.sh|.bat]. This script requires two environment variables to be set before execution: JAVA_HOME, andMCF_HOME, which should point to ManifoldCF's home execution directory, where the properties.xml file is found.)

The basic steps required to set up and run ManifoldCF in command­driven file­based multi­process mode are as follows:

Install PostgreSQL or MySQL. The PostgreSQL JDBC driver included with ManifoldCF is known to work with version 9.1, so thatversion is the currently recommended one. If you want to use MySQL, the ant "download­dependencies" build target will fetch theappropriate MySQL JDBC driver.

Configure the database for your environment; the default configuration is acceptable for testing and experimentation.

Create the database instance (see commands below)

Initialize the database instance (see commands below)

Register the pull agent (org.apache.manifoldcf.crawler.system.CrawlerAgent, see below)

Register your connectors and authorities (see below)

Install a Java application server, such as Tomcat.

Deploy the war files from web/war, except for mcf­combined.war, to your application server (see below).

Set the starting environment variables for your app server to include any ­D commands found in web/define. The ­D commandsshould be of the form, "­D<file name>=<file contents>". You will also need a "­Dorg.apache.manifoldcf.configfile=<properties file>"define option, or the equivalent, in the application server's JVM startup in order for ManifoldCF to be able to locate its configurationfile.

Use the processes/executecommand[.bat|.sh] command from execute the appropriate commands from the next section below, beingsure to first set the JAVA_HOME and MCF_HOME environment variables properly.

Start any supporting processes that result from your build. (Some connectors such as Documentum and FileNet have auxiliaryprocesses you need to run to make these connectors functional.)

Start your application server.

Start the ManifoldCF agents process.

At this point, you should be able to interact with the ManifoldCF UI, which can be accessed via the mcf­crawler­ui web application

The detailed list of commands is presented below.

Commands

After you have created the necessary configuration files, you will need to initialize the database, register the "pull­agent" agent, and thenregister your individual connectors. ManifoldCF provides a set of commands for performing these actions, and others as well. The classesimplementing these commands are specified below.

Core Command Class Arguments Functionorg.apache.manifoldcf.core.DBCreate dbuser [dbpassword] Create ManifoldCF database instanceorg.apache.manifoldcf.core.DBDrop dbuser [dbpassword] Drop ManifoldCF database instanceorg.apache.manifoldcf.core.LockClean None Clean out synchronization directoryorg.apache.manifoldcf.core.Obfuscate string Obfuscate a string, for use as an obfuscated parameter value

Agents Command Class Arguments Functionorg.apache.manifoldcf.agents.Install None Create ManifoldCF agents tablesorg.apache.manifoldcf.agents.Uninstall None Remove ManifoldCF agents tablesorg.apache.manifoldcf.agents.Register classname Register an agent classorg.apache.manifoldcf.agents.UnRegister classname Un­register an agent classorg.apache.manifoldcf.agents.UnRegisterAll None Un­register all current agent classesorg.apache.manifoldcf.agents.SynchronizeAll None Un­register all registered agent classes that can't be foundorg.apache.manifoldcf.agents.RegisterOutput classname

descriptionRegister an output connector class

org.apache.manifoldcf.agents.UnRegisterOutput classname Un­register an output connector classorg.apache.manifoldcf.agents.UnRegisterAllOutputs None Un­register all current output connector classesorg.apache.manifoldcf.agents.SynchronizeOutputs None Un­register all registered output connector classes that

can't be foundorg.apache.manifoldcf.agents.RegisterTransformation classname

descriptionRegister a transformation connector class

org.apache.manifoldcf.agents.UnRegisterTransformation classname Un­register a transformation connector classorg.apache.manifoldcf.agents.UnRegisterAllTransformationsNone Un­register all current transformation connector classesorg.apache.manifoldcf.agents.SynchronizeTransformations None Un­register all registered transformation connector classes

Page 11: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 11/17

that can't be foundorg.apache.manifoldcf.agents.AgentRun None Main agents process classorg.apache.manifoldcf.agents.AgentStop None Stops the running agents process

Crawler Command Class Arguments Functionorg.apache.manifoldcf.crawler.Register classname

descriptionRegister a repository connector class

org.apache.manifoldcf.crawler.UnRegister classname Un­register a repository connector classorg.apache.manifoldcf.crawler.UnRegisterAll None Un­register all repository connector classesorg.apache.manifoldcf.crawler.SynchronizeConnectorsNone Un­register all registered repository connector classes that can't

be foundorg.apache.manifoldcf.crawler.ExportConfiguration filename

[passcode]Export crawler configuration to a file

org.apache.manifoldcf.crawler.ImportConfiguration filename[passcode]

Import crawler configuration from a file

NOTE: By adding a passcode as a second argument to the ExportConfiguration command class, the exported file will be encrypted byusing the AES algorithm. This can be useful to prevent repository passwords to be stored in clear text. In order to use this functionality,you must enter a salt value to your configuration file. The same passcode along with the salt value are used to decrypt the file with theImportConfiguration command class. See the documentation for the commands and properties above to find the correct arguments andsettings.

Authorization Domain Command Class Arguments Functionorg.apache.manifoldcf.authorities.RegisterDomain domainname description Register an authorization domainorg.apache.manifoldcf.authorities.UnRegisterDomain domainname Un­register an authorization domain

User Mapping Command Class Arguments Functionorg.apache.manifoldcf.authorities.RegisterMapper classname

descriptionRegister a mapping connector class

org.apache.manifoldcf.authorities.UnRegisterMapper classname Un­register a mapping connector classorg.apache.manifoldcf.authorities.UnRegisterAllMappersNone Un­register all mapping connector classesorg.apache.manifoldcf.authorities.SynchronizeMappers None Un­register all registered mapping connector classes that

can't be found

Authority Command Class Arguments Functionorg.apache.manifoldcf.authorities.RegisterAuthority classname

descriptionRegister an authority connector class

org.apache.manifoldcf.authorities.UnRegisterAuthority classname Un­register an authority connector classorg.apache.manifoldcf.authorities.UnRegisterAllAuthoritiesNone Un­register all authority connector classesorg.apache.manifoldcf.authorities.SynchronizeAuthorities None Un­register all registered authority connector classes that

can't be found

Remember that you need to include all the jars under multiprocess­file­example/processes/lib in the classpath whenever you run one ofthese commands! But, luckily, there are scripts which do this for you. These can be found in multiprocess­file­example/processes/executecommand[.sh,.bat]. The scripts require some environment variables to be set, such as MCF_HOME andJAVA_HOME, and expect the configuration file to be found at MCF_HOME/properties.xml.

Deploying the mcf‐crawler‐ui, mcf‐authority‐service, and mcf‐api‐service web applications

If you built ManifoldCF using ant, then the ant build will have constructed four war files for you under web/war. You should ignore themcf­combined war in this directory for this deployment model. If you intend to run ManifoldCF in multiprocess mode, you will need todeploy the other web applications on you application server. There is no requirement that the mcf­crawler­ui, mcf­authority­service,and mcf­api­service web applications be deployed on the same instance of the application server. With the current architecture ofManifoldCF, they must be deployed on the same physical server, however.

For each of the application servers involved with ManifoldCF, you must set the following define, so that the ManifoldCF web applicationscan locate the configuration file:

‐Dorg.apache.manifoldcf.configfile=<configuration file path>

Running the agents process

The agents process is the process that actually performs the crawling for ManifoldCF. Start this process by running the command"org.apache.manifoldcf.agents.AgentRun". This class will run until stopped by invoking the command"org.apache.manifoldcf.agents.AgentStop". It is highly recommended that you stop the process in this way. You may also stop the processusing a SIGTERM signal, but "kill ­9" or the equivalent is NOT recommended, because that may result in dangling locks in theManifoldCF synchronization directory. (If you have to, clean up these locks by shutting down all ManifoldCF processes, including theapplication server instances that are running the web applications, and invoking the command "org.apache.manifoldcf.core.LockClean".)

The connectors.xml configuration file

The quick­start, combined, and simplified multi­process sample deployments of ManifoldCF have their own configuration file, calledconnectors.xml, which is used to register the available connectors in the database. The file has this basic format:

<?xml version="1.0" encoding="UTF‐8" ?><connectors> (clauses)</connectors>

Page 12: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 12/17

The following tags are available to specify your connectors and authorization domains:

<repositoryconnector name="pretty_name" class="connector_class"/>

<authorityconnector name="pretty_name" class="connector_class"/>

<mappingconnector name="pretty_name" class="connector_class"/>

<outputconnector name="pretty_name" class="connector_class"/>

<transformationconnector name="pretty_name" class="connector_class"/>

<authorizationdomain name="pretty_name" domain="domain_name"/>

The connectors.xml file typically has some connectors commented out ­ namely the ones build with stubs which require you to supply athird­party library in order for the connector to run. If you build ManifoldCF yourself, the example­proprietary and multiprocess­file­example­proprietary and multiprocess­zk­example­proprietary directories instead use connectors­proprietary.xml. The connectors youbuild against the proprietary libraries you supply will not have their connectors­proprietary.xml tags commented out.

Running connector‐specific processes

Connector­specific processes require the classpath for their invocation to include all the jars that are in the correspondingprocesses/<process_name> directory. The Documentum and FileNet connectors are the only two connectors that currently requireadditional processes. Start these processes using the commands listed below, and stop them with SIGTERM (or ^C, if they are running ina shell).

Connector Process Main class Script name (relative to dist)Documentumprocesses/documentum­

serverorg.apache.manifoldcf.crawler.server.DCTM.DCTM processes/documentum­

server/run[.sh|.bat]Documentumprocesses/documentum­

registryorg.apache.manifoldcf.crawler.registry.DCTM.DCTMprocesses/documentum­

registry/run[.sh|.bat]FileNet processes/filenet­server org.apache.manifoldcf.crawler.server.filenet.Filenet processes/filenet­server/run[.sh|.bat]FileNet processes/filenet­registry org.apache.manifoldcf.crawler.registry.filenet.Filenet processes/filenet­registry/run[.sh|.bat]

The registry process in all cases must be started before the corresponding server process, or the server process will report an error. (It will,however, retry after some period of time.) The scripts all require an MCF_HOME environment variable pointing to the place whereproperties.xml is found, as well as a JAVA_HOME environment variable pointing the JDK. The server scripts also require otherenvironment variables as well, consistent with the needs of the DFC or the FileNet API respectively. For example, DFC requires theDOCUMENTUM environment variable to be set, while the FileNet server script requires the WASP_HOME environment variable.

It is important to understand that the scripts work by building a classpath out of all jars that get copied into the lib and lib­proprietarydirectory underneath each process during the ant build. The lib­proprietary jars cannot be distributed in the binary version ofManifoldCF, so if you use this option you will still need to copy them there yourself for the processes to run. If you build ManifoldCFyourself, these jars are copied from the lib­proprietary directories underneath the documentum or filenet connector directories. For theserver startup scripts to work properly, the lib­proprietary directories should have all of the jars needed to allow the api code to function.

Database selection

You have a variety of open­source databases to choose from when deploying ManifoldCF. The supported databases each have their ownstrengths and weaknesses, and are listed below:

PostgreSQL (preferred)

MySQL (preferred)

MariaDB (not yet evaluated))

HSQLDB

You can select the database of your choice by setting the approprate properties in the applicable properties.xml file. The choice ofdatabase is largely orthogonal to the choice of deployment model. The ManifoldCF deployment examples provided can thus be readilyaltered to use the database you desire. The details and caveats of each choice is described below.

Configuring a PostgreSQL database

Despite having an internal architecture that cleanly abstracts from specific database details, ManifoldCF is currently fairly specific toPostgreSQL at this time. There are a number of reasons for this.

ManifoldCF uses the database for its document queue, which places a significant load on it. The back­end database is thus asignificant factor in ManifoldCF's performance. But, in exchange, ManifoldCF benefits enormously from the underlying ACIDproperties of the database.

The strategy for getting optimal query plans from the database is not abstracted. For example, PostgreSQL 8.3+ is very sensitive tocertain statistics about a database table, and will not generate a performant plan if the statistics are inaccurate by even a little, insome cases. So, for PostgreSQL, the database table must be analyzed very frequently, to avoid catastrophically bad plans. Butluckily, PostgreSQL is pretty good at doing analysis quickly. Oracle, on the other hand, takes a very long time to perform analysis,but its plans are much less sensitive.

PostgreSQL always does a sequential scan in order to count the number of rows in a table, while other databases return thisefficiently. This has affected the design of the ManifoldCF UI.

The choice of query form influences the query plan. Ideally, this is not true, but for both PostgreSQL and for (say) Oracle, it is.

Page 13: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 13/17

PostgreSQL has a high degree of parallelism and lack of internal single­threadedness.

ManifoldCF has been tested against version 8.3.7, 8.4.5, 9.1, 9.2, and 9.3 of PostgreSQL. We recommend the following configurationparameter settings to work optimally with ManifoldCF:

A default database encoding of UTF­8

postgresql.conf settings as described in the table below

pg_hba.conf settings to allow password access for TCP/IP connections from ManifoldCFA maintenance strategy involving cronjob­style vacuuming, rather than PostgreSQL autovacuum

Postgresql.conf parameterspostgresql.conf parameter Tested value

standard_conforming_strings onshared_buffers 1024MBcheckpoint_segments 300maintenanceworkmem 2MBtcpip_socket truemax_connections 400checkpoint_timeout 900datestyle ISO,Europeanautovacuum off

Note well: The standard_conforming_strings parameter setting is important to prevent any possibility of SQL injection attacks. WhileManifoldCF uses parameterized queries in almost all cases, when it does do string quoting it presumes that the SQL standard for quotingis adhered to. It is in general good practice to set this parameter when working with PostgreSQL for this reason.

A note about PostgreSQL database maintenance

PostgreSQL's architecture causes it to accumulate dead tuples in its data files, which do not interfere with its performance but do bloat thedatabase over time. The usage pattern of ManifoldCF is such that it can cause significant bloat to occur to the underlying PostgreSQLdatabase in only a few days, under sufficient load. PostgreSQL has a feature to address this bloat, called vacuuming. This comes in threevarieties: autovacuum, manual vacuum, and manual full vacuum.

We have found that PostgreSQL's autovacuum feature is inadequate under such conditions, because it not only fights for databaseresources pretty much all the time, but it falls further and further behind as well. PostgreSQL's in­place manual vacuum functionality is abit better, but is still much, much slower than actually making a new copy of the database files, which is what happens when a manual fullvacuum is performed.

Dead­tuple bloat also occurs in indexes in PostgreSQL, so tables that have had a lot of activity may benefit from being reindexed at thetime of maintenance.

We therefore recommend periodic, scheduled maintenance operations instead, consisting of the following:

VACUUM FULL VERBOSE;

REINDEX DATABASE <the_db_name>;

During maintenance, PostgreSQL locks tables one at a time. Nevertheless, the crawler ui may become unresponsive for some operations,such as when counting outstanding documents on the job status page. ManifoldCF thus has the ability to check for the existence of a fileprior to such sensitive operations, and will display a useful "maintenance in progress" message if that file is found. This allows a user toset up a maintenance system that provides adequate feedback for an ManifoldCF user of the overall status of the system.

Configuring a MySQL database

MySQL is not quite as fast as PostgreSQL, but it is a relatively close second in performance tests. Nevertheless, the ManifoldCF team doesnot have a large amount of experience with this database at this time. More details will be added to this section as information andexperience becomes available.

Configuring an HSQLDB database

HSQLDB's performance seems closely tied to how much of the database can be actually held in memory. Performance at this time is abouthalf that of PostgreSQL.

HSQLDB can be used with ManifoldCF in either an embedded fashion (which only works with single­process deployments), or in externalfashion, with a database instance running in a separate process. See the properties.xml property descriptions for configuration details.

The ManifoldCF configuration files

Currently, ManifoldCF requires two configuration files: the main configuration property file, and the logging configuration file.

properties.xml file properties

The properties.xml property file path can be specified by the system property "org.apache.manifoldcf.configfile". If not specified through a­D operation, its name is presumed to be <user_home>/lcf/properties.xml. The form of the property file is XML, of the following basicform:

<?xml version="1.0" encoding="UTF‐8" ?><configuration> (clauses)</configuration>

Page 14: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 14/17

The properties.xml file allows properties to be specified. A property clause has the form:

<property name="property_name" value="property_value"/>

One of the optional properties is the name of the logging configuration file. This property's name is "org.apache.manifoldcf.logconfigfile". Ifnot present, the logging configuration file will be assumed to be <user_home>/manifoldcf/logging.ini. The logging configuration file is astandard commons­logging property file, and should be formatted accordingly.

Note that all properties described below can also be specified on the command line, via a ­D switch. If both methods of setting theproperty are used, the ­D switch value will override the property file value.

The following table describes the configuration property file properties, and what they do:

property.xml propertiesProperty Required? Function

org.apache.manifoldcf.login.name No Crawler UI login user ID (defaults to "admin")org.apache.manifoldcf.login.password No Crawler UI login user password (defaults to "admin")org.apache.manifoldcf.login.password.obfuscated No Obfuscated crawler UI login user password (defaults to "admin")org.apache.manifoldcf.login.apiname No API login user ID (defaults to "")org.apache.manifoldcf.login.apipassword No API login user password (defaults to "")org.apache.manifoldcf.login.apipassword.obfuscated No Obfuscated API login user password (defaults to "")org.apache.manifoldcf.crawleruiwarpath Yes, for Jetty Location of Crawler UI warorg.apache.manifoldcf.authorityservicewarpath Yes, for Jetty Location of Authority Service warorg.apache.manifoldcf.apiservicewarpath Yes, for Jetty Location of API Service warorg.apache.manifoldcf.usejettyparentclassloader Yes, for Jetty true for single­process example, false for multiprocess example.org.apache.manifoldcf.connectorsconfigurationfile No Location of connectors.xml file, for QuickStart, so ManifoldCF can

register connectors.org.apache.manifoldcf.dbsuperusername No Database superuser name, for QuickStart, so ManifoldCF can

create database instance.org.apache.manifoldcf.dbsuperuserpassword No Database superuser password, for QuickStart, so ManifoldCF can

create database instance.org.apache.manifoldcf.dbsuperuserpassword.obfuscatedNo Obfuscated database superuser password, for QuickStart, so

ManifoldCF can create database instance.org.apache.manifoldcf.ui.maxstatuscount No The maximum number of documents ManifoldCF will try to count

for the job status display. Defaults to 500000.org.apache.manifoldcf.databaseimplementationclass No Specifies the class to use to implement database access. Default is

a built­in Hsqldb implementation. Supported choices are:org.apache.manifoldcf.core.database.DBInterfacePostgreSQL,org.apache.manifoldcf.core.database.DBInterfaceMySQL,org.apache.manifoldcf.core.database.DBInterfaceMariaDB,org.apache.manifoldcf.core.database.DBInterfaceHSQLDB

org.apache.manifoldcf.postgresql.hostname No PostgreSQL server host name, or localhost if not specified.org.apache.manifoldcf.postgresql.port No PostgreSQL server port, or standard port if not specified.org.apache.manifoldcf.postgresql.ssl No Set to "true" for ssl communication with PostgreSQL.org.apache.manifoldcf.mysql.server No The MySQL or MariaDB server name. Defaults to 'localhost'.org.apache.manifoldcf.mysql.client No The MySQL or MariaDB client property. Defaults to 'localhost'.

You may want to set this to '%' for a multi­machine setup.org.apache.manifoldcf.hsqldbdatabasepath No Absolute or relative path to HSQLDB database; default is '.'.org.apache.manifoldcf.hsqldbdatabaseprotocol Yes, for remote

HSQLDBconnection

The HSQLDB JDBC protocol; choices are 'hsql', 'http', or 'https'.Default is blank (which means an embedded instance)

org.apache.manifoldcf.hsqldbdatabaseserver Yes, for remoteHSQLDBconnection

The HSQLDB remote server name.

org.apache.manifoldcf.hsqldbdatabaseport No The HSQLDB remote server port.org.apache.manifoldcf.hsqldbdatabaseinstance No The HSQLDB remote database instance name.org.apache.manifoldcf.lockmanagerclass No Specifies the class to use to implement synchronization. Default is

either file­based synchronization or in­memory synchronization,using the org.apache.manifoldcf.core.lockmanager.LockManagerclass. Options includeorg.apache.manifoldcf.core.lockmanager.BaseLockManager,org.apache.manifoldcf.core.FileLockManager, andorg.apache.manifoldcf.core.lockmanager.ZooKeeperLockManager.

org.apache.manifoldcf.synchdirectory Yes, if file­basedsynchronizationclass isspecified

Specifies the path of a synchronization directory. All ManifoldCFprocess owners must have read/write privileges to this directory.

org.apache.manifoldcf.zookeeper.connectstring Yes, ifZooKeeper­basedsynchronizationclass isspecified

Specifies the ZooKeeper connection string, consisting of comma­separated hostname:port pairs.

org.apache.manifoldcf.zookeeper.sessiontimeout No Specifies the ZooKeeper session timeout, ifZooKeeperLockManager is specified. Defaults to 2000.

org.apache.manifoldcf.database.maxhandles No Specifies the maximum number of database connection handles

Page 15: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 15/17

that will by pooled. Recommended value is 200.

org.apache.manifoldcf.database.handletimeout No Specifies the maximum time a handle is to live before it ispresumed dead. Recommend a value of 604800, which is themaximum allowable.

org.apache.manifoldcf.database.connectiontracking No True or false. When "true", will track all allocated databaseconnection handles, and will dump an allocation stack trace whenthe pool is exhausted. Useful for diagnosing connection leaks.

org.apache.manifoldcf.logconfigfile No Specifies location of logging configuration file.org.apache.manifoldcf.database.name No Describes database name for ManifoldCF; defaults to "dbname" if

not specified.org.apache.manifoldcf.database.username No Describes database user name for ManifoldCF; defaults to

"manifoldcf" if not specified.org.apache.manifoldcf.database.password No Describes database user's password for ManifoldCF; defaults to

"local_pg_password" if not specified.org.apache.manifoldcf.database.password.obfuscated No Obfuscated database user's password for ManifoldCF; defaults to

"local_pg_password" if not specified.org.apache.manifoldcf.crawler.threads No Number of crawler worker threads created. Suggest a value of 30.org.apache.manifoldcf.crawler.expirethreads No Number of crawler expiration threads created. Suggest a value of

10.org.apache.manifoldcf.crawler.cleanupthreads No Number of crawler cleanup threads created. Suggest a value of

10.org.apache.manifoldcf.crawler.deletethreads No Number of crawler delete threads created. Suggest a value of 10.org.apache.manifoldcf.crawler.historycleanupinterval No Milliseconds to retain history records. Default is 0. Zero means

"forever".org.apache.manifoldcf.misc No Miscellaneous debugging output. Legal values INFO, WARN, or

DEBUG.org.apache.manifoldcf.db No Database debugging output. Legal values INFO, WARN, or

DEBUG.org.apache.manifoldcf.lock No Lock management debugging output. Legal values INFO, WARN,

or DEBUG.org.apache.manifoldcf.cache No Cache management debugging output. Legal values INFO,

WARN, or DEBUG.org.apache.manifoldcf.agents No Agent management debugging output. Legal values INFO,

WARN, or DEBUG.org.apache.manifoldcf.perf No Performance logging debugging output. Legal values INFO,

WARN, or DEBUG.org.apache.manifoldcf.crawlerthreads No Log crawler thread activity. Legal values INFO, WARN, or

DEBUG.org.apache.manifoldcf.hopcount No Log hopcount tracking activity. Legal values INFO, WARN, or

DEBUG.org.apache.manifoldcf.jobs No Log job activity. Legal values INFO, WARN, or DEBUG.org.apache.manifoldcf.connectors No Log connector activity. Legal values INFO, WARN, or DEBUG.org.apache.manifoldcf.scheduling No Log document scheduling activity. Legal values INFO, WARN, or

DEBUG.org.apache.manifoldcf.authorityconnectors No Log authority connector activity. Legal values INFO, WARN, or

DEBUG.org.apache.manifoldcf.authorityservice No Log authority service activity. Legal values are INFO, WARN, or

DEBUG.org.apache.manifoldcf.salt Yes, if file

encryption isused

Specify the salt value to be used for encrypting the file to whichthe crawler configuration is exported.

The following table describes 'advanced' configuration property file properties. They shouldn't need to be changed but provide a greaterlevel of customization:

Advanced property.xml propertiesProperty Required?Default Function

org.apache.manifoldcf.crawler.repository.store_historyNo true If you do not require reports from within this will disablelogging to the repository history (although the reports will stillrun they will not contain any content). This can increasethroughput and reduce the rate of growth of the database.

org.apache.manifoldcf.db.postgres.analyze.<tablename>

No 2000 For postgresql, specify how many changes should be carriedout before carrying out an 'ANALYZE' on the specified table.

org.apache.manifoldcf.db.postgres.reindex.<tablename>

No 250000 For postgresql, specify how many changes should be carriedout before carrying out an 'REINDEX' on the specified table.

org.apache.manifoldcf.db.mysql.analyze.<tablename> No 2000 For MySql or MariaDB, specify how many changes should becarried out before carrying out an 'ANALYZE' on the specifiedtable.

org.apache.manifoldcf.ui.maxstatuscount No 500000 Set the upper limit for the precise document count to bereturned on the 'Status and Job Management' page.

The configuration file can also specify a set of directories which will be searched for connector jars. The directive that adds to the classpath is:

<libdir path="path"/>

Note that the path can be relative. For the purposes of path resolution, "." means the directory in which the properties.xml file is itself

Page 16: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 16/17

located.

Logging configuration file properties

The logging.ini file contains Apache commons­logging properties in a standard Java <name>=<value> format. The way the ManifoldCFlogging output is formatted is controlled through this file, as are any loggers that ManifoldCF doesn't explicitly define (e.g. loggers forApache commons­httpclient). Other resources are therefore best suited to describe the parameters that can be used and to what effect.

Running the ManifoldCF Apache2 plug in

The ManifoldCF Apache2 plugin, mod­authz­annotate, is designed to convert an authenticated principle (e.g. from mod­auth­kerb), andquery a set of authority services for access tokens using an HTTP request. These access tokens are then passed to a (not included) searchengine UI, which can use them to help compose a search that properly excludes content that the user is not supposed to see.

The list of authority services so queried is configured in Apache's httpd.conf file. This project includes only one such service: the javaauthority service, which uses authority connections defined in the crawler UI to obtain appropriate access tokens.

In order for mod­authz­annotate to be used, it must be placed into Apache2's extensions directory, and configured appropriately in thehttpd.conf file.

Note: The ManifoldCF project now contains support for converting a Kerberos principal to a list of Active Directory SIDs. Thisfunctionality is contained in the Active Directory Authority. The following connectors are expected to make use of this authority:

FileNet

CIFS

SharePoint

Configuring the ManifoldCF Apache2 plug in

mod­authz­annotate understands the following httpd.conf commands:

Command Meaning ValuesAuthzAnnotateEnable Turn on/off the plugin "On", "Off"AuthzAnnotateAuthority Point to an authority service that supports ACL queries, but not ID queries The authority URLAuthzAnnotateACLAuthority Point to an authority service that supports ACL queries, but not ID queries The authority URLAuthzAnnotateIDAuthority Point to an authority service that supports ID queries, but not ACL queries The authority URLAuthzAnnotateIDACLAuthority Point to an authority service that supports both ACL queries and ID queries The authority URL

Running ManifoldCF with Apache Maven

If you build ManifoldCF with Maven, then you will need to run ManifoldCF under Maven. You currently don't get a lot of options here; theonly model offered is the QuickStart single process model. To run it, all you need to do is:

cd framework/jetty‐runnermvn exec:exec

Integrating ManifoldCF into another application

ManifoldCF can be integrated into another application through a variety of methods. We'll cover these below.

Integrating the Quick Start example

The Quick Start example can readily be integrated into a single­process application, by using the support methods found in theorg.apache.manifoldcf.jettyrunner.ManifoldCFJettyRunner class. For the web application components of ManifoldCF, you can either usethis class to start them under Jetty, or you can choose to deploy them yourself. Please note, however, that if you start the ManifoldCFagents process within a web application, you are effectively not running a single­process version of ManifoldCF anymore, because eachweb application will effectively have its own set of static classes.

If you want to try the single­process integration, you should learn what you need by reading the Javadoc for the ManifoldCFJettyRunnerclass.

Integrating a multi‐process setup

In a multi process setup, all of the ManifoldCF processes might as well exist on their own. You can learn how to programmatically start theagents process by looking at the code in the AgentRun command class, as described above. Similarly, the command classes that registerconnectors are very small and should be easy to understand.

Integrating ManifoldCF with a search engine

ManifoldCF's Authority Service is designed to allow maximum flexibility in integrating ManifoldCF security with search engines. Theservice receives a user identity (as a set of authorization domain/user name tuples), and produces a set of tokens. It also returns asummary of the status of all authorities that were involved in the assembly of the set of tokens, as a nicety. A search engine user interfacecould thus signal the user when the results they might be seeing are incomplete, and why.

The Authority Service expects the following arguments, passed as URL arguments and properly URL encoded:

Authority Service URL parametersAuthority Service URL Meaning

Page 17: Building and Running ManifoldCF

2015/7/18 Building ManifoldCF

http://manifoldcf.apache.org/release/release­2.1/en_US/how­to­build­and­deploy.html#Running+ManifoldCF 17/17

parameterusername the username, if there is only one authorization domaindomain the optional authorization domain if there is only one authorization domain (defaults to empty

string)username_XX username number XX, where XX is an integer starting at zerodomain_XX authorization domain XX, where XX is an integer starting at zero

Access tokens and authority statuses are returned in the HTTP response separated by newline characters. Each line has a prefix asfollows:

Authority Service response prefixesAuthority Service response prefix Meaning

TOKEN: An access tokenAUTHORIZED: The name of an authority that found the user to be authorizedUNREACHABLEAUTHORITY: The name of an authority that was found to be unreachable or unusableUNAUTHORIZED: The name of an authority that found the user to be unauthorizedUSERNOTFOUND: The name of an authority that could not find the user

It is important to remember that only the "TOKEN:" lines actually matter for security. Even if any of the error conditions apply, the set oftokens returned by the Authority Service will be correctly supplied in order to apply appropriate security to documents being searched.

If you choose to deploy a search­engine plugin supplied by the Apache ManifoldCF project (for example, the Solr plugin), you will not needknow any of the above, since part of the plugin's purpose is to communicate with the Authority Service and apply the access tokens thatare returned to the search query automatically. Some plugins, such as the ElasticSearch plugin, are more or less like toolkits, but still hidemost of the above from the integrator. In a more highly customized system, however, you may need to develop your own code whichinteracts with the Authority Service in order to meet your goals.

Last Published: 05/05/2015 08:23:01Copyright © 2009­2015 The Apache Software Foundation.Apache ManifoldCF, ManifoldCF, Apache Forrest, Forrest, Apache Solr, Solr, Apache, the Apache feather logo, the Apache Forrest logo, and the Apache ManifoldCF logo aretrademarks of The Apache Software Foundation. Documentum and EMC are a trademarks of EMC Corporation. SharePoint, Windows, and Microsoft are trademarks ofMicrosoft, Inc. FileNet P8 and IBM are trademarks of IBM, Inc. LiveLink and OpenText are trademarks of OpenText, Inc. QBase, MetaCarta, and GTS are trademarks of QBase,Inc. Meridio and Autonomy are trademarks of Hewlett Packard, Inc. Alfresco is a trademark of Alfresco Software, Inc. Jira is a trademark of Atlassian, Inc.