Supplementary Material for Hadoop Recognition of ...lik/publications/Wei-Ai-IEEE... · Hadoop...

3
1 Supplementary Material for Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields Kenli Li, Wei Ai*, Zhuo Tang, Fan Zhang, Lingang Jiang, Keqin Li, and Kai Hwang 1 CONDITIONAL RANDOM FIELDS MODEL This section provides more detailed description of the CRF approach. 1.1 Conditional Random Fields In Figure 1, we give a simple but concrete example to explain the two-phase biomedical named entity recognition using CRF. In Figure 1(a), B-protein in output denotes the beginning of protein class, and I- protein denotes the rest of the protein class. Because the final result of recognition using BIO format, in the input , ”peri-kappa B factor” is a protein. Figure 1(b) gives the major components of the CRF model. 1.2 L-BFGS Algorithm for CRF As shown in Figure 2, L-BFGS first sets parameters an initial value λ 0 . Then L-BFGS repeatedly improves the parameter estimates: λ 1 , λ 2 , .... The whole process uses a loop with a given number L of iterations. From λ t to λ t+1 , L-BFGS finds the search direction P t , decides the step length a t , and moves in this direction P t . The key of finding the search direction P t is to calculate a gradient vector L t . 1.3 Viterbi Algorithm for CRF We give a simple example to explain the detailed steps of Algorithms 2 in Figure 3. Figure 3(a) shows Steps 1–3, and Figure 3(b) illustrates Step 4. Kenli Li, Wei Ai, Zhuo Tang, Lingang Jiang, and Keqin Li are with the College of Information Science and Engineering, Hunan University, Changsha, Hunan, China, 410082. E-mail: Kenli Li ([email protected]), Wei Ai ([email protected]), Zhuo Tang ([email protected]), Lingang Jiang ([email protected]). Fan Zhang is with Kavli Institute for Astrophysics and Space Research, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. E-mail: f [email protected] Keqin Li is also with the Department of Computer Science, State University of New York, New Paltz, New York 12561, USA. E-mail: [email protected]. Kai Hwang is with the Department of Electrical Engineering, Uni- versity of Southern California, Los Angeles, CA 90089, USA. E-mail: [email protected]. Corresponding author: Wei Ai (a) The two-phase approach using CRF λ (b) The CRF model Fig. 1: An example of biomedical NER based on CRF 2 EXPERIMENTAL SOFTWARE AND HARD- WARE CONFIGURATIONS Tables 1 and 2 provide the software and hardware configurations in the Hadoop cluster.

Transcript of Supplementary Material for Hadoop Recognition of ...lik/publications/Wei-Ai-IEEE... · Hadoop...

Page 1: Supplementary Material for Hadoop Recognition of ...lik/publications/Wei-Ai-IEEE... · Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields Kenli Li, Wei

1

Supplementary Material forHadoop Recognition of Biomedical Named

Entity Using Conditional Random FieldsKenli Li, Wei Ai*, Zhuo Tang, Fan Zhang, Lingang Jiang, Keqin Li, and Kai Hwang

1 CONDITIONAL RANDOM FIELDS MODEL

This section provides more detailed description of theCRF approach.

1.1 Conditional Random Fields

In Figure 1, we give a simple but concrete exampleto explain the two-phase biomedical named entityrecognition using CRF. In Figure 1(a), B-protein inoutput denotes the beginning of protein class, and I-protein denotes the rest of the protein class. Becausethe final result of recognition using BIO format, in theinput , ”peri-kappa B factor” is a protein. Figure 1(b)gives the major components of the CRF model.

1.2 L-BFGS Algorithm for CRF

As shown in Figure 2, L-BFGS first sets parameters an

initial value ~λ0. Then L-BFGS repeatedly improves the

parameter estimates: ~λ1, ~λ2, .... The whole process usesa loop with a given number L of iterations. From ~λt

to ~λt+1, L-BFGS finds the search direction ~Pt, decidesthe step length at, and moves in this direction ~Pt. Thekey of finding the search direction ~Pt is to calculate agradient vector ∇Lt.

1.3 Viterbi Algorithm for CRF

We give a simple example to explain the detailed stepsof Algorithms 2 in Figure 3. Figure 3(a) shows Steps1–3, and Figure 3(b) illustrates Step 4.

• Kenli Li, Wei Ai, Zhuo Tang, Lingang Jiang, and Keqin Li are with theCollege of Information Science and Engineering, Hunan University,Changsha, Hunan, China, 410082.E-mail: Kenli Li ([email protected]), Wei Ai ([email protected]), ZhuoTang ([email protected]), Lingang Jiang ([email protected]).

• Fan Zhang is with Kavli Institute for Astrophysics and Space Research,Massachusetts Institute of Technology, Cambridge, MA 02139, USA.E-mail: f [email protected]

• Keqin Li is also with the Department of Computer Science, StateUniversity of New York, New Paltz, New York 12561, USA. E-mail:[email protected].

• Kai Hwang is with the Department of Electrical Engineering, Uni-versity of Southern California, Los Angeles, CA 90089, USA. E-mail:[email protected].

• Corresponding author: Wei Ai

�����������

���

� � �����������

����� �

���

���������

���� �

���

���������

�� ���

����

��

���

��������

� ��

� � �����

���

�����

����

��

���

���� �����

���� ���

���� ���

���� ���

�����������

�����������

�����������

������

� !

" ���

�����#

� !

" ���

����� ������������$�����

(a) The two-phase approach using CRF

�������������� �

�����������

�� �������������

�����������

���������

���������������

��

���������

������������

������������

�����������

�������

��������

���������������

������

��������

���

������� �����

�����������������

���������������

��

(b) The CRF model

Fig. 1: An example of biomedical NER based on CRF

2 EXPERIMENTAL SOFTWARE AND HARD-WARE CONFIGURATIONS

Tables 1 and 2 provide the software and hardwareconfigurations in the Hadoop cluster.

Page 2: Supplementary Material for Hadoop Recognition of ...lik/publications/Wei-Ai-IEEE... · Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields Kenli Li, Wei

2

Fig. 2: The process of L-BFGS algorithm

iX

1

iX

2

iX

3

iX

4

*

4y

3.011

=��α 08.012

=��α 005.013

=��α 0281.014

=��α11

2=��φ 11

3=��φ 21

4=��φ

4.021

=��α 02.022

=��α12

2=��φ

015.023

=��α12

3=��φ

056.024

=��α22

4=��φ

(a) Steps 1–3

iX

1

iX

2

iX

3

iX

4

*

4y

.,,,

4321CyOyCyOy

iiii ====(b) Step 4

Fig. 3: An example for the process of the Viterbi algorithm

Page 3: Supplementary Material for Hadoop Recognition of ...lik/publications/Wei-Ai-IEEE... · Hadoop Recognition of Biomedical Named Entity Using Conditional Random Fields Kenli Li, Wei

3

TABLE 1: The software and hardware configurations in the Hadoop cluster

The node type Operating system CPU Memory QuantityNameNode Open suse 11 4-core, 3.07 GHz 8G 1DataNode Open suse 11 4-core, 2.70 GHz 8G 40

TABLE 2: Key configuration in the Hadoop cluster

Configuration items Configuration properties ValueMap slots mapred.tasktracker.map.tasks.maximum 4

Reduce slot mapred.tasktracker.Reduce.tasks.maximum 2Copy thread mapred.reduce.parallel.copies 5

HDFS replications dfs.replication 1Input file size dfs.block.size 64 M

DataNode heartbeat dfs.heartbeat.interval 3 s