Supplementary Material for Hadoop Recognition of ...lik/publications/Wei-Ai-IEEE... · Hadoop...
Transcript of Supplementary Material for Hadoop Recognition of ...lik/publications/Wei-Ai-IEEE... · Hadoop...
1
Supplementary Material forHadoop Recognition of Biomedical Named
Entity Using Conditional Random FieldsKenli Li, Wei Ai*, Zhuo Tang, Fan Zhang, Lingang Jiang, Keqin Li, and Kai Hwang
✦
1 CONDITIONAL RANDOM FIELDS MODEL
This section provides more detailed description of theCRF approach.
1.1 Conditional Random Fields
In Figure 1, we give a simple but concrete exampleto explain the two-phase biomedical named entityrecognition using CRF. In Figure 1(a), B-protein inoutput denotes the beginning of protein class, and I-protein denotes the rest of the protein class. Becausethe final result of recognition using BIO format, in theinput , ”peri-kappa B factor” is a protein. Figure 1(b)gives the major components of the CRF model.
1.2 L-BFGS Algorithm for CRF
As shown in Figure 2, L-BFGS first sets parameters an
initial value ~λ0. Then L-BFGS repeatedly improves the
parameter estimates: ~λ1, ~λ2, .... The whole process usesa loop with a given number L of iterations. From ~λt
to ~λt+1, L-BFGS finds the search direction ~Pt, decidesthe step length at, and moves in this direction ~Pt. Thekey of finding the search direction ~Pt is to calculate agradient vector ∇Lt.
1.3 Viterbi Algorithm for CRF
We give a simple example to explain the detailed stepsof Algorithms 2 in Figure 3. Figure 3(a) shows Steps1–3, and Figure 3(b) illustrates Step 4.
• Kenli Li, Wei Ai, Zhuo Tang, Lingang Jiang, and Keqin Li are with theCollege of Information Science and Engineering, Hunan University,Changsha, Hunan, China, 410082.E-mail: Kenli Li ([email protected]), Wei Ai ([email protected]), ZhuoTang ([email protected]), Lingang Jiang ([email protected]).
• Fan Zhang is with Kavli Institute for Astrophysics and Space Research,Massachusetts Institute of Technology, Cambridge, MA 02139, USA.E-mail: f [email protected]
• Keqin Li is also with the Department of Computer Science, StateUniversity of New York, New Paltz, New York 12561, USA. E-mail:[email protected].
• Kai Hwang is with the Department of Electrical Engineering, Uni-versity of Southern California, Los Angeles, CA 90089, USA. E-mail:[email protected].
• Corresponding author: Wei Ai
�����������
���
� � �����������
����� �
�
���
���������
�
���� �
���
���������
�� ���
����
��
���
��������
� ��
� � �����
���
�
�����
����
��
���
���� �����
�
�
�
�
�
�
�
���� ���
���� ���
���� ���
�
�
�
�
�
�
�
�
�����������
�
�����������
�����������
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
�
������
� !
" ���
�����#
� !
" ���
����� ������������$�����
(a) The two-phase approach using CRF
�������������� �
�����������
�� �������������
�����������
���������
���������������
��
���������
������������
������������
�����������
�������
��������
���������������
������
��������
�
���
������� �����
�����������������
���������������
��
(b) The CRF model
Fig. 1: An example of biomedical NER based on CRF
2 EXPERIMENTAL SOFTWARE AND HARD-WARE CONFIGURATIONS
Tables 1 and 2 provide the software and hardwareconfigurations in the Hadoop cluster.
2
Fig. 2: The process of L-BFGS algorithm
iX
1
iX
2
iX
3
iX
4
*
4y
3.011
=��α 08.012
=��α 005.013
=��α 0281.014
=��α11
2=��φ 11
3=��φ 21
4=��φ
4.021
=��α 02.022
=��α12
2=��φ
015.023
=��α12
3=��φ
056.024
=��α22
4=��φ
(a) Steps 1–3
iX
1
iX
2
iX
3
iX
4
*
4y
.,,,
4321CyOyCyOy
iiii ====(b) Step 4
Fig. 3: An example for the process of the Viterbi algorithm
3
TABLE 1: The software and hardware configurations in the Hadoop cluster
The node type Operating system CPU Memory QuantityNameNode Open suse 11 4-core, 3.07 GHz 8G 1DataNode Open suse 11 4-core, 2.70 GHz 8G 40
TABLE 2: Key configuration in the Hadoop cluster
Configuration items Configuration properties ValueMap slots mapred.tasktracker.map.tasks.maximum 4
Reduce slot mapred.tasktracker.Reduce.tasks.maximum 2Copy thread mapred.reduce.parallel.copies 5
HDFS replications dfs.replication 1Input file size dfs.block.size 64 M
DataNode heartbeat dfs.heartbeat.interval 3 s