CIS392 Sp 03Assign#11 CIS392 Text Processing, Retrieval, and Mining Spring 03 Instructor: Dr. Y. F....
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of CIS392 Sp 03Assign#11 CIS392 Text Processing, Retrieval, and Mining Spring 03 Instructor: Dr. Y. F....
CIS392 Sp 03 Assign#1 1
CIS392 Text Processing, Retrieval, and Mining
Spring 03
Instructor: Dr. Y. F. Brook Wu
BOW toolkit:
http://www.cs.cmu.edu/~mccallum/bow
CIS392 Sp 03 Assign#1 2
Login in to AFS On campus: go to a computer lab in GITC 2305. At home: make sure the internet connection has been
established. Assume everyone has Windows at home. Click on
Start Run Type in “telnet afs1.njit.edu” (without quotes; the first
screen shows some useful information.) Enter user name and password What if your account doesn’t work: Call help desk
973.596.2900, they can reset your password for you.
CIS392 Sp 03 Assign#1 3
Useful UNIX commands Note: All filenames and commands in UNIX
system are case sensitive. General syntax:
Command [option] Argument Options modify the way command works, and
they are optional. Arguments are usually files; sometimes they
are optional too. Ex: rm –r directory_name
CIS392 Sp 03 Assign#1 4
Note Typing two “-” next to each other in MS
PowerPoint will make them look like “—” . Those BOW and UNIX commands you see in these slides, therefore, are confusing. So, please refer to BOW help file and UNIX documentations for their actual usages.
CIS392 Sp 03 Assign#1 5
Useful UNIX commands man (for manual) ex: man ls (manual for ls
command) cd (change directory) ls (list files and attributes) dir (list files) mkdir (crete a directory) rm (delete a file) rm –fr directory_name (delete the whole
directory and files inside it.)
CIS392 Sp 03 Assign#1 6
Useful UNIX commands rmdir (remove directory) cp (copy) pwd (current working directory) pico (a text editor) more filename (read plain text file one
screen at a time. Press space bar to continue and “q” to quit.)
quota (disk space)
CIS392 Sp 03 Assign#1 7
More useful UNIX commands http://www.njit.edu/CSD/Docs/
unixcmds.html http://www.njit.edu/Directory/Admin/
CSD/Academic_Computing/Manuals/UNIX/UNIX.html
CIS392 Sp 03 Assign#1 8
How to create your home page on AFS system? Help info:
http://www-ec.njit.edu/ec_info/newuser/web/web.html
Execute this command at the UNIX prompt: /usr/ec/bin/home.page.setup
Your URL: http://www-ec.njit.edu/~yourusername
CIS392 Sp 03 Assign#1 9
Overview of Retrieval Experiment
Create a sub-directory for CIS392 assignments under ~your_user_name/public_html
Create 3 sub-directories under the above directory for the 3 automatic indexing activities
Perform 3 automatic indexing activities with 3 different options
CIS392 Sp 03 Assign#1 10
Overview of Retrieval Experiment (cont) Perform 3 retrievals for each of the
above 3 auto indexing activities Analyze how different indexing options
affect retrieval Make an html page to present your
results.
CIS392 Sp 03 Assign#1 11
Creating sub directories Change directory to public_html by
typing: cd public_html mkdir cis392 (now you’ve created a
directory for your CIS392 retrieval assignments)
cd cis392 (go inside cis392 directory)
CIS392 Sp 03 Assign#1 12
Creating three sub-directories mkdir model1 (this directory stores results
from default settings: no stemming and stopped words removed.)
mkdir model2 (this directory stores results from the following settings: no stemming, and stopped words INCLUDED.)
mkdir model3 (this directory stores results from the following settings: stemming, and stopped words removed.)
CIS392 Sp 03 Assign#1 13
URL of your retrieval experiment
http://www-ec.njit.edu/~yourusername/cis392/cis392re.html
See a sample page created by Prof Wu: http://www-ec.njit.edu/~wu/cis392/cis392re.html
CIS392 Sp 03 Assign#1 14
Getting Access to BOW and Test Collection
there are three directories under ~wu/IR_Tools: bow (for BOW system), to execute BOW,
change directory to: ~wu/IR_Tools/bow/bin som (for self-organizing map program. Do
NOT use it now!) tc (test collection, Library and Information
Science Abstracts) the text is under ~wu/IR_Tools/tc/lisa/text/group0 to group5
CIS392 Sp 03 Assign#1 15
Test Collection: LISA The sample queries are stored in
~wu/IR_Tools/tc/lisa/LISA.QUE
The relevant documents corresponding to queries are stored in:~wu/IR_Tools/tc/lisa/LISA.REL
(“-1” marks the end of the entry.)
CIS392 Sp 03 Assign#1 16
Operating Arrow of BOW Read information from BOW’s web site
(again, the URL is list on the “Resources” section of the class syllabus)
Read Arrow’s help file (available on syllabus page; You should print a copy of the help file.)
CIS392 Sp 03 Assign#1 17
Automatic Indexing To begin the retrieval tasks, first you need to
index the whole document collection. Specify lexing options (stopped words
removal and/or stemming) at this time. arrow -d ~yourusername/public_html/cis392
--index ~wu/IR_Tools/tc/lisa/text/* The * sign is a wildcard represents all files
and directories under ~wu/IR_Tools/tc/lisa/text
CIS392 Sp 03 Assign#1 18
Automatic Indexing -d parameter specifies where you will store the
statistics resulted from indexing. (You will have to specify this directory when you want to index and retrieve documents.)
The path after –index specifies the location of text collection.
The default lexing settings of the above task include: NO stemming performed, and stopped words REMOVED.
CIS392 Sp 03 Assign#1 19
Query assigned for retrieval Please refer to retrieval experiment
section of the online syllabus to see which query you get for the experiment. (http://web.njit.edu/~wu/teaching/sp03/CIS392/CIS392-Sp03.htm)
CIS392 Sp 03 Assign#1 20
Retrieval First, please specify where the indexing
statistics is stored, and then the query to be performed.
arrow –d ~yourusername/public_html/cis392/model1 --num-hits-to-show=25 –query > ~yourusername/public_html/cis392/model1/retrieved_docs
The greater-than sign (>) specifies the output filename and where it will be stored.
CIS392 Sp 03 Assign#1 21
Presenting your RE create a page under your
~/public_html/cis392 directory named: cis392re.html
this page should contain several pieces of information, see: http://web.njit.edu/~wu/cis392/cis392re.html
CIS392 Sp 03 Assign#1 22
Presenting your RE You can create this html page with the pico editor in
UNIX (if you know basic html tags) , Microsoft Word (save the file in html format), or Netscape composer.
If you use an html editor, you might need FTP software. http://www.zdnet.com/downloads/stories/info/0,10615,30994,00.html
Before due date: Please check all items on your html page and make sure all of them are displayed properly.
After due date: do not make changes. I can check when the files were last updated.