Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

100
Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–10—23

Transcript of Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

Page 1: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

Quality assessment of a current awareness system

Thomas Krichel

LIU & HГУ

2007–10—23

Page 2: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

acknowledgments

• Thanks to organizers.

• I am grateful for comment by – Bernado Bátiz-Lazo

– Joanna P. Davies

– Marco Novarese

– Christian Zimmermann

• I thank everybody involved in RePEc and NEP, as well as JISC.

Page 3: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

current awareness• Current Awareness aka “Selective

Dissemination of Information” is a simple idea: a user is informed about new documents in her area of interest.

• Current awareness generate a double classification in – subject ...

– time ...

• matter.

Page 4: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

why bother?

• It is niche activity that has been neglected by the search engines.

• I have registered with Google and Amazon. They give me tips but these are generally poor.

• We can not trust computers to do it.– Neither on subject matter

– Nor on time

Page 5: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

types of current awareness

• personal (amazon) vs collective (Google news)

• machine generated vs human generated?

• Actually I claim the only human-generated current awareness service for academic documents is NEP.

Page 6: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

computer-based + keywords

• In computer generated current awareness one can filter for keywords.

• In academic digital libraries, since the papers describe research results, they contain all “ideas” that have not been previously seen.

• Therefore getting the keywords right is impossible.

Page 7: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

computer-based + categories

• It is possible to classify documents based on categories say “football” vs “tennis”.

• It works fine when the vocabulary used in different categories is quite different.

• For some academic areas the differences are just too subtle.

Page 8: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

computers and time problem

• In a digital library the “date” of a document can mean anything.

• The metadata may be dated in some implicit form.– Recently arrived records can be calculated.

– But record handles may be unstable.

– Recently arrived records do not automatically mean new documents.

Page 9: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

we need humans• Catalogers are expensive.

• We need volunteers to do the work.

• Junior researchers have good incentives. They– need to be aware of latest literature;

– are absent from the informal circulation channels of top level academics;

– need to “get their name around” among researchers in the field..

Page 10: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

introducing NEP• There is only one large freely-available

human-based current awareness service.

• It is “NEP: New Economics Papers” at http://nep.repec.org

• Remainder of this talk is about NEP.

Page 11: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

NEP: New Economics Papers

• NEP is a current awareness system for the working paper in the RePEc digital library about economics.

• Published articles are excluded because they are way too old.

Page 12: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

NEP service model

• There is a basic model behind this service we could call the "NEP Service Model".– two stages...

– flat report space...

Page 13: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

general two stage setup

• First stage: A general editor compiles a list of all new papers. This forms an issues of the “allport”.

• Second stage: A group of subject editors filter the new allport issue into subject reports. Each editor does this independently for the subject reports she looks after.

Page 14: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

a flat space

• There is a series of reports. Each reports has a number of issues over time.

• There is an “allport” a report that contains all papers that are new in the period covered by the issue.

Page 15: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

first stage in NEP• General editor compiles a list of recent

additions to the RePEc working papers data.– Computer generated– Journal articles are excluded– Examined by the General Editor (GE, a person)

• This list forms an issue of nep-all.

Page 16: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

first stage in NEP

• nep-all contains all new papers

• Circulated to– nep-all subscribers

– Editors of subject-reports

Page 17: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

second stage

• Each editor creates, independently, a subject report for her subject. She does this by removing from nep-all.

• A subject report issues (sri) is the result of this process.

• There have been over 47,000 sris issued through the lifetime of NEP to date.

Page 18: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

history

• There are basically two phases in NEP the pre-ernad 1998 to 2004 and the post ernad phase.

• I will deal with pre-ernad history here.

• Some research on NEP has been conducted in the pre-ernad phase.

• This has informed the work that went into ernad.

Page 19: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

early history

• System was conceived by Thomas Krichel

• Name “NEP” by Sune Karlsson

• Implemented by José Manuel Barrueco Cruz.

• Started to run in May 1998.

Page 20: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

starting setup

• First the system was all email based.

• The nep-all was composed as an email.

• It was sent to editors as an email.

• Editors used whatever tool they used to compose the email.

Page 21: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

web interface

• John S. Irons issued the first web interface for report composition on 2000-02-01.

• This would just compose the report. • Editors would still cut and paste the results

of the form into email clients.

Page 22: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

historic mail support• First mail support was given by

mailbase.ac.uk.• When this was closed in 2000-11, NEP

moved to jiscmail.ac.uk. • Since the mailing list service was only

supposed to be for UK academic community, it was deemed not sustainable.

• Thomas Krichel started hosting lists on 2002-11-16. It is a nightmare.

Page 23: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

Aeroflot document• The Aeroflot document was a thinking piece

that Thomas Krichel wrote as early as 2001. http://openlib.org/home/krichel/work/aeroflot.html

• This paper already sets out ideas for what would be ernad.

• At that time the Siberian RePEc team promised help with building such a system.

Page 24: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

discover disaster

• In 2002-2003 Jeremiah Cochise Trinidad Christensen and Thomas Krichel were the first people to try to get a systematic picture of how NEP works.

• They discover that this is exceedingly difficult.

Page 25: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

mail log parsing

• Logs were not moved from to Maibase to JISCMail.

• Mailbase removed the logs in 2002-11. Thomas Krichel got them just before they were destroyed.

• The mail logs were the only source for historic NEP information.

Page 26: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

parsing targets

• handles: severely compromised by cut-n-paste operations, editor locales, etc.

• date of issue: editors were free to set dates, nep-all dates may not be preserved

• time of issue: an email is almost impossible to time.

Page 27: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

state of pre-ernad data• After a regular expressions orgy, we can

get some approximate idea about the handles that were used.

• Thus the thematic component is roughly intact.

• We have a problem with a bug in the discovery program that made many papers appear several times in nep-all. This makes it difficult to associate subject and allport issues.

Page 28: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

state of pre-ernad data

• Timing of emails is extremely difficult, even with full headers.

• The logs of the Mailbase system only have times for when the email client said it sent the mail. This is the local editor's PC time, can be years out of whack!

• We still have some data for research...

Page 29: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

research conducted on NEP

• Most of the research conducted on ernad has been done in the pre-ernad phase.

• The difficulties of some of this work has informed the construction of ernad.

Page 30: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

Chu and Krichel (2003)

• Heting Chu & Thomas Krichel (2003) “NEP: Current awareness service of the RePEc digital library”. http://www.dlib.org/dlib/december03/chu/12chu.html vaguely talks about NEP. Notes that there is a problem of timeliness in the subject report issue, despite the very shaky data.

Page 31: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

Barrueco Cruz et al. (2003)

• Jose Manuel Barrueco Cruz, Thomas Krichel and Jeremiah Cochise Trinidad-Chrisitensen “Organizing Current Awareness in a Large Digital Library” http://openlib.org/home/krichel/papers/espoo.pdf have two themes– overlap between reports...

– coverage ratio...

• as well as history and suggestions.

Page 32: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

overlap

• Barrueco Cruz et al (2003) argue that overlap occurs not when two papers are appearing in the two reports, but when the two reports are read by the same readers.

• They have data on pairwise overlap between reports, based on crude membership data.

Page 33: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

overlap puzzle• Here is a puzzle to think about

– If a person will be interested in two subject areas because they are close, she will subscribe to both reports.

– But since they are thematically close, she will sometimes receive the same papers twice.

• With mail technology and asyn-chronous issue generation, this appears difficult to solve.

Page 34: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

coverage ratio

• We call the coverage ratio the number of papers in nep-all that have been announced in at least one subject report.

• We can define this ratio – for each nep-all issue

– for a subset of nep-all issues

– for NEP as a whole

Page 35: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

coverage ratio theory & evidence

• Over time more and more NEP reports have been added. As this happens, we expect the coverage ratio to increase.

• However, the evidence, from research by Barrueco Cruz, Krichel and Trinidad is– The coverage ratio of different nep-all

issues varies a great deal. – Overall, it remains at around 70%.

• We need some theory as to why.

Page 36: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

Krichel & Bakkalbaşı (2005)

• Thomas Krichel and Nisa Bakkalbaşı “Developing a predicitve model of editor selectivity in a current awareness service of a large digital library”. http://openlib.org/home/krichel/papers/boston.pdf

Page 37: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

coverage ratio theories

• Krichel & Bakkalbaşı (2005) build two theories of the observations of Barrueco Cruz at al. (2003)

• They are – Target-size theory– Quality theory

• descriptive quality• substantive quality

Page 38: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

theory 1: target size theory

• When editors compose a report issue, they have a size of the issue in mind.

• If the nep-all issue is large, editors will take a narrow interpretation of the report subject.

• If the nep-all ratio is small, editors will take a wide interpretation of the report subject.

Page 39: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

target size theory & static coverage

• There are two things going on– The opening new subject reports

improves the coverage ratio. – The expansion of RePEc implies that the

size of nep-all, though varying in the short-run, grows in the long run. Target size theory implies that the coverage ratio deteriorates.

• The static coverage ratio is the result of both effects canceling out.

Page 40: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

theory 2: quality theory

• George W. Bush version of quality theory– Some papers are rubbish. They will not get

announced.– The amount of rubbish in RePEc remains

constant.– This implies constant coverage.

• Reality is slightly more subtle.

Page 41: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

2 versions of quality theory• Descriptive quality theory: papers that are

badly described– misleading titles– no abstract– languages other than English

• Substantive quality theory: papers that are well described, but not good– from unknown authors– issued by institutions with unenviable

research reputation

Page 42: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

practical importance• We do care whether one or the other theory

is true.– Target size theory implies that NEP

should open more reports to achieve perfect coverage.

– Quality theory suggests that opening more report will have little to no impact on coverage.

• Since operating more reports is costly, there should be an optimal number of reports.

Page 43: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

results• Krichel & Bakkalbaşı (2005) build a binary

logistic regression analysis model.

• They find positive evidence for both target size and quality theory.

• The NEP editors don't like the results. They insist that they only filter by topic.

Page 44: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

Bátiz-Lazo & Krichel (2005)

• Bernardo Bátiz-Lazo “On-line distribution of working papers through NEP: A Brief Business History” http://openlib.org/home/krichel/papers/kassel.pdf has an early history of NEP that covers organizational details I don't talk about here.

Page 45: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

ernad

• stands for editing reports on new academic documents.

• Software system designed by Thomas Krichel at http://openlib.org/home/kric hel/work/altai.html.

• Software written in Perl by Roman D. Shapiro. Cost $2000.

• Started to work after 2004-12.

Page 46: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

cut editor freedom I

• Editors no longer send mail to lists.

• Only one email address sends mail.

• But the mail appears like coming from the editor:From: Marcus Desjardin

<[email protected]>

Reply-To: Marcus Desjardin <[email protected]>

Page 47: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

cut editor freedom II

• Editors can no longer edit report issue emails, e.g. add announcements of conferences.

• They are generated from XML files into standardized text and HTML files bound together by MIME multipart/alternative.

• They can not change dates of issue.

Page 48: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

help editors• Provide a simple-to-use interface for the

composition of reports– provide an easy to scroll input

– allow for easy sorting of report

– do a better job at pretty-printing

• Get ready for the introduction of pre-sorting

• Actually presorting was only introduced in 2005-08.

Page 49: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

statistical learning• The idea is that a computer may be able to

make decision on the current nep-all reports based on the observation of earlier editorial decisions. This is known as pre-sorting.

• Thomas Krichel “Information retrieval performance measures for a current awareness report composition aid” http://openlib.org/home/krichel/sendai.pdf deals with the evaluation of presorting.

Page 50: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

presorting

• When an allport issue is created, it is presorted.

• In the allport rif each paper has a number in document order. That number is still reported in the presorted rif.

• The method is support vector machines svm, using svm_light.

Page 51: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

pre-sorting reconceived

• We should not think of pre-sorting via SVM as something to replace the editor.

• We should not think about it encouraging editors to be lazy.

• Instead, we should think it as an invitation to examine some papers more closely than others.

Page 52: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

headline vs. bottomline data• The editors really have a three stage process

of decision. – They read title, author names.

– They read the abstract.

– They read the full text.

• A lot of papers fail at the first hurdle.

• SVM can read the abstract and prioritize papers for abstract reading.

• Editors are happy with the presorting system.

Page 53: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

what is the value of an editor?

• It turns out that reports have very different forecastability. Some are almost perfect, others are weak.

• If the forecast is very weak the editor may be – a genius– a prankster.

Page 54: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

svm training set

• The positive examples are taking from the report up to a certain time limit, called the experience length.

• The negative results are taken from nep-all, from the date of the last issued subject report until the experience length.

• The experience length is fixed in ernad.conf. For NEP it is 13 months.

Page 55: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

features selection

• We use individual words out of the contents from titles, author names, abstracts, classification data and the id of the RePEc series.

• We normalize the Euclidean sum of the feature weights.

• We run svm_light with the default settings.

Page 56: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

presorting timeline• When a nep-all issues has been created, a

customized version of its rif is created in the source/us directory.

• This issue is then presorted. The presorted version is stored in the source/ps directory.

• Presorting therefore only takes account of the information available at nep-all creation time.

Page 57: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

underlying technologies

• Written in Perl, using LibXSLT.• Uses mod_perl under Apache 2.• Runs on Debian GNU/Linux, could run on

similar systems.• Ernad needs to used some sort of mailing

system but is not geared to a specific system. It basically just sends mail.

Page 58: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

underlying information

• AMF is a format for description of academic documents and academics.

• http://amf.openlib.org/doc/amf.html

• It is based on XML Schema, itself based on XML.

• Report data and issue data is encoded in AMF.

Page 59: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

ernad.conf

• Ernad uses one single configuration file ernad.conf.

• It has a simple attribute = value structure.

Page 60: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

affordances and domains

• There are basically four things that an ernad-based current awareness system provides for.

• For each of these “affordances”, we have a separate domain.

• This allows for distinct affordances to be run by distinct domains.

Page 61: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

the composition domain• This is the domain used by the report issue

composition interface. • This is the virtual domain that the ernad

apache is running under.• The ernad process creates files so the

apache server is best run as the user ernad.

• Recall ernad requires mod_perl, which in turn is incompatible with suexec.

Page 62: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

the service domain

• This is where potential reader look at information about the ernad service– what reports are available

– who edits them

• This domain is fixed through the reports configuration file report.amf.xml.

Page 63: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

the list domain• This is the domain where the mailing lists

are under– domain of web interface

– domain of the mailing lists

• Each report report has a list report@listdomain, where listdomain is the list domain.

• This domain is fixed through the reports configuration file report.amf.xml.

Page 64: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

delivery domain• The links to the full text use the encode the

identifier of the paper and the identifier of the report.

• This allows to see what report readers requested.

• It is imperative that these links are not further disseminated. There should be no archives of nep lists.

• It is fixed in ernad.conf

Page 65: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

reports.amf.xml• The first part has the definition of NEP

itself. (next slide)

• The second part as the definition of a report (slide after)

• The allport is the first listed

• reports.amf.xml fixes– report handles

– editor information, (incl. here: editor ids)

– list domain

– service domain

Page 66: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

<amf xmlns="http://amf.openlib.org" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://amf.openlib.org http://amf.openlib.org/2001/amf.xsd" xmlns:ernad="http://ernad.openlib.org"><collection id="RePEc:nep"><title>NEP: New Economics Papers</title> <accesspoint>ftp://all.repec.org/RePEc/wop/conf</accessoint><homepage>http://nep.repec.org</homepage><haspart> <collection id="RePEc:nep:nepall">

Page 67: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

<collection id="RePEc:nep:nepcba"><title>

Central Banking</title><homepage>

http://lists.repec.org/mailman/listinfo/nep-cba </homepage><ernad:password>...

</ernad:password><haseditor><person>

<name>Alexander Mihailov</name><name xml:lang="en">Alexander Mihailov</name>

<homepage>http://econpapers.repec.org/RAS/pmi59.htm</homepage>

<email>...</email><ispartof> <organization>...</organization>

</ispartof></person></haseditor>

</collection>

Page 68: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

operation• A rif always as a name yyyy-mm-dd_tist,

– where yyyy is the year, mm month, dd day of the nep-all issue.

– tist is a UNIX time stamp, i.e. the number of seconds that have passed since the first of January 1970

• rifs are never deleted. When an operation is made, a new version of the rif with a new tist is written.

Page 69: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

creating a sri 0

• After login in to ernad, an editor sees a set of allport issues to work on.

• This is the report selection stage.• If there is no allport issue needs working on

a “sorry” message is displayed.

Page 70: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

creating a sri 1

• When a subject report issue is created, it is copied from the source/ps or source/us directory, depending on whether the editor chooses with presorted to work with the presorted or the unsorted version of the report.

Page 71: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

creating a sri 1.5

• If there is no paper worth to be included in the report, the editor can move back from the paper selection to the issue selection stage. There she can delete the issue.

• The rif of the issue is not deleted. Instead an empty issue is created in the final “sent” directory.

Page 72: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

creating a sri 2

• Once papers have been selected, a rif is created in the into the director “selected”. This rif only contains the selected papers.

• If there are changes in the selections, new rifs are created, as soon as the editor moves to the next screen.

Page 73: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

creating a sri 3

• Once papers have been ordered, a rif is created in the into the director “ordered”.

• If there are changes in the order, new rifs are created, as soon as the editor moves to the next screen.

Page 74: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

creating a sri 4

• Once the sri has been previewed the editor can click the “send” button. The rif is stored in the “sent” directory.

• Ernad created a mail file containing a HTML and text version of the sri and places it in the “mail” directory.

• This file can be sent again if there is an email problem.

Page 75: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

a lot of data

ernad@khufu:~$ dateThu Oct 18 05:18:37 EDT 2007ernad@khufu:~$ du -s

ernad/var/reports/24147956 ernad/var/reports/ernad@khufu:~$ find

ernad/var/reports/ -type f | grep -c ^76043

Page 76: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

maintenance

• Thomas Krichel has written a technical guide, mainly for the director and the general editor.

• It is at http://nep.repec.org/technical.

• It illustrates well that the technical maintenance is still quite heavy.

• A lot of maintenance scripts still have to be written by Thomas.

Page 77: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

assessment

• How well does NEP work?

• Some criteria are already discussed the literature– delay

– coverage ratio

– overlap

Page 78: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

coverage to lossage

• There is a web site

• http://nep.repec.org/lossage

• that does show the papers that have not been sent, as well as the coverage ratio for each issue.

• It appears that coverage has improved.

Page 79: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

overlap

• There is no script to compute current overlap data.

• There is quite good historical subscriber for most of the post-ernad period.

• Thus it is possible to calculate overlap of reports for various nep-all issues.

Page 80: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

to improve coverage

• It would be interesting to redo the work of Krichel and Bakkalbaşı (2005)

• For English papers, we can try to presort nep-all issues for a virtual “nep-no” report. This could help to identify thematic gaps.

• We open language-specific reports?

Page 81: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

delay

• The site http://nep.repec.org/delay shows average delays of editors.

• Half of the editors appear to do a good job. A good job is when the average delay is below a week.

Page 82: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

editor activity• There is a web site containing activity data

of editors.

• http://nep.repec.org/editor_activity.

• There appear some minor problems but overall it appears ok.

• Date is only available from 2005-06, because of a misunderstanding between Roman D. Shapiro and Thomas Krichel.

Page 83: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

downloads

• This is the ultimate measure of success.

• Downloads from a report can be measured because of a GCI parameter identifying the report.

• Parsing the logs and matching the handles with handles in reports is difficult.

Page 84: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

downloads from issues

• A lot of downloads are made by editors when they compose the report issue.

• A lot of others are made by robots.

• As for the rest, data is at http://nep.repec.org/downloads

• At this time this is preliminary.

Page 85: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

the Kiev framework• This is a framework I want to discuss today

to assess NEP.

• Objective: maximize downloads of papers through NEP per paper announced.

• Means:– targeted report

– large and targeted audience

• both can be influenced by the editor

Page 86: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

unit of assessment

• The unit of assessment is the report issue. This is not an assessment of – the report

– the editor

• The independent variable is related to dependent variables through simple linear regression.

Page 87: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

dependent variable

• It is the number of downloads made by users from a report.

• We try to get to the true user data.

• We only look at data after pre-sorting was introduced, say 2006 and 2007.

• In the following, I am looking at the independent variables (i.v.)

Page 88: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

i.v. of normalization

• i.v. 1: issue_size– This is the number of papers in the report.

• i.v. 2: membership_size– This is the number of members that the get the

report just before the issue is mailed.

Page 89: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

i.v. of membership

• One indication of membership quality it that is is dispersed.

• i.v. 3: concentration_1– A measure of concentration of subscribers' top

level domain.

• i.v. 4: concentration_2– A measure of concentration of subscribers' top

and second-level domain.

Page 90: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

i.v. of timeliness

• i.v. 5: all_time– the time between nep-all and the current

subject issue

• i.v. 6: neighbour_time– the minimum of

• the time between current issue and next issue

• the time between the current issue and and the previous issue

Page 91: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

i.v. of composition• i.v. 7: composition_duration

– the total time of composition

• i.v. 8: ordering_step– the total number of times the report was

ordered

• i.v. 9: trailer– the position of the last paper selected

• i.v. 10: all_size– the size of the corresponding nep-all

Page 92: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

i.v. of season

• We know that the activity of RePEc is seasonal.

• i.v. 11 to 22: m1 to m12– dummies to indicate the month

Page 93: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

anything missing?

• the report?

• the editor?

Page 94: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

Leonardo Fernandes Souto

• is a Brazilian PhD student working on current awareness services.

• He has (correctly) identified NEP as the model that all should follow ;-)

• His questionnaire is at http://nep.repe c.org/research/NEP_questionnaire_2007-10-06.doc.

Page 95: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

future extensions of ernad 1

• Use editor identities to build a customized experience length.

• Use a collection of multi-word RePEc keywords to aid pre-sorting.

Page 96: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

future extensions of ernad 2

• Deal better with duplicate papers under different handles.– use lists before ordering as a basis for

inclusion into pre-sorting. This will allow editors to delete duplicates without confusing the SVM

– potentially detect duplicates at allport composition time.

Page 97: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

future extensions of NEP

• Use RSS as an alternative dissemination method.

Page 98: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

tough problems

• Filtering for new papers is deficient as the date on papers is not mandatory. Presorting for age seems impossible.

• The fight with spam is a problem for anyone who sends out a lot of mail.

Page 99: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

finally: stop the press...• 2007-10-17 Christian Zimmermann wrote

“At http://ideas.repec.org/i/ e.html, I have attempted to classify registered authors by field. For this I used their papers as disseminated by NEP, and if at least one fourth are in a report, authors are considered to be a specialist of that field.”

Page 100: Quality assessment of a current awareness system Thomas Krichel LIU & HГУ 2007–1023.

http://openlib.org/home/krichel

Thank you for your attention!