WebInfoMall: the Chinese Web Archive how we got started and how it is now
description
Transcript of WebInfoMall: the Chinese Web Archive how we got started and how it is now
WebInfoMall: the Chinese Web Archive
how we got started and how it is now
Huang Lianen and Li XiaomingPeking University, ChinaDigital Archive WorkshopAugust 27, 2007, Xian, China
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
OutlineOutline
Motivation developed in 2001Motivation developed in 2001 2001, I was not able to give an answer when some one 2001, I was not able to give an answer when some one
asked me what had been on Chinese web 1996.asked me what had been on Chinese web 1996. 2100, I’d like to be able to answer concretely if some one will 2100, I’d like to be able to answer concretely if some one will
ask me what were on Chinese web 2001 ? ask me what were on Chinese web 2001 ?
Archiving technologyArchiving technology For long-term web crawl and store, what technology should For long-term web crawl and store, what technology should
be used, be used, especially in a university lab environment ?especially in a university lab environment ?
Exhibition of the archiveExhibition of the archive How do we show the archive to the society ?How do we show the archive to the society ?
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
On the elapsing nature of Web dataOn the elapsing nature of Web data Li Xiaoming, “On the Li Xiaoming, “On the
estimation of the number estimation of the number of previous Chinese Web of previous Chinese Web pages”, Journal of Peking pages”, Journal of Peking University, Vol.39, No.3, University, Vol.39, No.3, May 2003, 394-398.May 2003, 394-398.
As a by-product, we also As a by-product, we also obtained the result that obtained the result that the time for 50% of the time for 50% of current web pages current web pages disappearing is about disappearing is about 0.99 year. 0.99 year.
Observing the elapsing nature, can we Observing the elapsing nature, can we archive them before they are gone ?archive them before they are gone ?
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
We have some advantageWe have some advantage
With a search engine, 50% is done !
The system work started in 2001
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
The progress and current statusThe progress and current status The crawl started in 2001 and the first batch of The crawl started in 2001 and the first batch of
data was put on line Jan 18, 2002. data was put on line Jan 18, 2002. As of today, there is a total repository over 2.5 As of today, there is a total repository over 2.5
billion Chinese web pages (different), more billion Chinese web pages (different), more precisely, pages crawled from mainland China’s precisely, pages crawled from mainland China’s webweb
About 1 million pages incremental every day.About 1 million pages incremental every day. Initially, we used tapes for storage, but changed to Initially, we used tapes for storage, but changed to
hard disks later.hard disks later. Total online data (compressed) volume Total online data (compressed) volume ≈≈ 30TB, 30TB,
with an off line backup.with an off line backup. Spring 2002, “historical browsing” was provided; Spring 2002, “historical browsing” was provided;
summer 2006, beta test of “backward browsing” summer 2006, beta test of “backward browsing” was testedwas tested
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
示例:示例: InfoMallInfoMall 界面界面
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
示例:输入示例:输入 www.sina.com.cnwww.sina.com.cn
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
示例:示例: 2002.1.182002.1.18 新浪新浪 Headquarter of Bin Ladin was bombed.
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
链接保持链接保持 The first air strike in new year, American AF bombed the headquarter of Bin Ladin.
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
继续保持链接继续保持链接
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
2002.10.82002.10.8
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
2003.9.22003.9.2
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
2004.5.282004.5.28
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
Featured collections: sarsFeatured collections: sars
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
Featured collections: Featured collections: the first the first manned space vehiclemanned space vehicle
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
We ask three questions: We ask three questions:
What’s the use ?What’s the use ? Preserving historical information before it’s lostPreserving historical information before it’s lost Implying great opportunities for deep miningImplying great opportunities for deep mining Providing access to previous information much more Providing access to previous information much more
convenient than libraries even if they have kept it.convenient than libraries even if they have kept it. Can we do it ? (or at least get a pretty good Can we do it ? (or at least get a pretty good
start)start) ““we”: a university lab. we”: a university lab.
How we do it ?How we do it ?
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
Can we do it ? (resource requirement)Can we do it ? (resource requirement)
““hard” resourcehard” resource Crawler system: 4 computers of $5,000 eachCrawler system: 4 computers of $5,000 each Storage system: about 50 million pages per 1TB, amounts to Storage system: about 50 million pages per 1TB, amounts to
$4,000. If you need a backup, double the investment.$4,000. If you need a backup, double the investment. Access web server: $4,000Access web server: $4,000 Space (not big, but reliable) to put these machinesSpace (not big, but reliable) to put these machines High speed network connection, ? per month ? High speed network connection, ? per month ?
““soft” resourcesoft” resource Permission for crawling and keepingPermission for crawling and keeping A staff to handle the daily routine mattersA staff to handle the daily routine matters Persistent enthusiasm for this undertaking Persistent enthusiasm for this undertaking
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
How we do it ?How we do it ?
Incremental crawlingIncremental crawling A scheduled daily operation, collect about one to A scheduled daily operation, collect about one to
two million new pages a day, fingerprint comparetwo million new pages a day, fingerprint compared with previous pagesd with previous pages
Data storage and incorporationData storage and incorporation Once a few weeks after having collected enough Once a few weeks after having collected enough
datadata AccessibilityAccessibility
Wayback machine styleWayback machine style Featured exhibitions Featured exhibitions
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
WebInfoMall: hierarchical module daWebInfoMall: hierarchical module data organizationta organization Assurance of scalability and dynamic re- Assurance of scalability and dynamic re-
configurabilityconfigurability Convenient for coping with changes at all levelsConvenient for coping with changes at all levels
record : file : batch : disk : node : system
Matching logical data organization with physical devices structure as close as possible
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
The architectureThe architecture
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
The operations under the hoodThe operations under the hood
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
ComparisonComparison
A survey done A survey done by National Liby National Library of Chinabrary of China
Web InfoMall iWeb InfoMall is the only large s the only large scale web archiscale web archive in China – ove in China – operated in a uniperated in a university lab !versity lab !
In the flattened world,
“small can act big !”
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
Resource sharingResource sharing We have published data storage format And provide WebInfoMall data to research community
for free. The beneficiary research units include Peking University,
Tsinghua University, Chinese Academy of Sciences, Shanghai Jiaotong University, Renmin Univerisyt of China, Harbin Institue of Technology, ....
In particular, we built the largest Chinese Web Test collection with compressed 200GB web pages (CWT200g) for evaluation of Chinese web information retrieval technologies
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
SummarySummary WebInfoMall, http://www.infomall.cn is the
Chinese web archive since 2001, with over 2.5 billion pages in its repository as for 2007.
Straightforward technology has been used for building WebInfoMall Linux box + Berkeley DB + hierarchical module data organization
We are looking into different ways to access the data to get values more than just information preservation and history browsing
Institute of Network Computing and Institute of Network Computing and Information SystemsInformation Systems
Thanks for your Thanks for your attentionattention
[email protected]@pku.edu.cn