Image Information Mining Conference: The Sentinels Era

Image Information Mining Conference 05/03/2014

Quantifying the Value of Federated Datasets in Earth

Observation Information Mining and Analytics

P.G. Marchetti (*), M. Iapaolo (**)

* European Space Agency (ESA/ESRIN)

EOP Research and Ground Segment Technology Section

** Randstad Italia c/o ESA/ESRIN


1. Introduction

2. EO Datasets Value

3. Representation Capacity and Information Content for EO Datasets

4. Initial Results

5. Towards “Big Data”

6. Future work and perspectives

Outline


1. Volumes of EO data systematically collected, processed and stored

is continuously increasing

2. It becomes more and more difficult evaluating their “value”

3. Datasets are made available by different institutions (agencies,

commercial providers, etc.) Federation of EO datasets

4. The Dataset Value is a vague concept: inherent information

content, its possible exploitation, relation with user’s application

needs, etc.

How to evaluate the value of an EO dataset in a typical

scenario of a network of federated datasets?

Introduction


Communication networks: the value of a network (its growth potential) grows as a quadratic function (n2) with the number of network nodes n (Metcalf’s Law)

Generic concept of value (importance) applicable to a wide range of natural phenomena (occurrences of words in a text, size of population

of big cities, etc.): the kth ranked item has a value (frequency, size) of

about 1/k of the first one (Zipf’s Law)

Total Value = sum of decreasing 1/k values over all the n items

EO Datasets Value

≈ log(n)

Applying to all n nodes: Total Value ≈ n log(n)


The Crossover Point with the Zipf’s law is obtained for larger n

with respect to the Metcalf’s law

EO Datasets Value

Plot of nlog(n) growing function, compared with the linear and quadratic one

The origin is set on n=1.


EO Datasets Value and Information Content

1. In the EO context, it is of paramount importance to assess the value

of datasets from the information content point of view (neither

from growth potential nor from a market value )

2. The actual exploitation of federated datasets is mainly based on

their information content, extracted through time series analysis

and image information mining techniques and analytics

3. The relative value (i.e. the information content) of an EO dataset

permits to:

estimate the number of EO products (or samples) to be used

select which datasets are relevant for an analysis

Need for a theoretical framework for the assessment of the

value (information content) of a federation of EO datasets


Representation Capacity

Given a family of n non-overlapping datasets in a federation, D={D1,D2,

…,Dn};

Select from D a sample S={S1, S2, …, Sn}, where each Sh is contained in

Dh (h=1,2,…,n);

Our aim is here to assess and quantify how much S is

representative of D, and how it can characterise the value of D

The Representation Capacity in D, K(D) is a measure for the degree

of arbitrariness in choosing the sample S from D

K(D) should be a non-decreasing function f(x) where x is the size of the

set from which the images must be extracted



K(D) = f(x) = f(d1d2…dn) = f(d1)+f(d2)+…+f(dn) = K(D1)+K(D2)+…+K(Dn)

f(x) = k log(x)

Assuming sh proportional to K(Dh) sh = k log dh

Method for building the sample S from D: how to extract a representative

number of samples sh from the datasets Dh, in order to preserve the relative

value (information content) of the corresponding datasets Dh.


Information Content

Denoting Ih = log(dh):

Ih represents the average information associated to the random choice

(with uniform probability 1/dh) of a single product from each Dh

M = k I

The number of EO product samples M is therefore proportional to the average Shannon information I associated to the h random choices


1. The Representation Capacity of an EO product dataset D is

proportional to the log of the cardinality of the EO dataset

2. The value of a federation of datasets should take into account the

Representation Capacity, and therefore grows with the log of the

size of the individual datasets

3. In order to evaluate and compare different datasets in a federation

for further processing, a general methodology to preserve the

relative information content has been defined



1. Additional constraints could be imposed by further processing,

image mining, time series analysis and statistics/analytics

objectives and requirements

2. The simplified approach presented in this paper could allow to

assess the value (information content) a federation of EO dataset

according to the Shannon’s theoretical framework

3. This approach should complement the one derived from the Zipf’s

law, based on the number n of datasets in the federation, to help

decision makers in evaluating the wealth of available information.

Comments


Information Content

Denoting Ih = log(dh):

Ih represents the average information associated to the random choice

(with uniform probability 1/dh) of a single product from each Dh

To comply with this requirement of representativeness, we define as

Informative Units the vectors of EO product samples v={v1,v2,…,vn}

with each vh belonging to the corresponding dataset Dh. It is important

to notice that the Informative Unit is neither a single EO product in Dh

nor an arbitrary n-tuple of EO products.


Initial results 1-3

1. General approach for the assessment of the value – in terms of information content – of a federation of EO datasets

2. Interpretation of results under the Shannon information theoretical framework:

o The information content of a dataset is proportional to its cardinality

o Considering a sample of data extracted from the whole dataset, the Representation Capacity of the dataset is proportional to the log of its cardinality

o As a consequence, the value (information content) of a federation of EO datasets grows with the log of the size of the individual datasets

ESA Presentation | DD/MM/YYYY | Slide 14

ESA UNCLASSIFIED – For Official Use

Initial results 2-3

CRYOSA

T

ENVIS

ATER

S

GOCE

IKONOS

KOMPS

AT

LANDSA

T

MODIS

PROBA

RAPIDEY

E

SMOS

0

500

1000

1500

2000

2500

3000

Series1

Number of papers published on IEEE

search performed on 14.02.2014

Oops, if we have a look at the papers…


Initial results 3-3

The identification of a general method for evaluating, comparing and selecting different datasets cannot ignore other information elements like:

• the papers published and their quality, content, relevance, citations and impact factors e.g. (see Hirsch [1]) h-index

• the papers published and related parameters: mission, sensor, area, …

• the web pages published (see PageRank [2])

• Social media

• …


1. New models for research and service support are emerging in the Earth Observation context / Data availability from forthcoming missions will increase rapidly

2. Facilities for EO dissemination and processing services, geographically distributed in a federated domain, largely scalable with reliable Quality of Services are urgently needed

3. Federated domains shall federate both computing and storage resources. The federation is valued and sustained by the underpinning Earth Observation datasets and their information content

4. To value datasets federations in wider contexts (e.g. Big Data, Web 2.0) R&D activities are needed to fully exploit the information they contain

5. A programmatic framework to sustain such R&D activities must be set-up to cover the various aspects involved (IIM, TS analysis, EO data analytics, multi-dimensional databases, semantic web, visual analytics, etc.)

Future Work, towards “Big Data”


1. The programmatic framework should span a time frame of 5-10 years

2. It should include a strong user validation step (possibly involving hundreds of users and laboratories)

3. Should be extended to include other domains (not only EO!!): Earth and Space Science, Engineering … see the announced “Big Data from Space” Conference !

4. Recent work (Mazzucato) demonstrates the benefits to fund large and strongly supported research programmes (venture capital and market will follow, exploiting former consistent investments by state funded institutions)

5. Research on value-enahnced search for EO data may help in adding value and is needed to exploit to the great variety of data which will be made available!

Future Work, towards “Big Data”

http://congrexprojects.com/2014-events/BigDatafromSpace

ESA Presentation | DD/MM/YYYY | Slide 18

ESA UNCLASSIFIED – For Official Use

References

[1] J.E. Hirsch, An index to quantify an individual's scientific research output, Proceedings National Academy of Science 46:16569, 2005

[2] L. Page, S. Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab., 1999

http://arxiv.org/find/physics/1/au:+Hirsch_J/0/1/0/all/0/1




Thank you!!

Image Information Mining Conference: The Sentinels Era

Technology

Transcript of Image Information Mining Conference: The Sentinels Era