Meta Data and Quality of Data for OGD Platform India

16
Open Government Data Platform India (https://data.gov.in) Meta Data and Quality of Data By: Sunil Babbar, Scientist-C, NIC

Transcript of Meta Data and Quality of Data for OGD Platform India

Page 1: Meta Data and Quality of Data for OGD Platform India

Open Government Data Platform India

(https://data.gov.in)

Meta Data and Quality of Data

By: Sunil Babbar, Scientist-C, NIC

Page 2: Meta Data and Quality of Data for OGD Platform India

Data Contributors and Their Role

• Nominated by Chief Data Officer • Coordinate and Identify datasets which can be

contributed • Preparing the datasets

– Getting them cleaned– Metadata preparation for datasets in the predefined format– Ensuring quality and correctness datasets of his/her

unit/division.• Contributing Catalogs/Resources(Datasets) through

pre-defined workflow(Data Contributor Chief Data Officer(CDO) for review and publish PMU to publish on OGD Platform)

Page 3: Meta Data and Quality of Data for OGD Platform India

Resources (Datasets / Apps)

A data set (or dataset) is a collection of data A data set corresponds to the contents of a single table or statistical

data matrix, where every column represents a particular variable, and each row corresponds to a given member of the data set in

question Open Data Formats:

CSVXLSODFXML/RDFJSONRSS/AtomKML/GML

Page 4: Meta Data and Quality of Data for OGD Platform India

Catalog

Catalog is grouping of the similar resources (Datasets/Apps) A catalog represents a collection of resources that you

group together Acts like directory of information about resources Benefit of Catalog

To facilitate data access by users who are first interested in a particular kind of data

Catalog helps in grouping the resources with same theme/subject and thus facilitate the user in searching a specific dataset/resource easily

Ministry/Departments have less effort to upload same set of resources or updating the dataset for new period without writing the metadata again and again

To facilitate the users for easier navigation and searching for relevant data.

Page 5: Meta Data and Quality of Data for OGD Platform India

Catalog Formation

Catalog with same resource with different time period (Annual, Quarterly, Monthly, Weekly and Daily) Eg. Annual Rainfall Data

Catalog with same resource but with different jurisdiction (India, States, Districts, Block, Village) States/UTs-wise Forest and Tree Cover

Catalog with same resource but different category (Schedule Caste, Schedule Tribe, General, Religion etc.) District-wise crimes committed against Schedule Caste

Catalog with Similar type of resource under same report (Resources of similar nature) from the same report/survey can be grouped under the same catalog Primary Census Abstract 2011 - India and States

Page 6: Meta Data and Quality of Data for OGD Platform India

MetaData

• Is the information that describes the data– What is that data (About Data)– Data source – Who Created– When created– Etc.

• Metadata allows the data to be traced to a know its origin and quality

Page 7: Meta Data and Quality of Data for OGD Platform India

Metadata Elements for Catalogs

Title (Required): A unique name for the catalog (group of resources) Should contain the general terms which describes the essential properties/characteristics of the

datasets/resources Should be in plain English and include sufficient detail to facilitate search and discovery Time-period should not be mentioned in the catalog title normally so that for the similar resources,

containing same type of data for the next time-period/periodic updating, can be accommodated in same catalog

However in exceptional cases, it can contain time period particularly for periodic surveys/census which contains a huge number of datasets/resources belonging to the same period/year

Eg. Current Population Survey , Consumer Price Index, Variety wise Daily Market Prices Data, State wise Construction of Deep Tube wells over the years, etc.

Description (Required): Provide a detailed description of the catalog An abstract determining the nature and purpose of the catalog Contains the name of variables which are available in the datasets Can also contains the definition of some variable

Page 8: Meta Data and Quality of Data for OGD Platform India

Metadata Elements for Catalogs

Keywords (Required): It is a list of terms, separated by commas, describing and indicating at the content of the catalog. Example: rainfall, weather, monthly statistics. Help users discover your dataset; please include terms that would be used by

technical and non-technical users.

Group Name: This is an optional field to provide a Group Name to multiple catalogs in order to show that they may be presented as a group or a set.

Sector & Sub Sector (Required): Choose the sectors(s)/subsector(s)those most closely apply(ies) to your catalog.

Asset Jurisdiction (Required): This is a required field to identify the exact location or area to which the catalog and resources(dataset/apps) caters to viz. entire country, state/province, district, city ,etc.

Page 9: Meta Data and Quality of Data for OGD Platform India

Example - Creation of catalog

Catalog Title: Company Master Data 2015

(Incorrect - Contains time frame, so in future if we want to add data under this catalog e.g Company master data for 2016, it would be not be possible to upload data under this catalog)

Company Master Data (Correct) Catalog Description:

Get data of Company master data..?? (Incorrect - Does not contain detail information. Description should contain the name of variables

which are available in the datasets) Get data on master details of any company registered with Registrar of Companies (RoC). Data contains

various information like Corporate Identification Number(CIN), Company Name, Company Status, Company Class, Company Category, Authorized Capital in INR, Paid-up Capital in INR, Date of Registration, Registered State, Registrar of Companies, Principal Business Activity, Registered Office Address and Sub Category. (Correct)

Keywords: Company Master Data, ….??

(Incorrect - list of terms describing and indicating the content of the catalog, all the possible search keywords should be included

Registered Companies, Company master Data, Company Data, Indian Companies, Company, Company Details, Corporate Identification Number, CIN, Company Address (Correct)

Page 10: Meta Data and Quality of Data for OGD Platform India

Metadata Elements for Resources

Title (Required) : A unique name of the resource Should be self explanatory viz. Consumer Price Index for <Month/Year> etc. Resource title should contain the time frame, so no duplication will occur in future eg.

Consumer Price Index from April-2000 to April-2015, Rainfall of the year 2012

Access Method (Required) : How user is going to get that data Upload a Dataset or Single Click Link to Dataset

Category (Required) : Is it a Dataset or an Application

Reference URLs: This may include description to the study design, instrumentation, implementation, limitations, and appropriate use of the dataset or tool. In the case of multiple documents or URLs, please delimit with commas or enter in separate lines.

Page 11: Meta Data and Quality of Data for OGD Platform India

Metadata Elements for Resources

If Resource Category is Dataset Granularity of Data: It mentions the time interval over which the

data inside the dataset is collected/ updated on a regular basis (one-time, annual, hourly, etc.)

Frequency (Required): It mentions the time interval over which the dataset is published on the OGD Platform on a regular interval (one - time, annual, hourly, etc.).

Access Type: It mentions the type of access viz. Open, Priced, Registered Access or Restricted Access (G2G).

If Resource Category is App App Type (Required): It mentions the type of App being contributed viz.

Web App, Web Service, Mobile App, Web Map Service, RSS, APIs etc.

Page 12: Meta Data and Quality of Data for OGD Platform India

Metadata Elements for Resources

Date Released: It mentions the release date of the Dataset/App.

Note: It mentions the any more information the contributor /Chief Data Officer wishes to provide to the data consumer or about the resource

Resource note should contain proper explanations of any special characters/notations like *, # , NA etc which was used in the datasets

Other relevant information regarding this dataset should also be provided in the note section. Information regarding figures in the data should also be provided, i.e Figures are in numbers,

Unit: (Rs./qtl. ) Footnote available under a report should be part of Resource Note

NDSAP Policy Compliance: This field is to indicate if this dataset is in conformity with the National Data Sharing and Access Policy of the Govt. of India.

Page 13: Meta Data and Quality of Data for OGD Platform India

Example - Creation of Resource

Resource Title: Number of Registered Motor Vehicles (Transport & Non-Transport) in Delhi

(Incorrect - Resource title should contain the time frame, so no duplication will occur in future

Number of Registered Motor Vehicles (Transport & Non-Transport) in Delhi during 2009-2010 (correct)

• Resource Note: NIL

(Incorrect - No note but dataset contains some special notations like *, # etc, There are some cells contain NA, some other relevant information are also present for this particular dataset)

Figures are in numbers; NA: Not available; $:Category-wise data not received; *:Included in cars; Totals are provisional representing summation of available data (Correct)

Resource Category: Application

(Incorrect – As it is dataset not application) Datasets

Page 14: Meta Data and Quality of Data for OGD Platform India

Quality of Datasets

• Data Compositeness/Completeness/Consistency– Check for the constituent elements (variables) within the dataset– The dataset should be well explained in terms of the variable

present therein the dataset through a descriptive metadata – The metadata should well describe the time-period, units,

definitions, frequency, data source, jurisdiction and notes to special mention in the dataset

– The time series data should be continuous in nature• Data Coverage

– Dataset should be made available at the lowest possible levels to allow users correctly describe the phenomena being measured

Page 15: Meta Data and Quality of Data for OGD Platform India

Quality of Datasets

• Standard process of “data cleansing” :– Assigning string, date, character and numbers to the required fields– Abbreviations and acronyms to be replaced by full forms. – No special characters and blank spaces (replaced with NA) in the

matrix. – Column header should be self-explanatory – Similar font size with no formulas and merged columns.– Dataset should be de-normalized without any merged column– No formula of calculated column should appear in dataset like Total

or Average of available column or rows– Above all it must be in machine readable format viz. CSV, XML, JSON,

ODS, XLS etc.– File name should not contain special character except _ and -; no

blank space should not be present in file name.

Page 16: Meta Data and Quality of Data for OGD Platform India

THANK YOU