Meta Data and Quality of Data for OGD Platform India
-
Upload
data-portal-india -
Category
Government & Nonprofit
-
view
248 -
download
1
Transcript of Meta Data and Quality of Data for OGD Platform India
Open Government Data Platform India
(https://data.gov.in)
Meta Data and Quality of Data
By: Sunil Babbar, Scientist-C, NIC
Data Contributors and Their Role
• Nominated by Chief Data Officer • Coordinate and Identify datasets which can be
contributed • Preparing the datasets
– Getting them cleaned– Metadata preparation for datasets in the predefined format– Ensuring quality and correctness datasets of his/her
unit/division.• Contributing Catalogs/Resources(Datasets) through
pre-defined workflow(Data Contributor Chief Data Officer(CDO) for review and publish PMU to publish on OGD Platform)
Resources (Datasets / Apps)
A data set (or dataset) is a collection of data A data set corresponds to the contents of a single table or statistical
data matrix, where every column represents a particular variable, and each row corresponds to a given member of the data set in
question Open Data Formats:
CSVXLSODFXML/RDFJSONRSS/AtomKML/GML
Catalog
Catalog is grouping of the similar resources (Datasets/Apps) A catalog represents a collection of resources that you
group together Acts like directory of information about resources Benefit of Catalog
To facilitate data access by users who are first interested in a particular kind of data
Catalog helps in grouping the resources with same theme/subject and thus facilitate the user in searching a specific dataset/resource easily
Ministry/Departments have less effort to upload same set of resources or updating the dataset for new period without writing the metadata again and again
To facilitate the users for easier navigation and searching for relevant data.
Catalog Formation
Catalog with same resource with different time period (Annual, Quarterly, Monthly, Weekly and Daily) Eg. Annual Rainfall Data
Catalog with same resource but with different jurisdiction (India, States, Districts, Block, Village) States/UTs-wise Forest and Tree Cover
Catalog with same resource but different category (Schedule Caste, Schedule Tribe, General, Religion etc.) District-wise crimes committed against Schedule Caste
Catalog with Similar type of resource under same report (Resources of similar nature) from the same report/survey can be grouped under the same catalog Primary Census Abstract 2011 - India and States
MetaData
• Is the information that describes the data– What is that data (About Data)– Data source – Who Created– When created– Etc.
• Metadata allows the data to be traced to a know its origin and quality
Metadata Elements for Catalogs
Title (Required): A unique name for the catalog (group of resources) Should contain the general terms which describes the essential properties/characteristics of the
datasets/resources Should be in plain English and include sufficient detail to facilitate search and discovery Time-period should not be mentioned in the catalog title normally so that for the similar resources,
containing same type of data for the next time-period/periodic updating, can be accommodated in same catalog
However in exceptional cases, it can contain time period particularly for periodic surveys/census which contains a huge number of datasets/resources belonging to the same period/year
Eg. Current Population Survey , Consumer Price Index, Variety wise Daily Market Prices Data, State wise Construction of Deep Tube wells over the years, etc.
Description (Required): Provide a detailed description of the catalog An abstract determining the nature and purpose of the catalog Contains the name of variables which are available in the datasets Can also contains the definition of some variable
Metadata Elements for Catalogs
Keywords (Required): It is a list of terms, separated by commas, describing and indicating at the content of the catalog. Example: rainfall, weather, monthly statistics. Help users discover your dataset; please include terms that would be used by
technical and non-technical users.
Group Name: This is an optional field to provide a Group Name to multiple catalogs in order to show that they may be presented as a group or a set.
Sector & Sub Sector (Required): Choose the sectors(s)/subsector(s)those most closely apply(ies) to your catalog.
Asset Jurisdiction (Required): This is a required field to identify the exact location or area to which the catalog and resources(dataset/apps) caters to viz. entire country, state/province, district, city ,etc.
Example - Creation of catalog
Catalog Title: Company Master Data 2015
(Incorrect - Contains time frame, so in future if we want to add data under this catalog e.g Company master data for 2016, it would be not be possible to upload data under this catalog)
Company Master Data (Correct) Catalog Description:
Get data of Company master data..?? (Incorrect - Does not contain detail information. Description should contain the name of variables
which are available in the datasets) Get data on master details of any company registered with Registrar of Companies (RoC). Data contains
various information like Corporate Identification Number(CIN), Company Name, Company Status, Company Class, Company Category, Authorized Capital in INR, Paid-up Capital in INR, Date of Registration, Registered State, Registrar of Companies, Principal Business Activity, Registered Office Address and Sub Category. (Correct)
Keywords: Company Master Data, ….??
(Incorrect - list of terms describing and indicating the content of the catalog, all the possible search keywords should be included
Registered Companies, Company master Data, Company Data, Indian Companies, Company, Company Details, Corporate Identification Number, CIN, Company Address (Correct)
Metadata Elements for Resources
Title (Required) : A unique name of the resource Should be self explanatory viz. Consumer Price Index for <Month/Year> etc. Resource title should contain the time frame, so no duplication will occur in future eg.
Consumer Price Index from April-2000 to April-2015, Rainfall of the year 2012
Access Method (Required) : How user is going to get that data Upload a Dataset or Single Click Link to Dataset
Category (Required) : Is it a Dataset or an Application
Reference URLs: This may include description to the study design, instrumentation, implementation, limitations, and appropriate use of the dataset or tool. In the case of multiple documents or URLs, please delimit with commas or enter in separate lines.
Metadata Elements for Resources
If Resource Category is Dataset Granularity of Data: It mentions the time interval over which the
data inside the dataset is collected/ updated on a regular basis (one-time, annual, hourly, etc.)
Frequency (Required): It mentions the time interval over which the dataset is published on the OGD Platform on a regular interval (one - time, annual, hourly, etc.).
Access Type: It mentions the type of access viz. Open, Priced, Registered Access or Restricted Access (G2G).
If Resource Category is App App Type (Required): It mentions the type of App being contributed viz.
Web App, Web Service, Mobile App, Web Map Service, RSS, APIs etc.
Metadata Elements for Resources
Date Released: It mentions the release date of the Dataset/App.
Note: It mentions the any more information the contributor /Chief Data Officer wishes to provide to the data consumer or about the resource
Resource note should contain proper explanations of any special characters/notations like *, # , NA etc which was used in the datasets
Other relevant information regarding this dataset should also be provided in the note section. Information regarding figures in the data should also be provided, i.e Figures are in numbers,
Unit: (Rs./qtl. ) Footnote available under a report should be part of Resource Note
NDSAP Policy Compliance: This field is to indicate if this dataset is in conformity with the National Data Sharing and Access Policy of the Govt. of India.
Example - Creation of Resource
Resource Title: Number of Registered Motor Vehicles (Transport & Non-Transport) in Delhi
(Incorrect - Resource title should contain the time frame, so no duplication will occur in future
Number of Registered Motor Vehicles (Transport & Non-Transport) in Delhi during 2009-2010 (correct)
• Resource Note: NIL
(Incorrect - No note but dataset contains some special notations like *, # etc, There are some cells contain NA, some other relevant information are also present for this particular dataset)
Figures are in numbers; NA: Not available; $:Category-wise data not received; *:Included in cars; Totals are provisional representing summation of available data (Correct)
Resource Category: Application
(Incorrect – As it is dataset not application) Datasets
Quality of Datasets
• Data Compositeness/Completeness/Consistency– Check for the constituent elements (variables) within the dataset– The dataset should be well explained in terms of the variable
present therein the dataset through a descriptive metadata – The metadata should well describe the time-period, units,
definitions, frequency, data source, jurisdiction and notes to special mention in the dataset
– The time series data should be continuous in nature• Data Coverage
– Dataset should be made available at the lowest possible levels to allow users correctly describe the phenomena being measured
Quality of Datasets
• Standard process of “data cleansing” :– Assigning string, date, character and numbers to the required fields– Abbreviations and acronyms to be replaced by full forms. – No special characters and blank spaces (replaced with NA) in the
matrix. – Column header should be self-explanatory – Similar font size with no formulas and merged columns.– Dataset should be de-normalized without any merged column– No formula of calculated column should appear in dataset like Total
or Average of available column or rows– Above all it must be in machine readable format viz. CSV, XML, JSON,
ODS, XLS etc.– File name should not contain special character except _ and -; no
blank space should not be present in file name.
THANK YOU