1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.

39
1 of 38 Data Mining in Ensembl with Data Mining in Ensembl with BioMart BioMart

Transcript of 1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.

1 of 38

Data Mining in Ensembl with Data Mining in Ensembl with BioMartBioMart

2 of 38

Simple Text-based Simple Text-based Search EngineSearch Engine

3 of 38

‘‘Mouse Gene’ Gives Us ResultsMouse Gene’ Gives Us Results

4 of 38

A More Complex Query is Not as A More Complex Query is Not as UsefulUseful

5 of 38

BioMart- Data miningBioMart- Data mining

• BioMart is a search engine that can find multiple terms and put them into a table format.

• Such as: human gene (IDs), chromosome and base pair position

• No programming required!

6 of 38

General or Specific Data-TablesGeneral or Specific Data-Tables

• All the genes for one species

• Or… only genes on one specific region of a chromosome

• Or… genes on one region of a chromosome associated with a disease

7 of 38

BioMart Data SetsBioMart Data Sets

• Ensembl genes• Vega genes• SNPs

• Markers• Phenotypes• Gene expression information• Gene ontology• Homology predictions• Protein annotation

8 of 38

Web InterfaceWeb Interface

With BioMart, quickly extract gene-associated information from the Ensembl databases.

9 of 38

Information FlowInformation Flow

• Choose the species of interest (Dataset)

• Decide what you would like to know about the genes (Attributes)(sequences, IDs, description…)

• Decide on a smaller geneset using Filters.(enter IDs, choose a region …)

10 of 38

Web InterfaceWeb Interface

Three main stages: Dataset, Attributes and Filters.

Choose the species of

interest

Choose what information

to view.

Choose the gene set using what we know.

11 of 38

The First Step: Choose the DatasetThe First Step: Choose the Dataset

Homo sapiensgenes are the

default.

12 of 38

The Second Step: AttributesThe Second Step: Attributes

Attributes are what we want to know about the genes.

Four output pages.

13 of 38

TheThe SNP Attribute PageSNP Attribute Page

Output variation information such as SNP reference ID and alleles.

14 of 38

Filters Allow Gene SelectionFilters Allow Gene Selection

Choose the gene set by region, gene ID(s), protein/domain type.

15 of 38

Export Sequence or TablesExport Sequence or Tables

Genes and attributes are exported as sequence (Fasta format) or tables.

16 of 38

Query:Query:

• For all mouse genes on chromosome 10 that are protein coding, I would like to know the IDs in both Ensembl and MGI.

• In the query:

Attributes: what we want to know.

Filters: what we know

17 of 38

Query:Query:

• For all mouse genes on chromosome 10 that are protein coding, I would like to know the IDs in both Ensembl and MGI.

• In the query:

Attributes: what we want to know.

Filters: what we know

18 of 38

Query:Query:

• For all mouse genes on chromosome 10 that are protein coding, I would like to know the IDs in both Ensembl and MGI.

• In the query:

Attributes: what we want to know.

Filters: what we know

19 of 38

A Brief ExampleA Brief Example

Change dataset tomouse

Mus musculus

20 of 38

A Brief ExampleA Brief Example

Dataset has changed.

21 of 38

Attributes (Output Options)Attributes (Output Options)

ClickAttributes.

Attributes allow us to choose what we wish to know.

IDs are found in the ‘Features’ page.

Click on ‘GENE’.

22 of 38

Default options selected:Ensembl Gene ID and Transcript ID

Attributes (Output Options)Attributes (Output Options)

Ensembl Gene ID is selected

23 of 38

Scroll down to select MGI symbol.Also select the accession number.

Attributes (Output Options)Attributes (Output Options)

‘Markersymbol ID’ will give us the MGI ID

24 of 38

‘Results’ give us Gene IDs for all mouse genes in the Ensembl database.

The Results TableThe Results Table

25 of 38

Select a Smaller Gene SetSelect a Smaller Gene Set

Select ‘Filters’

Expand the REGION panel

Instead of all mouse genes, select protein coding genes on chromosome 10.

26 of 38

Select Genes on Chromosome 10Select Genes on Chromosome 10

Select chromosome

10

Instead of all mouse genes, select protein coding genes on chromosome 10.

27 of 38

Select Protein Coding GenesSelect Protein Coding Genes

Filters are set to chromosome 10 and protein-coding genes. Genes must meet BOTH

criteria to be in the result table.

Gene type:protein coding

28 of 38

Results (Preview)Results (Preview)

This is a preview- if you are happy with the table, click ‘Go’.

For the full result table: Go

29 of 38

Full Result TableFull Result Table

Ensembl Gene IDTranscript

IDMGI

symbolMGI Accession

Number

30 of 38

Original Query:Original Query:

• For all mouse genes on chromosome 10 that are protein coding, I would like to know the IDs in both Ensembl and MGI.

• In the query:

Attributes: columns in the Result Table

Filters: what we know

31 of 38

Other Export Options (Attributes)Other Export Options (Attributes)

• Sequences: UTRs, flanking sequences, cDNA and peptides, etc

• Gene IDs from Ensembl and external sources (MGI, Entrez, etc.)

• Microarray data

• Protein Functions/descriptions (Interpro, GO)

• Orthologous gene sets

• SNP/ Variation Data

32 of 38

Central ServerCentral Server

www.biomart.org

33 of 38

WormBase WormBase

34 of 38

HapMapHapMap

Population frequencies

Inter- population comparisons

Gene annotation

35 of 38

DictyBaseDictyBase

36 of 38

Uniprot, MSDUniprot, MSD

37 of 38

GRAMENEGRAMENE

Rice, Maize, Arabidopsis genomes…

38 of 38

How to Get ThereHow to Get There

• Either www.biomart.org/biomart/martview

• Or click on ‘BioMart’ from Ensembl

QQ&&AA

Thanks Arek KasprzykBenoît BallesterSyed HaiderRichard HollandDamian Smedley