Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a...

17
Data Mining in Ensembl with Data Mining in Ensembl with EnsMart EnsMart
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Data Mining in Ensembl with EnsMart. 2 of 24 All genes from a candidate region Genes with a...

Data Mining in Ensembl with Data Mining in Ensembl with EnsMartEnsMart

2 of 24

• All genes from a candidate region

• Genes with a particular protein domain

• Members of a protein family

• Genes associated with SNPs

Possible queries…Possible queries…

3 of 24

Specific queriesSpecific queries

• Disease related genes between markers D10S255 and D10S259

• Transmembrane proteins with an Ig-MHC domain (IPR003006) on chromosome 2

• Genes with associated coding SNPs on chromosomal band 5q35.3

• Mouse homologues for human disease genes.

4 of 24

• Human genes with upstream regions conserved w.r.t. mouse

• Upstream sequence for all Ensembl genes mapped to U95A chip (similarly, complete genomic annotation of MG_U74).

• Genomic location and description of all mouse, rat and fugu homologues of all human genes, with transmembrane domains, expressed in cardiovascular system and have non-synonymous SNPs.

More specific queriesMore specific queries

5 of 24

EnsMart – vertical and EnsMart – vertical and horizontal data integrationhorizontal data integration

Ensembl Genes

EST Genes

Vega Genes

SNPs

ZebrafishHuman Mouse Anopheles FuguRat

6 of 24

Genes

EST

Markers

Diseases

Protein Annotation

SNPs

Homology

Expression

Ensembl data setsEnsembl data sets

7 of 24

• Data retrieval tool

• Query builder interface

• Gene or SNP lists

• Associated features or sequences

• Various output formats

EnsMartEnsMart

8 of 24

SPECIES

FOCUS

REGION

SNP

PROTEIN

HOMOLOGY

GENE

EXPRESSION

REFSEQ

INTERPRO

GO

SWISSPROT

EMBL

AFFY

REGION

SNP

PROTEIN

HOMOLOGY

GENE

EXPRESSION

FASTA

FILE

EXCEL

TEXT

GTF

HTML

start filter outputInformation flowInformation flow

9 of 24

Species and focusSpecies and focus

10 of 24

Restrict your queryRestrict your query

11 of 24

Restrict your queryRestrict your query

12 of 24

Select output optionsSelect output options

13 of 24

Select output optionsSelect output options

14 of 24

Output formatsOutput formatsHTML

15 of 24

Obtaining sequencesObtaining sequences

16 of 24

• Normalised

• Each data point stored only once

• Quick updates

• Minimal storage requirements

But:• Many tables

• Many joins for complicated queries

• Slow for data mining questions

Ensembl core databaseEnsembl core database

17 of 24

• De-normalised

• Tables with ‘redundant’ information

• Query-optimised

• Fast and flexible

• Ideal for data mining

Mart databaseMart database