Praktikum Information Integration

Transcrição

Praktikum Information Integration
Information Integration
Assignment 3: Consistency of
Annotations (NCBI – UniProtKB)
Ulf Leser
Overview
• Reuse all existing data
• We want to integrate and compare functional annotation
from another data source – the UniProtKB
• UniProtKB can be accessed through several interfaces
– Your choice
• We don’t want to produce a DDoS – restrict comparison to
a small region of a chromosome
• Therefore, we need to clean MAP_LOCATION information
– Finally :-)
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Task 1: Data Cleansing MAP_LOCATION
• So far, we left MAP_LOCATION as it was, though we have seen many
problems
• We go for a quick-and-dirty solution to keep things simple
–
–
–
–
–
Set the field to NULL if the value is “-”
Remove all parts after a “|”
Remove all parts after a “;”
Remove all parts after a “-”
Remove all parts after a “ ” (blank)
• Perform the same changes with CHROMOSOME
– We leave the few inconsistencies between CHROMOSOME and
MAP_LOCATION (for now)
• Produces 1:1 relationship between a gene and a chromosomal region
– In varying levels of granularity
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Task 2: NCBI – UniProtKB Mapping ??? VERSION!
• We need a mapping from GENE_ID (NCBI) to UniProtKB-ID (EBI)
– This is a little complicated
– GENE_ID 1:N PROTEIN_ACCESSION 1:N UniProtKB-ID
• Step 1
– Create a 1:N relationship between NCBI genes and NCBI
PROTEIN_ACCESSION
• This should have been done already in assignment 2
– Remove from PROTEIN_ACCESSION the “version” part
• Everything after a “.”
• Step 2
– Download the File gene_refseq_uniprot_collab from the NCBI web site
– Upload data into a database table (2 columns)
• Check that many PROTEIN_ACCESSIONs appear in the first column of
this table
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Task 3: Access UniProtKB
• Query all proteins whose genes are located in 21q22.1 (exact match)
– Should be around 50
• You may use either of three ways (chose now! We want diversity!)
– Option 1: Use HTTP and parse flatfile
• See http://www.ebi.ac.uk/Tools/webservices/services/dbfetch_rest
– Option 1: Use HTTP and parse XML
• See http://www.ebi.ac.uk/Tools/webservices/services/dbfetch_rest
– Option 3: Use a Java library and parse nothing
• See : http://www.ebi.ac.uk/uniprot/remotingAPI/
• See also :
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/24/10/1321
• Download/access/extract the following information per protein
– Taxonomy ID(s)
– All functional annotations with GO terms
• Update your database schema to add this information
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Where is the data?
• XML
– „<dbReference type="NCBI Taxonomy" key="2" id="10116"/>”
– “<dbReference type="Go" key="32" id="GO:0004866">”
• Flatfile
– „OX
– „DR
NCBI_TaxID=10116;“
GO; GO:0004866; F:endopeptidase inhibitor activity;…“
• API
– „getNcbiTaxonomyIds()“
– „getGoTerms()“
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Which method?
Group
Method
1
API
2
API
3
Flatfile
4
XML
5
Flatfile
6
API
7
XML
10
Flatfile
11
API
12
XML
13
API
14
Flatfile
15
XML
16
API
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Task 4: Compare Functional Annotation
• Knowledge about the function of a gene or protein is often
highly diverging among data sources
• Let’s see
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Task 5: Queries
•
Answer the following queries
1. How many different MAP_LOCATIONS did remain?
2. How many genes are located on each chromosome?
3. Compute a frequency table: How many NCBI genes are connected
to 1,2,… UNIPROTKB-IDs?
4. How many NCBI genes have a TAX_ID attached that is different
from one of the TAX_IDs of at least one associated UniProtKB-ID?
5. How many UniProtKB – GO associations did you collect for how
many UniProtKB-ID’s for how many NCBI genes (three numbers)?
6. How many gene-GO associations exist in the NCBI data that are
not in the UniProtKB data?
7. How many protein-GO associations exist in the UniProtKB data
that are not in the NCBI data?
8. How many NCBI genes have at least one diverging annotation
1. Either a GO term in NCBI / not in UniProt or vice versa
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Competition
• Accessing a data source over the web can be slow – or fast
• Access the requested information as fast as possible
– This includes HTTP connection, parsing, writing in your database,
etc.
• Write a program that
– Takes as input a MAP_LOCATION
– Computes all NCBI genes at this location
– Retrieves all GO and TAX data for all associated UniProtKB proteins
• The program must be executable “as is” on gruenau2
Ulf Leser: Information Integration, Praktikum, WS 2008/2009
Deliverables
•
By Monday 8.12., or Wednesday 10.12, 23:59 o’clock
– Three weeks
•
Send by mail as ASCII
– An updated schema graph with the new table(s) for UniProt data
– Queries and answers for 8 questions
– For the competition: Executable code as specified on last slide
Ulf Leser: Information Integration, Praktikum, WS 2008/2009

Documentos relacionados