Praktikum Information Integration
Transcrição
Praktikum Information Integration
Information Integration Assignment 3: Consistency of Annotations (NCBI – UniProtKB) Ulf Leser Overview • Reuse all existing data • We want to integrate and compare functional annotation from another data source – the UniProtKB • UniProtKB can be accessed through several interfaces – Your choice • We don’t want to produce a DDoS – restrict comparison to a small region of a chromosome • Therefore, we need to clean MAP_LOCATION information – Finally :-) Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Task 1: Data Cleansing MAP_LOCATION • So far, we left MAP_LOCATION as it was, though we have seen many problems • We go for a quick-and-dirty solution to keep things simple – – – – – Set the field to NULL if the value is “-” Remove all parts after a “|” Remove all parts after a “;” Remove all parts after a “-” Remove all parts after a “ ” (blank) • Perform the same changes with CHROMOSOME – We leave the few inconsistencies between CHROMOSOME and MAP_LOCATION (for now) • Produces 1:1 relationship between a gene and a chromosomal region – In varying levels of granularity Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Task 2: NCBI – UniProtKB Mapping ??? VERSION! • We need a mapping from GENE_ID (NCBI) to UniProtKB-ID (EBI) – This is a little complicated – GENE_ID 1:N PROTEIN_ACCESSION 1:N UniProtKB-ID • Step 1 – Create a 1:N relationship between NCBI genes and NCBI PROTEIN_ACCESSION • This should have been done already in assignment 2 – Remove from PROTEIN_ACCESSION the “version” part • Everything after a “.” • Step 2 – Download the File gene_refseq_uniprot_collab from the NCBI web site – Upload data into a database table (2 columns) • Check that many PROTEIN_ACCESSIONs appear in the first column of this table Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Task 3: Access UniProtKB • Query all proteins whose genes are located in 21q22.1 (exact match) – Should be around 50 • You may use either of three ways (chose now! We want diversity!) – Option 1: Use HTTP and parse flatfile • See http://www.ebi.ac.uk/Tools/webservices/services/dbfetch_rest – Option 1: Use HTTP and parse XML • See http://www.ebi.ac.uk/Tools/webservices/services/dbfetch_rest – Option 3: Use a Java library and parse nothing • See : http://www.ebi.ac.uk/uniprot/remotingAPI/ • See also : http://bioinformatics.oxfordjournals.org/cgi/content/abstract/24/10/1321 • Download/access/extract the following information per protein – Taxonomy ID(s) – All functional annotations with GO terms • Update your database schema to add this information Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Where is the data? • XML – „<dbReference type="NCBI Taxonomy" key="2" id="10116"/>” – “<dbReference type="Go" key="32" id="GO:0004866">” • Flatfile – „OX – „DR NCBI_TaxID=10116;“ GO; GO:0004866; F:endopeptidase inhibitor activity;…“ • API – „getNcbiTaxonomyIds()“ – „getGoTerms()“ Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Which method? Group Method 1 API 2 API 3 Flatfile 4 XML 5 Flatfile 6 API 7 XML 10 Flatfile 11 API 12 XML 13 API 14 Flatfile 15 XML 16 API Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Task 4: Compare Functional Annotation • Knowledge about the function of a gene or protein is often highly diverging among data sources • Let’s see Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Task 5: Queries • Answer the following queries 1. How many different MAP_LOCATIONS did remain? 2. How many genes are located on each chromosome? 3. Compute a frequency table: How many NCBI genes are connected to 1,2,… UNIPROTKB-IDs? 4. How many NCBI genes have a TAX_ID attached that is different from one of the TAX_IDs of at least one associated UniProtKB-ID? 5. How many UniProtKB – GO associations did you collect for how many UniProtKB-ID’s for how many NCBI genes (three numbers)? 6. How many gene-GO associations exist in the NCBI data that are not in the UniProtKB data? 7. How many protein-GO associations exist in the UniProtKB data that are not in the NCBI data? 8. How many NCBI genes have at least one diverging annotation 1. Either a GO term in NCBI / not in UniProt or vice versa Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Competition • Accessing a data source over the web can be slow – or fast • Access the requested information as fast as possible – This includes HTTP connection, parsing, writing in your database, etc. • Write a program that – Takes as input a MAP_LOCATION – Computes all NCBI genes at this location – Retrieves all GO and TAX data for all associated UniProtKB proteins • The program must be executable “as is” on gruenau2 Ulf Leser: Information Integration, Praktikum, WS 2008/2009 Deliverables • By Monday 8.12., or Wednesday 10.12, 23:59 o’clock – Three weeks • Send by mail as ASCII – An updated schema graph with the new table(s) for UniProt data – Queries and answers for 8 questions – For the competition: Executable code as specified on last slide Ulf Leser: Information Integration, Praktikum, WS 2008/2009