Tuesday, 12 April 2016

Do you use ontologies? Then read on.

Do you read papers on ontologies? Do you author papers on ontologies? Do you review papers on ontologies? Do you love surveys, any sort of survey, regardless of content or length or topic?

"I've found the methods section!
No wait, it was a crumb."
If you've read, written or reviewed papers describing how an ontology came into being then you might well have experienced the issue that ontology write ups can be a bit.. tricky. It's not always obvious what you should include when you're writing it, and when you review it you often think 'yes but I don't understand this part' and 'why is this missing?'. Ontology is not a science in the same way experimental biology is, so writing up your ontology method can be a bit of a shot in the dark at times; different reviewers have different ideas of what should be included and this can lead to an unsatisfactory experience at all ends of the process.

We're trying to improve this with your help.

I have been working with Robert Stevens and Chris Mungall to put together some guidelines for Minimal Information for Reporting an Ontology (MIRO). We would like to publish the guidelines with as much community input as is possible and to that end we have a survey that you can take, average response so far is 9 minutes:



The survey asks you to rate the importance of each guideline and optionally comment on each guideline - on any aspect including wording. There's also an opportunity to say what you believe is missing.

Once we have input we will review what we get as feedback and revise the MIRO guidelines appropriately. We will also publish an anonymised summary of the responses to the survey and what we plan to do in response.

By attempting to establish a set of guidelines on what should be reported when writing up an ontology we hope this will make the process of writing a paper easier and make the experience of understanding how an ontology came to be more consistent. To make this as useful as possible then, we need your help as we want to reflect as many concerns as is reasonable to do. So please, do join in, if you use ontologies, read about them, write about them, review them or have an addiction to completing surveys about ontologies.

As always, we thank you in advance you lovely people.

photo credit: Lichen Survey 12 via photopin (license)

Friday, 8 April 2016

Blockchain in pole position to be the next Big Thing, bioinformatics should take note

Over in Computer Science there's a growing murmur/surge/hysteria of interest about a technology for recording transactions securely, robustly and immune to fraud. It's called Blockchain. Many of you will have heard of the most famous application to use Blockchain - Bitcoin, the digital currency. Although Bitcoin will have been the first contact most of us have had with this area of distributed digital transactions, the underlying technology is beginning to take some of the limelight as applications in other areas are starting to become a reality. More importantly there's now money in them there hills and this is driving interest towards what I've no doubt will be a frenzy of hype.

Organisations and companies in the area of finance have been aware of Blockchain technology for a while now. There are various financial institutions beginning to investigate its use for capturing financial transactions. But now the biomedical world is also waking up to the possibilities it may present. Philips this year announced a new Blockchain lab, investigating the application in 'connected health' and it will not be the only health care company looking into the potential of the technology. I'm told by a trusted source that government are also looking into the use of Blockchain. So could this be a future technology for sharing biomedical data? Is this really a 'game changer'?

What the Hell is Blockchain?


Blockchain, keeps the things you want to protect, secure:
medical records, drug data, stuffed rabbits, etc.
Blockchain is a mechanism by which transactions can be stored in a decentralised way and which enables those transactions to be recorded in a robust, tamper proof manner. For those familiar with computer science it has some similarities with how a single linked list is stored with each block referring back to the previous link in the chain (sometimes called the parent block). This chain links all the way back to the original block (sometimes called the genesis block). A block's identity is defined, in part, by this link address (produced using a hash algorithm). This means that a change to a block somewhere necessitates a chain to each block referring to it in turn, since each block's identify relies on the parent. The computational cost associated with recalculating all of these identities is so large as to be unfeasible and is one of the reasons Blockchain is considered very robust.

The decentralised way in which these transactions are stored means that should a block or 'node' - essentially a database which is storing some of the transaction data - then it can be replaced by other blocks. The data stored on this block is not only stored on this block, it has backups across the network, so there is a robustness to the overall architecture, avoiding a single point of failure.

There are also features of Blockchains that make them desirable for trust. Bitcoin is a good example. In the Bitcoin network it is possible to test whether or not a new bitcoin is indeed genuine or not, the so called 'proof of work' principle. If nodes don't agree on the validity of a Bitcoin then it is rejected. This adds trust and it adds authentication to the system and underpins how Bitcoin works in practice as a real currency.

Jeni Tennison's (Tech Director at ODI) blog offers an accessible but more detailed overview if you're interested.

Data sharing and trust


Open data is a Big Thing now in bioinformatics and has very much become a first class citizen. I see the hashtag #opendata all the time (though I also see Kanye West trending a lot too, so you know, caveatwitter emptor). Most public grants require that their outputs are shared openly, including the data (with a few exceptions such as those regarding patient privacy) and there is a general movement towards even commercial organisations now beginning to open up some of their less sensitive data.

When data is openly available and shared, then there is a responsibility on the consumer and it's a fairly onerous one; how much do you trust the data you are using? Data that is passed around naturally has a source, often has a few people coming into contact with it, such as through some analysis they've performed, and then has results about it published - along with the data hopefully.

There are also 'exotic' data to consider - for instance from patient self-reporting. The capability a person now has to monitor various aspects of their own health, lifestyle and environment are staggering and are set to increase this decade. FitBit is a perfect example of this, but there are now countless others. This mix of exotic and conventional data (i.e. data from a study or clinic) all feeds into a picture of increasingly available data, increasing in diversity and probably complexity.

I've previously discussed Big data in a blog and why I think the variability and meaning is the key thing rather than size, but there is another dimension to this movement - that of trust. If we are to start taking all of this data seriously and treating it with the same level of importance then there are aspects we need to be sure of. Is the data reliable? Did it come from the self-reported source? Did those that consumed and modified it also record their source? As a data provider, can I have trust in limiting in whose hands my data will end up? The latter is of critical importance to people offering up their own patient data voluntarily. Patients need to trust us for this to work.

In the work I do in data curation and knowledge sharing we have a largely closed ecosystem; we know who the curators are, we know the users, we know how things happened and when. In an open global world of data sharing this is much harder to capture. How do you know when someone changed something, who changed it and in what way? How can I stop my data from being misused?

A Bioinformatics Blockchain


I can conceive several ways in which a Bioinformatics Blockchain might work. Here I don't claim to have all the spec of course, this is a blog after all. But conversations I've had with others in the technology field recently have led me to believe there is potential for bioinformatics.

  1. Sharing Anonymised, Trustworthy Patient Data. The volume of self-reported patient data continues to grow, although it is of arguably mixed quality and reliability. At the same time, it is possible (at least in the UK) to access your own 'proper' clinical record from your local GP or your hospital including your medical tests. I recently signed up to get access to mind, it's quite amazing what they hold and what they're able to share. As a patient, there are likely many questions about what to do with their own data - how they can get the most out of it, in who can they trust to share it with? As a bioinformatician, the questions are similar - how can I best consume this data to get the most value out of it, and in turn hopefully helping the patient and how can I ensure the data remains in the right hands. They're intrinsically linked, and I believe the use of more secure and trustworthy sharing such as through a blockchain could help. Using a blockchain of anonymised clinical data, as a patient I could control who can see my data and which parts - it should be possible to set detailed metadata on aspects such as the permissions on transactions which can not be modified. This can help overcome barriers to the sharing of personal medical data. 
  2. Tracking supply chain semantics. I recently came across a white paper from provenance.org on tracking the provenance of supplies in the 'farm to fork' movement of tracing how a food stuff is farmed, fed, processed, stored, packaged and sold. This issue came to national attention in recent years over the meat adulteration scandal whereby traces of undeclared horse meat were found in beef products. The use of blockchain provides trustworthy transactions over how a product is being handled through the supply chain in unforgeable records. Importantly, Blockchain is not a centralised system, so it is not possible for a rogue supplier to doctor held records.
  3. Sample tracking and biobanks. This is a similar scenario to tracking supply chains in many regards. Biomedical samples collected in Biobanks typically come with a great deal of information, indeed this data is critical. Again, tracking the provenance of biobank samples through blockchain may provide additional trust to the provenance. It is also possible this will enable those supplying samples to biobanks to have more control over who samples are shared with than before. 
  4. Distributed curation and crowd sourcing. Curation of data is the lifeblood of bioinformatics and one thing is certain, with the increase in data, the capacity for a finite number of curators at institutes like the EBI to curate this public data will soon be insufficient. It's simple economics. Crowd sourcing curation in more distributed ways may well be part of the solution to this challenge. Blockchain as a digital signature, to authenticate when data has been curated, in what way and by whom, may provide a mechanism to help add trust to a more open strategy of pooling curation resource. 

Circle of Hype


Whenever I think about writing about any new technology (or in this case new application of an existing technology) I'm always wary of the lessons that I've been taught by the Semantic Web. The hype surrounding some of those early (and frankly even recent) publications damaged the field as it over-promised and under-delivered, not because the idea was a bad one, and not just because the technology was immature. The greatest barrier to the adoption of semantic technologies has been the social aspects of adopting something which has high up front costs with promise of reward coming much later. This is especially important when we consider that existing technologies can be 'made' to do about 80% of the sem web stuff with a bit of hackery (for instance flattening out an ontology into a database table). 

Ask - is this new technology so much better/different than existing technology, and, is there a social will to make it happen? Perhaps so in this case. The technology is more or less proven with the success of BitCoin. Whatever the ups and downs of the currency value, this is unrelated to the trust and security that backs it through Blockchain and Blockchain's methods are transparent and there for all to see. Bitcoin, for instance, isn't some dark magic, it's extremely clever maths and computer science primarily, although it was created by some mysterious genius who is still unknown to this day.. cue theories of government conspiracies.

The social aspect may be trickier, but again, there seems to be some will - and perhaps more importantly some funding. In their latest cancer 'moonshot' announcement the US government suggested that the project would aim to enhance "data access" and "collaborations with researchers, doctors, philanthropies, patients, and patient advocates, and biotechnology and pharmaceutical companies." Sounds like secure and trustworthy methods for sharing data should be near the top of the list of requirements.

Perhaps more importantly the project is going to be worth around $1 billion. Money talks, bullshit walks as they say. We have the technology, do we have the will?

Tuesday, 5 January 2016

Confessions of a Local Organiser: An unqualified volunteer's advice

Having just published a blog on my thoughts on how SWAT4LS 2015 went, scientifically, I wanted to publish alongside that a more nuts and bolts blog about my experiences in running such a conference as I've done a few such meetings now. In particular, I know a lot of us run them without any qualifications and sometimes without much prior experience; we're all basically unqualified volunteers. So on the off chance hearing tips from someone who is unqualified, has learnt this through doing, making mistakes, and improving, here it is.

No surprises

My mantra is always 'no surprises' or at least keeping them to a bare minimum. So this means trying to plan as many aspects as possible but also considering things that could go wrong. Here are a few things I consider important.
  1. Work with people you trust. I have organised meetings with Robert Stevens on several occasions now. We organised the first UK Ontology Network workshop together and have organised several industry training programmes and workshops for funded projects. We play well together and Robert is the King of not giving me any nasty surprises so I never doubt for a second the things we organise. We also asked Simon Jupp to organise the hackathon at EBI and Helen Parkinson helped as an additional local organiser - again people I know well. If I had to rank most important things in organising these meetings, this would be top.
  2. Do a participant walkthrough. In early 2015 Robert and I did a walkthrough of the whole SWAT4LS flow, starting at writing the call for papers ending at publishing the proceedings and everything in between. It was immensely useful in working out what comes next in a broad sense. At a point in October, after Robert and I had settled on the scientific programme, I wrote out a fine grained flow of what would happen during the three days. Doing this helps to think of all the things that need to happen as people arrive and each session unfolds. This helped me think about things such as when they arrive will there be signs, will they find the wifi passwords, will they find the name badges, will there be sufficient power when they sit down and plug their laptops in and so on. For the poster session I thought about how they position themselves if they just entered the room randomly so for this I prepared staggered poster board numbers with paper titles on to spread people around for each session. They are small details but it all helps and it also prevents 50 of the same question - "where do i hang my poster?", "where are the rooms?", "why is my laptop on fire?" etc.
  3. Have long breaks. From the off, we wanted long breaks minimally, 30 minutes for coffee and 90 minutes for lunch. People find this time very important for talking with other attendees some of which they may only see at such events a few times a year. Don't fear the empty time slots, to some people they are the most important part!
  4. Less is more. We tried to maintain a good level of participation but also kept in mind that simply accepting everything is not worthwhile. For one thing it undermines the excellent job our PC (the oft unsung heroes) do on reviewing the papers and coming to a decision for us. So we planned out reasonable length sessions and filled them with the best papers until they were full, and not expand the days until every submission fitted. 
  5. Make sure someone is keeping a tab on finances. This is one of the most overlooked, unglamorous but most important jobs in organising even smaller meetings. For SWAT we had projections for numbers, what the rooms would cost us and how we could best meet our costs whilst charging the most reasonable rate possible. This is always a balancing act. Charge too much and you might put people off, charge too little and even if you get good numbers you might not cover your costs and then you're in trouble. You need to make sensible and cautious predictions knowing that if you get fewer than normal you can still break even. If you get much more than you expect than you can always reinvest that next time round with some bursaries for students etc. Whatever you pick, you must keep track of it all because there are many hidden costs you can miss and someone needs to pay for them e.g. poster board hire, printing conference proceedings, even printing name badges. And of course my massive fee* (*I did not receive a fee).
  6. Be prepared to set aside a stupid amount of time. I consider myself a good organiser, in fact I hate being disorganised, but I still think I spent a full-time month towards the end of 2015 organising SWAT4LS. For someone now running a new company that is a luxury I probably couldn't afford really but I had accepted this a year in advance knowing this would take a lot of my time. There are always things to do so make sure you are committed knowing this in advance. Everything takes time and you need to be prepared to invest it. Good co-organisers help of course (see #1).
  7. Pay for an EasyChair. We used the free version and it was a total pain in the ass for multiple submission types. Honestly, pay for a premium version, it's £100 and it is definitely worth it.
  8. Do a participant survey at end of conference. But not later than that. Do it while it's fresh e.g. the last session of last day. I think gathering people's thoughts on what they liked, didn't like and what they might suggest e.g. for topics or keynotes is always useful (even if sometimes you don't like hearing it). 
  9. Have best paper awards even if you have no monetary prize. My company sponsored an award for best poster but in truth the iPad is for Christmas, the glory is for life. The recognition is the thing so have an award even if you don't have a prize. We didn't have a best student award (we had best paper and best poster) but this was, with hindsight, remiss. I realised this when I saw that a PhD student's work was nominated for best poster prize but was up against huge consortia contributions. How do you reconcile that? Best student award would have (Bas Stringer was the student and he was the runner up for the best poster prize).
  10. Try not to lose your mind It's hard, granted. You're probably doing it right now in fact, reading this. "I don't agree with this bollocks," you're screaming. My, you are an angry individual. Anyway the main thing is to keep clam and carry on as the Brits say. Stuff is bound to go wrong, but if you take care of the main things people can live with the rest. Or they'll complain loudly which, let's face it, we all enjoy doing from time to time anyway.

"My blog on what I thought about SWAT4LS 2015 and what I might do differently next time" blog

The Semantic Web Applications and Tools for Life Sciences (SWAT4LS) international conference took place in Cambridge in December and I was honoured to help organise and run it. I've been attending the conference for a number of years now and have always enjoyed my experiences and previously blogged about the Paris event. I wanted to give two perspectives related to this year's event. This blog is my take on the conference itself. I have penned a second on my thoughts on organising and running a conference like this because most of us that organise these meetings are basically unqualified volunteers, so some may find it a useful perspective. If you're interested in an additional perspective on this year's event you can also see this excellent blog from Egon Willighagen.

Conference: 7-9th December

Day one of the conference featured two parallel sessions of tutorials. As a local organiser I didn't get a chance to attend any one of the tutorials in full as I was largely running around so it would be unfair of me to pass any judgement. I got the impression from speaking with those attending that they enjoyed most of them, that pacing was occasionally an issue (one massively under-ran) but that overall they were packed with goodies. Hopefully we'll hear more detail on this from others who did attend - the organising committee are planning a participant survey in January and I'll blog a suitable version of the outcome in due course.

Day two was the first day of the main conference programme. The first morning kicked off with what Robert and I (as scientific chairs we were responsible for putting the programme together) had informally called the 'biomedical data' sessions. In the late afternoon the 'ontology' session also took place. There were some nice highlights from these sessions:
  • Marco Roos presented work using linked data identify correlations between the tissue specificity of a Transcription Start Site (TSS) and the frequency and size of genetic variants in the genomic region covering the TSS using, amongst other data, FANTOM5 ontology annotated data sets converted into RDF and integrated.
  • Chris Baker describing the iCyrus semantic framework for linked open biomedical image discovery, demo available at http://cbakerlab.unbsj.ca:8080/icyrus/ 
  • Kevin Dalleau presented work on text mining linked data to identify pharmacogenes using a variety of data sources including PharmaGKB, DisGenNet, ClinVar and DrugBank. There was a question from Robert Hoehndorf around how the test and training sets were selected and whether or not this causes issues in confirmation bias in the results. It's an interesting bit of work anyway and I'd recommend reading the paper yourself to make your own call. 
  • Hugo Leroux on integrating clinical data in Australia using two common standards - CDISC ODM and FHIR (or more specifically integrating ODM data into FHIR). 
  • Wikidata presented by Elvira Mitraka, and work on populating it with semantic network information on gene, drugs and disease. 
  • Simon Jupp presented the new Ontology Lookup Service at EBI and the Webulous collaborative ontology development framework and connected Google sheets add-on.
Main conference auditorium. No laptops on fire = going well.
Bijan Parsia gave the first keynote on Representing All Clinical Knowledge for use in complex computational systems (here's a  link to Bijan's keynote slides. It was an entertaining and thought provoking talk. To me it demonstrated that there are many issues still to be tackled in representing clinical knowledge that is computationally amenable, many solutions which might help in resolving them but perhaps more than anything that there is still a gap (chasm?). Some of Bijan's most pertinent words were on trying to identify study design and in interpreting results in literature from said studies. It's an almost endless task tracking down all the bits to make sense of them especially in the longer term. It made me feel like this part of the scientific process is still very much broken which seems strange for what is a huge industry ($9.4 billion revenue generated by scientific publishing industry according to this Nature article). This became more pertinent following Melissa's keynote on day three (see below). 

On day three, the second day of the main conference programme, we opened with the Industry session, followed by what we had internally called a 'technical' session and closing on the 'metadata' session. Some highlights included:
  • Elizabeth Wu presented the Alzforum, a website and community that curators Alzheimer related research. What was interesting about this talk was not that it was a semantic web application but moreover that it was an excellent use case for why semantics and linked data could help this community. And that was essentially what Elizabeth came to present - openly looking for collaborators. I know by the end of the conference some initial discussions had begun which is always good to hear. 
  • Daniel Herzig demoed GraphScope which allows for simple, user friendly keyword search across SPARQL endpoints. It's certainly been a complaint for many years that SPARQL us too hard for people to use (I've previously suggested that SPARQL is not intrinsically hard but rather the poor RDF and poor schema documentation/lack of examples is the thing). This tool aims to tackle some of these issues. 
  • Julian Everett present the Cochrane PICO Annotator a tool for annotating evidence in clinical trial reports and reviews, using a bit of MEDRA, SNOMED-CT and ATC and exposing these annotations as linked data. They can thus be queried, enabling some very rich queries to be performed over these valuable resources. Struck me this was a very simple but effective demo of how linked data can really help in the medical record parts of bio.
Melissa Haendel gave the second keynote titled "Not everyone can become a great artist; but a great artist can come from anywhere: Envisioning a world where everyone helps solve disease" (slides here). Her talk had a nice story to it, around how discoveries and the research that goes into them can come from almost anywhere and in fact does come from almost everywhere. Putting data together presents one challenge, a largely but not exclusively technical one. Melissa demonstrated some of the work involved in putting one such investigation into a STIM1 mutation which involved a whole raft of genetic and phenotypic data integration from human and animal model data. But putting together the trail of contributions from data generators, lab scientists to data gurus is another, in some aspects more difficult, cultural challenge. Semantics help with both, but only to the extent that you can find them. At least now the culture of describing data is blooming. The culture of annotating your data with who did what and when, less so. 

The point was simple in many ways - the work you do, the data you collect, the analysis you publish - can all be helpful to someone, somewhere, at some point in time, probably. Finding it, knowing who did it and how someone may appropriately apply it, can be challenges in and of themselves, never mind the biomedical research part. Cogent I thought, particularly following Bijan's keynote the day previous. And of course all of these things are challenges the semantic web aims to help with. 

The conference ended with a discussion panel which featured a set of questions we had crowd sourced before we invited the panel members. The panelists were fab - all credit to them - and after an initial slow audience start, it eventually got going and a lively debate followed. The set questions were largely around whether poor quality RDF is damaging to the overall efforts and to what extent sem web technology could ever become a technology of choice. This quickly descended into discussion about scientific literature, something I know Egon Willighagen (panelist) and Phil Lord (chair) have voiced strong opinions previously. Frank Gibson (panelist) has also previously worked in the publishing industry. So perhaps a topic for next year's SWAT4LS then...

Thoughts on SWAT

Robert (right) and I (left) at registration table in
a non-selfie-selfie or 'photograph'.
My overall feeling was that SWAT4LS was a success. Numbers were good (>120 people), content was good (overall acceptance rate was 54%) and the session chairs did a marvellous job - all credit to them. The wifi network and food were also great.

One aspect Robert and I had emphasised from the start was to try and add more gender balance. I can't lie, this isn't easy in a field dominated by computer scientists and is a problem that goes deeper than this one conference. But it's an issue we chose not to ignore and we tried. We had the most female PC members to date (34%), had mix of female session chairs (38%) and we added female organisers to the committee for the first time. Of course this is not just about numbers - quality and competence are equally important - but we took some steps. But this still needs to improve in the longer term. I look at the numbers and they still need work to get closer to parity. M Scott Marshall is organising next year's SWAT4LS and he is a big supporter of this effort so long may it continue to improve.

Balancing content is always tricky when you have a lot of submissions (52 papers, 33 posters). We aimed for a mix of sessions - long papers, short papers, demos, flash presentations with two poster sessions, two keynotes and a panel discussion. I thought on the whole they were very good. The flash presentations were particularly excellent given how tricky then can be. I especially enjoyed Melanie Courtot's one-minute poem. Genius.

We also opted for a dedicated industry session to try and bring in a different perspective, excellently chaired by Kerstin Forsberg. I personally enjoyed the session and took a lot away from it but a couple of people did suggest to me that some of the talks focused a little too much on how great their products were - natural when you're a company I suppose - but that this was not of primary interest to this audience who wanted the guts not the glamour. There were certainly some gems though - Alzforum was one in particular (mentioned previously). I still wonder how SWAT4LS can bring in more industry insights without wandering into sales sessions, which it can naturally tend towards. I'd welcome suggestions.

It seems our experiment on crowd sourcing panel questions did not quite work. The crowd got going once the topics (massively) strayed into semantic publishing and online access but not so much before. We only had 7 submissions of panel questions so perhaps this was why. This may just be a communication issue i.e. getting the call for panel questions to a wider audience. I had a thought that next time I do this I would ask for questions from the audience on day one for a panel occurring on day two. Just on bits of card perhaps and a box to post them in. People are busy, so naturally they are most engaged when they are there, sat in the conference rather than weeks or months in advance. You also don't want to surprise the panelists of course, so again, it's a balancing act.

Next Year?

SWAT4LS will be held in Amsterdam in 2016 chaired by M. Scott Marshall. I don't have many insights into what Scott is planning just yet, all will be revealed no doubt. Certainly I envision a trend towards clinical applications as this is a more general trend in bioinformatics. And there is also the issue of literature. Elsevier are of course based in Amsterdam so perhaps there is scope to bring them in to the room to join the discussion. I also saw other graph solutions popping up in various presentations that were not RDF exclusively - Neo4j appeared several times (and indeed sponsored the event). I would also be interested to see how Neo4J might interact with this community of sem webbers. Little steps.

</swat4ls2015>

Tuesday, 1 September 2015

Collecting Questions for a Conference Panel from the Community

Robert Stevens and I are the Scientific Chairs for Semantic Web Applications and Tools for Life Sciences (SWAT4LS) 2015, Cambridge UK in December 2015. As we’re coming to the end of the paper submission period we’ll be doing reviewing and forming the programme. Rather than wall-to-wall talks, we want to break up the day a bit and one of the ways we are doing this is by including a panel session. They're usually lively, fun and informative affairs; friendships are made, broken and made again as experts in the field vigorously debate the burning issues at hand. 

This year we want to do it slightly differently. Rather than form the panel first and then choose appropriate questions, we would like to elicit questions from the community and then choose the panel to suit. 
Panels are typically lively, absorbing
and full of energy. 

So, this is a “call for panel questions” or a CFPQ. Add your proposed questions along to the short CFPQ survey and Robert and I will take a look and along with the rest of the SWAT4Ls organising committee choose some questions and invite a panel. 

Those people that have their question used for a panel will receive:

  • An acknowledgement of their contribution (unless they choose not to do so).
  • A £10 Amazon voucher (or equivalent in some other country’s currency).

This CFPQ will close on 1st October 2015.

Disclaimer: No Sem Web experts will be intentionally harmed in the making of this panel.

Monday, 10 August 2015

Making RDF Mainstream Requires Simple Not Clever Things

I'll be honest, this blog came out of frustration initially rather than some insightful epiphany. I almost started it "Dear World, please stop publishing poor quality RDF" but on reflection thought this was not constructive.

My initial frustration has been borne out of reviewing a number of papers over the last few months for conferences and journals which describe one of two things; publishing RDF (and sometimes Linked Data) and trying to consume RDF (and sometimes Linked Data). The latter's struggle was very informative about the former. I felt their pain. It was not that the work was not interesting, some of it was indeed very interesting, it was more that the publishers often neglected the simple things that make online services useable. It was more frustrating knowing that at least some of the problems users face could be easily addressed if RDF publishers paid a little more care to doing the simple things.

There are two issues at stake here, quality of service and quality of content.

Quality of Service and the SPARQL Endpoint Graveyard


The inability to hit an endpoint running by using the examples documented is top of my simple things missing list. I have previously taken to blogging examples of how you might explore a SPARQL endpoint primarily because so much RDF sits behind a SPARQL endpoint with poor documentation (most often none) and poor examples (most often none). This is a typical SPARQL endpoint on the web which took me seconds to find:

Figure 1. I know this one:
10 PRINT "SPARQL RULEZ"
20 GOTO 10
OK I hacked the button text, but ask, would you publish a REST API like this? Who is this aimed at? I can only see two user groups this could be of interest to: 1) the developers of the endpoint themselves and, 2) SPARQL hackers. Now of course, I am not precluding that machines can also query this endpoint and perhaps they know more about the schema and data than I am giving them credit for. But who programmes those bots? Presumably someone somewhere has to work out what the data is about and how it is being described? Particularly when I do some basic exploratory querying and find a lot of locally minted URIs describing types with little shared, commonly used ontologies. More importantly, are those the only two user groups the RDF community wish to engage with?

It also concerns me how many SPARQL endpoints are published and go to join the SPARQL Endpoint Graveyard. I am not entirely immune from this, having produced a very old Gene Atlas endpoint in 2010 which we subsequently replaced (although the endpoint is still live: http://www.ebi.ac.uk/microarray-srv/openrdf-sesame/repositories/gxa and the old webpage also redirects to the latest Atlas offering https://www.ebi.ac.uk/rdf/services/atlas/). But there are so many more that feature in publications and portals that are dead within months. I am not going to name them here, that is not my objective.

Quality of Content


Beyond the service, I also don't see the point in publishing poor quality data, alongside poor quality (or on some occasions entirely absent) semantics as RDF. This is just a format conversion; the RDF adds nothing, particularly if writing queries for it is an archeological exploration such that Howard Carter would be proud (see Quality of Service above).

There are many papers and blogs on reusing ontologies with RDF and whether you should mint your own ontology or use existing ones (I note this special issue from Semantic Web Journal will be on the way). I advocate reusing existing, commonly used ontologies if your aim is data integration with the wider world, or at least reusing the parts you want to integrate on (for example gene ontology URIs). Local ontologies to add a bit of structure are probably ok. If the publisher's use cases are not concerned with data integration then mint away, but then documentation is even more important if the schema came straight out of the author's head.

I do draw the line though at content that simply parses in a huge file and spits it out as RDF with no thought as to how this should be structured semantically and I've seen too much of that lately. The claim "it's now on the Semantic Web as RDF" should be followed with the question "how will a new user consume it?".

Worse still, I believe this is actually damaging to the cause of those interested in making a success of the semantic web. Poor data, poor service provision are the banes of my sem web reviewing life. It has to end and I believe much can be done by addressing simple things.

Simple 5 Star RDF Offerings


My frustration made me think back to TBL's 5 star Linked Open Data criteria. For those working outside of the semantic web community these are still lofty goals so the bar is still fairly high to achieve five stars. But I believe for those working in this community, we should do better. We should be 5 star by TBL's standards but we should be more than that. Here's what I think we should be aiming for:


  1. Provide a SPARQL endpoint that is accessible most of the time (there are 10,080 minutes in a week, if it's down for just 5% of the time that's 504 minutes, over 8 hours it would be offline - that's a lot).
  2. Provide example queries.
  3. Make your ontology/schema separately accessible and downloadable.
  4. Make your URIs dereference to a webpage.
  5. Give things a sensible rdfs:label where possible.


They're not explosive and ground breaking, sure, but then neither is a good RESTful API and this is what we need to be comparing SPARQL APIs with if we want someone to actually use our RDF as a first class service. Would you publish any other API on the web and not provide documentation and examples on how to use it? Would you persist with a REST API that was offline 50% of the time? There are a few other nice-to-haves I would mention but that I'm not fascist about - including a VOID description for instance is very useful for both human and machine to gain an insight into the data set a user may be about to explore. Making previous versions available is also a nice-to-have, though I do understand this is often not practically possible because of size and compute power.

I should also add that I'm not against the mantra of publishing early and often, far from it, I'm fully engaged with agile methods of early delivery and continuous incremental improvement. But early and often was also intended to engage with users to constantly improve. If we want that feedback we need users to engage in the first instance so we need to make it easier for them.

Rules are better written as positives therefore I avoided any "dont's" but there are a few worth mentioning:

1. Don't publish a SPARQL endpoint unless you can maintain uptime (unless it's a sandbox, see 3 below).

2. Don't reinvent URIs for your schema/ontology if good, commonly used URIs already exist and data integration is your aim. Otherwise, provide sameAs mappings. It's detrimental to expect integration via mapping to be undertaken user side, it's asking a lot of the user.

3. Be honest about the maturity of your offering, if it's beta or sandbox say so. Definitely don't write a paper on it claiming "this valuable resource is now done" if in a month's time it's dead.

4. Don't convert bad quality data into RDF this just creates bad quality data in RDF; semantics don't come for free, clean it, structure it or leave it alone.

5. Don't publish RDF you, as the publisher, couldn't query in a year's time.

I will finish with a positive note. I think there is a place for RDF both as a research tool and as a more mainstream API to rich data. I wouldn't work in the field if I thought otherwise. I just wish I didn't have to start so many SPARQL queries first trying to understand the developers mind. They're typically not safe places to go alone, at night, in the dark... so many brackets...


Thursday, 4 June 2015

SWAT4LS Conference coming to Cambridge

I am pleased to write that Cambridge will be hosting the Semantic Web Applications and Tools for Life Sciences (SWAT4LS) this year. And for the first time in its seven year history SWAT4LS will be a full conference with the main scientific programme consisting of a two day conference, a one day tutorial and workshop - all at Clare College, Cambridge University - and a one day hackathon hosted at the European Bioinformatics Institute. Prof Robert Stevens and I will act as scientific chairs for the conference.

Beautiful, green, leafy Cambridge depicted here in industrial dark grey.
Winter is coming (in December).
SWAT4LS is a meeting I've attended several times over the years and I have always been pleasantly surprised by the quality of the content. That may sound like damning praise, but let me clarify that statement. The Sem Web community has been responsible for what I consider to be a lot of unsubstantiated hype over the years, perhaps beginning with the original paper Tim Berners-Lee published. The truth is that the vision he had for agents crawling the Sem Web has actually happened without a lot of the sorts of technology we thought we would need; insurance comparison quotes, aggregator shopping websites, hotel or holiday recommendation sites even social networks.

But there is also no doubt in my mind that semantic technologies are becoming increasingly important, particularly in the life science, as we deal with ever larger data sets of increased complexity and increasing veracity. How best can we describe such data and, in a world where open shared data is the new norm, how can we best let others reuse it? Using ontologies and common exchange formats, such as RDF or JSON-LD seems to be a natural solution to these challenges. The things I have observed at SWAT4LS over the last few years have shown me that this technology is maturing and the community is really producing real-world application which solve real-world problems in life sciences. Long may it continue at this years SWAT4LS conference in Cambridge.

The deadline for paper submission is 15th September 2015, see the call for papers for details:  http://www.swat4ls.org/workshops/cambridge2015/call-for-papers/

If you are interested in attending the hackthon at EBI please contact Simon Jupp: jupp@ebi.ac.uk

If you are interested in submitting a tutorial please contact Scott Marshall: info@swat4ls.org

Tuesday, 26 May 2015

RDF Training at EBI

We ran an RDF training workshop last week at EBI in conjunction with the EBI industry programme. I thought it might be useful to blog the course materials for those trying to use the EBI RDF Platform as it contains some step by step guides for building queries and some examples you can copy and paste. I've coped the text below but I have also uploaded the file as a word doc (the pdf breaks when you try and copy and paste the queries because of some weird line encoding issues). If you find any errors please let me know. The answers to all of the questions are at the end of the page.

EBI RDF Training Material
James Malone, Simon Jupp


Resources

The EBI RDF Platform is available at: http://www.ebi.ac.uk/rdf
The EFO ontology homepage is available at: http://www.ebi.ac.uk/efo

1. Exploring a SPARQL endpoint

Aims:
• Enable you to query a SPARQL endpoint with exploratory queries to gain some initial familiarity with the data and schema
• Explore types, predicates and basic ontology querying


Q 1.1 What types are used in the triple store?

We can use a simple pattern to extract any subject which has a type and then list those types to help us explore what sorts of schema is in the triple store. The following query does this. Note the shorthand ‘a’ in the query – this is shorthand for rdf:type – either can be used.

SELECT DISTINCT ?type
WHERE {
   ?subject a ?type . 
}


We can refine this further and ask for just ontology classes that are used as types in the triple store. Here we ask the same query but extend to ask for types that are also types of owl:Class:

PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT ?type
WHERE {
   ?subject a ?type .
   ?type a owl:Class .
}

Q 1.2 What are the labels on a given class?

Using the above query is useful to get the URIs for the classes but typically the label is more informative to humans. To get the label we use rdfs:label. Extending the above example to get labels:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
SELECT DISTINCT ?type ?label 
WHERE {
   ?type a owl:Class .
   ?type rdfs:label ?label .
}

Q 1.3 What predicates are used in the triple store?

As well as the types for subjects and objects in the triple store, it is often important and useful to know the predicates which can connect them. Here we can ask for a list of predicates using:
SELECT DISTINCT ?p 
WHERE {
   ?subject ?p ?object .
}

Q 1.4 What are the parent classes for an ontology class?

Given one of the ontology classes we have identified from the previous queries, we can explore the graph around it. For example, we can ask what the parent classes of a type are by asking which things the class is a subClassOf:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT DISTINCT ?parentClass ?label 
WHERE {
   atlasterms:Assay rdfs:subClassOf ?parentClass .
   ?parentClass rdfs:label ?label .
}

This will return the direct subclass relationships only, however. If we want to explore the tree of subClassOf relationships to the root node we need to use a transitive query (in SPARQL this is called a property path). A shortcut for this is to add an asterisk to the predicate in question. Try this:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT DISTINCT ?parentClass ?label
WHERE {
   atlasterms:Assay rdfs:subClassOf* ?parentClass .
   ?parentClass rdfs:label ?label .
}

2. Exploring the Gene Expression Atlas

Aims:
• Explore the data in the Gene Expression Atlas
• Show expression levels for a gene
• Find experiments for a given gene
• Find genes differentially expressed for a condition in a species

In this section we will work through a serious of questions designed to explore the Expression Atlas: http://www.ebi.ac.uk/rdf/services/atlas/  The SPARQL you will write will utilise the documentation on the Atlas RDF pages, in particular the schema diagram which can be found on the documentation page: http://www.ebi.ac.uk/rdf/documentation/atlas The example queries on the website may also be used to help you.


Q 2.1 What types are used in the Atlas triple store?

Use the examples from section 1 to extract a list of the types in the Atlas triple store and explore them.

Q 2.2 What triples connect to an Experiment in Atlas? What are the types of those things that are connected to an Experiment?

Use the examples from section 1 to query the predicates and types from the experiment class in the Atlas RDF. Hint: experiment class is:

      http://rdf.ebi.ac.uk/terms/atlas/Experiment

For the second question, you will need to use the rdf:type or the shorthand which is just ‘a’. Complete the following:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT DISTINCT ?type
WHERE {
   ?experiment a atlasterms:Experiment .
   _______  ?predicate  _______ .
   ______ a ?type .
}

Q 2.3 Which genes/proteins are studied in experiment E-GEOD-1085?

For this query we would like to connect up the experiment part of the graph with the gene part. The elements required are shown below in the fragment of the Atlas schema diagram. To get from experiment to gene one graph would be experiment->differentialExpressionAnalysis->DifferentialExpressionRatio->ProbeDesignElement->DatabaseReference

This would result in database references for the probe in question – both genes and proteins. Let’s extract all of these with a SPARQL query first.




The URI for an experiment is:

      http://rdf.ebi.ac.uk/resource/atlas/ID_NUMBER

So for E-GEOD-1085 it is:

      http://rdf.ebi.ac.uk/resource/atlas/E-GEOD-1085
   
or if we add a PREFIX then we can use:

      atlas: E-GEOD-1085

Fill in the gaps for the following stub using the schema and diagram:

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT distinct ?dbXref
WHERE {
      atlas:E-GEOD-1085 atlasterms:hasAnalysis ?analysis . 
      ?analysis    ______      ______ .
      ______   atlasterms:isMeasurementOf   ______ .
      ______    ______   ?dbXref .
}

Q 2.4 Which genes are studied in experiment E-GEOD-1085?

In this query we want to limit the results from the previous query to just genes and ignore proteins or any other dbXrefs. There are several ways of achieving this but one way is to identify the types of those dbXrefs and then include only entities which are types we are interested in.

First we need to extend the query to discover what types of things the dbXrefs are so we can investigate them. Complete the following query to pull out the types, remembering that the predicate for a type in RDF is rdf:type

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo:<http://purl.obolibrary.org/obo/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> 
SELECT distinct ?dbXref ?type
WHERE {
      atlas:E-GEOD-1085 atlasterms:hasAnalysis ?analysis .
      ?analysis    ______      ______ .
      ______   atlasterms:isMeasurementOf  ______ .
      ______    ______   ?dbXref .
      ______    ______   ?type .
}

This will list the types in the ?type variable for each dbXref in the result. Here we can explore the entity types, click on some of these links e.g. http://www.ebi.ac.uk/rdf/services/atlas/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk%2Fterms%2Fatlas%2FEnsemblDatabaseReference
You will see that gene types like the above have been classified as subclasses of the Gene database reference class: http://www.ebi.ac.uk/rdf/services/atlas/describe?uri=http%3A%2F%2Frdf.ebi.ac.uk%2Fterms%2Fatlas%2FGeneDatabaseReference

Modify your query above to fetch back only back genes which are a type of this Gene ID class. Hint: This will involve using the subClassOf predicate.

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT distinct ?dbXref
WHERE {
   atlas:E-GEOD-1085 atlasterms:hasAnalysis ?analysis .
   ?analysis    ______  ______ .
   ______   atlasterms:isMeasurementOf  ______  .
   ______    ______   ?dbXref .
   ?dbXref    a  ?type  .
   ?type ______   _______ .
}

Q 2.5 Which experiments is the gene Ikbke studied in?

Given the above query, we should now be able to simply reverse it to retrieve experiments for a given gene, walking the graph in the other direction. Modify your answer from the previous queries to answer this.  Hint: gene IDs in the Atlas RDF are of the form:

      http://identifiers.org/ensembl/ENSMUSG00000042349

To query based on a label e.g. Ikbke you will need to use the rdfs:label triple. As a reminder, you can extract a label on a class as follows:

      ?class rdfs:label ?label

Query based on text rather on ID requires the use of a FILTER construct in SPARQL. Filter will match based on a given value and include only those results which match the criteria stated. For example, if we wanted to include just those Ensembl entries whose labels is ‘Brca1’ we can use the following query (the ‘i’ in the query means ignore case):

PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT distinct ?dbXref ?label
WHERE { 
   ?dbXref rdf:type atlasterms:EnsemblDatabaseReference .
   ?dbXref rdfs:label ?label .
   FILTER regex(?label, "Brca1", "i" )
}
Combine the filter to query to find experiments which study the gene Ikbke.


Q 2.6 Which genes are differentially expressed in liver cancer? 

Connecting genes to the experimental factors (the conditions they are studied under) requires connecting several parts of the graph. We can see in the image below, analysis performed over experiments are connected to values (atlas:DifferentialExpressionRatio) which hasFactorValue of experimental factors. It is this part of the graph in which the ontology, EFO is used to model the experimental conditions. You can visit the EFO homepage at http://www.ebi.ac.uk/efo to browse ontology classes which are used to model disease, anatomy and so on.



To query for a condition (such as liver cancer) we require the ontology class for this – or otherwise we can perform the regex query previously, although this can be slow using SPARQL. The ontology class for liver cancer is:

      http://www.ebi.ac.uk/efo/EFO_0000182

To find genes for this condition we need to connect this factor value part up to the query for genes previously performed. Complete the stub below which builds on previous queries to query for expression value, p-value, gene name and the experimental condition value:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT distinct ?expressionValue ?dbXref ?pvalue ?propertyValue 
WHERE {
   ?analysis atlasterms:hasExpressionValue _____ . 
   _____ a atlasterms:IncreasedDifferentialExpressionRatio . 
   _____ rdfs:label ?expressionValue . 
   ?value atlasterms:pValue ?pvalue . 
   ?value atlasterms:hasFactorValue _____ . 
   _____  atlasterms:isMeasurementOf _____ . 
   _____  atlasterms:dbXref ?dbXref .
   ?factor atlasterms:propertyValue _____ . 
   ?factor rdf:type  _____ . 
} 

Q 2.7 Which human genes are differentially expressed in liver cancer?

From the previous query, extend the graph to include only genes which are typed as human genes. This should exploit the organism part of the graph. To query for all species you can use the query:
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT distinct ?organism ?label
WHERE {
   ?dbXref atlasterms:taxon ?organism .
   ?organism rdfs:label ?label .
}

Now extend using the correct URI for Homo sapiens. Here is a stub to help:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT distinct ?expressionValue ?dbXref ?pvalue ?propertyValue
WHERE {
   ?analysis atlasterms:hasExpressionValue _____ . 
   _____ a atlasterms:IncreasedDifferentialExpressionRatio .
   _____ rdfs:label ?expressionValue .
   ?value atlasterms:pValue ?pvalue .
   ?value atlasterms:hasFactorValue _____ .
   _____  atlasterms:isMeasurementOf _____ .
   ?probe atlasterms:dbXref ?dbXref .
   ______  atlasterms:taxon ______ .
   ?factor atlasterms:propertyValue ?propertyValue .
   ?factor rdf:type  ______ . 
}

3. Federating Queries

Aims:
• Use SERVICE query
• Connect a query from Reactome to Gene Expression Atlas
• Connect a query from Gene Expression Atlas to ChEMBL

One of the advantages of using SPARQL endpoints is that they represent a ‘universal API’ to data which enables a SPARQL query in one endpoint is able to call others from within that single query – often called federation. Here we explore federation as a way of integrating multiple data sets.

Q 3.1 What genes are differentially expressed for diabetes type II and which pathways are they involved in?

To resolve this query we will utilise the Expression Atlas and Reactome pathway endpoints. From the previous section you should now be able to write a query which generates a list of differentially expressed genes for a condition – diabetes type II in this case. To help, the EFO class for diabetes type II is:

   http://www.ebi.ac.uk/efo/EFO_0001360

Reactome uses Uniprot proteins as references so to connect between Atlas genes and Reactome we will first need to extract the Uniprot protein references from the Expression endpoint. As in previous queries where we asked for Ensembl references, instead we will require Uniprot references which are entities typed as:

   atlasterms:UniprotDatabaseReference

Complete the stub below to pull out the Uniprot references for diabetes Type II:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>

SELECT distinct ?dbXrefProt ?expressionValue ?propertyValue
WHERE {
   #Get differentially expressed genes (and proteins) where factor is asthma
   ?value atlasterms:pValue ?pvalue .
   ?value atlasterms:hasFactorValue ?factor .
   ?value rdfs:label ?expressionValue .
   ?value atlasterms:isMeasurementOf ?probe .
   ?probe atlasterms:dbXref ?dbXrefProt .
   ?dbXrefProt a atlasterms:UniprotDatabaseReference .
   ?factor atlasterms:propertyType ?propertyType .
   ?factor atlasterms:propertyValue ?propertyValue .
   ?factor rdf:type efo:EFO_0001360 .
}

Next we need to connect this to the Reactome endpoint to pull out pathways. Calling a remote SPARQL endpoint uses the SERVICE keyword with the URL of the SPARQL endpoint to be queried in the form:

   SERVICE <URL>

For Reactome this would be:

   SERVICE <http://www.ebi.ac.uk/rdf/services/reactome/sparql>

To query for pathways based on a Uniprot URI the following query can be used:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX biopax3: <http://www.biopax.org/release/biopax-level3.owl#>

SELECT DISTINCT ?pathway ?pathwayname
WHERE {
   ?protein rdf:type biopax3:Protein .
   ?protein biopax3:memberPhysicalEntity
      [biopax3:entityReference ?dbXrefProt] .
   ?pathway biopax3:displayName ?pathwayname .
   ?pathway biopax3:pathwayComponent ?reaction .
   ?reaction ?predicate ?protein .
}

In this query if we replace the ?dbXrefProt with the Uniprot protein URI we can query for pathways for a specific protein, e.g. protein uniprot:Q4V3C8:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX biopax3: <http://www.biopax.org/release/biopax-level3.owl#>
PREFIX uniprot: <http://purl.uniprot.org/uniprot/>

SELECT DISTINCT ?pathway ?pathwayname
WHERE {
   ?protein rdf:type biopax3:Protein .
   ?protein biopax3:memberPhysicalEntity
      [biopax3:entityReference uniprot:Q4V3C8] .
   ?pathway biopax3:displayName ?pathwayname .
   ?pathway biopax3:pathwayComponent ?reaction .
   ?reaction ?predicate ?protein .
}

We now have the components to put together the query. We need to combine the query for genes differentially expressed, pull out their corresponding Uniprot references and then feed that into the Reactome pathway query. They connection point between the two queries is the dbXrefProt (in bold below). This will feed the protein identifiers into the second query in Reactome, returning just pathways for these proteins that are satisfied by the first query. Complete the stub below to combine the two and satisfy the query (remember the class for diabetes type II is efo:EFO_0001360). You should query for the protein, the expression value, the property value and the pathway name:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX biopax3: <http://www.biopax.org/release/biopax-level3.owl#>

SELECT distinct ?dbXrefProt ?expressionValue ?propertyValue  ________
WHERE {
  #Get differentially expressed genes (and proteins) for diabetes TII
  ?value atlasterms:pValue ?pvalue .
  ?value atlasterms:hasFactorValue ?factor .
  ?value rdfs:label ?expressionValue .
  ?value atlasterms:isMeasurementOf ?probe .
  ?probe atlasterms:dbXref ?dbXrefProt .
  ?dbXrefProt a ______ .
  ?factor atlasterms:propertyType ?propertyType .
  ?factor atlasterms:propertyValue ?propertyValue .
  ?factor rdf:type _______ .

      #call the reactome sparql endpoint
      _______   _______ {
     ?protein rdf:type biopax3:Protein .
     ?protein biopax3:memberPhysicalEntity [biopax3:entityReference ?dbXrefProt] .
      ______  biopax3:displayName ?pathwayname .
     ?pathway biopax3:pathwayComponent ?reaction .
     ?reaction ?rel ?protein .
     }
}

Q 3.2 What genes are differentially expressed for diabetes type II and which compounds target them?

For this query we will use the previous Atlas query in combination with the ChEMBL triple store. The ChEMBL SPARQL endpoint is available at:

   http://www.ebi.ac.uk/rdf/services/chembl/sparql

To retrieve compounds that target a particular protein we need to connect several parts of the ChEMBL schema. Starting from an activity we need to connect the molecule which is used to target the proteins of interest from the gene expression results. In similar method to the Reactome endpoint query, we will use the dbXrefProt as the protein reference into which the gene expression query results will be input. Looking at the ChEMBL part first, the below query will return molecules and corresponding targets in ChEMBL along with corresponding Uniprot reference for human:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
SELECT DISTINCT ?moleculeLabel ?target ?dbXrefProt

WHERE{
   ?activity a cco:Activity .
   ?activity cco:hasMolecule ?molecule .
   ?activity cco:hasAssay ?assay .
   ?molecule rdfs:label ?moleculeLabel .
   ?assay cco:hasTarget ?target .
   ?target cco:hasTargetComponent ?targetcmpt .
   ?targetcmpt cco:targetCmptXref ?dbXrefProt .
   ?targetcmpt cco:taxonomy <http://identifiers.org/taxonomy/9606> .
   ?dbXrefProt a cco:UniprotRef .
}

Connect this query to the query with Gene Expression Atlas to return genes differentially expressed diabetes Type II and the molecule labels which target them from ChEMBL. The query should start:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>

SELECT distinct ?dbXrefProt ?expressionValue ?propertyValue  ?moleculeLabel
WHERE {

Answers

Q2.1 What types are used in the Atlas database? Click here to run query
SELECT DISTINCT ?type 
WHERE {
   ?object a ?type .
}

Q2.2 Which triples are connected to an Experiment in Atlas? What are the types of those things that are connected to an Experiment? Click here to run query
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>

SELECT DISTINCT ?predicate ?type 
WHERE {
   ?experiment a atlasterms:Experiment .
   ?experiment ?predicate ?type
}

To get the types of things that are connected to and Experiment: Click here to run
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>

SELECT DISTINCT ?type 
WHERE {
   ?experiment a atlasterms:Experiment .
   ?experiment ?predicate ?object .
   ?object a ?type .
}

Q 2.3 Which genes are studied in experiment E-GEOD-1085? Click here to run
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>

SELECT distinct ?dbXref
WHERE {
   atlas:E-GEOD-1085 atlasterms:hasAnalysis ?analysis .
   ?analysis atlasterms:hasExpressionValue ?value .
   ?value atlasterms:isMeasurementOf ?probe .
   ?probe atlasterms:dbXref ?dbXref .
}

Q 2.4 Which genes are studied in experiment E-GEOD-1085? Click here to run
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>

SELECT distinct?dbXref ?type
WHERE {
   atlas:E-GEOD-1085 atlasterms:hasAnalysis ?analysis .
   ?analysis atlasterms:hasExpressionValue ?value .
   ?value atlasterms:isMeasurementOf ?probe .
   ?probe atlasterms:dbXref ?dbXref .
   ?dbXref rdf:type ?type .
}

2.4 part 2 for genes typed as subClassOf genedatabasereference: Click here to run
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>

SELECT distinct ?dbXref
WHERE {
   atlas:E-GEOD-1085 atlasterms:hasAnalysis ?analysis .
   ?analysis atlasterms:hasExpressionValue ?value .
   ?value atlasterms:isMeasurementOf ?probe .
   ?probe atlasterms:dbXref ?dbXref .
   ?dbXref rdf:type ?type .
   ?type rdfs:subClassOf atlasterms:GeneDatabaseReference .
}

Q 2.5 Which experiments is the gene Ikbke studied in? Click here to run
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>

SELECT distinct ?experiment
WHERE {
   ?experiment atlasterms:hasAnalysis ?analysis .
   ?analysis atlasterms:hasExpressionValue ?value .
   ?value atlasterms:isMeasurementOf ?probe .
   ?probe atlasterms:dbXref ?dbXref .
   ?dbXref rdf:type ?type .
   ?type rdfs:subClassOf atlasterms:GeneDatabaseReference .
   ?dbXref rdfs:label ?label .
   FILTER regex(?label, "Ikbke", "i" )
}

Q 2.6 Which genes are differential expressed in liver cancer? Click here to run
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>

SELECT distinct ?expressionValue ?dbXref ?pvalue ?propertyValue 
WHERE {
   ?analysis atlasterms:hasExpressionValue ?value .
   ?value a atlasterms:IncreasedDifferentialExpressionRatio .
   ?value rdfs:label ?expressionValue .
   ?value atlasterms:pValue ?pvalue .
   ?value atlasterms:hasFactorValue ?factor .
   ?value atlasterms:isMeasurementOf ?probe .
   ?probe atlasterms:dbXref ?dbXref .
   ?factor atlasterms:propertyValue ?propertyValue .
   ?factor rdf:type efo:EFO_0000182 .
}

Q 2.7 Which human genes are differential expressed in liver cancer? Click here to run
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
SELECT distinct ?expressionValue ?dbXref ?pvalue ?propertyValue
WHERE {
   ?analysis atlasterms:hasExpressionValue ?value .
   ?value a atlasterms:IncreasedDifferentialExpressionRatio .
   ?value rdfs:label ?expressionValue .
   ?value atlasterms:pValue ?pvalue .
   ?value atlasterms:hasFactorValue ?factor .
   ?value atlasterms:isMeasurementOf ?probe .
   ?probe atlasterms:dbXref ?dbXref .
   ?dbXref atlasterms:taxon obo:NCBITaxon_9606 .
   ?factor atlasterms:propertyValue ?propertyValue .
   ?factor rdf:type efo:EFO_0000182 .
}

Q 3.1 What genes are differentially expressed for diabetes type II and which pathways are they involved in?

Part 1: protein references for diabetes type II: Click here to run
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX biopax3: <http://www.biopax.org/release/biopax-level3.owl#>

SELECT distinct ?dbXrefProt ?expressionValue ?propertyValue
WHERE {
   #Get differentially expressed genes (and proteins) for diabetes TII
   ?value atlasterms:pValue ?pvalue .
   ?value atlasterms:hasFactorValue ?factor .
   ?value rdfs:label ?expressionValue .
   ?value atlasterms:isMeasurementOf ?probe .
   ?probe atlasterms:dbXref ?dbXrefProt .
   ?dbXrefProt a atlasterms:UniprotDatabaseReference .
   ?factor atlasterms:propertyType ?propertyType .
   ?factor atlasterms:propertyValue ?propertyValue .
   ?factor rdf:type efo:EFO_0001360 .
}

Final query using Reactome and Atlas: Click here to run
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX biopax3: <http://www.biopax.org/release/biopax-level3.owl#>

SELECT distinct ?dbXrefProt ?expressionValue ?propertyValue  ?pathwayname
WHERE {
   #Get differentially expressed genes (and proteins) for diabetes TII
   ?value atlasterms:pValue ?pvalue .
   ?value atlasterms:hasFactorValue ?factor .
   ?value rdfs:label ?expressionValue .
   ?value atlasterms:isMeasurementOf ?probe .
   ?probe atlasterms:dbXref ?dbXrefProt .
   ?dbXrefProt a atlasterms:UniprotDatabaseReference .
   ?factor atlasterms:propertyType ?propertyType .
   ?factor atlasterms:propertyValue ?propertyValue .
   ?factor rdf:type efo:EFO_0001360 .
   #call the reactome sparql endpoint
   SERVICE   <http://www.ebi.ac.uk/rdf/services/reactome/sparql> {
   ?protein rdf:type biopax3:Protein .
   ?protein biopax3:memberPhysicalEntity [biopax3:entityReference ?dbXrefProt] .
   ?pathway  biopax3:displayName ?pathwayname .
   ?pathway biopax3:pathwayComponent ?reaction .
   ?reaction ?rel ?protein .
   }
}

Q 3.2 What genes are differentially expressed for diabetes type II and which compounds target them? Click here to run
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX efo: <http://www.ebi.ac.uk/efo/>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>

SELECT distinct ?dbXrefProt ?expressionValue ?propertyValue ?moleculeLabel
WHERE {
   #Get differentially expressed genes (and proteins) for diabetes TII
   ?value atlasterms:pValue ?pvalue .
   ?value atlasterms:hasFactorValue ?factor .
   ?value rdfs:label ?expressionValue .
   ?value atlasterms:isMeasurementOf ?probe .
   ?probe atlasterms:dbXref ?dbXrefProt .
   ?dbXrefProt a atlasterms:UniprotDatabaseReference .
   ?factor atlasterms:propertyType ?propertyType .
   ?factor atlasterms:propertyValue ?propertyValue .
   ?factor rdf:type efo:EFO_0001360 .

   #call chembl for compounds targeting them
   SERVICE <http://www.ebi.ac.uk/rdf/services/chembl/sparql> {
      ?act a cco:Activity ;
      cco:hasMolecule ?molecule ;
      cco:hasAssay ?assay .
      ?molecule rdfs:label ?moleculeLabel .
      ?assay cco:hasTarget ?target .
      ?target cco:hasTargetComponent ?targetcmpt .
      ?targetcmpt cco:targetCmptXref ?dbXrefProt .
      ?targetcmpt cco:taxonomy <http://identifiers.org/taxonomy/9606> .
      ?dbXrefProt a cco:UniprotRef .
   }
}