Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE - Test converting a slice of Reactome BioPAX #4

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dustine32
Copy link
Contributor

For issue #3. But this PR is not at all meant to be merged (you'll notice all but 13 models have been deleted).

This branch contains the 13 result models that were produced using @deustp01's URL. The saved source BioPAX is in source/, the ShEx reports are in products/.

To generate the models:

wget https://github.com/geneontology/pathways2GO/releases/download/v1.1.3/biopax2go.jar
wget https://curator.reactome.org/ReactomeRESTfulAPI/RESTfulWS/biopaxExporter/Level3/15869 -O source/15869.owl
java -jar -Xmx8G biopax2go.jar -b source/15869.owl -bg products/blazegraph.jnl -o models/ -e REACTO -dc https://orcid.org/0000-0002-7334-7852 -dp https://reactome.org -lego go-lego.owl

To generate ShEx reports with Minerva (using geneontology/minerva@fbcfbf9):

MINERVA_CLI_MEMORY=12G bin/minerva-cli.sh --validate-go-cams --shex -i /Users/ebertdu/pathways2go/models/ -r /Users/ebertdu/pathways2go/products/ -ontojournal blazegraph-lego.jnl

@kltm Maybe this PR setup is a model for some future "experiments?"

@kltm
Copy link
Member

kltm commented Jul 27, 2021

@dustine32 Looks great.
I think a first question we might want to tackle is: do we want to have separate test/experimental "pipelines" for every source, or could we make due with a list of sources to process? If the latter, I think we could make pretty quick headway with our current pipeline setup.

@dustine32
Copy link
Contributor Author

@kltm By "list of sources," do you mean something like source/15869.owl, source/15870.owl, ... being processed through pathways2go.jar? Or more like, all BioPAX, all SynGO, all WB/MGI/etc sources being processed through their respective import software in the same Jenkinsfile?

@kltm
Copy link
Member

kltm commented Jul 27, 2021

@dustine32 The former. For example, in this repo, if we had a file (in sources?) that was like

- name: foo
  url: http://location1
- name: bar
  url: http://location2

A script could get all of them, process them, and give them separate reports. That way, we could use skyhook as it stands now to have different pipelines for different sets of these if necessary, but mainly work with a single one.
A thought anyways. It may be easier to just drop a Jenkinsfile here and have the GO build machine start processing them as another project set w/in the org. In that case, we could just do it as a list in the metadata (or a yaml as above).

@dustine32
Copy link
Contributor Author

Ah, ok! Yeah, that metadata file should work.

The most "main" pipeline I would assume would only be on the Reactome release BioPAX file Homo_sapiens.owl, which is used to create the R-HSA-* models currently. A slight wrench is that this Homo_sapiens.owl file is in the larger zip file and would need be further specified in the metadata. For instance:

- name: reactome_release
  url: https://reactome.org/download/current/biopax.zip
  specific_files: Homo_sapiens.owl

Or we could arrange Reactome to provide a direct URL to this file? Or I could be overcomplicating it?

@deustp01
Copy link

deustp01 commented Jul 28, 2021

Or we could arrange Reactome to provide a direct URL to this file?

A direct URL would be one that pointed only to the human annotated content in the BioPAX and not any content, computationally inferred or manually annotated, for model organisms? I can ask Guanming Wu whether that is hard or dangerous, if you'd like.

Actually, thinking for a minute, now I'm confused. That computational inference is performed on the database of material that has been approved for release, after the release process has started. The gk_central database, the one that we edit for curation purposes and that drives the curator.reactome.org web site and the BioPAX3 generation process that yielded your file, does not contain any of the computationally inferred material. It does contain some manually curated non-human pathways. For example, a long time ago we annotated parts of chicken nucleotide metabolism, in pathway R-GGA-419470, and a bit of rice nucleotide metabolism, R-OSA-5655149. Is that what you're seeing? (In which case I'm forgetful - I forgot that gk_central has manual nucleotide annotations for non-human species - and not so confused.)

Continuing to think, maybe this is exactly your point: you would like to go to Reactome and specify a pathway (nucleotide metabolism, or signal transduction) and a species, and get a BioPAX3 file with exactly that material?

@kltm
Copy link
Member

kltm commented Jul 28, 2021

@deustp01
I think we're still exploring the possible at our end. What we're trying to feel our way through is how to (ideally) automate Dustin so that interested users can modify a metadata file and have the pipeline run on its own with little or no human attention after a metadata PR has been taken. Part of the answer to that is going to be "what is the likely space that will need to be worked on by people at GO and Reactome" and "can Reactome provide better access to that space". Currently, our pipeline is very limited--essentially, get file, unzip, grab known file that we know we want. To automate, we're looking for a pattern that will hopefully meet future needs for this sub-project (and could hopefully be used for future efforts that go along the same line).
It would be good for us to know, the next time you talk to Gaunming Wu, what extensions would be (easily) possible to the Reactome pipeline--it may be that there is some happy overlap.

@ukemi
Copy link

ukemi commented Jul 28, 2021

But something to also think about is making this all generic in the future. If there is work on the Reactome side and they are the responsible party, is this going to be the standard for all the groups we support? If so, we need some type of commitment that they will live up to their end of the bargain, don't we?

@kltm
Copy link
Member

kltm commented Jul 28, 2021

The contract I think we can commit to now is that:

  • upstream data goes into the *-go-cams repo's /source/ directory (ideally pushed, but pull could be negotiated as well, so the contract moves to an API or HTTP location)
  • a process documented in the README is applied, producing intermediate products, including reports for QC, and objective criteria for whether a data load passes
  • a process is applied (possibly the one above) to product a final set of models in the models/ directory

The first step there is obviously the big one, so we're still trying to get a feel for what makes sense. My druthers at the moment would be to have the upstream push (leaving the responsibility with them for updates), but, practically, it may be easier to do it some other way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants