Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking pandora compare RAM usage and runtime #148

Closed
leoisl opened this issue Jun 27, 2019 · 19 comments
Closed

Tracking pandora compare RAM usage and runtime #148

leoisl opened this issue Jun 27, 2019 · 19 comments
Assignees

Comments

@leoisl
Copy link
Collaborator

leoisl commented Jun 27, 2019

This is not really an issue - just a way of documenting pandora compare RAM usage across commits and to know if we need to improve or not.

Dataset for massif profiling:
downsampled klebs dataset with 8000 PRGs and 50 samples at 45x coverage (for the PRG, reads, command-line, etc, see /hps/nobackup/iqbal/leandro/compare_test/profiling_test_8000_PRGs_50_samples on yoda).

Massif profiling:

Git commit: leoisl@5b329a7

Massif output:
massif.out.DEBUG_5b329a.txt

Peak snapshot screenshot:
massif out DEBUG_5b329a

Comments on the 3 instructions allocating most of the RAM:

  1. https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/include/kmergraph.h#L91

Takes 413MB / 1.5GB (26.4% of the RAM): the current RAM bottleneck. The RAM usage of this data structure increases with number of PRGs, number of nodes in PRGs and number of samples. It might be tricky to improve on this.

  1. https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/src/kmergraph.cpp#L299

Takes 226MB / 1.5GB (14.45% of RAM) - this increases only with number of PRGs and number of nodes in PRGs - it does not increase with number of samples, so I think we don't need to worry about improving the memory usage here.

  1. https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/src/pangenome/panread.cpp#L37

Takes 167MB / 1.5GB (10.68% of RAM) - could be improved (see TODOs in https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/include/minihit.h ), but it does not seem a bottleneck.

@leoisl leoisl self-assigned this Jun 27, 2019
@leoisl
Copy link
Collaborator Author

leoisl commented Jun 27, 2019

Heavy dataset just to measure maximum RAM usage (no massif profiling):

Full klebs dataset with 61340 PRGs and 151 samples at 45x coverage (see /hps/nobackup/iqbal/leandro/compare_test/150_samples_test for details).

Git commit: leoisl@888fff3

Maximum RAM usage: 17.1 GB

Info about time performance:
Number of requested threads: 16
Percent of CPU this job got: 811% (on average, 8.1 threads working simultaneously)
Wall clock time: 5h 21min

@leoisl
Copy link
Collaborator Author

leoisl commented Jun 27, 2019

Please tell me if:

  1. Should we track pandora compare RAM usage in another dataset?

  2. Should we improve RAM usage?

  3. Should we improve runtime?

Heavier datasets than the full klebs dataset are very welcome.

@iqbal-lab
Copy link
Collaborator

For that first one, surely we don't need a uint32 for number of samples.
More tricky to organise would be getting rid of the floats and quantized the likelihoods

@rmcolq
Copy link
Collaborator

rmcolq commented Jun 27, 2019

In that first one we don't have a uint32 for number of samples, we have a pair of uint32 to represent the forward and reverse coverage in each sample on each node in the kmergraph. We still don't need uint32. At some point I remember that we updated almost all uints in pandora to uint32 on Robyn's advice to reduce the possibility of uint conversions. This may be the time to rethink that.

@rmcolq
Copy link
Collaborator

rmcolq commented Jun 27, 2019

I see no reason to track pandora compare RAM in another dataset just yet. This is a 150X improvement in RAM use over what I had before, and means that we can in theory scale to a dataset of 1000 samples with 45X coverage on yoda already (assuming linear growth, which isn't quite accurate?).

If there are obvious things that would reduce RAM like changing the coverage to a uint16 (and making sure we had some catch just in case there were some bizarrely common kmer) we should do those.

Is runtime still linear? My gut feeling is that maybe that could be improved a bit next?

@leoisl
Copy link
Collaborator Author

leoisl commented Jun 27, 2019

Agreed. This is how quantizing coverage is done in GATB: GATB/gatb-core#19 (comment) ... there are probably many ways to do this... then we would not need a uint32 for the coverage, but we would have some error... I think even a very simple encoding using uint16 in pandora is enough: coverages < 65535 are encoded exacly, coverages >= 65535 are all encoded as 65535. I don't think we will have many of these last cases anyway, assuming uniform read coverage, so the errors would be close to 0 and it would cut the RAM bottleneck by half.

If we want to decrease even further the RAM usage, we could go for uint8 with some errors using for e.g. GATB encoding.

What do you think?

@rmcolq
Copy link
Collaborator

rmcolq commented Jun 27, 2019

That's a neat method! My only hesitation is the idea of using any more GATB...

@leoisl
Copy link
Collaborator Author

leoisl commented Jun 27, 2019

I see no reason to track pandora compare RAM in another dataset just yet. This is a 150X improvement in RAM use over what I had before, and means that we can in theory scale to a dataset of 1000 samples with 45X coverage on yoda already (assuming linear growth, which isn't quite accurate?).

If there are obvious things that would reduce RAM like changing the coverage to a uint16 (and making sure we had some catch just in case there were some bizarrely common kmer) we should do those.

Is runtime still linear? My gut feeling is that maybe that could be improved a bit next?

My guess is that it scales in RAM for thousands of samples. The peak memory usage shown here is when pandora is mapping a single sample (creating the pangraph from the sample). This does not change with the number of samples, since we process each sample one by one. If we look strictly after the reads are all mapped, we have 1.3 GB of RAM usage. The 2nd and 3rd most RAM consumers are the KmerGraphs and the Index, and this does not depend on number of samples, but rather on the PRGs. The 1st most RAM consumer is the coverage info on all KmerNodes, and that is 415MB for 150 samples. I think it is linear, so for 1000 samples we would have 2.7GB of coverage information (which can be decreased if using uint16 or uint8). So it seems to scale fine, but I could be totally wrong, best thing is to simply run and check, but we need such dataset.

And you are right about runtime - this will be the bottleneck... I'd guess from some days to 1 week or more for 1000 samples, which is not very nice...

@rmcolq
Copy link
Collaborator

rmcolq commented Jun 27, 2019

That is really great!! Runtime it is then...

@leoisl
Copy link
Collaborator Author

leoisl commented Jun 27, 2019

That's a neat method! My only hesitation is the idea of using any more GATB...

GATB is already a dependency due to the de novo module, so I think there is no issue of using GATB stuff then... If you prefer, we could replicate what they do for the coverage encoding: https://github.com/GATB/gatb-core/blob/2332064bf74032e801537d43dce8f87f018cea4a/gatb-core/src/gatb/tools/collections/impl/MapMPHF.hpp#L96-L157

@leoisl
Copy link
Collaborator Author

leoisl commented Jun 27, 2019

That is really great!! Runtime it is then...

yeah... but these are just guesses, I'd prefer to work with data however, so we would be sure we would be working on the true bottlenecks...

Do we have any dataset with thousands of samples? If not, can we simply duplicate the samples in the klebs dataset until we reach thousands? If exact duplication is not realistic, maybe simulate reads or sth like that?

@leoisl leoisl changed the title Tracking pandora compare RAM usage Tracking pandora compare RAM usage and runtime Jun 27, 2019
@rmcolq
Copy link
Collaborator

rmcolq commented Jun 27, 2019

@iqbal-lab said he had some with 1000s to process in the near future. Probably doesn't have them right now.

Ha, let's not reimplement - ignore what I said and either use GATB, or use uint16.

@leoisl
Copy link
Collaborator Author

leoisl commented Jun 27, 2019

I see no reason to track pandora compare RAM in another dataset just yet. This is a 150X improvement in RAM use over what I had before, and means that we can in theory scale to a dataset of 1000 samples with 45X coverage on yoda already (assuming linear growth, which isn't quite accurate?).
If there are obvious things that would reduce RAM like changing the coverage to a uint16 (and making sure we had some catch just in case there were some bizarrely common kmer) we should do those.
Is runtime still linear? My gut feeling is that maybe that could be improved a bit next?

My guess is that it scales in RAM for thousands of samples. The peak memory usage shown here is when pandora is mapping a single sample (creating the pangraph from the sample). This does not change with the number of samples, since we process each sample one by one. If we look strictly after the reads are all mapped, we have 1.3 GB of RAM usage. The 2nd and 3rd most RAM consumers are the KmerGraphs and the Index, and this does not depend on number of samples, but rather on the PRGs. The 1st most RAM consumer is the coverage info on all KmerNodes, and that is 415MB for 150 samples. I think it is linear, so for 1000 samples we would have 2.7GB of coverage information (which can be decreased if using uint16 or uint8). So it seems to scale fine, but I could be totally wrong, best thing is to simply run and check, but we need such dataset.

And you are right about runtime - this will be the bottleneck... I'd guess from some days to 1 week or more for 1000 samples, which is not very nice...

Ooooops, I am totally mistaken here.... I computed these stats for the downsampled dataset for massif profiling, and used as if they were on the full dataset... I think linear RAM growth is an upper bound so this seems runnable on yoda, but the best way to know is running such dataset itself

@leoisl
Copy link
Collaborator Author

leoisl commented Jun 27, 2019

Ok, let's switch to uint16 for coverage information then - almost no errors and half RAM usage for recording coverage information. Let's go for uint8 only if RAM becomes a bottleneck.

@iqbal-lab
Copy link
Collaborator

Yes we do have thousands now! 3000 klebs and 8000 e coli.
I think it's runtime that's the performance barrier now and we need to move to parallel across the cluster. We could do huge numbers very fast.

But also in terms of priorities, we have bigger holes to fill about how we run Pandora smoothly, integrating output of de novo back in, make prg etc.

@iqbal-lab
Copy link
Collaborator

Re coverage

  1. Now Rachel is potentially doing viruses, beware v high covg samples
  2. Illumina is a bigger issue than Nanopore in terms of repeats causing v high coverages?

@leoisl
Copy link
Collaborator Author

leoisl commented Jun 27, 2019

Yes we do have thousands now! 3000 klebs and 8000 e coli.
I think it's runtime that's the performance barrier now and we need to move to parallel across the cluster. We could do huge numbers very fast.

But also in terms of priorities, we have bigger holes to fill about how we run Pandora smoothly, integrating output of de novo back in, make prg etc.

Agreed, will then postpone performance improvement in pandora compare for later. But I am interested to know how the current version performs in these huge datasets... Could you point them to me in yoda or ebi-cli?

Re coverage

  1. Now Rachel is potentially doing viruses, beware v high covg samples
  2. Illumina is a bigger issue than Nanopore in terms of repeats causing v high coverages?

We will encode with uint16: coverages < 65535 are encoded exacly, coverages >= 65535 are all encoded as 65535. Do you think this is enough?

@iqbal-lab
Copy link
Collaborator

Sounds good to me!

@leoisl
Copy link
Collaborator Author

leoisl commented May 7, 2021

Closing, outdated. We have better performance tracking with the plots generated to evaluate pandora performance for the paper.

@leoisl leoisl closed this as completed May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants