Tracking pandora compare RAM usage and runtime #148

leoisl · 2019-06-27T13:58:00Z

This is not really an issue - just a way of documenting pandora compare RAM usage across commits and to know if we need to improve or not.

Dataset for massif profiling:
downsampled klebs dataset with 8000 PRGs and 50 samples at 45x coverage (for the PRG, reads, command-line, etc, see /hps/nobackup/iqbal/leandro/compare_test/profiling_test_8000_PRGs_50_samples on yoda).

Massif profiling:

Git commit: leoisl@5b329a7

Massif output:
massif.out.DEBUG_5b329a.txt

Peak snapshot screenshot:

Comments on the 3 instructions allocating most of the RAM:

https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/include/kmergraph.h#L91

Takes 413MB / 1.5GB (26.4% of the RAM): the current RAM bottleneck. The RAM usage of this data structure increases with number of PRGs, number of nodes in PRGs and number of samples. It might be tricky to improve on this.

https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/src/kmergraph.cpp#L299

Takes 226MB / 1.5GB (14.45% of RAM) - this increases only with number of PRGs and number of nodes in PRGs - it does not increase with number of samples, so I think we don't need to worry about improving the memory usage here.

https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/src/pangenome/panread.cpp#L37

Takes 167MB / 1.5GB (10.68% of RAM) - could be improved (see TODOs in https://github.com/leoisl/pandora/blob/5b329a78d89eaa0f85592258a8fb378677f71587/include/minihit.h ), but it does not seem a bottleneck.

The text was updated successfully, but these errors were encountered:

leoisl · 2019-06-27T14:05:19Z

Heavy dataset just to measure maximum RAM usage (no massif profiling):

Full klebs dataset with 61340 PRGs and 151 samples at 45x coverage (see /hps/nobackup/iqbal/leandro/compare_test/150_samples_test for details).

Git commit: leoisl@888fff3

Maximum RAM usage: 17.1 GB

Info about time performance:
Number of requested threads: 16
Percent of CPU this job got: 811% (on average, 8.1 threads working simultaneously)
Wall clock time: 5h 21min

leoisl · 2019-06-27T14:08:58Z

Please tell me if:

Should we track pandora compare RAM usage in another dataset?
Should we improve RAM usage?
Should we improve runtime?

Heavier datasets than the full klebs dataset are very welcome.

iqbal-lab · 2019-06-27T14:17:11Z

For that first one, surely we don't need a uint32 for number of samples.
More tricky to organise would be getting rid of the floats and quantized the likelihoods

rmcolq · 2019-06-27T14:33:04Z

In that first one we don't have a uint32 for number of samples, we have a pair of uint32 to represent the forward and reverse coverage in each sample on each node in the kmergraph. We still don't need uint32. At some point I remember that we updated almost all uints in pandora to uint32 on Robyn's advice to reduce the possibility of uint conversions. This may be the time to rethink that.

rmcolq · 2019-06-27T14:40:19Z

I see no reason to track pandora compare RAM in another dataset just yet. This is a 150X improvement in RAM use over what I had before, and means that we can in theory scale to a dataset of 1000 samples with 45X coverage on yoda already (assuming linear growth, which isn't quite accurate?).

If there are obvious things that would reduce RAM like changing the coverage to a uint16 (and making sure we had some catch just in case there were some bizarrely common kmer) we should do those.

Is runtime still linear? My gut feeling is that maybe that could be improved a bit next?

leoisl · 2019-06-27T14:41:25Z

Agreed. This is how quantizing coverage is done in GATB: GATB/gatb-core#19 (comment) ... there are probably many ways to do this... then we would not need a uint32 for the coverage, but we would have some error... I think even a very simple encoding using uint16 in pandora is enough: coverages < 65535 are encoded exacly, coverages >= 65535 are all encoded as 65535. I don't think we will have many of these last cases anyway, assuming uniform read coverage, so the errors would be close to 0 and it would cut the RAM bottleneck by half.

If we want to decrease even further the RAM usage, we could go for uint8 with some errors using for e.g. GATB encoding.

What do you think?

rmcolq · 2019-06-27T14:58:34Z

That's a neat method! My only hesitation is the idea of using any more GATB...

leoisl · 2019-06-27T15:02:43Z

I see no reason to track pandora compare RAM in another dataset just yet. This is a 150X improvement in RAM use over what I had before, and means that we can in theory scale to a dataset of 1000 samples with 45X coverage on yoda already (assuming linear growth, which isn't quite accurate?).

If there are obvious things that would reduce RAM like changing the coverage to a uint16 (and making sure we had some catch just in case there were some bizarrely common kmer) we should do those.

Is runtime still linear? My gut feeling is that maybe that could be improved a bit next?

My guess is that it scales in RAM for thousands of samples. The peak memory usage shown here is when pandora is mapping a single sample (creating the pangraph from the sample). This does not change with the number of samples, since we process each sample one by one. If we look strictly after the reads are all mapped, we have 1.3 GB of RAM usage. The 2nd and 3rd most RAM consumers are the KmerGraphs and the Index, and this does not depend on number of samples, but rather on the PRGs. The 1st most RAM consumer is the coverage info on all KmerNodes, and that is 415MB for 150 samples. I think it is linear, so for 1000 samples we would have 2.7GB of coverage information (which can be decreased if using uint16 or uint8). So it seems to scale fine, but I could be totally wrong, best thing is to simply run and check, but we need such dataset.

And you are right about runtime - this will be the bottleneck... I'd guess from some days to 1 week or more for 1000 samples, which is not very nice...

rmcolq · 2019-06-27T15:08:59Z

That is really great!! Runtime it is then...

leoisl · 2019-06-27T15:13:10Z

That's a neat method! My only hesitation is the idea of using any more GATB...

GATB is already a dependency due to the de novo module, so I think there is no issue of using GATB stuff then... If you prefer, we could replicate what they do for the coverage encoding: https://github.com/GATB/gatb-core/blob/2332064bf74032e801537d43dce8f87f018cea4a/gatb-core/src/gatb/tools/collections/impl/MapMPHF.hpp#L96-L157

leoisl · 2019-06-27T15:17:02Z

That is really great!! Runtime it is then...

yeah... but these are just guesses, I'd prefer to work with data however, so we would be sure we would be working on the true bottlenecks...

Do we have any dataset with thousands of samples? If not, can we simply duplicate the samples in the klebs dataset until we reach thousands? If exact duplication is not realistic, maybe simulate reads or sth like that?

rmcolq · 2019-06-27T15:32:24Z

@iqbal-lab said he had some with 1000s to process in the near future. Probably doesn't have them right now.

Ha, let's not reimplement - ignore what I said and either use GATB, or use uint16.

leoisl · 2019-06-27T15:33:11Z

I see no reason to track pandora compare RAM in another dataset just yet. This is a 150X improvement in RAM use over what I had before, and means that we can in theory scale to a dataset of 1000 samples with 45X coverage on yoda already (assuming linear growth, which isn't quite accurate?).
If there are obvious things that would reduce RAM like changing the coverage to a uint16 (and making sure we had some catch just in case there were some bizarrely common kmer) we should do those.
Is runtime still linear? My gut feeling is that maybe that could be improved a bit next?

My guess is that it scales in RAM for thousands of samples. The peak memory usage shown here is when pandora is mapping a single sample (creating the pangraph from the sample). This does not change with the number of samples, since we process each sample one by one. If we look strictly after the reads are all mapped, we have 1.3 GB of RAM usage. The 2nd and 3rd most RAM consumers are the KmerGraphs and the Index, and this does not depend on number of samples, but rather on the PRGs. The 1st most RAM consumer is the coverage info on all KmerNodes, and that is 415MB for 150 samples. I think it is linear, so for 1000 samples we would have 2.7GB of coverage information (which can be decreased if using uint16 or uint8). So it seems to scale fine, but I could be totally wrong, best thing is to simply run and check, but we need such dataset.

And you are right about runtime - this will be the bottleneck... I'd guess from some days to 1 week or more for 1000 samples, which is not very nice...

Ooooops, I am totally mistaken here.... I computed these stats for the downsampled dataset for massif profiling, and used as if they were on the full dataset... I think linear RAM growth is an upper bound so this seems runnable on yoda, but the best way to know is running such dataset itself

leoisl · 2019-06-27T15:37:35Z

Ok, let's switch to uint16 for coverage information then - almost no errors and half RAM usage for recording coverage information. Let's go for uint8 only if RAM becomes a bottleneck.

iqbal-lab · 2019-06-27T16:25:23Z

Yes we do have thousands now! 3000 klebs and 8000 e coli.
I think it's runtime that's the performance barrier now and we need to move to parallel across the cluster. We could do huge numbers very fast.

But also in terms of priorities, we have bigger holes to fill about how we run Pandora smoothly, integrating output of de novo back in, make prg etc.

iqbal-lab · 2019-06-27T16:26:52Z

Re coverage

Now Rachel is potentially doing viruses, beware v high covg samples
Illumina is a bigger issue than Nanopore in terms of repeats causing v high coverages?

leoisl · 2019-06-27T16:48:13Z

Yes we do have thousands now! 3000 klebs and 8000 e coli.
I think it's runtime that's the performance barrier now and we need to move to parallel across the cluster. We could do huge numbers very fast.

But also in terms of priorities, we have bigger holes to fill about how we run Pandora smoothly, integrating output of de novo back in, make prg etc.

Agreed, will then postpone performance improvement in pandora compare for later. But I am interested to know how the current version performs in these huge datasets... Could you point them to me in yoda or ebi-cli?

Re coverage

Now Rachel is potentially doing viruses, beware v high covg samples

Illumina is a bigger issue than Nanopore in terms of repeats causing v high coverages?

We will encode with uint16: coverages < 65535 are encoded exacly, coverages >= 65535 are all encoded as 65535. Do you think this is enough?

iqbal-lab · 2019-06-27T16:49:41Z

Sounds good to me!

leoisl · 2021-05-07T16:24:00Z

Closing, outdated. We have better performance tracking with the plots generated to evaluate pandora performance for the paper.

leoisl added the Performance label Jun 27, 2019

leoisl self-assigned this Jun 27, 2019

leoisl changed the title ~~Tracking pandora compare RAM usage~~ Tracking pandora compare RAM usage and runtime Jun 27, 2019

leoisl mentioned this issue Jun 27, 2019

Use uint16 for recording coverage information #151

Closed

leoisl closed this as completed May 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking pandora compare RAM usage and runtime #148

Tracking pandora compare RAM usage and runtime #148

leoisl commented Jun 27, 2019 •

edited

Loading

leoisl commented Jun 27, 2019

leoisl commented Jun 27, 2019 •

edited

Loading

iqbal-lab commented Jun 27, 2019

rmcolq commented Jun 27, 2019

rmcolq commented Jun 27, 2019

leoisl commented Jun 27, 2019

rmcolq commented Jun 27, 2019

leoisl commented Jun 27, 2019

rmcolq commented Jun 27, 2019

leoisl commented Jun 27, 2019

leoisl commented Jun 27, 2019

rmcolq commented Jun 27, 2019

leoisl commented Jun 27, 2019

leoisl commented Jun 27, 2019

iqbal-lab commented Jun 27, 2019

iqbal-lab commented Jun 27, 2019

leoisl commented Jun 27, 2019

iqbal-lab commented Jun 27, 2019

leoisl commented May 7, 2021

Tracking pandora compare RAM usage and runtime #148

Tracking pandora compare RAM usage and runtime #148

Comments

leoisl commented Jun 27, 2019 • edited Loading

leoisl commented Jun 27, 2019

leoisl commented Jun 27, 2019 • edited Loading

iqbal-lab commented Jun 27, 2019

rmcolq commented Jun 27, 2019

rmcolq commented Jun 27, 2019

leoisl commented Jun 27, 2019

rmcolq commented Jun 27, 2019

leoisl commented Jun 27, 2019

rmcolq commented Jun 27, 2019

leoisl commented Jun 27, 2019

leoisl commented Jun 27, 2019

rmcolq commented Jun 27, 2019

leoisl commented Jun 27, 2019

leoisl commented Jun 27, 2019

iqbal-lab commented Jun 27, 2019

iqbal-lab commented Jun 27, 2019

leoisl commented Jun 27, 2019

iqbal-lab commented Jun 27, 2019

leoisl commented May 7, 2021

leoisl commented Jun 27, 2019 •

edited

Loading

leoisl commented Jun 27, 2019 •

edited

Loading