Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance: Use file indexer when scanning with file source #3333

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

adammcclenaghan
Copy link
Contributor

Description

This PR alters file_source.go to use a new file indexer, rather than the existing directory indexer.

Currently, when scanning a non-archive file, file_source.go applies a filter function to the directory indexer such that all files other than the file being scanned and its parent directory are ignored by the directory indexer. See here.

This approach becomes problematic when the scanned file is inside a directory with a large number of files, for two reasons:

  1. Go’s filepath.Walk provides lexical ordering guarantees by reading the entire directory into memory. For sufficiently large containing directories, scanning a single file takes many GB’s of memory.
  2. The total time to scan the single file also increases wrt the number of files in the containing directory due to the directory walk and the time taken to perform memory allocation etc.

This Pprof shows heap allocation when scanning a file within a directory containing a large number of files, I’m including it here as proof of my root cause analysis
Screenshot 2024-10-15 at 11 45 59

Walking all of the files in the containing directory is redundant when using a file source, since as mentioned above the filter function will ignore everything other than the scanned file and its parent dir.

In this change, I have added a new file indexer which should match the existing behaviour of the directory indexer for a single file source. However, instead of walking the file system, it simply makes an attempt to index the containing directory and the file target.

I have also added file.go to satisfy the resolver interface when using the file indexer. Much of the functionality matches that of directory.go and I would appreciate it if there are any suggestions for improvement here, as I appreciate there's a bit of duplicated code.

The existing directory.go has many unit tests to verify behaviour in the event that the directory being walked contains symlinks etc. I have attempted to simplify the unit tests for file.go as it does not have to handle all of the complexity that directory.go does, but I would really appreciate extra review attention in this area as I may not be aware of all the ways a target for file_source may be defined.

I haven’t got a pprof diagram for the new approach, but memstat profiling has shown O(1) heap use wrt the number of files in the containing directory when using file source as expected.

Additionally, creating a resolver via a file_source is also happening in O(1) time wrt the number of files in the containing directory too.

Type of change

  • Performance (make Syft run faster or use less memory, without changing visible behavior much)

Checklist:

  • I have added unit tests that cover changed behavior
  • I have tested my code in common scenarios and confirmed there are no regressions
  • I have added comments to my code, particularly in hard-to-understand sections

Prevents filesystem walks when scanning a single file, to
optimise memory & scan times in case the scanned file
lives in a directory containing many files.

Signed-off-by: adammcclenaghan <adam@mcclenaghan.co.uk>
return nil
}

// TODO: These are Copy-pasted from Directory.go - should we consider splitting them out into a shared place?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be a good time to make a filetree resolver (unexported, for internal use only) that is based around just taking an index and dealing with the access (the duplicated parts). This way the file and directory resolver would use this type as an embedding, but the tests and functionality live in one spot (this new filetree resolver) while the file and directory resolvers get to leverage them.

Shared behaviour for resolving indexed filetrees.

Signed-off-by: adammcclenaghan <adam@mcclenaghan.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants