Performance: Use file indexer when scanning with file source #3333

adammcclenaghan · 2024-10-15T11:03:26Z

Description

This PR alters file_source.go to use a new file indexer, rather than the existing directory indexer.

Currently, when scanning a non-archive file, file_source.go applies a filter function to the directory indexer such that all files other than the file being scanned and its parent directory are ignored by the directory indexer. See here.

This approach becomes problematic when the scanned file is inside a directory with a large number of files, for two reasons:

Go’s filepath.Walk provides lexical ordering guarantees by reading the entire directory into memory. For sufficiently large containing directories, scanning a single file takes many GB’s of memory.
The total time to scan the single file also increases wrt the number of files in the containing directory due to the directory walk and the time taken to perform memory allocation etc.

This Pprof shows heap allocation when scanning a file within a directory containing a large number of files, I’m including it here as proof of my root cause analysis

Walking all of the files in the containing directory is redundant when using a file source, since as mentioned above the filter function will ignore everything other than the scanned file and its parent dir.

In this change, I have added a new file indexer which should match the existing behaviour of the directory indexer for a single file source. However, instead of walking the file system, it simply makes an attempt to index the containing directory and the file target.

I have also added file.go to satisfy the resolver interface when using the file indexer. Much of the functionality matches that of directory.go and I would appreciate it if there are any suggestions for improvement here, as I appreciate there's a bit of duplicated code.

The existing directory.go has many unit tests to verify behaviour in the event that the directory being walked contains symlinks etc. I have attempted to simplify the unit tests for file.go as it does not have to handle all of the complexity that directory.go does, but I would really appreciate extra review attention in this area as I may not be aware of all the ways a target for file_source may be defined.

I haven’t got a pprof diagram for the new approach, but memstat profiling has shown O(1) heap use wrt the number of files in the containing directory when using file source as expected.

Additionally, creating a resolver via a file_source is also happening in O(1) time wrt the number of files in the containing directory too.

Type of change

Performance (make Syft run faster or use less memory, without changing visible behavior much)

Checklist:

I have added unit tests that cover changed behavior
I have tested my code in common scenarios and confirmed there are no regressions
I have added comments to my code, particularly in hard-to-understand sections

Prevents filesystem walks when scanning a single file, to optimise memory & scan times in case the scanned file lives in a directory containing many files. Signed-off-by: adammcclenaghan <adam@mcclenaghan.co.uk>

wagoodman · 2024-10-15T17:35:00Z

syft/internal/fileresolver/file.go

+	return nil
+}
+
+// TODO: These are Copy-pasted from Directory.go - should we consider splitting them out into a shared place?


it might be a good time to make a filetree resolver (unexported, for internal use only) that is based around just taking an index and dealing with the access (the duplicated parts). This way the file and directory resolver would use this type as an embedding, but the tests and functionality live in one spot (this new filetree resolver) while the file and directory resolvers get to leverage them.

Shared behaviour for resolving indexed filetrees. Signed-off-by: adammcclenaghan <adam@mcclenaghan.co.uk>

Use file indexer when scanning with file source

987578e

Prevents filesystem walks when scanning a single file, to optimise memory & scan times in case the scanned file lives in a directory containing many files. Signed-off-by: adammcclenaghan <adam@mcclenaghan.co.uk>

wagoodman reviewed Oct 15, 2024

View reviewed changes

Create filetree resolver

038ecae

Shared behaviour for resolving indexed filetrees. Signed-off-by: adammcclenaghan <adam@mcclenaghan.co.uk>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Use file indexer when scanning with file source #3333

Performance: Use file indexer when scanning with file source #3333

adammcclenaghan commented Oct 15, 2024

wagoodman Oct 15, 2024

Performance: Use file indexer when scanning with file source #3333

Are you sure you want to change the base?

Performance: Use file indexer when scanning with file source #3333

Conversation

adammcclenaghan commented Oct 15, 2024

Description

Type of change

Checklist:

wagoodman Oct 15, 2024

Choose a reason for hiding this comment