Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to generate hashes from list of files #971

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from

Conversation

Ian-Clowes
Copy link
Contributor

Use --generate (-g) option to consume [files] on command line as containers of lists of files to generate hashes for

Use --generate (-g) option to consume [files] on command line as containers of lists of files to generate hashes for
@Ian-Clowes
Copy link
Contributor Author

Finally figured out all the CI snags. I'm not massively familar with git - let me know if I should retry this.

@t-mat
Copy link
Contributor

t-mat commented Oct 8, 2024

Hi @Ian-Clowes, thanks for your effort.

But I have a quick question: how about bucktick "`" notation?

(edit: bucktick noation is incompatible with sh and POSIX shells though)

cd 🐈/path/to/xxHash🐈
printf "README.md\nLICENSE\n" > pr-971-args
cat pr-971-args
# README.md
# LICENSE
./xxhsum `< pr-971-args`
# 9c871fd3096b631d  README.md
# 4a2ace65fa00bd3e  LICENSE

(edit) : alternative notation

./xxhsum $(< pr-971-args)

@Ian-Clowes
Copy link
Contributor Author

Ian-Clowes commented Oct 8, 2024

You mean as an alternative way of passing in multiple files from a static or dynamically generated list? I agree that makes sense where things like the shell and helpers like xargs are available, although the recipies can be a bit opaque.

I added the option mainly because...

  • I have limited native shell / utility flexibility as I'm using it on Windows
  • The task is to find which of the c. 1,000,000 files on two disks that should be copies of each other have been picked up empty sectors after one was re-cloned from a failing SSD, so a single invocation with files as parameters will challenging
  • I didn't want to have the overhead of the xxhsum binary being repeatedly invoked if I did install xargs or similar - preferring a single instance that would work through all the files
  • The file names are full of spaces, which would also make a multi-parameter command line harder to craft trivially, although --print0 and --arg0 could make it workable

@Cyan4973
Copy link
Owner

Cyan4973 commented Oct 9, 2024

This new capability probably deserves to be documented.

I see xxhsum -h has been updated, which is a good start, but this help format is designed to be very condensed, essentially a one-line reminder.

For more details on what the capability does and why (including the relevant comment from @t-mat), one could use the man page template at cli/xxhsum.1.md to provide them.
If your feature allows circumventing shell expansion limitations for example, and works on Windows were shell expansion barely exist, these are good rationale to justify its existence.

Another thing I would like to see is what the format of [files] should be when it's used to ingest file names, as opposed to checking their hashes as it's used today.
It seems to expect one file per line, in which case it's simply better stated than guessed.
I suppose it wouldn't work if filenames were just space separated, or comma separated.
There might be some other complex details around special characters worth mentioning too.

Describe the use and rationale for the --generate option in the man page
Fix two snags:
- Bug / quirk of Visual Studio causes failure to process CMakeLists.txt if .gitignore contains *.txt
- .vs/ folder artefacts are not to be controlled
@Ian-Clowes
Copy link
Contributor Author

This new capability probably deserves to be documented.

For more details on what the capability does and why (including the relevant comment from @t-mat), one could use the man page template at cli/xxhsum.1.md to provide them.

Added some new material there

.gitignore Outdated
@@ -47,12 +47,14 @@ tmp*
tests/*.unicode
tests/unicode_test*
*.txt
!CMakeLists.txt
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a topic about improving the .gitignore content,
but I would prefer to keep it separate from this topic (i.e. a different PR),
so that we can then focus the review on why the .gitignore would benefit an update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall i just revert and repush - or some rebase type thing (which is where I have previously created some unlikley tangles due to my newness with using git)?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, one should go back to the commit that made the unwanted change,
and either remove it from the chain if that's all it does,
or modify it to remove the change, if it's part of a larger commit, and rebase everything on top of it.

But playing with the block chain can be complex and error prone.
So at this point, it's also somewhat simpler to just erase the unwanted changes manually and make that a new commit.

What matter is that, as a whole, this multi-commits PR doesn't modify .gitignore (and we can have a separate PR for that).

@Cyan4973
Copy link
Owner

Cyan4973 commented Oct 9, 2024

OK, almost there!

Beyond the minor change requested from this PR (.gitignore),
I am curious about the name of the capability, aka -g or --generate.

I'm wondering if this is a good name to describe this capability.
--generate makes me thing about some kind of data generator, so I don't intuitively think about a list of files to read.

So, what about --filelist for example ?
or --files-from (I think this one is used in tar or rsync) ?

@Ian-Clowes
Copy link
Contributor Author

Ian-Clowes commented Oct 9, 2024

I think --files-from, possibly without a short form?

man tar has...

-T, --files-from=FILE
              Get names to extract or create from FILE.

While xargs uses:

 -a file, --arg-file=file
              Read items from file instead of standard input.

Casting vote to rsync:

--files-from=FILE       read list of source-file names from FILE

@Cyan4973 Cyan4973 self-assigned this Oct 9, 2024
- Use --files-from instead of --generate for new feature
- Remove the need for --quiet when generating hashes from file list
@gcflymoto
Copy link

FYI ... zstd and lz4 use

--filelist LIST Read a list of files to operate on from LIST.

@Cyan4973
Copy link
Owner

That's a good point @gcflymoto .

I guess we could have both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants