Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add counter telemetry #2741

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add counter telemetry #2741

wants to merge 1 commit into from

Conversation

me-diru
Copy link
Contributor

@me-diru me-diru commented Aug 21, 2024

Fixes #2564

I think this captures the metrics of LLM and Key-Value store

@me-diru
Copy link
Contributor Author

me-diru commented Aug 21, 2024

Captured the LLM model and prompt information and kv stores get and set key information

Not sure if the key is accessible when querying on Prometheus/Grafana
Screenshot from 2024-08-21 11-50-55

cc: @calebschoepp

@@ -92,6 +92,9 @@ impl key_value::HostStore for KeyValueDispatch {
store: Resource<key_value::Store>,
key: String,
) -> Result<Result<Option<Vec<u8>>, Error>> {
// Log key value host component get feature
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the intention behind these comments, but I think they're needlessly verbose. Directly reading spin_telemetry::counter is pretty intuitive as to what it's doing.

@@ -92,6 +92,9 @@ impl key_value::HostStore for KeyValueDispatch {
store: Resource<key_value::Store>,
key: String,
) -> Result<Result<Option<Vec<u8>>, Error>> {
// Log key value host component get feature
spin_telemetry::counter!(spin.key_value_get = 1, key = key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should all probably be monotonic_counter's b/c they're monotonically increasing.

@@ -102,6 +105,9 @@ impl key_value::HostStore for KeyValueDispatch {
key: String,
value: Vec<u8>,
) -> Result<Result<(), Error>> {
// Log key value host component set feature
spin_telemetry::counter!(spin.key_value_set = 1, key = key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that you're adding all this to the host components which we're going to be deleting very soon. This should all probably be added to the factors which mean this would have to be a PR against the factors branch.

spin_telemetry::counter!(
spin.llm_infer = 1,
model_name = model,
prompt_given = prompt
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want this as an attribute. The prompt could be very large and it is also likely to be high cardinality which is not a good fit for a metric attribute.

@me-diru me-diru force-pushed the OTel-metric branch 2 times, most recently from cd4491a to ea08fb6 Compare August 22, 2024 21:40
@itowlson
Copy link
Contributor

@me-diru is this intended to land in the factors branch, or should it be rebased off main? Currently, merging this would merge a whole bunch of unrelated factors stuff too.

@calebschoepp
Copy link
Collaborator

@me-diru is this intended to land in the factors branch, or should it be rebased off main? Currently, merging this would merge a whole bunch of unrelated factors stuff too.

This should land on factors (or main once factors merges in there).

@me-diru me-diru changed the base branch from main to factors August 22, 2024 22:07
@me-diru
Copy link
Contributor Author

me-diru commented Aug 22, 2024

I changed the base to factors, and it should only reflect my code changes. Thanks for checking in @itowlson !

@me-diru
Copy link
Contributor Author

me-diru commented Aug 22, 2024

@calebschoepp
I am not sure how to test the telemetry metrics in factors. For llm-compute, I think I am capturing it in the right place. However, when I run the build using

4318/v1/traces OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=h
ttp://localhost:4318/v1/metrics ../../../spin/target
/debug/spin up --runtime-config-file ../runtime-conf
ig.toml

it gives me an error of

Eror: unused runtime config key(s): llm_compute

I tried to run the factor_test.rs for the factor-llm, but it gave me another error on

error[E0277]: the trait bound `url::Url: Deserialize<'_>` is not satisfied
  --> crates/factor-llm/src/spin.rs:91:17
   |
91 | #[derive(Debug, serde::Deserialize)]
   |                 ^^^^^^^^^^^^^^^^^^ the trait `Deserialize<'_>` is not implemented for `url::Url`
   |

Which I think satisfies

On the other hand, for factor-key-value, I think it is utilizing the spin-key-value crate to do the set and get functions? So I guess the monotonic counters for key-value should suffice.

Would be great to have your input

@calebschoepp
Copy link
Collaborator

@me-diru I still see 3 commits that aren't yours that you'll want to take out of this diff.

@itowlson
Copy link
Contributor

itowlson commented Aug 22, 2024

@me-diru

the trait Deserialize<'_> is not implemented for url::Url

You might need to enable the serde feature for the url crate. https://docs.rs/url/latest/url/#feature-serde

(This is a common pattern in Rust utility crates, where it's useful for people to be able to serialise the types, but they don't want to force a heavyweight serde dependency on people who just want to sling URL or times or whatever.)

@me-diru
Copy link
Contributor Author

me-diru commented Aug 22, 2024

You might need to enable the serde feature for the url crate. https://docs.rs/url/latest/url/#feature-serde

That did the trick, tests passed :D

I am just curious how the tests passed before, though 😅 In the current case of factor-llm, we don't have to deserialize the Url?

@itowlson
Copy link
Contributor

@me-diru It will work if any crate in the build turns the feature on. This can be cause surprises when you use your crate in a slightly different build context and the other crate that happened to make things work is no longer there and boom your code stops compiling.

This is a significant pain point for features but I gather there is not much that can be done.

@calebschoepp
Copy link
Collaborator

Are you still having errors running it @me-diru? I would need to see the runtime config you're using to help more.

@me-diru
Copy link
Contributor Author

me-diru commented Aug 23, 2024

Are you still having errors running it @me-diru? I would need to see the runtime config you're using to help more.

Yes, it's still happening. When I run the same command with the latest spin cli release(2.7), it works fine.

anonymized runtime-config file

[llm_compute]
type = "remote_http"
url = "<URL>"
auth_token = "<AUTH-TOKEN>"

I checked with factors branch spin binary and the same error occurs. I don't think the metrics code is causing this one.

Maybe @lann could give more insight

@lann
Copy link
Collaborator

lann commented Aug 26, 2024

☝️ This hopefully fixed it.

@me-diru
Copy link
Contributor Author

me-diru commented Sep 5, 2024

@calebschoepp I think it's now capturing the metrics in the new factors code!
image

@calebschoepp
Copy link
Collaborator

Sweet, @me-diru is this ready for a final review?

@me-diru me-diru changed the title Add counter telemetry(WIP) Add counter telemetry Sep 21, 2024
@me-diru
Copy link
Contributor Author

me-diru commented Sep 21, 2024

@calebschoepp Yes! Please review :D

Copy link
Collaborator

@calebschoepp calebschoepp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work. LGTM!

Could someone from @fermyon/spin-core-maintainers enable CI and merge this?

@@ -98,6 +98,8 @@ impl key_value::HostStore for KeyValueDispatch {
store: Resource<key_value::Store>,
key: String,
) -> Result<Result<Option<Vec<u8>>, Error>> {
spin_telemetry::monotonic_counter!(spin.key_value_get = 1, key = key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be emitting them in traces already but I'm not sure its a good idea to be sending KV store keys by default.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Danielle opened a somewhat related issue for being more careful with what we're emitting in telemetry. https://github.com/orgs/fermyon/projects/62/views/1?pane=issue&itemId=79180245

If we wanted to get fancy we could only emit these potentially sensitive values when some sort of local debug flag is set. But, for simplicities sake we're probably best of just not emitting it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if we don't have a specific need for it lets omit until we can come up with a better approach.

@vdice
Copy link
Member

vdice commented Sep 23, 2024

Could someone from @fermyon/spin-core-maintainers enable CI and merge this?

CI didn't run as the base branch is still factors. Perhaps rebase this on main? Then CI should run.

@calebschoepp
Copy link
Collaborator

Good catch, @me-diru you'll have to point this onto main.

@me-diru
Copy link
Contributor Author

me-diru commented Sep 23, 2024

The plan was to merge it to factors branch capturing metrics in the new codebase. @calebschoepp I wonder if the main branch was updated to point the PR there? 👀

I see all commits from factors branch are merged to main. I will do that, thanks! :D

@me-diru me-diru changed the base branch from factors to main September 23, 2024 23:55
@me-diru me-diru force-pushed the OTel-metric branch 2 times, most recently from 88bc510 to 8b03d12 Compare September 24, 2024 00:24
@me-diru
Copy link
Contributor Author

me-diru commented Sep 24, 2024

I have excluded capturing key details in the metrics. However, I was still getting some weird linting issues. It would be great to get more context on them!

@calebschoepp
Copy link
Collaborator

I have excluded capturing key details in the metrics. However, I was still getting some weird linting issues. It would be great to get more context on them!

Seems like there aren't any linting errors in CI. Are you just experiencing those linting errors locally? What are they?

@me-diru
Copy link
Contributor Author

me-diru commented Sep 25, 2024

Seems like there aren't any linting errors in CI. Are you just experiencing those linting errors locally? What are they?

I think so. This is what I got.

Screenshot from 2024-09-23 17-15-45

Interesting to see it not being reflected in CI though

@itowlson
Copy link
Contributor

@me-diru CI is currently locked to Rust 1.79. Is it possible you are on a more recent version of Rust with shiny new lints?

@rylev
Copy link
Collaborator

rylev commented Sep 26, 2024

FWIW the lint warnings there have been fixed in #2866 so a rebase will get rid of those.

@me-diru
Copy link
Contributor Author

me-diru commented Oct 2, 2024

@itowlson I am using Rust 1.81! That makes sense. Thanks for chiming in!

It's good to go now. I will rebase and get it merged @rylev

@me-diru
Copy link
Contributor Author

me-diru commented Oct 2, 2024

Interesting to see most of my merge conflicts were version downgrades

[[package]]
name = "wit-parser"
version = "0.217.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you need to fix this lockfile for linting to pass.

Signed-off-by: Rohit Dandamudi <rohitdandamudi.1100@gmail.com>
@me-diru
Copy link
Contributor Author

me-diru commented Oct 3, 2024

There was a timeout error in all-integration-tests under Zig Setup. Rerunning it just to make sure.

Run goto-bus-stop/setup-zig@v2
AggregateError [ETIMEDOUT]: 
    at internalConnectMultiple (node:net:1117:1[8](https://github.com/fermyon/spin/actions/runs/11154357210/job/31003494593?pr=2741#step:4:9))
    at internalConnectMultiple (node:net:1185:5)
    at Timeout.internalConnectMultipleTimeout (node:net:1711:5)
    at listOnTimeout (node:internal/timers:575:[11](https://github.com/fermyon/spin/actions/runs/11154357210/job/31003494593?pr=2741#step:4:12))
    at process.processTimers (node:internal/timers:514:7)
    ```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Record OTel metrics in host components
6 participants