Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EES-5017, EES-5016 Add data set GET query endpoint #4709

Merged
merged 12 commits into from
Apr 11, 2024
Merged

EES-5017, EES-5016 Add data set GET query endpoint #4709

merged 12 commits into from
Apr 11, 2024

Conversation

ntsim
Copy link
Collaborator

@ntsim ntsim commented Mar 27, 2024

This PR adds the data set GET query endpoint - /api/v1/data-sets/{dataSetId}/query. This provides functionality to let users make data set queries using query parameters e.g. ?indicators=headcount&filters.in=kaNhs,zvUFQ&sorts=geographicLevel|Asc

Note that this doesn't provide test coverage for most of this functionality. This will be covered in EES-5015.

Query parsing

Parsing data set queries to this endpoint roughly proceeds like so:

  1. Validate the GET query string (as DataSetGetQueryRequest)
  2. Convert to DataSetQueryRequest (will be used in POST query variant) in DataSetQueryService
  3. Parse and validate query criteria using DataSetQueryParser, producing a SQL WHERE fragment
  4. Parse and validate indicators from data set query
  5. Parse and validate sorts and their respective column orderings from the data set query
  6. Execute data set query using all requisite parts (where, columns and sorts)
  7. Return the paginated results

We've now implemented the majority of the base query criteria parsing functionality to handle the various facet types and their comparators. This doesn't include handling of more complex operators like and, or and not. This will be implemented in EES-4722.

Dapper and SQL builders

As there is no ORM support for DuckDB, the querying functionality has been implemented using a combination of Dapper and InterpolatedSql.

Dapper itself is very simplistic and only works with raw SQL strings out of the box. This isn't particularly safe in the context of SQL injection, so we've complemented it with InterpolatedSql which provides protection using its SqlBuilder API e.g.

var myParam = "test";
var builder = new SqlBuilder($"SELECT * FROM something WHERE field = {myParam}");

var sql = builder.Build(); 
sql.Sql; // Outputs: "SELECT * FROM something WHERE field = @p0"
sql.SqlParameters; // Outputs: Single parameter with value "test"

For real usages of this, check out the various repositories that have been created. We use a modified variant of InterpolatedSql's Dapper integration to make it a bit nicer to build and execute queries.

DuckDB issues

Unfortunately, there were some issues with DuckDB that we've had to work around, primarily around named parameters. Dapper only really works with named parameters, but this seems to cause issues with the DuckDB.NET library that we're using.

When using named parameters, DuckDB.NET would consistently fail to throw exceptions around parameter binding (need to create an issue for this), or memory access violations when debugging.

I'm communicating with the library maintainer to see if there's anything we can do here. For the time-being, I've been able to implement workarounds by wrapping the DuckDB.NET database connection code and forcing the use of positional auto-incrementing parameters (i.e. ? parameters) in generated SQL.

MiniProfiler

To help with debugging and performance, we've added MiniProfiler which lets us profile our code. This is particularly useful for data set queries, where it can help identify performance bottlenecks.

MiniProfiler is enabled via the MiniProfiler.Enabled app setting, so it can be disabled or enabled on a per-environment basis.

Related changes

  • Fixes GeographicLevels in the data set meta summary not being populated correctly from a data set version. This now uses the GeographicLevelMeta instead of LocationMetas to populate it.

Other changes

  • Improve error details for location, time period and sort string validators by now showing the parsed fields.
  • Fixed null pointer exception being thrown when gathering message placeholders from ValidationFailure in FluentValidationActionFilter

@ntsim ntsim changed the title EES-5017 Add data set GET query endpoint EES-5017, EES-5016 Add data set GET query endpoint Mar 27, 2024
@ntsim ntsim force-pushed the ees-5017 branch 20 times, most recently from 9407d80 to ec53169 Compare April 3, 2024 11:36
@@ -0,0 +1,9 @@
namespace GovUk.Education.ExploreEducationStatistics.Public.Data.Api.ViewModels;

public record ErrorViewModel : Common.ViewModels.ErrorViewModel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the need for this record, rather than just using the base?

Copy link
Collaborator Author

@ntsim ntsim Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some weaknesses with the common ErrorViewModel in that its fields are essentially all optional.

This is problematic as it means as this means that the view model is potentially going to be wrong in the generated docs.

There's also some other issues with the current generated docs in terms of required properties. Instead of doing this here, I'm going to try and tackle this in EES-5057

public record ErrorViewModel
{
/// <summary>
/// The error message.
/// </summary>
public string Message { get; init; } = string.Empty;
public virtual string Message { get; init; } = string.Empty;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the need for this change

Copy link
Collaborator Author

@ntsim ntsim Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per other comment - this is due to issues with the current view model. We ideally want to be able to extend this to create a more correct view model for the public API.

For now, I'm leaving this to implement it in EES-5057 instead.

@@ -29,6 +31,7 @@ public record DataSetQueryResultViewModel
/// <summary>
/// The geographic level relevant to this result's data.
/// </summary>
[JsonConverter(typeof(EnumToEnumValueJsonConverter<GeographicLevel>))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this not work out of the box with the custom schema filter we have for GeographicLevel?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't for Swagger. It's for JSON serialisation 😉

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was definitely a bit tired when reviewing these bits of the code.....

@ntsim ntsim force-pushed the ees-5017 branch 3 times, most recently from 18866c1 to 0108572 Compare April 5, 2024 13:22
@ntsim ntsim force-pushed the ees-5017 branch 7 times, most recently from 68dbf4d to 5c5c709 Compare April 8, 2024 17:27
This includes upgrading DuckDB.NET to 0.10.1.2 and introduces:

- Dapper - for executing queries and mapping results
- InterpolatedSql - for writing SQL queries and mitigating SQL injection
- MiniProfiler - for performance profiling
…rrectly

This changes the `GeographicLevels` to be correctly populated from the
`GeographicLevelMeta` instead of the `LocationMetas`.

As part of this, we've also updated `DataSetVersion` generator
extensions to set the geographic levels meta based on the locations meta
in the default cases. This is sensible default to avoid having to
explicitly set the geographic levels every time.
…e consistent

This updates the location and time period string validator errors so
that they're consistent with recent changes to the sort string format
(and its error output).
@ntsim ntsim force-pushed the ees-5017 branch 5 times, most recently from 90ae671 to b8b6484 Compare April 11, 2024 00:10
{
var command = duckDbConnection.SqlBuilder(
$"""
SELECT DISTINCT {FiltersTable.Cols.ColumnName:raw}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jack-hive penny for your thoughts as I've gone back and forth on this in my head a few times, but what do you think about this column name?

I'm kinda considering renaming it to FilterId (filter_id in the table itself). However, this is a touch confusing given the table itself is called filters.parquet and we have other id columns too.

Like I said, a bit on the fence about this. It might be fine as-is, but maybe not. Any thoughts?

Copy link
Collaborator

@jack-hive jack-hive Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm......

I'm a bit on the fence too.

I think FilterId makes more sense in terms of the code - but is less intuitive in terms of the domain context.

Maybe CsvColumnName might be more explicit and intuitive than ColumnName. I must admit, it wasn't immediately obvious to me what we meant by ColumnName when I was first reviewing the code. So might be best we change it.

I think I'd either choose CsvColumnName or FilterId

This change changes data seeding so that null meta IDs in the data
Parquet table are coalesced into `0`. This avoids unexpected behaviour
in queries where rows with null meta IDs will not be included in the
results correctly.
This changes the location and filter Parquet tables to `filter_options`
and `location_options` respectively.

As part of this, we've also renamed `filter_options.column_name` to
`filter_options.filter_id` to make it more obvious what this is.
@ntsim ntsim merged commit 8760887 into dev Apr 11, 2024
7 checks passed
@ntsim ntsim deleted the ees-5017 branch April 11, 2024 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants