Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methodology for normalization with max value division #180

Open
andtsouch opened this issue Nov 18, 2022 · 6 comments
Open

Methodology for normalization with max value division #180

andtsouch opened this issue Nov 18, 2022 · 6 comments
Labels

Comments

@andtsouch
Copy link

Hi! I am creating this issue to touch upon the way the tidyCovariateData function is normalizing the data when using Normalize=T.

From what I can see in the code, the data are normalized by dividing with the maximum value of each covariate:

https://github.com/OHDSI/FeatureExtraction/blob/main/R/Normalization.R (line 162)

if (normalize) { ParallelLogger::logInfo("Normalizing covariates") newCovariates <- newCovariates %>% inner_join(covariateData$maxValuePerCovariateId, by = "covariateId") %>% mutate(covariateValue = .data$covariateValue / .data$maxValue) %>% select(-.data$maxValue) metaData$normFactors <- covariateData$maxValuePerCovariateId %>% collect() } newCovariateData$covariates <- newCovariates }

I am interested to discuss the choice of this normalization method, since it is not really recommended in literature. Have you seen that it has some specific advanatages for machine learning models? From what I am aware of, most of the time methods like min-max or z-score are suggested. Is that maybe a feature for a potential future update or would you recommend some way to provide a custom normalization function on the current version of the package?

Thank you in advance for your consideration!

@anthonysena
Copy link
Collaborator

Hi @andtsouch - this is a good question. Just to call out the block of code you referenced:

if (normalize) {
ParallelLogger::logInfo("Normalizing covariates")
newCovariates <- newCovariates %>%
inner_join(covariateData$maxValuePerCovariateId, by = "covariateId") %>%
mutate(covariateValue = .data$covariateValue / .data$maxValue) %>%
select(-.data$maxValue)
metaData$normFactors <- covariateData$maxValuePerCovariateId %>%
collect()
}
newCovariateData$covariates <- newCovariates
}

Reading this code, it looks like the intent is to make sure the covariateValue is between [0,1].

I am interested to discuss the choice of this normalization method, since it is not really recommended in literature. Have you seen that it has some specific advanatages for machine learning models?

I'm not sure I have a good answer to this so I'm going to tag @schuemie and @jreps to ask them to weigh in on this approach for normalization.

@jreps
Copy link
Contributor

jreps commented Jan 10, 2023

I agree - dividing by max value is not the type of normalization commonly used. I used to use the subtract min value and divide by max-min. We could add an option of type of normalization that defaults to the old way (so things are backwards compatible) but lets users pick a different approach?

@schuemie
Copy link
Member

The choice for this type of normalization was mainly a practical one: if you for example aim to make the mean equal to 0 and the SD equal to 1 (a common form of normalization) that will mean that the covariates are no longer sparse; everyone will have a non-zero value for all covariates, which would blow up memory.

Jenna's proposal would work. For almost all variables (where the min value equals 0) it would be identical to the current approach.

@egillax
Copy link

egillax commented Jan 11, 2023

Why would you do anything to the binary covariates? This type of normalization is normally only done on continuous covariates.

tidyCovariateData already is computing which covariates are binary. Should be easy to save some CPU cycles and only normalize the non-binary features.

But back to @andtsouch question. I agree that min-max scaling and z-scoring are most common in my experience. But I'm not aware of any literature showing superiority of one method over the other. There are though some papers trying to answer this question like this one.

@andtsouch you reference the literature, I'd be very interested if there is a specific paper you had in mind when you say max abs scaling is not recommended?

@schuemie
Copy link
Member

Why would you do anything to the binary covariates?

Many machine learning algorithms make no distinction between binary and continuous covariates. For example, for LASSO ideally you would set the mean to 0, and SD to 1 for all, so the (single) hyperparameter scales well across all covariates.

But if all you want to do is make sure all covariates are in the 0 - 1 range then yes, you don't need to touch the binary covariates.

@andtsouch
Copy link
Author

Hi @anthonysena , @egillax, @jreps, @schuemie
Thank you for the interesting discussion. 
First of all, I also agree that normalization would apply only to continuous variables e.g. measurement values, which I believe is already what the package is doing. 
Thank you for the clarification @schuemie, I see how max-scaling could help with sparse data. I guess the answer is that there is not really an one-size-fits-all approach to normalization, since ultimately the choice of normalization method should be guided by the characteristics of the data and the goals of the analysis. For example, min-max scaling would be a better fit with data that is normally distributed, has a small range, and includes negative values. Ideally, I would explore the data (eg the measurement covariates) and experiment with different normalization methods to determine the most appropriate approach and test the effect on the performance of the model. Therefore, I think the best solution is to be able to customize the choice of normalization if possible. After some related work, I think this would actually be possible using a custom code and utilizing the featureEngineeringSettings function from the PLP package, since in that way someone can manipulate the test and train data separately. I guess @jreps might have a better suggestion from the available tools. 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants