Methodology for normalization with max value division #180

andtsouch · 2022-11-18T11:04:42Z

Hi! I am creating this issue to touch upon the way the tidyCovariateData function is normalizing the data when using Normalize=T.

From what I can see in the code, the data are normalized by dividing with the maximum value of each covariate:

https://github.com/OHDSI/FeatureExtraction/blob/main/R/Normalization.R (line 162)

if (normalize) { ParallelLogger::logInfo("Normalizing covariates") newCovariates <- newCovariates %>% inner_join(covariateData$maxValuePerCovariateId, by = "covariateId") %>% mutate(covariateValue = .data$covariateValue / .data$maxValue) %>% select(-.data$maxValue) metaData$normFactors <- covariateData$maxValuePerCovariateId %>% collect() } newCovariateData$covariates <- newCovariates }

I am interested to discuss the choice of this normalization method, since it is not really recommended in literature. Have you seen that it has some specific advanatages for machine learning models? From what I am aware of, most of the time methods like min-max or z-score are suggested. Is that maybe a feature for a potential future update or would you recommend some way to provide a custom normalization function on the current version of the package?

Thank you in advance for your consideration!

The text was updated successfully, but these errors were encountered:

anthonysena · 2023-01-10T16:32:02Z

Hi @andtsouch - this is a good question. Just to call out the block of code you referenced:

FeatureExtraction/R/Normalization.R

Lines 162 to 172 in 5363f5f

    
             if (normalize) { 
        
               ParallelLogger::logInfo("Normalizing covariates") 
        
               newCovariates <- newCovariates %>%  
        
                 inner_join(covariateData$maxValuePerCovariateId, by = "covariateId") %>% 
        
                 mutate(covariateValue = .data$covariateValue / .data$maxValue) %>% 
        
                 select(-.data$maxValue) 
        
               metaData$normFactors <- covariateData$maxValuePerCovariateId %>% 
        
                 collect() 
        
             }  
        
             newCovariateData$covariates <- newCovariates 
        
           }

Reading this code, it looks like the intent is to make sure the covariateValue is between [0,1].

I am interested to discuss the choice of this normalization method, since it is not really recommended in literature. Have you seen that it has some specific advanatages for machine learning models?

I'm not sure I have a good answer to this so I'm going to tag @schuemie and @jreps to ask them to weigh in on this approach for normalization.

jreps · 2023-01-10T16:45:20Z

I agree - dividing by max value is not the type of normalization commonly used. I used to use the subtract min value and divide by max-min. We could add an option of type of normalization that defaults to the old way (so things are backwards compatible) but lets users pick a different approach?

schuemie · 2023-01-11T05:19:05Z

The choice for this type of normalization was mainly a practical one: if you for example aim to make the mean equal to 0 and the SD equal to 1 (a common form of normalization) that will mean that the covariates are no longer sparse; everyone will have a non-zero value for all covariates, which would blow up memory.

Jenna's proposal would work. For almost all variables (where the min value equals 0) it would be identical to the current approach.

egillax · 2023-01-11T08:43:53Z

Why would you do anything to the binary covariates? This type of normalization is normally only done on continuous covariates.

tidyCovariateData already is computing which covariates are binary. Should be easy to save some CPU cycles and only normalize the non-binary features.

But back to @andtsouch question. I agree that min-max scaling and z-scoring are most common in my experience. But I'm not aware of any literature showing superiority of one method over the other. There are though some papers trying to answer this question like this one.

@andtsouch you reference the literature, I'd be very interested if there is a specific paper you had in mind when you say max abs scaling is not recommended?

schuemie · 2023-01-11T10:44:40Z

Why would you do anything to the binary covariates?

Many machine learning algorithms make no distinction between binary and continuous covariates. For example, for LASSO ideally you would set the mean to 0, and SD to 1 for all, so the (single) hyperparameter scales well across all covariates.

But if all you want to do is make sure all covariates are in the 0 - 1 range then yes, you don't need to touch the binary covariates.

andtsouch · 2023-03-01T10:59:45Z

Hi @anthonysena , @egillax, @jreps, @schuemie
Thank you for the interesting discussion.
First of all, I also agree that normalization would apply only to continuous variables e.g. measurement values, which I believe is already what the package is doing.
Thank you for the clarification @schuemie, I see how max-scaling could help with sparse data. I guess the answer is that there is not really an one-size-fits-all approach to normalization, since ultimately the choice of normalization method should be guided by the characteristics of the data and the goals of the analysis. For example, min-max scaling would be a better fit with data that is normally distributed, has a small range, and includes negative values. Ideally, I would explore the data (eg the measurement covariates) and experiment with different normalization methods to determine the most appropriate approach and test the effect on the performance of the model. Therefore, I think the best solution is to be able to customize the choice of normalization if possible. After some related work, I think this would actually be possible using a custom code and utilizing the featureEngineeringSettings function from the PLP package, since in that way someone can manipulate the test and train data separately. I guess @jreps might have a better suggestion from the available tools.

anthonysena added the question label Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methodology for normalization with max value division #180

Methodology for normalization with max value division #180

andtsouch commented Nov 18, 2022

anthonysena commented Jan 10, 2023

jreps commented Jan 10, 2023

schuemie commented Jan 11, 2023

egillax commented Jan 11, 2023

schuemie commented Jan 11, 2023

andtsouch commented Mar 1, 2023

Methodology for normalization with max value division #180

Methodology for normalization with max value division #180

Comments

andtsouch commented Nov 18, 2022

anthonysena commented Jan 10, 2023

jreps commented Jan 10, 2023

schuemie commented Jan 11, 2023

egillax commented Jan 11, 2023

schuemie commented Jan 11, 2023

andtsouch commented Mar 1, 2023