Your Career Platform for Big Data

Be part of the digital revolution in Switzerland

 

Latest Jobs

Arobase Geneva, Switzerland
17/01/2019
Full time
Für einer unseren Kunden in Pratteln BL, suchen wir nach eine/n Unter der Leitung des Managements aus der Schweiz und innerhalb eines Vertriebsteams ist es Ihre Aufgabe, Bestands- und Neukunden im Hinblick auf eine effiziente Umsetzung der eigenen Strategie zu beraten. Sie verfügen über fundiertes Wissen und Sachkenntnisse vom SOLIDWORKS Produkt-Portfolio und allen damit verbundenen Prozessen/Projekten. Ihre Aufgaben: Durchführung von Projekten in Ihrem Zuständigkeitsbereich, von der Konzeption bis zur Umsetzung Verständnis der Kundenstrategie zur Identifizierung von Projekten Analyse der Stärken und Schwächen bei Bestands- und Neukunden Definition einer Roadmap und eines Businessplans auf Grundlage der identifizierten Projekte Machbarkeits- und Risikoanalyse Präsentation von unseren Produkten und Lösungen beim Kunden Auswertung des Projekts und Einbringen ggf. notwendiger Änderungen Wissensweitergabe an die Benutzer Erkennen neuer Chancen und Risiken Ihr Profil: Sie sind proaktiv und übernehmen die Initiative Sie verfügen über exzellente Analysefähigkeiten und können überzeugen Sie arbeiten selbständig und sind in der Lage, von zu Hause aus zu arbeiten (ca. 60% der Tätigkeit ist ausgelagert) und sich in das bestehende Sales- und Business Consultant-Team zu integrieren Sie arbeiten lösungsorientiert Sie sind stressresistent und belastbar Sie sind selbstbewusst und entscheidungsfreudig Sie sind Experte in der Kommunikation, sowohl mündlich als auch schriftlich Sie sprechen deutsch, können sich aber auch in Englisch verständigen. Französische Sprachkenntnisse sind ein Plus. Sie haben Erfahrung in der Projektarbeit, behalten dabei aber immer den Blick fürs Ganze Neuerungen können Sie schnell erfassen und integrieren Vertiefte Kenntnisse vom SOLIDWORKS Produkt-Portfolio sind ein grosser Vorteil   Interessiert an einer neuen Herausforderung ? Wir warten auf Ihre Bewerbung!
Logitech Lausanne, Switzerland
17/01/2019
Full time
Logitech is transforming into a connected company, where devices and cloud services work hand in hand to create new experiences and business models. Logitech CTO Office has the mission of defining and helping implementing technology strategies of Logitech. It works across horizontal functions as well as business groups to develop platforms and technical components which are common in the development of products and services of different and cross categories. The Artificial Intelligence Group is part of CTO Office, and is an organization of world-class scientists and engineers in machine learning and computer science. The mission of the group is to drive breakthrough innovations and to provide to all the Business Groups (BGs) of Logitech with the state-of-art technologies in artificial intelligence. The Artificial Intelligence Group is hiring in Logitech’s Lausanne (Switzerland) site based on the EPFL campus. We are looking for exceptional individuals who are interested in both innovating in the field and applying and integrating innovations into products and services. Your Contribution: Be Yourself. Be Open. Stay Hungry and Humble. Collaborate. Challenge. Decide and just Do. These are the behaviors you’ll need for success at Logitech. In this role you will: 1+ year of relevant work or academic experience in Machine Learning or Artificial Intelligence, including software development experience. Exposure to a broad range of Machine Learning techniques, incl. deep learning. Familiarity with computer vision, audio, speech and language processing. Pragmatic attitude and ability to rapidly dive in new scientific fields. Innovative, curious, self-starter & autonomous. Desire to collaborate in a team of both researchers and software developers. Key Qualifications: For consideration, you must bring the following   minimum   skills and behaviors to our team: General purpose programming languages incl. Java, C/C++ or Python. Training and evaluation of deep neural networks, e.g. CNNs, RNNs, GANs. Deep learning framework such as pyTorch or Tensorflow (Keras). SW development in embedded platforms such as ARM. Digital signal processing (FFT, MFCCs, beamforming, designs of filters, Matlab). Education MsC in Computer Science, Computational Sciences, Machine Learning or related sciences. Logitech is the sweet spot for people who are passionate about products, making a mark, and having fun doing it.   As a company, we’re small and flexible enough for every person to take initiative and make things happen. But we’re big enough in our portfolio, and reach, for those actions to have a global impact. That’s a pretty sweet spot to be in and we’re always striving to keep it that way.                                                                                                                                                                       “All qualified applicants will receive consideration for employment  without regard to race, sex, color, religion, sexual orientation, gender identity, national origin, protected veteran status, or on the basis of disability.”
Logitech Lausanne, Switzerland
16/01/2019
Full time
Logitech is transforming into a connected company, where devices and cloud services work hand in hand to create new experiences and business models. Logitech CTO Office has the mission of defining and helping implementing technology strategies of Logitech. It works across horizontal functions as well as business groups to develop platforms and technical components which are common in the development of products and services of different and cross categories. The Artificial Intelligence Group is part of CTO Office, and is an organization of world-class scientists and engineers in machine learning and computer science. The mission of the group is to drive breakthrough innovations and to provide to all the Business Groups (BGs) of Logitech with the state-of-art technologies in artificial intelligence. The Artificial Intelligence Group is hiring in Logitech’s Lausanne (Switzerland) site based on the EPFL campus. We are looking for exceptional individuals who are interested in both innovating in the field and applying and integrating innovations into products and services. Your Contribution: Be Yourself. Be Open. Stay Hungry and Humble. Collaborate. Challenge. Decide and just Do. These are the behaviors you’ll need for success at Logitech. In this role you will: 1+ year of relevant work or academic experience in Machine Learning or Artificial Intelligence, including software development experience Exposure to a broad range of Machine Learning techniques, incl. deep learning Familiarity with computer vision, audio, speech and language processing Pragmatic attitude and ability to rapidly dive in new scientific fields Innovative, curious, self-starter & autonomous Desire to collaborate in a team of both researchers and software developers. Key Qualifications: For consideration, you must bring the following   minimum   skills and behaviors to our team: General purpose programming languages incl. Java, C/C++ or Python Training and evaluation of deep neural networks, e.g. CNNs, RNNs, GANs Deep learning framework such as pyTorch or Tensorflow (Keras) SW development in embedded platforms such as ARM Digital signal processing (FFT, MFCCs, beamforming, designs of filters, Matlab) Education MsC in Computer Science, Computational Sciences, Machine Learning or related sciences Logitech is the sweet spot for people who are passionate about products, making a mark, and having fun doing it.   As a company, we’re small and flexible enough for every person to take initiative and make things happen. But we’re big enough in our portfolio, and reach, for those actions to have a global impact. That’s a pretty sweet spot to be in and we’re always striving to keep it that way.                                                                                                                                                                       “All qualified applicants will receive consideration for employment  without regard to race, sex, color, religion, sexual orientation, gender identity, national origin, protected veteran status, or on the basis of disability.”
Logitech Lausanne, Switzerland
16/01/2019
Full time
Logitech is transforming into a connected company, where devices and cloud services work hand in hand to create new user experiences. You will be a Data Engineer within the CTO Office, a transversal organisation developing a common data platform, enabling big data / internet of things analytics and other advanced technologies such as Machine Learning for Logitech’s Business Groups. We will be leveraging public clouds, like AWS, Azure, and GCP, as well as tools like Apache Spark, Snowflake, D3.js and Tableau. We will be developing the worldwide infrastructure and operational best practices serving several millions of customers and devices. Due to the nature of the CTO office team, this is a challenging role that requires being able to anticipate business needs and focus on business success, as well as strong technical skills, willingness to experiment with technology, and ability to deliver on multiple projects under pressure. Your Contribution: Be Yourself. Be Open. Stay Hungry and Humble. Collaborate. Challenge. Decide and Do. These are the behaviors you’ll need for success at Logitech. In this role you will: Develop and maintain ETL flows for loading data into the warehouse from the systems collecting data from devices Work with engineering teams and business users to define data schemas for device event data stored in a common data warehouse Define and manage views in the warehouse to meet the requirements of data scientists using platforms like Spark and business users using visualization tools like Tableau Work with data scientists to productize analytics and data models, developing new ETL flows for new applications driven by model-based analytics As business needs grow, develop and maintain data stream processing workflows for device event data, supporting the needs of business users for up-to-date information and customer-facing services Your Skills: For consideration, you must bring the following skills and behaviors to our team: 2 years of relevant work experience in, building pipelines for conventional, unstructured, streaming or big data sets using tools like Spark, Flink or Hadoop. Programming proficiency in at least one major language (Python, Scala or Java) Experience building fault tolerant distributed systems Strong problem solving skills Good presentation skills Fluent in English DESIRED SKILLS A CONCEPTUAL UNDERSTANDING OF DATA SCIENCE AND MACHINE LEARNING APPLICATIONS LIKE RECOMMENDER SYSTEMS, CLASSIFICATION, PREDICTIVE MODELING AND CLUSTERING. Familiarity with consumer oriented analytic techniques like segmentation and user profiling. Understanding of parallelized data ingestion techniques (not essential but useful) Pragmatic attitude and ability to rapidly iterate and evolve ideas into products Experience assessing and anticipating business needs and focusing on delivering business-relevant results Desire to collaborate in a team of researchers,  data scientists and software developers. EDUCATION MSC DEGREE IN COMPUTER SCIENCE, DATA SCIENCE, MACHINE LEARNING OR RELATED TECHNICAL FIELD OR EQUIVALENT PRACTICAL EXPERIENCE. Logitech is the sweet spot for people who are passionate about products, making a mark, and having fun doing it. As a company, we’re small and flexible enough for every person to take initiative and make things happen. But we’re big enough in our portfolio, and reach, for those actions to have a global impact. That’s a pretty sweet spot to be in and we’re always striving to keep it that way. “ All qualified applicants will receive consideration for employment   without regard to race, sex, color, religion, sexual orientation, gender identity, national origin, Veteran status, or on the basis of disability.”

DataCareer Blog

   Random Forest is a powerful ensemble learning method that can be applied to various prediction tasks, in particular classification and regression. The method uses an ensemble of decision trees as a basis and therefore has all advantages of decision trees, such as high accuracy, easy usage, and no necessity of scaling data. Moreover, it also has a very important additional benefit, namely perseverance to overfitting (unlike simple decision trees) as the trees are combined. In this tutorial, we will try to predict the value of diamonds from the Diamonds dataset (part of ggplot2) applying a Random Forest Regressor in R. We further visualize and analyze the obtained predictive model and look into the tuning of hyperparameters and the importance of available features. Loading and preparing data # Import the dataset diamond <-diamonds head(diamond) ## # A tibble: 6 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 The dataset contains information on 54,000 diamonds. It contains the price as well as 9 other attributes. Some features are in the text format, and we need to encode them to the numerical format. Let’s also drop the unnamed index column. # Convert the variables to numerical diamond$cut <- as.integer(diamond$cut) diamond$color <-as.integer(diamond$color) diamond$clarity <- as.integer(diamond$clarity) head(diamond) ## # A tibble: 6 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <int> <int> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 5 2 2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 4 2 3 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 2 2 5 56.9 65 327 4.05 4.07 2.31 ## 4 0.290 4 6 4 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 2 7 2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 3 7 6 62.8 57 336 3.94 3.96 2.48 As we already mentioned, one of the benefits of the Random Forest algorithm is that it doesn’t require data scaling. So, to use this algorithm, we only need to define features and the target that we are trying to predict. We could potentially create numerous features by combining the available attributes. For simplicity, we will not do that now. If you are trying to build the most accurate model, feature creation is definitely a key part and substantial time should be invested in creating features (e.g. through interaction). # Create features and target X <- diamond %>% select(carat, depth, table, x, y, z, clarity, cut, color) y <- diamond$price Training the model and making predictions At this point, we have to split our data into training and test sets. As a training set, we will take 75% of all rows and use 25% as test data. # Split data into training and test sets index <- createDataPartition(y, p=0.75, list=FALSE) X_train <- X[ index, ] X_test <- X[-index, ] y_train <- y[index] y_test<-y[-index] # Train the model regr <- randomForest(x = X_train, y = y_train , maxnodes = 10, ntree = 10) Now, we have a pre-trained model and can predict values for the test data. We then compare the predicted value with the actual values in the test data and analyze the accuracy of the model. To make this comparison more illustrative, we will show it both in the forms of table and plot the price and the carat value # Make prediction predictions <- predict(regr, X_test) result <- X_test result['price'] <- y_test result['prediction']<- predictions head(result) ## # A tibble: 6 x 11 ## carat depth table x y z clarity cut color price prediction ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <dbl> ## 1 0.24 62.8 57 3.94 3.96 2.48 6 3 7 336 881. ## 2 0.23 59.4 61 4 4.05 2.39 5 3 5 338 863. ## 3 0.2 60.2 62 3.79 3.75 2.27 2 4 2 345 863. ## 4 0.32 60.9 58 4.38 4.42 2.68 1 4 2 345 863. ## 5 0.3 62 54 4.31 4.34 2.68 2 5 6 348 762. ## 6 0.3 62.7 59 4.21 4.27 2.66 3 3 7 351 863. # Import library for visualization library(ggplot2) # Build scatterplot ggplot( ) + geom_point( aes(x = X_test$carat, y = y_test, color = 'red', alpha = 0.5) ) + geom_point( aes(x = X_test$carat , y = predictions, color = 'blue', alpha = 0.5)) + labs(x = "Carat", y = "Price", color = "", alpha = 'Transperency') + scale_color_manual(labels = c( "Predicted", "Real"), values = c("blue", "red")) The figure displays that predicted prices (blue scatters) coincide well with the real ones (red scatters), especially in the region of small carat values. But to estimate our model more precisely, we will look at Mean absolute error (MAE), Mean squared error (MSE), and R-squared scores. # Import library for Metrics library(Metrics) ## ## Attaching package: 'Metrics' ## The following objects are masked from 'package:caret': ## ## precision, recall print(paste0('MAE: ' , mae(y_test,predictions) )) ## [1] "MAE: 742.401258870433" print(paste0('MSE: ' ,caret::postResample(predictions , y_test)['RMSE']^2 )) ## [1] "MSE: 1717272.6547428" print(paste0('R2: ' ,caret::postResample(predictions , y_test)['Rsquared'] )) ## [1] "R2: 0.894548902990278" We obtain high error values (MAE and MSE). To improve the predictive power of the model, we should tune the hyperparameters of the algorithm. We can do this manually, but it will take a lot of time. In order to tune the parameters ntrees (number of trees in the forest) and maxnodes (maximum number of terminal nodes trees in the forest can have), we will need to build a custom Random Forest model to obtain the best set of parameters for our model and compare the output for various combinations of the parameters. Tuning the parameters # If training the model takes too long try setting up lower value of N N=500 #length(X_train) X_train_ = X_train[1:N , ] y_train_ = y_train[1:N] seed <-7 metric<-'RMSE' customRF <- list(type = "Regression", library = "randomForest", loop = NULL) customRF$parameters <- data.frame(parameter = c("maxnodes", "ntree"), class = rep("numeric", 2), label = c("maxnodes", "ntree")) customRF$grid <- function(x, y, len = NULL, search = "grid") {} customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) { randomForest(x, y, maxnodes = param$maxnodes, ntree=param$ntree, ...) } customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL) predict(modelFit, newdata) customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL) predict(modelFit, newdata, type = "prob") customRF$sort <- function(x) x[order(x[,1]),] customRF$levels <- function(x) x$classes # Set grid search parameters control <- trainControl(method="repeatedcv", number=10, repeats=3, search='grid') # Outline the grid of parameters tunegrid <- expand.grid(.maxnodes=c(70,80,90,100), .ntree=c(900, 1000, 1100)) set.seed(seed) # Train the model rf_gridsearch <- train(x=X_train_, y=y_train_, method=customRF, metric=metric, tuneGrid=tunegrid, trControl=control) Let’s visualize the impact of tuned parameters on RMSE. The plot shows how the model’s performance develops with different variations of the parameters. For values maxnodes: 80 and ntree: 900, the model seems to perform best. We would now use these parameters in the final model. plot(rf_gridsearch) Best parameters: rf_gridsearch$bestTune ## maxnodes ntree ## 5 80 1000 Defining and visualizing variables importance For this algorithm, we used all available diamond features, but some of them contain more predictive power than others. Let’s build the plot with features list on the y axis. On the X axis we’ll have incremental decrease in node impurities from splitting on the variable, averaged over all trees, it is measured by the residual sum of squares and therefore gives us a rough idea about the predictive power of the feature. Generally, it is important to keep in mind, that random forest does not allow for any causal interpretation. varImpPlot(rf_gridsearch$finalModel, main ='Feature importance') From the figure above you can see that the size of diamond (x,y,z refer to length, width, depth) and the weight (carat) contain the major part of the predictive power. // //
   A common and very challenging problem in machine learning is overfitting, and it comes in many different appearances. It is one of the major aspects of training the model. Overfitting occurs when the model is capturing too much noise in the training data set which leads to bad predication accuracy when applying the model to new data. One of the ways to avoid overfitting is regularization technique. In this tutorial, we will examine Ridge regression and Lasso which extend the classical linear regression. Earlier, we have shown how to work with Ridge and Lasso in Python, and this time we will build and train our model using R and the caret package. Like classical linear regression, Ridge and Lasso also build the linear model, but their fundamental peculiarity is regularization. The goal of these methods is to improve the loss function so that it depends not only on the sum of the squared differences but also on the regression coefficients. One of the main problems in the construction of such models is the correct selection of the regularization parameter. Comparing to linear regression, Ridge and Lasso models are more resistant to outliers and the spread of data. Overall, their main purpose is to prevent overfitting. The main difference between Ridge regression and Lasso is how they assign a penalty to the coefficients. We will explore this with our example, so let’s start. We will work with the Diamonds dataset, which you can download from the next website: http://vincentarelbundock.github.io/Rdatasets/datasets.html . It contains the prices and other attributes of almost 54,000 diamonds. We will be predicting the price of diamonds using other attributes and compare the results for Ridge, Lasso, and OLS. # Upload the dataset diamonds <-read.csv('diamonds.csv', header = TRUE, sep = ',') head(diamonds) ## X carat cut color clarity depth table price x y z ## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 # Drop the index column diamonds<- diamonds[ , -which(names(diamonds)=='X')] head(diamonds) ## carat cut color clarity depth table price x y z ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 # Print unique values of text features print(levels(diamonds$cut)) ## [1] "Fair" "Good" "Ideal" "Premium" "Very Good" print(levels(diamonds$clarity)) ## [1] "I1" "IF" "SI1" "SI2" "VS1" "VS2" "VVS1" "VVS2" print(levels(diamonds$color)) ## [1] "D" "E" "F" "G" "H" "I" "J" As you can see, there is a finite number of variables, so we can change these categorical variables to numerical. diamonds$cut <- as.integer(diamonds$cut) diamonds$color <-as.integer(diamonds$color) diamonds$clarity <- as.integer(diamonds$clarity) head(diamonds) ## carat cut color clarity depth table price x y z ## 1 0.23 3 2 4 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 4 2 3 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 2 2 5 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 4 6 6 62.4 58 334 4.20 4.23 2.63 ## 5 0.31 2 7 4 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 5 7 8 62.8 57 336 3.94 3.96 2.48 Before building the models, let’s first scale data. Scaling will put the ranges of our features from -1 till 1. Thus a feature with a larger scale won’t have a larger impact on the target variable. To scale a feature we need to subtract the mean from every value and divide it by standard deviation. For scaling data and training the models we’ll use caret package. The caret package (short for Classification And REgression Training) is a wrapper around a number of other packages and provides a unified interface for data preparation, models training, and metrics evaluation. # Create features and target matrixes X <- diamonds %>% select(carat, depth, table, x, y, z, clarity, cut, color) y <- diamonds$price # Scale data preprocessParams<-preProcess(X, method = c("center", "scale")) X <- predict(preprocessParams, X) Now, we can basically build Lasso and Ridge. We’ll split the data into a train and a test dataset but for now we won’t set the regularization parameter lambda. It is set to 1. The “glmnet” method in caret has an alpha argument that determines what type of model is fit. If alpha = 0 then a ridge regression model is fit, and if alpha = 1 then a lasso model is fit. Here we’ll use caret as a wrapper for glment. # Spliting training set into two parts based on outcome: 75% and 25% index <- createDataPartition(y, p=0.75, list=FALSE) X_train <- X[ index, ] X_test <- X[-index, ] y_train <- y[index] y_test<-y[-index] # Create and fit Lasso and Ridge objects lasso<-train(y= y_train, x = X_train, method = 'glmnet', tuneGrid = expand.grid(alpha = 1, lambda = 1) ) ridge<-train(y = y_train, x = X_train, method = 'glmnet', tuneGrid = expand.grid(alpha = 0, lambda = 1) ) # Make the predictions predictions_lasso <- lasso %>% predict(X_test) predictions_ridge <- ridge %>% predict(X_test) # Print R squared scores data.frame( Ridge_R2 = R2(predictions_ridge, y_test), Lasso_R2 = R2(predictions_lasso, y_test) ) ## Ridge_R2 Lasso_R2 ## 1 0.8584854 0.8813328 # Print RMSE data.frame( Ridge_RMSE = RMSE(predictions_ridge, y_test) , Lasso_RMSE = RMSE(predictions_lasso, y_test) ) ## Ridge_RMSE Lasso_RMSE ## 1 1505.114 1371.852 # Print coeficients data.frame( as.data.frame.matrix(coef(lasso$finalModel, lasso$bestTune$lambda)), as.data.frame.matrix(coef(ridge$finalModel, ridge$bestTune$lambda)) ) %>% rename(Lasso_coef = X1, Ridge_coef = X1.1) ## Lasso_coef Ridge_coef ## (Intercept) 3934.27842 3934.23903 ## carat 5109.77597 2194.54696 ## depth -210.68494 -92.16124 ## table -207.38199 -156.89671 ## x -1178.40159 709.68312 ## y 0.00000 430.43936 ## z 0.00000 423.90948 ## clarity 499.77400 465.56366 ## cut 74.61968 83.61722 ## color -448.67545 -317.53366 These two models give very similar results. Now let’s choose the regularization parameter with the help of tuneGrid. The models with the highest R-squared score will give us the best parameters. parameters <- c(seq(0.1, 2, by =0.1) , seq(2, 5, 0.5) , seq(5, 25, 1)) lasso<-train(y = y_train, x = X_train, method = 'glmnet', tuneGrid = expand.grid(alpha = 1, lambda = parameters) , metric = "Rsquared" ) ridge<-train(y = y_train, x = X_train, method = 'glmnet', tuneGrid = expand.grid(alpha = 0, lambda = parameters), metric = "Rsquared" ) linear<-train(y = y_train, x = X_train, method = 'lm', metric = "Rsquared" ) print(paste0('Lasso best parameters: ' , lasso$finalModel$lambdaOpt)) ## [1] "Lasso best parameters: 2.5" print(paste0('Ridge best parameters: ' , ridge$finalModel$lambdaOpt)) ## [1] "Ridge best parameters: 25" predictions_lasso <- lasso %>% predict(X_test) predictions_ridge <- ridge %>% predict(X_test) predictions_lin <- linear %>% predict(X_test) data.frame( Ridge_R2 = R2(predictions_ridge, y_test), Lasso_R2 = R2(predictions_lasso, y_test), Linear_R2 = R2(predictions_lin, y_test) ) ## Ridge_R2 Lasso_R2 Linear_R2 ## 1 0.8584854 0.8813317 0.8813157 data.frame( Ridge_RMSE = RMSE(predictions_ridge, y_test) , Lasso_RMSE = RMSE(predictions_lasso, y_test) , Linear_RMSE = RMSE(predictions_ridge, y_test) ) ## Ridge_RMSE Lasso_RMSE Linear_RMSE ## 1 1505.114 1371.852 1505.114 print('Best estimator coefficients') ## [1] "Best estimator coefficients" data.frame( ridge = as.data.frame.matrix(coef(ridge$finalModel, ridge$finalModel$lambdaOpt)), lasso = as.data.frame.matrix(coef(lasso$finalModel, lasso$finalModel$lambdaOpt)), linear = (linear$finalModel$coefficients) ) %>% rename(lasso = X1, ridge = X1.1) ## lasso ridge linear ## (Intercept) 3934.23903 3934.27808 3934.27526 ## carat 2194.54696 5103.53988 5238.57024 ## depth -92.16124 -210.18689 -222.97958 ## table -156.89671 -207.17217 -210.64667 ## x 709.68312 -1172.30125 -1348.94469 ## y 430.43936 0.00000 23.65170 ## z 423.90948 0.00000 22.01107 ## clarity 465.56366 499.73923 500.13593 ## cut 83.61722 74.53004 75.81861 ## color -317.53366 -448.40340 -453.92366 Our score raised a little, but with these values of lambda, there is not much difference. Let’s build coefficient plots to see how the value of lambda influences the coefficients of both models. We will use glmnet function to train the models and then we’ll use plot() function that produces a coefficient profile plot of the coefficient paths for a fitted “glmnet” object. Xvar variable of plot() defines what is on the X-axis, and there are 3 possible values it can take: “norm”, “lambda”, or “dev”, where “norm” is for L1-norm of the coefficients, “lambda” for the log-lambda sequence, and “dev” is the percent deviance explained. We’ll set it to lambda. To train glment, we need to convert our X_train object to a matrix. # Set lambda coefficients paramLasso <- seq(0, 1000, 10) paramRidge <- seq(0, 1000, 10) # Convert X_train to matrix for using it with glmnet function X_train_m <- as.matrix(X_train) # Build Ridge and Lasso for 200 values of lambda rridge <- glmnet( x = X_train_m, y = y_train, alpha = 0, #Ridge lambda = paramRidge ) llaso <- glmnet( x = X_train_m, y = y_train, alpha = 1, #Lasso lambda = paramLasso ) ## Warning: Transformation introduced infinite values in continuous x-axis ## Warning: Transformation introduced infinite values in continuous x-axis As a result, you can see that when we raise the lambda in the Ridge regression, the magnitude of the coefficients decreases, but never attains zero. The same scenario in Lasso influences less on the large coefficients, but the small coefficients are reduced to zeroes. We can conclude from the plot that the “carat”" feature has the most impact on the target variable. Other important features are “clarity” and “color” while features like “z” and “cut” have barely any impact. Therefore Lasso can also be used to determine which features are important to us and have strong predictive power. The Ridge regression gives uniform penalties to all the features and in such way reduces the model complexity and prevents multicollinearity. Now, it’s your turn! // //
Companies have a growing demand to visualize their data with business intelligence tools. We compared the salaries from across 10 different European countries using   Glassdoor , which offers self-reported salary information by location and employer, giving us some key insights into the salaries of people with “Business Intelligence” in their job title. Switzerland with the highest salary for Business Intelligence There are few reliable resources available to track salaries in the workplace. Most people are private about how much they earn and a lot of companies choose not to share them. Glassdoor allows access to self-reported salary information by location and employer, and can provide us with some insights into the salaries of the job titles containing “business intelligence.” We collected the salary data from 10 different European countries, including Austria, Belgium, France, Germany, Ireland, Italy, Netherland, Spain, Switzerland, and the UK. To make salaries comparable, we changed them into annual salaries in Euro (GBP/EUR: 1.12, CHF/EUR: 0.88). The figure plots the average nominal salaries per year by country and ranking. Junior level salaries range from EUR 24,900 in Italy to EUR 108,400 in Switzerland. Germany and Ireland follow closely behind with EUR 62,300 and EUR 51,900. For senior positions in Switzerland, wages are EUR 125,000, making it the clear lead in Europe.     When the cost of living is adjusted, the difference between Germany and Switzerland decreases These nominal salaries don’t tell us much about the underlying purchasing power. European cities like Geneva and Zurich are famous for being expensive places to live. To take into account the difference in the cost of living, and to compute real wages, we use the OECD price level index . The table below shows what the average annual salary for junior level positions is, with the prices adjusted. It shows that the cost of living in Switzerland is 70% higher than Spain, and so some of the differences in salaries for business intelligence can be explained by the cost of living. However even with these adjustments, there’s not that much that changes in the ranking. Switzerland still stands out clearly with the highest wage, while Italy remains at the lowest. The difference between Germany and Switzerland however, shrinks from around EUR 45,000 per year, to around EUR 12,700 per year.   Country     Annual Salary in 1000 EUR     OECD Price level     Adjusted Salary   Italy 24.9 91 27.3 Spain 28.1 83 33.9 United Kingdom 41 108 38.0 Netherlands 44.9 103 43.6 France 46.9 101 46.4 Belgium 47.8 101 47.3 Ireland 51.9 102 50.8 Germany 62.3 98 63.6 Switzerland 108.4 142 76.3
View all blog posts