Your Career Platform for Big Data

Be part of the digital revolution in Switzerland

 

Latest Jobs

Onedot AG Schlieren, Switzerland
18/04/2019
Internship
Gesucht: strukturierter, kundenorientierter Datenenthusiast mit erster Erfahrung in Datenverarbeitung Wir haben Onedot gegründet, weil wir glauben, dass Daten über verschiedene Quellen und Organisationen hinweg besser konsumierbar sein sollen. In Realität leben wir in einer Welt voller unstrukturierter und verteilten Daten, was uns daran hindert, die wichtigsten globalen Probleme in verschiedensten Industrien zu lösen. Onedot bietet eine dynamische, flexible und inspirierende Arbeitskultur mit viel Platz für persönliche Entwicklung und Autonomie, wo Du unser Unternehmen aktiv mitgestalten und einen substantiellen Beitrag zu unserem Erfolg leisten kannst. Als Data Management Intern stehst Du im Zentrum unseres Unternehmens und arbeitest zusammen mit unseren Kunden, deren Produktdaten und mit unserer proprietären Technologie. Du hast bereits erste Erfahrungen gemacht, Probleme in Daten schnell zu identifizieren und das Potential welches in Daten steckt rasch zu erkennen. Strukturiertes Denken, selbständige exakte Arbeitsweise sowie ein ausgeprägter Qualitätssinn sind für Dich selbstverständlich. Modernste Technologie ist dein Ding, und Du verstehst es bestens komplexe Sachverhalte auf einfache Art umzusetzen. Onedot entwickelt modernste Machine Learning Algorithmen, probabilistische Methoden und statistische Textanalysen, um unstrukturierte Daten automatisch in strukturierte Information zu verwandeln. Unsere wegweisende Software erzielt Resultate mit übermenschlicher Präzision, macht heutige Datenstandards überflüssig und beschleunigt das Wachstum von Organisationen in allen Industrien. Onedot ist ein Startup welches von führenden Venture Capital Firmen und Weltklasse-Unternehmern unterstützt wird. Dein Wirkungsbereich: Du unterstützt Kundenprojekte selbständig durch mit Hilfe von Onedot Technologie (automatisiertes Schema Mapping, Datenintegration, Produktklassifikation, Segmentierung, Datenharmonisierung) Du analysierst Kundenanforderungen, bildest diese auf unsere Technologie ab und hilfst die gesamten Datenprozessierungs-Pipelines aufzubauen Du führst selbständig Qualitätssicherung, Nachbearbeitung und Reporting auf Kundendaten durch Du verbesserst die Automatisierung unserer industrialisierten Datenprozessierungs-Pipelines kontinuierlich Du arbeitest mit Verkauf, Produktmanagement und Produktentwicklung zusammen Unsere Voraussetzungen: Erste Erfahrungen im Bereich Datenverarbeitung oder Reporting Kenntnisse in Scriptsprachen (Bash, Python oder R), Datenformaten (Excel, CSV, XML, JSON) und Office Software (Excel) Ausgezeichnete Deutsch- und Englischkenntnisse, mündlich und schriftlich, C2 oder höher Affinität zu E-Commerce, Produktdaten und Enterprise Software Du bist ein(e) Grossdenker(in), lebst eine Getting-Things-Done Mentalität und hast Spass in einer schnell getakteten Umgebung zu arbeiten! Unsere Benefits: Leistungsorientierte Kompensation, weil wir gute Arbeit, die viel Impact generiert gerne belohnen Flexible Arbeitszeiten, da deine Resultate, nicht nur Präsenzzeit zählen Gelegentliches Arbeiten von zu Hause, damit Du konzentriert und produktiv sein kannst Team aus verschiedensten Nationalitäten, sodass wir voneinander lernen können und gemeinsam vorwärts kommen Cooles Büro mit viel Platz, weil wir es gerne gemütlich und komfortabel haben Feine Snacks, Getränke, Kaffee und andere Annehmlichkeiten sind bei uns inbegriffen Häufige Teamevents, um uns auch persönlich besser kennen zu lernen Geplante Ausbildungen und Trainings, damit wir unseren Horizont immer wieder erweitern Im Rahmen unseres Engagements für die Vielfalt unserer Firmenkultur engagiert sich Onedot für gleiche Beschäftigungsmöglichkeiten ohne Rücksicht auf Rasse, Hautfarbe, nationale Herkunft, ethnische Zugehörigkeit, Geschlecht, Behinderung, sexuelle Orientierung, Geschlechtsidentität oder Religion. Dies ist eine 80%-100% Praktikumsstelle in unserem Hauptquartier in Schlieren/Zürich. Bewirb Dich noch heute und werde Teil der Onedot Erfolgsgeschichte!
AiM Services SA Geneva, Switzerland
16/04/2019
Full time
We are looking for a Data Engineer for one of our Geneva based customers. Responsibilities: Strengthen agile development capacity in the database Participation in the evolution of databases, and modeling of the HUB and Datawarehouse Develop and support ETL feed Analyze, model and integrate external data sources Database support for application development and web services Implementation of "best practices" for the data model, the database life cycle and the testing practices Profile Required Skills : Very good knowledge of relational databases, SQL, optimization of SQL queries, PL / SQL. Very good knowledge of Oracle. Very good knowledge of HUB and DataWareHouse databases, with practical experience. Very good ability to model financial data, alternative funds an asset. Good knowledge of agile principles. Experience in Scrum practice an asset. Knowledge of Informatica an asset. Knowledge of Sisense an asset Languages: French mother language English: advanced Training and qualifications : EPF diploma, HES, university degree or training deemed equivalent. Home in Switzerland for more than 3 years required. Activity rate: 100% Date taken from the position: as soon as possible
SIX Group AG Zürich, Switzerland
16/04/2019
Full time
As member of the Business Analysis department within the Strategic Transformation organization, you will review data provider and stock exchange notices, analyze and document requirements to ensure the needs of stakeholders are met within product and project parameters. You ideally have experience with working in stakeholder/matrix-style management environment. Familiarity with stock exchange data feeds terminology is preferred but not obligatory but you must demonstrate an ability to work effectively within a global team as well as independently to proactively identify and resolve problems. Main duties Support the correct and consistent integration of financial data from multiple sources into our feed and display products; including real-time, intraday and end of day delivery of market and reference data Analyze the impact of required changes for our central host database as well as for client products and define the necessary adjustments for respective data structures and rule sets in order to meet client needs Identify business needs, analyze and conceptualize them, in order to allow better informed business decisions and to provide strong guidance for software development Execute user acceptance testing as well as the preparation of client information on changes Assist project management during project planning and delivery phase Your profile Bachelor's or higher education degree in Finance, Economics or Computer Science At least 2 years of working experience in financial services and as a business analyst, and a  robust knowledge of financial markets and securities instruments Knowledge of object-oriented / other programing language and familiarity with various file formats (e.g. UTF, ANSO) and file structures (e.g. XML, JSON) is a great plus Experience with agile development projects Analytical, technical, problem solving, and communication skills are a key strength Excellent written and verbal communication skills in English and German. French is a plus If you have any questions, please call Valeria Schmid +41 58 399 80 17. We only accept online direct applications.Valeria Schmid Human Resources +41 58 399 80 17 We strive for a diverse workforce and welcome all applicants regardless of personal background.
Seervision Zürich, Switzerland
16/04/2019
Full time
We are getting our hands on huge amounts of data while producing video. Efficient ways to parse the data and build a database suitable for training detection and behavioral analytics algorithms will be a fundamental part of future development. Responsibilities: Develop mechanisms to automatically store metadata out of the live footage and the camera operation from every Seervision robot camera (labeling in machine learning). Format footage and metadata for use in (un)supervised machine learning tasks. Develop software bridges between footage metadata and commercial video processing software. Potential: Take full control of the machine learning tasks and focus solely on R&D in close collaboration with ETH. Control over team hiring decisions. ​Advantages: 2+ year in machine learning, preferably with neural networks. Experience with video processing software (e.g. Adobe Premiere).

DataCareer Blog

In the modern world, the information flow which befalls on a person is daunting. This led to a rather abrupt change in the basic principles of data perception. Therefore visualization is becoming the main tool for presenting information. With the help of visualization, information is presented to the audience in a more accessible, clear, visual form. Properly chosen method of visualization can make it possible to structure large data arrays, schematically depict elements that are insignificant in content, and make information more comprehensible. One of the most popular languages for data processing and analysis is Python, largely due to the high speed of creating and development of the libraries which grant basically unlimited possibilities for various data processing. The same is true for data visualization libraries. In this article, we will look at the basic tools of visualizing data that are used in the Python development environment. Matplotlib Matplotlib is perhaps the most widely known Python library for data visualization. Being easy to use, it offers ample opportunities to fine tune the way data is displayed. Polar area chart The library provides main visualization algorithms, including scatter plots, line plots, histograms, bar plots, box plots, and more. It is worth noting that the library has fairly extensive documentation, that makes it comfortable enough to work with even for beginners in the sphere of data processing and visualization. Multicategorical plot One of the main advantages of this library is a well-thought hierarchical structure. The highest level is represented by the functional interface called  matplotlib.pyplot , which allows users to create complex infographics with just a couple of lines of code by choosing ready-made solutions from the functions offered by the interface. Histogram The convenience of creating visualizations using matplotlib is provided not only due to the presence of a number of built-in graphic commands but also due to the rich arsenal on the configuration of standard forms. Settings include the ability to set arbitrary colors, shapes, line type or marker, line thickness, transparency level, font size and type, and so on. Seaborn Despite the wide popularity of the Matplotlib library, it has one drawback, which can become critical for some users: the low-level API and therefore, in order to create truly complex infographics, you may need to write a lot of generic code. Hexbin plot Fortunately, this problem is successfully leveled by the Seaborn library, which is a kind of high-level wrapper over Matplotlib. With its help, users are able to create colorful specific visualizations: heat maps, time series, violin charts, and much more. Seaborn heatmap Being highly customizable, Seaborn allows users wide opportunities to add unique and fancy looks to their charts in a quite a simple way with no time costs. ggplot Those users who have experience with R, probably heard about ggplot2, a powerful data visualization tool within the R programming language environment. This package is recognized as one of the best tools for graphical presentation of information. Fortunately, the extensive capabilities of this library are now available in the Python environment due to porting the package, which is available there under the name  ggplot . Box plot As we mentioned earlier, the process of data visualization has a deep internal structure. In other words, the process of creating a visualization is a clearly structured system, which largely influences the way of the thoughts in the process of creating infographics. And ggplot teaches the user to think in such a structured approach, to think according to this system so that in the process of consistently building commands, the user automatically starts detecting patterns in the data. Scatter plot Moreover, the library is very flexible. Ggplot provides users with ample opportunities for customizing how data will be displayed and preprocessing datasets before they are rendered. Bokeh Despite the rich potential of the ggplot library, some users may lack interactivity. Therefore, for those who need interactive data visualization, the Bokeh library has been created. Stacked area chart Bokeh is an open-source Javascript library with client-side for Python that allows users to create flexible, powerful and beautiful visualizations for web applications. With its help, users can create both simple bar charts and complex, highly detailed interactive visualizations without writing a single line in Javascript. Please have a look at  this gallery  to get an idea of the interactive features of Bokeh. plotly For those who need interactive diagrams, we recommend to check out the plotly library. It is positioned primarily as an online  platform , on which the users can create and publish their own visualizations. However, the library can also be used offline without uploading the visualization to the plotly server. Contour plot Due to the fact that this library is positioned by developers mostly as an autonomous product, it is constantly being refined and expanded. So, it provides users truly unlimited possibilities for data visualization, whether it’s interactive graphics or contours. You can find some examples of Plotly through the link below and have a look at the features of the library.  https://plot.ly/python/ Conclusion Over the past few years, data visualization tools available to Python developers have made a significant leap forward. Many powerful packages have appeared and are expanding in every possible way, implementing quite complex ways of graphical representation of information. This allows users not only to create various infographics but also to make them truly attractive and understandable to the audience.
The more carefully you process the data and go into details, the more valuable information you can get for your benefit. Data visualization is an efficient and handy tool for gaining insights from data. Moreover, you can make the data far more understandable, colorful and pleasant with the help of visualization tools. As data is changing every second, it is an urgent task to investigate it carefully and get the insights as fast as possible. Data visualization tools cover a full scope of opportunities and additional functions which are called upon to facilitate the visualization process for you. Thus, we attempted to make an overview of the most popular and useful libraries for data visualization in R. Ggplot2 Ggplot2 is a system for creating charts based on the Grammar of Graphics. It proved to be one of the best R libraries for visualization. Ggplot2 works with both univariate and multivariate numerical and categorical data. Thus, it is very flexible. The plot specification is at a high level of abstraction and has complete graphics system. It contains a variety of labels, themes, different coordinate systems, etc. Therefore, you get the opportunity to: control data values with scales option filter, sort, summarize datasets create complex plots. However, some activities are not available with ggplot2 such as 3d graphics, graph-theory type graphs, and interactive graphics. Here are several examples of the visualization plots made with the help of Ggplot2. Density plot Boxplot Scatterplot Plotly Plotly is an online platform for data visualization, available in R and Python. This package creates interactive web-based plots using plotly.js library. Its advantage is that it can build contour plots, candlestick charts, maps, and 3D charts, which cannot be created using most packages. In addition, it has 30 repositories available. Plotly gives you an opportunity to interact with graphs, change their scale and point out the necessary record. The library also supports graph hovering. Moreover, you can easily add Plotly in knitr/R Markdown or Shiny apps. Have a look at several plots and charts created with Plotly. Contour plot Candlestick chart 3d scatterplot Dygraphs Dygraphs is an R interface to the JavaScript charting library. This library proved to be fast and flexible in its application. It facilitates the work with dense data. Dygraphs is a useful tool for charting time-series data in R. The benefits of this library include the support of visualizing xts objects, support of graph covering such as shaded regions, event lines, and point annotations, interaction with graphs, showing upper/lower bars, synchronization and the range selector, and more. You can also easily add Dygraphs in knitr/R Markdown or Shiny apps. Moreover, huge datasets with millions of points don’t affect its speed. Also, you can use RColorBrewer with Dygraphs to increase the range of colors. Below you can see a vivid representation of the data visualization with Dygraphs package. Leaflet Leaflet is a well-known package based on JavaScript libraries for interactive maps. It is widely used for mapping and working with the customization and design of interactive maps. Besides, Leaflet provides an opportunity to make these maps mobile-friendly. It's abilities include: interaction with plots, and the change in their scale map design (Markers, Pop-ups, GeoJSON) easy integration with knitr/R Markdown or Shiny apps work with latitude/longitude columns support of the Shiny logic using map bounds and mouse events visualization of maps in non-spherical Mercator projections. Rgl Rgl package may become a perfect fit for creating interactive 3D plots in R. It offers a variety of 3D shapes, lighting effects, different types of the objects, and even the ability to make an animation of your 3D scene. Rgl contains high-graphics commands and low-level structure. The plot types include meaning points, lines, segments from z=0, and spheres. Moreover, with Rgl, you can: interact with graphs apply various decorative functions easily add Rgl in knitr/R Markdown or Shiny apps create new shapes Conclusion To sum up, data visualization is more than a charming picture of your data. It's a chance to see the data under the hood. R is one of the powerful visualization tools. Using R you can build a variety of charts from a simple pie chart to more sophisticated such as 3d graphs, interactive graphs, maps, etc. Of course, this list is not complete and there exist many other great visualization tools which can bring their specific benefits to your data visualization. Nevertheless, we compiled this list from our experience. Summarizing everything mentioned before, Plotly, Dygraph, and Leaflet support zooming, moving your graphs. If you are plotting time series, you can filter dates using a range selector. For building 3d models it is highly suitable to use Rgl. Do your best with handy R visualization tools!  
   Random Forest is a powerful ensemble learning method that can be applied to various prediction tasks, in particular classification and regression. The method uses an ensemble of decision trees as a basis and therefore has all advantages of decision trees, such as high accuracy, easy usage, and no necessity of scaling data. Moreover, it also has a very important additional benefit, namely perseverance to overfitting (unlike simple decision trees) as the trees are combined. In this tutorial, we will try to predict the value of diamonds from the Diamonds dataset (part of ggplot2) applying a Random Forest Regressor in R. We further visualize and analyze the obtained predictive model and look into the tuning of hyperparameters and the importance of available features. Loading and preparing data # Import the dataset diamond <-diamonds head(diamond) ## # A tibble: 6 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 The dataset contains information on 54,000 diamonds. It contains the price as well as 9 other attributes. Some features are in the text format, and we need to encode them to the numerical format. Let’s also drop the unnamed index column. # Convert the variables to numerical diamond$cut <- as.integer(diamond$cut) diamond$color <-as.integer(diamond$color) diamond$clarity <- as.integer(diamond$clarity) head(diamond) ## # A tibble: 6 x 10 ## carat cut color clarity depth table price x y z ## <dbl> <int> <int> <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 5 2 2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 4 2 3 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 2 2 5 56.9 65 327 4.05 4.07 2.31 ## 4 0.290 4 6 4 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 2 7 2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 3 7 6 62.8 57 336 3.94 3.96 2.48 As we already mentioned, one of the benefits of the Random Forest algorithm is that it doesn’t require data scaling. So, to use this algorithm, we only need to define features and the target that we are trying to predict. We could potentially create numerous features by combining the available attributes. For simplicity, we will not do that now. If you are trying to build the most accurate model, feature creation is definitely a key part and substantial time should be invested in creating features (e.g. through interaction). # Create features and target X <- diamond %>% select(carat, depth, table, x, y, z, clarity, cut, color) y <- diamond$price Training the model and making predictions At this point, we have to split our data into training and test sets. As a training set, we will take 75% of all rows and use 25% as test data. # Split data into training and test sets index <- createDataPartition(y, p=0.75, list=FALSE) X_train <- X[ index, ] X_test <- X[-index, ] y_train <- y[index] y_test<-y[-index] # Train the model regr <- randomForest(x = X_train, y = y_train , maxnodes = 10, ntree = 10) Now, we have a pre-trained model and can predict values for the test data. We then compare the predicted value with the actual values in the test data and analyze the accuracy of the model. To make this comparison more illustrative, we will show it both in the forms of table and plot the price and the carat value # Make prediction predictions <- predict(regr, X_test) result <- X_test result['price'] <- y_test result['prediction']<- predictions head(result) ## # A tibble: 6 x 11 ## carat depth table x y z clarity cut color price prediction ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <dbl> ## 1 0.24 62.8 57 3.94 3.96 2.48 6 3 7 336 881. ## 2 0.23 59.4 61 4 4.05 2.39 5 3 5 338 863. ## 3 0.2 60.2 62 3.79 3.75 2.27 2 4 2 345 863. ## 4 0.32 60.9 58 4.38 4.42 2.68 1 4 2 345 863. ## 5 0.3 62 54 4.31 4.34 2.68 2 5 6 348 762. ## 6 0.3 62.7 59 4.21 4.27 2.66 3 3 7 351 863. # Import library for visualization library(ggplot2) # Build scatterplot ggplot( ) + geom_point( aes(x = X_test$carat, y = y_test, color = 'red', alpha = 0.5) ) + geom_point( aes(x = X_test$carat , y = predictions, color = 'blue', alpha = 0.5)) + labs(x = "Carat", y = "Price", color = "", alpha = 'Transperency') + scale_color_manual(labels = c( "Predicted", "Real"), values = c("blue", "red")) The figure displays that predicted prices (blue scatters) coincide well with the real ones (red scatters), especially in the region of small carat values. But to estimate our model more precisely, we will look at Mean absolute error (MAE), Mean squared error (MSE), and R-squared scores. # Import library for Metrics library(Metrics) ## ## Attaching package: 'Metrics' ## The following objects are masked from 'package:caret': ## ## precision, recall print(paste0('MAE: ' , mae(y_test,predictions) )) ## [1] "MAE: 742.401258870433" print(paste0('MSE: ' ,caret::postResample(predictions , y_test)['RMSE']^2 )) ## [1] "MSE: 1717272.6547428" print(paste0('R2: ' ,caret::postResample(predictions , y_test)['Rsquared'] )) ## [1] "R2: 0.894548902990278" We obtain high error values (MAE and MSE). To improve the predictive power of the model, we should tune the hyperparameters of the algorithm. We can do this manually, but it will take a lot of time. In order to tune the parameters ntrees (number of trees in the forest) and maxnodes (maximum number of terminal nodes trees in the forest can have), we will need to build a custom Random Forest model to obtain the best set of parameters for our model and compare the output for various combinations of the parameters. Tuning the parameters # If training the model takes too long try setting up lower value of N N=500 #length(X_train) X_train_ = X_train[1:N , ] y_train_ = y_train[1:N] seed <-7 metric<-'RMSE' customRF <- list(type = "Regression", library = "randomForest", loop = NULL) customRF$parameters <- data.frame(parameter = c("maxnodes", "ntree"), class = rep("numeric", 2), label = c("maxnodes", "ntree")) customRF$grid <- function(x, y, len = NULL, search = "grid") {} customRF$fit <- function(x, y, wts, param, lev, last, weights, classProbs, ...) { randomForest(x, y, maxnodes = param$maxnodes, ntree=param$ntree, ...) } customRF$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL) predict(modelFit, newdata) customRF$prob <- function(modelFit, newdata, preProc = NULL, submodels = NULL) predict(modelFit, newdata, type = "prob") customRF$sort <- function(x) x[order(x[,1]),] customRF$levels <- function(x) x$classes # Set grid search parameters control <- trainControl(method="repeatedcv", number=10, repeats=3, search='grid') # Outline the grid of parameters tunegrid <- expand.grid(.maxnodes=c(70,80,90,100), .ntree=c(900, 1000, 1100)) set.seed(seed) # Train the model rf_gridsearch <- train(x=X_train_, y=y_train_, method=customRF, metric=metric, tuneGrid=tunegrid, trControl=control) Let’s visualize the impact of tuned parameters on RMSE. The plot shows how the model’s performance develops with different variations of the parameters. For values maxnodes: 80 and ntree: 900, the model seems to perform best. We would now use these parameters in the final model. plot(rf_gridsearch) Best parameters: rf_gridsearch$bestTune ## maxnodes ntree ## 5 80 1000 Defining and visualizing variables importance For this algorithm, we used all available diamond features, but some of them contain more predictive power than others. Let’s build the plot with features list on the y axis. On the X axis we’ll have incremental decrease in node impurities from splitting on the variable, averaged over all trees, it is measured by the residual sum of squares and therefore gives us a rough idea about the predictive power of the feature. Generally, it is important to keep in mind, that random forest does not allow for any causal interpretation. varImpPlot(rf_gridsearch$finalModel, main ='Feature importance') From the figure above you can see that the size of diamond (x,y,z refer to length, width, depth) and the weight (carat) contain the major part of the predictive power. // //
View all blog posts