The open-source project R is among the leading tools for data science and machine learning tasks. Given its open-source framework, there are continuous contributions and new package libraries with new features pop up frequently. Currently, the CRAN package repository features 12,525 available packages. This post takes a look at the most popular and useful packages that have set the standards for solving data manipulation, visualization, and machine learning problems.
Dplyr is an essential library for fast and easy data wrangling and analysis in R. It is designed to work with data tables, including tables from MySQL, PostgreSQL, SQLite, and Google BigQuery. The distinctive features of dplyr are the simplicity of command syntax and high performance.
The main concept of dplyr is to provide a few simple functions that take responsibility for the general data manipulation problems. These five determinative verbs are:
There is also group_by() function which allows you to perform any group actions.
Furthermore, due to C++ backend, dplyr speeds up all these commands, which makes it is especially popular when processing large amounts of data.
Data.table is a laconic library designed for heavy-duty data wrangling. With its help, you can do many operations in just one line. Moreover, in some cases, data.table is faster than dplyr, and it can determine the choice if there are memory or performance limitations. The library operates with functions such as subset, group, update, join and many more.
Data.table has a very concise general structure: DT[i, j, by], where the parameters i and j refer to rows and columns respectively, meaning subsetting rows using i and calculating j, and by refers to adding a group.
Sometimes data.table is used together with dplyr.
Ggplot2 is one of the most popular packages for data visualization among R users. It implements the idea of grammar graphics and applies a system of concepts such as data arrays (univariate and multivariate, numerical and categorical), visual tools, geometric objects, the statistical transformation of variables, coordinate systems, etc. to create those graphics. They are built layer by layer, combining all these main blocks listed above and, as a result, you get the desired kind of graphics display. Ggplot2 has functions that take on the solution of many secondary questions or plot specifications like whether a legend is needed, where to place it, or what boundaries to choose for axes, which allows concentrating on the principal tasks. Ggplot2 is also often used as a foundation for libraries that offer more complex graphs from the box.
However, there are some limitations and things that shouldn’t be performed with this package, among which are 3-dimensional or interactive graphics.
Image source: RStudio
This R library is designed to produce visualizations of a similar plan as ggplot2 but in an interactive web-key. As a backend for visualization, ggvis uses vega, which in its turn lies on D3.js, and for the interaction with the user, the package employs R extension of Shiny and dplyr grammar of data transformation.
Among the limitations are the inability to perform complex interactions like turning on or off certain layers, switching between datasets etc. and the necessity of connection to a running R session, which is not so great for publication.
Image source: plotly for R
The visualizations and the data behind them can be viewed and modified in a web browser. Nevertheless, plotly also has the same drawback as ggvis - the requirement for running R session.
Some of the most useful features that can be implemented in tables are filtering, pagination, sorting, and many other. You can also style, edit, and choose the options of displaying the table.
Image source: RStudio Blog
Classification and REgression Training (CARET) package represents a set of instruments that help to perform various machine learning tasks from data splitting and pre-processing to building predictive models and estimating their performance. In other words, the library combines powerful functions and algorithms for model training and prediction. There are 238 models available. However, all of them are for regression and classification only. A lot of popular metrics are implemented in the library, but you can also write your own quality metrics and wrapper methods for models. Also, caret is well-integrated with other algorithm-specific packages.
Gradient boosting is a machine learning technique for regression and classification problems, which uses the idea of creating a prediction model combining multiple weak models, usually decision trees. The affecting increase in performance makes gradient boosting a standout among the most powerful predictive tools that you can learn to use in machine learning. The gbm package (Generalized Boosted Regression Models) implements an extension to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine. It provides tools for quick modeling, variable selection, and final stage precision modeling ensuring robust and competitive performance. Gbm includes regression methods for least squares, absolute loss, t-distribution loss, quantile regression, logistic, multinomial logistic, Poisson, Cox proportional hazards partial likelihood, AdaBoost exponential loss, Huberized hinge loss, and Learning to Rank measures.
Another highly popular, powerful and versatile machine learning algorithm is Random Forest. It uses the idea of combining several decision trees to build a stronger model and improve the generalization ability for classification and regression tasks. An implementation of this algorithms is provided in randomForest R package. Ensembling different observation into one decision tree, this library uses common output obtained for the maximum of the observations to make a final prediction.
It’s important to note, that randomForest works with numeric or factor variables. It can also be utilized in an unsupervised mode for evaluating closeness among data points.
Last but not least on our list is Extreme Gradient Boosting (xgboost) library, which is an implementation of the Gradient Boosted Decision Trees algorithm in R interface. Again, the idea is to build an ensemble of successively refined elementary models that can find an answer to supervised machine learning problems. It has both linear model solver and tree learning algorithms. Among the most useful functions are regression, classification, and ranking which you can use directly, or there is also a possibility to define your own function. The package is valued for parallel computing, enabled cross-validation, and regularization. All these features make xgboost exceptionally high in predictive power and provide very good speed indicators.
These 10 packages have proven to be highly effective and helpful by many data science professionals and our team in particular. They are used to solve various complicated problems in different areas and find the answers to the myriad of scientific questions.
Of course, this is a subjective list, there are many other valuable R libraries available.
So, what are your favorite R packages to perform data science tasks? Share your experience in comments.