Your Career Platform for Big Data

Be part of the digital revolution in Switzerland

 

Latest Jobs

Astreya Zürich, ZH, Switzerland
19/06/2019
Full time
Job Description Astreya Partners is looking for an experienced AVM Technician to join our expanding team onsite with one of our global clients! In this role, you will be given the opportunity to work with some of the most innovative technologies in the world. We are seeking an AVM Technician who can provide support for corporate AV/VC environments, as well as someone who has the ability to confidently and quickly analyze problems using technical diagnostic skill sets, in order to effectively support and resolve issues. Technical responsibilities include high profile event support, audio, video, conferencing technology, projects, maintenance, support. Traveling to our clients various offices on occasion is required. Responsibilities: Installation of all AV equipment and displays to include: audio and video conferencing equipment, distributed audio, digital signage, control systems and all presentation systems. Installation of structured cabling to include pulling, terminating and testing cat5e/6, video, RGB, HDMI, DVI and audio cables. Ensure that SLAs and daily deliverables are met including ticket updates. Upload Crestron control programs to control processors and work with remote program team on control system certification. Responsible for all Inventory that includes weekly/monthly stock levels, returning product (RMA) and inventory DOA issues. Follow all policies, standards or safety guidelines that client requires. Problem solving and troubleshooting of any issue that may arise during an installation. Event Support - Perform equipment setups, breakdowns and troubleshoot issues. Perform all necessary pre-event checks as required. Liaise with cross-functional teams and internal users before and during events. Interface with end users and vendors to address requests and requirements. Identify and solve issues that impact the Client's conference rooms. Qualifications Skills Required: 5+ years of relevant experience Ability to think and work independently. Impeccable wiring and termination skills as well as proper installation techniques of all AV equipment. Excellent written and oral communication skills and strong customer service skills. Experience in reading/understanding architecture, electrical, structural & AV systems drawings. Ability to work day, evening or weekend shifts and travel to support a distant event if required. Must have valid driver’s license and safe driving record. Sitting, standing, stooping and bending for long or extended periods of time. Dexterity of hands and fingers to operate a computer keyboard, mouse, power tools and to handle other office equipment. Ability to climb ladder, scaffolding or transported by lifts. Ability to drive/operate a vehicle in a safe manner. Ability to regularly lift and transport up to 15 pounds, frequently lift and or transport moderately heavy equipment up to 50 pounds. Skills Preferred: CTS and or CTS-I Certification DMC-T Certification Additional Information What can Astreya offer you? Working with some of the biggest firms in the world as part of the Astreya delivery network Employment in the fast growing IT space providing you with brilliant career options for years to come Introduction to the new ways of working and awesome technologies Career paths to help you establish where you want to go A company-wide mentoring program to advise you along the way Online training courses through CBT-nuggets to upskill you Performance management system to provide you with meaningful, actionable feedback Dedicated management to provide you with a point of leadership and care Internal promotion focus We love to build people from within. Numerous on-the-job perks Peer Recognition Market competitive rates and benefits
UNITAR Geneva, Geneve, Switzerland
19/06/2019
Full time
The United Nations Institute for Training and Research (UNITAR) pursues a strong partnership strategy to deliver on its mandate to strengthen the capacities of beneficiaries through training and related activities, with more than two-thirds of training beneficiary outputs delivered in partnership. Finance and Budget Unit (FBU) is responsible for financial management of UNITAR and provides financial services and advisory role for its clients, namely staff, auditors, collaborators and beneficiaries of UNITAR. In order to efficiently manage resources and better serve its clients, (FBU) rationalizes the development of financial reporting tools to simplify and automate existing processes of financial reporting, building Project closure KPI dashboard, and Payment advisory system using in-house technologies. Under the supervision of the Chief Finance and Budget Unit, the incumbent will be required to perform services in the design, development and training of the end users of developed applications. The incumbent will: 1. Design and develop solutions by: Collect user requirements, design, build, financial statements automation model using in-house latest technologies to requiring no on-going user maintenance and superseding prior lengthy, manual process. Develop Project closure KPI Dashboard consisting of tabular views and interactive components to allow selection of different data views. Develop automated payment advisory system using in-house latest technologies. Producing mockups and prototype to give clients an idea of the outcomes of project. Working in a collaborative manner to incorporate the feedback 2. Support and training on tools developed by: Provide interactive support in UAT (Users Acceptance Testing) debugging of models and providing support in testing and enhancing the developed solutions. Provide guide on how to operate and maintain of developed tools. Providing user support and troubleshooting incidents and problems during beta release. Provide recommendations on performance, security, and user interface of the solutions. Work implies frequent interaction with the following Managers and focal points of Finance and Budget Unit with whom projects are undertaken in order to discuss the most appropriate solutions for a given set of requirements. Results Expected The result of work of the Business Intelligence Developer impacts directly on delivery of services that satisfy the requirements of the Finance and Budget Unit. Competencies Professionalism   — Excellent knowledge in the field of Business Intelligence development, particularly of latest BI technologies; very good knowledge of general IT concepts and techniques. Working experience in the following: Proficiency in development of BI solutions using Power BI Front-end development of interfaces using Microsoft technologies. Experience in Microsoft Office 365 platform and SharePoint web services/APIs Experience in cross application programming using VBA Proficiency in “M” Query language Proficiency in DAX and MDX expression Ability to produce mockups and design user interfaces Ability to query databases and model data Working knowledge of Office.JS Planning & Organizing   — Sound planning and organizational skills and ability to prioritize own work, delivering results and quality work, including when working under pressure. Communications   — Excellent communication skills (spoken, written and presentation), including the ability to draft/edit a variety of written documents, such as proposals for solutions to develop, and documentation of components and packages used in projects. Technology awareness   — Fully proficient in computer skills, with ability to use a variety of software and applications. Teamwork   — Strong interpersonal skills; ability to deal effectively with multiple constituencies and to establish and maintain effective working relations in a multi-cultural, multi-ethnic environment with sensitivity and respect for diversity. Qualifications Minimum requirements Education High School Diploma with required skills in advanced Microsoft Excel, Power BI tools , Office 365 and VBA, developed and demonstrated through professional work experience, project work and/or consultancy assignments. Experience Minimum 4-5 years’ experience in the area Business Intelligence development and IT support. Language Fluency in oral and written English is required. Desired requirements Knowledge of French or another official UN language would be an advantage. Performance Indicators for Evaluation of Results All deliverables submitted within the established timeframe Polit testing and release of specified tools for use to the satisfaction of the Unit personnel Reporting structure The consultant will report directly to the Chief, Finance and Budget Unit. General conditions of contracts for the service of consultants shall apply. How to apply All applicants are strongly encouraged to apply as soon as possible after the vacancy has been posted and well before the deadline stated in the vacancy announcement. Expressions of interest should include a letter of motivation and Curriculum Vitae and be sent to: fbu@unitar.org with the following subject line:  
Amaris Basel, BS, Switzerland
19/06/2019
Full time
Amaris is an independent, international Technologies and Management Consulting Group. Created in 2007, Amaris is already established in over 50 countries and supports more than 850 clients worldwide throughout their project's lifecycle. Our expertise covers five areas of innovation: Business and Management, Information Technologies, Engineering and High Technologies, Telecommunications and Biotech and Pharma. With more than 65 offices across the world, the Group offers proximity support to its clients in all their locations and many opportunities of international careers for employees. In 2019, Amaris aims to reach a turnover of 350 million euros, 6500 employees and to develop its workforces with the anticipation of a further 2000 new job openings. We expect to triple our workforce within the next few years and reach a leading international position in independent consulting.  Your Role  For one of our most important clients operating in the pharma/chemical sector, we are looking for IT consultants/managers experts in process optimisation and digitalisation.  The selected people will join a dynamic and exciting team working on a global processes optimisation project, within the Innovation & Research division.  You are an outstanding IT professional with minimum 5 years' experience in: Project Management Business Process Reengineering Process Optimisation Digital Innovation Experience in data science and artificial intelligence topics is a plus.  You are a great team player and out of the box thinker, accustomed to working under pressure and in a challenging environment, with good problem solving capabilities.  Experience in the pharma/chemical sector and fluency in English complete the profile. Fluency in German and/or French is a great plus.  Workplace:  Basel city
Vans Stabio, Ticino, Switzerland
19/06/2019
Full time
Job Description We’re looking for an outstanding Lead Data Scientist to join the VF Central Data Analytics team in Ticino, Switzerland. You may ask “Just who is VF?”. Well, we are the global company behind some of the world’s leading lifestyle brands. Household names such as Vans, Timberland, The North Face, Napapijri, Eastpak and Kipling. As one of the largest apparel providers in the world, we are passionate about finding great people to join our extended family. Let's talk about the role... The Lead Data Scientist will drive EMEA Direct-to-Consumer Data Science practice, in collaboration with the Global Team. Reporting to the EMEA Data & Analytics Lead, the Lead Data Scientist will be responsible for: Apply data modelling and analytical pipeline to large data set Manage e2e lifecycle for advanced data Science products Prove Business impact by leading scalable data initiatives across different domains, engaging different stakeholders and teams to increase efficiency and productivity Create and own the right set of metrics and audit mechanisms to track the performance of such initiatives Deal with ambiguity, conduct deep dive analyses on business problems, and formulate conclusions Provide leadership for Data Science domain, guide and inspire data Science team, identify new analytics opportunities and engage with business and technology partners Analytics Tracks: Marketing, Demand forecast, Consumer Lifecycle management, Pricing & Stock optimization, Channel Performances You will make the difference through your: MSc or PhD in quantitative or scientific discipline (Mathematics, Statistics, Physics, Engineering, Computer Science, Econometrics or strongly related subjects) 5-7 years of proven experience in Data science initiatives in Retail space, with proven knowledge of bringing to live analytical solutions Good knowledge of data structures and technologies (relational DB, file formats, NoSQL DB, etc.) related to Big Data and Cloud computing (Hadoop, Map Reduce, Spark) Ability to master few of the following languages: Python, Java, Scala, R, Matlab, Octave , as well as a dvanced SQL knowledge Experience in managing Data Science team Full proficiency of technical and business English What’s in it for you? Most companies like to say they offer a competitive salary, an amazing bonus and pension scheme as well as staff discounts (btw we offer 50%!). We also do this, only quite different. Because it’s not just our products which set us apart from others. It’s our people and we believe they deserve to be nurtured and looked after. That’s why, on top of the usual benefits, we offer much more: Career ownership, enabling you to build your knowledge and experience across different brands and even different countries A supportive feedback-based culture where respect and integrity guide us in what we do Tailored training. From a thorough induction to ongoing online and face-to-face training, we are committed to helping you grow, both professionally and personally. An inclusive environment where people of diverse backgrounds, lifestyles and nationalities love working together On site gym offering health and well-being initiatives Subsidised canteen as well as break out areas offering complimentary hot drinks

DataCareer Blog

Among the variety of open source relational databases, PostgreSQL is probably one of the most popular due to its functional capacities. That is why it is frequently used among all the areas of work where databases are involved. In this article, we will go through connection and usage of PostgreSQL in R. R is an open source language for statistical and graphics data analysis providing scientists, statisticians, and academics powerful tools for various manipulations. Besides, it allows creating and running emulations of the real-world data. Usually, R comes with an RStudio IDE, so that will be used while connecting and using PostgreSQL. PostgreSQL deployment in R One of the great things about R language is that it has numerous packages for almost every kind of needs. Moreover, the package library is constantly growing, as the packages are set up and developed by the community. Two main packages can be found in the library for connecting PostgreSQL in R environment:  RPostgreSQL  and  RPostgres . Both of them provide great functionality for database interactions, the difference is only in the way of installation. The RPostgreSQL package is available on the CRAN, a Comprehensive R Archive Network, and is installed with the following command run in the IDE: install.packages('RPostgreSQL') As for the RPostgres package, it can be installed in two ways: cloning from Github and installing directly from CRAN. To install the package from Github, first, devtools and remotes packages must be installed with the commands. install.packages('devtools') install.packages(‘remotes’)   Then, for installing package, run remotes::install_github("r-dbi/RPostgres")   To install package from CRAN, the next basic command is used: install.packages(‘RPostgres’)   The difference in these two ways is that in CRAN the latest stable version of a package is stored while on Github users can find the latest development version. The truth is, RPostgreSQL and RPostgres packages have no difference in the way they connect to the PostgreSQL database. They both use a special DBI package in R that provides a wide range of methods and classes to establishing connection with DBs. Note: we used RPostgres package for establishing the connection. Establishing basic connection with the database using R The Postgres package comes with the next command: con<-dbConnect(RPostgres::Postgres())   With the following steps you can set up the connection to a specific database:   library(DBI) db <- 'DATABASE'  #provide the name of your db host_db <- ‘HOST’ #i.e. # i.e. 'ec2-54-83-201-96.compute-1.amazonaws.com'   db_port <- '98939'  # or any other port specified by the DBA db_user <- USERNAME   db_password <- ‘PASSWORD’ con <- dbConnect(RPostgres::Postgres(), dbname = db, host=host_db, port=db_port, user=db_user, password=db_password)     To check if the connection is established, we can run the dbListTables(con) function that returns the list of the tables in our database. As you can see, no tables are stored in our database, so now it’s time to create one. Working with database As we’ve already mentioned, the R language provides a great pack of simulated datasets, that can be directly used from the IDE without downloading them previously. For our examples, we will use a popular “mtcars” dataset example, which contains data from the 1974 Motor Trend magazine car road test. Let’s first add it to the database and then check whether it has appeared in our database. The basic command to add “mtcars” to our database is dbWriteTable(con, "mtcars", mtcars) But we will do a little trick, that can make our table a little bit more readable. What we’ve done, is set up the table as a dataframe in R, renamed the first column to ‘carname’ and then removed initial dataset with the rm(mtcars) command as it is stored in the variable my_data. Using the   dbWriteTable method, we can write our dataframe to a PostgreSQL table. Then, let’s check how our table looks. Having a table in the database, we can now explore queries. For working with queries, two basic methods are needed: dbGetQuery and dbSendQuery The dbGetQuery method returns all the query results in a dataframe. The dbSendQuery registers the request for the data that has to be called by dbFetch for RPostgres to receive data. The dbFetch method allows setting parameters to query your data in some batches. The database table must have some primary key, basically, a unique identifier for every record in the table. Let’s assign the names of the cars in our table as a primary key using dbGetQuery method.   dbGetQuery(con, 'ALTER TABLE cars ADD CONSTRAINT cars_pk PRIMARY KEY ("carname")')   We have already used the dbReadTable method, but let’s return to it for a little bit to clarify the way it works. The dbReadTable method returns an overview of the data stored in the database and basically does the same function as dbGetQuery(con, ‘SELECT * FROM cars’) method. It should be noted that after using dbSendQuery requests, the dbClearResult method must be called, to remove any pending queries from the database to the current working environment. The dbGetQuery method does this by default and therefore there is no need to call dbClearResult after the execution. Creating basic queries The way of creating queries for a customized data table is basically the same as in SQL. The only difference is that the results of queries in R are stored as a variable. First, we extracted the query with the needed data from our cars table to a new variable. Then, we fetched it to the resulting variable, from which we can create a new table in our database and analyze the output of our query. Finally, the connection must be closed with the dbDisconnect(con) method. Conclusion In this article, we tried to cover the basis of connecting and using PostgreSQL in the R environment. Knowing the essentials of the SQL syntax, querying and modifying data in R is enough to connect to any standard database.. Nevertheless, we suggest reading through the package documentation, which will give you more insights on how to query data from PostgreSQL to the R environment.
What is Exploratory Data Analysis Exploratory data analysis (EDA) is a powerful tool for a comprehensive study of the available information providing answers to basic data analysis questions. What distinguishes it from traditional analysis based on testing a priori hypothesis is that EDA makes it possible to detect — by using various methods — all potential systematic correlations in the data. Exploratory data analysis is practically unlimited in time and methods allowing to identify curious data fragments and correlations. Therefore, you are able to examine information more deeply and accurately, as well as choose a proper model for further work. In Python language environment, there is a wide range of libraries that can not only ease but also streamline the process of exploring a dataset. We will use  Google Play Store Apps dataset  and go through the main tasks of exploration analysis to find out if there are any trends that can facilitate the process of setting and resolving a business problem. Data overview Before we start exploring our data, we must import the dataset and Python libraries needed for further work. We will use pandas library, a very powerful tool for comprehensive data analysis. In [1]: import pandas as pd In [2]: googleplaystore = pd.read_csv("googleplaystore.csv") Let's explore the structure of our dataframe by viewing the first and the last 10 rows. In [3]: googleplaystore.head(10) Out[3]:     App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver 0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up 1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up 2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up 3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up 4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up 5 Paper flowers instructions ART_AND_DESIGN 4.4 167 5.6M 50,000+ Free 0 Everyone Art & Design March 26, 2017 1.0 2.3 and up 6 Smoke Effect Photo Maker - Smoke Editor ART_AND_DESIGN 3.8 178 19M 50,000+ Free 0 Everyone Art & Design April 26, 2018 1.1 4.0.3 and up 7 Infinite Painter ART_AND_DESIGN 4.1 36815 29M 1,000,000+ Free 0 Everyone Art & Design June 14, 2018 6.1.61.1 4.2 and up 8 Garden Coloring Book ART_AND_DESIGN 4.4 13791 33M 1,000,000+ Free 0 Everyone Art & Design September 20, 2017 2.9.2 3.0 and up 9 Kids Paint Free - Drawing Fun ART_AND_DESIGN 4.7 121 3.1M 10,000+ Free 0 Everyone Art & Design;Creativity July 3, 2018 2.8 4.0.3 and up In [4]: googleplaystore.tail(10) Out[4]:     App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver 10831 payermonstationnement.fr MAPS_AND_NAVIGATION NaN 38 9.8M 5,000+ Free 0 Everyone Maps & Navigation June 13, 2018 2.0.148.0 4.0 and up 10832 FR Tides WEATHER 3.8 1195 582k 100,000+ Free 0 Everyone Weather February 16, 2014 6.0 2.1 and up 10833 Chemin (fr) BOOKS_AND_REFERENCE 4.8 44 619k 1,000+ Free 0 Everyone Books & Reference March 23, 2014 0.8 2.2 and up 10834 FR Calculator FAMILY 4.0 7 2.6M 500+ Free 0 Everyone Education June 18, 2017 1.0.0 4.1 and up 10835 FR Forms BUSINESS NaN 0 9.6M 10+ Free 0 Everyone Business September 29, 2016 1.1.5 4.0 and up 10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0 Everyone Education July 25, 2017 1.48 4.1 and up 10837 Fr. Mike Schmitz Audio Teachings FAMILY 5.0 4 3.6M 100+ Free 0 Everyone Education July 6, 2018 1.0 4.1 and up 10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical January 20, 2017 1.0 2.2 and up 10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.5 114 Varies with device 1,000+ Free 0 Mature 17+ Books & Reference January 19, 2015 Varies with device Varies with device 10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE 4.5 398307 19M 10,000,000+ Free 0 Everyone Lifestyle July 25, 2018 Varies with device Varies with device   We can see that dataframe googleplaystore has such problem as missing values. But for a more complex view on data, let's do a few more things. Firstly, we will use describe() pandas method that will help us to get a statistic summary of numerical columns in our dataset. We can also use info() method to check data types in each column as well as missing values and shape() for retrieving a number of rows and columns in the dataframe. In [5]: googleplaystore.describe() Out[5]:     Rating count 9367.000000 mean 4.193338 std 0.537431 min 1.000000 25% 4.000000 50% 4.300000 75% 4.500000 max 19.000000 In [6]: googleplaystore.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 10841 entries, 0 to 10840 Data columns (total 13 columns): App 10841 non-null object Category 10841 non-null object Rating 9367 non-null float64 Reviews 10841 non-null object Size 10841 non-null object Installs 10841 non-null object Type 10840 non-null object Price 10841 non-null object Content Rating 10840 non-null object Genres 10841 non-null object Last Updated 10841 non-null object Current Ver 10833 non-null object Android Ver 10838 non-null object dtypes: float64(1), object(12) memory usage: 1.1+ MB In [7]: googleplaystore.shape Out[7]: (10841, 13) In [8]: googleplaystore.dtypes Out[8]: App object Category object Rating float64 Reviews object Size object Installs object Type object Price object Content Rating object Genres object Last Updated object Current Ver object Android Ver object dtype: object So, what information do we have after these small actions? Firstly, we have some number of apps that are divided into various categories. Secondly, although such columns as, for example, "Reviews" contain numeric data, they have non-numeric type, that can cause some problems while further data processing. We are also interested in the total amount of apps and available categories in the dataset. To get the exact amount of apps, we will find all the unique values in the corresponding column. In [9]: len(googleplaystore["App"].unique()) Out[9]: 9660 In [10]: unique_categories = googleplaystore["Category"].unique() In [11]: unique_categories Out[11]: array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL', 'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER', 'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION', '1.9'], dtype=object) Duplicate records removal Usually, the duplicates of data appear in datasets, and this can aggravate the quality and accuracy of exploration. Plus, such data clogs the dataset, so we need to get rid of it. In [14]: googleplaystore.drop_duplicates(keep='first', inplace = True) In [15]: googleplaystore.shape Out[15]: (10358, 13) For removing rows with duplicates from a dataset, pandas has powerful and customizable method drop_duplicates(), which takes certain parameters needed to be considered while cleaning dataset. "keep=False" means that method will drop all the duplicates found in dataset with keeping only one value. "inplace = True" means that all the manipulations will be done and stored in the dataset we are currently using. As we can see above, our initial googleplaystore dataset contained 10841 rows. After removing duplicates, the number of rows decreased to 9948. NA analysis Another common problem of almost every dataset is columns with missing values. We will explore only the most common ways to clean a dataset from missing values. Firstly, let's look at the total amount of missing values in every column for each dataset. One of the great things about pandas is that it allows users to combine various operations in a single action, that brings great optimization opportunities and makes the code more compact. In [14]: googleplaystore.isnull().sum().sort_values(ascending=False) Out[14]: Rating 1465 Current Ver 8 Android Ver 3 Content Rating 1 Type 1 Last Updated 0 Genres 0 Price 0 Installs 0 Size 0 Reviews 0 Category 0 App 0 dtype: int64 Now, let's get rid of all the rows with missing values. Although some statistical approaches allow us to impute missing data with some values (like the most common value or mean value), today we will work only with cleared data. Pandas dropna() method also allows users to set parameters for proper data processing depending on the expected result. Here we stated that program must drop every row that contains any NA values and all the changes will be stored directly in our dataframe. In [16]: googleplaystore.dropna(how ='any', inplace = True) Let's now check the shape of the dataframe after all cleaning manipulations were performed. In [17]: googleplaystore.shape Out[17]: (8886, 13) If we look closer at our dataset and result of the dtypes method, we would see that such columns like "Reviews", "Size", "Price" and "Installs" should definitely have numeric values. So, let's see what values every column has in order to specify our further manipulations. In [18]: googleplaystore.Price.unique() Out[18]: array(['0', '$4.99', '$3.99', '$6.99', '$7.99', '$5.99', '$2.99', '$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49', '$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99', '$15.99', '$33.99', '$39.99', '$3.95', '$4.49', '$1.70', '$8.99', '$1.49', '$3.88', '$399.99', '$17.99', '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50', '$1.59', '$6.49', '$1.29', '$299.99', '$379.99', '$37.99', '$18.99', '$389.99', '$8.49', '$1.75', '$14.00', '$2.00', '$3.08', '$2.59', '$19.40', '$3.90', '$4.59', '$15.46', '$3.04', '$13.99', '$4.29', '$3.28', '$4.60', '$1.00', '$2.95', '$2.90', '$1.97', '$2.56', '$1.20'], dtype=object) In [19]: googleplaystore.Installs.unique() Out[19]: array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+', '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+', '1,000,000,000+', '1,000+', '500,000,000+', '100+', '500+', '10+', '5+', '50+', '1+'], dtype=object) In [20]: googleplaystore.Size.unique() Out[20]: array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M', '28M', '12M', '20M', '21M', '37M', '5.5M', '17M', '39M', '31M', '4.2M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.7M', '2.5M', '7.0M', '16M', '3.4M', '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M', '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M', '7.1M', '22M', '6.4M', '3.2M', '8.2M', '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M', '2.3M', '2.1M', '42M', '9.1M', '55M', '23k', '7.3M', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M', '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M', '79k', '8.4M', '3.7M', '118k', '44M', '695k', '1.6M', '6.2M', '53M', '1.4M', '3.0M', '7.2M', '5.8M', '3.8M', '9.6M', '45M', '63M', '49M', '77M', '4.4M', '70M', '9.3M', '8.1M', '36M', '6.9M', '7.4M', '84M', '97M', '2.0M', '1.9M', '1.8M', '5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M', '72M', '43M', '7.7M', '6.3M', '334k', '93M', '65M', '79M', '100M', '58M', '50M', '68M', '64M', '34M', '67M', '60M', '94M', '9.9M', '232k', '99M', '624k', '95M', '8.5k', '41k', '292k', '80M', '1.7M', '10.0M', '74M', '62M', '69M', '75M', '98M', '85M', '82M', '96M', '87M', '71M', '86M', '91M', '81M', '92M', '83M', '88M', '704k', '862k', '899k', '378k', '4.8M', '266k', '375k', '1.3M', '975k', '980k', '4.1M', '89M', '696k', '544k', '525k', '920k', '779k', '853k', '720k', '713k', '772k', '318k', '58k', '241k', '196k', '857k', '51k', '953k', '865k', '251k', '930k', '540k', '313k', '746k', '203k', '26k', '314k', '239k', '371k', '220k', '730k', '756k', '91k', '293k', '17k', '74k', '14k', '317k', '78k', '924k', '818k', '81k', '939k', '169k', '45k', '965k', '90M', '545k', '61k', '283k', '655k', '714k', '93k', '872k', '121k', '322k', '976k', '206k', '954k', '444k', '717k', '210k', '609k', '308k', '306k', '175k', '350k', '383k', '454k', '1.0M', '70k', '812k', '442k', '842k', '417k', '412k', '459k', '478k', '335k', '782k', '721k', '430k', '429k', '192k', '460k', '728k', '496k', '816k', '414k', '506k', '887k', '613k', '778k', '683k', '592k', '186k', '840k', '647k', '373k', '437k', '598k', '716k', '585k', '982k', '219k', '55k', '323k', '691k', '511k', '951k', '963k', '25k', '554k', '351k', '27k', '82k', '208k', '551k', '29k', '103k', '116k', '153k', '209k', '499k', '173k', '597k', '809k', '122k', '411k', '400k', '801k', '787k', '50k', '643k', '986k', '516k', '837k', '780k', '20k', '498k', '600k', '656k', '221k', '228k', '176k', '34k', '259k', '164k', '458k', '629k', '28k', '288k', '775k', '785k', '636k', '916k', '994k', '309k', '485k', '914k', '903k', '608k', '500k', '54k', '562k', '847k', '948k', '811k', '270k', '48k', '523k', '784k', '280k', '24k', '892k', '154k', '18k', '33k', '860k', '364k', '387k', '626k', '161k', '879k', '39k', '170k', '141k', '160k', '144k', '143k', '190k', '376k', '193k', '473k', '246k', '73k', '253k', '957k', '420k', '72k', '404k', '470k', '226k', '240k', '89k', '234k', '257k', '861k', '467k', '676k', '552k', '582k', '619k'], dtype=object) First of all, let's get rid of the dollar sign in "Price" column and turn values into numeric type. In [21]: googleplaystore['Price'] = googleplaystore['Price'].apply(lambda x: x.replace('$', '') if '$' in str(x) else x) googleplaystore['Price'] = googleplaystore['Price'].apply(lambda x: float(x)) Now, we will work with "Installs" column. We must get rid of plus sign and convert values to numeric. In [22]: googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x) googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x) googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: int(x)) Also, convert "Reviews" column to numeric type. In [23]: googleplaystore['Reviews'] = googleplaystore['Reviews'].apply(lambda x: int(x)) Finally, let's work with "Size" column as it needs more complex approach. This column contains various types of data. Among numeric values which can be whether in Mb or Kb, there are null values and strings. Moreover, we need to deal with the difference in values written in Mb and Kb. In [24]: googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace(',', '') if 'M' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: float(str(x).replace('k', '')) / 1000 if 'k' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: float(x)) Let's call describe() method one more time. As we can see, now we have statistical summary for all the needed columns that contain numeric values. In [25]: googleplaystore.describe() Out[25]:     Rating Reviews Size Installs Price count 8886.000000 8.886000e+03 7418.000000 8.886000e+03 8886.000000 mean 4.187959 4.730928e+05 22.760829 1.650061e+07 0.963526 std 0.522428 2.906007e+06 23.439210 8.640413e+07 16.194792 min 1.000000 1.000000e+00 0.008500 1.000000e+00 0.000000 25% 4.000000 1.640000e+02 5.100000 1.000000e+04 0.000000 50% 4.300000 4.723000e+03 14.000000 5.000000e+05 0.000000 75% 4.500000 7.131325e+04 33.000000 5.000000e+06 0.000000 max 5.000000 7.815831e+07 100.000000 1.000000e+09 400.000000   Building visualizations Visualization is probably one of the most useful approaches in data analysis. Sometimes not all the correlations and dependencies can be seen from the tabular data, and therefore various plots and diagrams can help to clearly depict them. Let's go through the different ways we can explore categories. Exploring which categories have the biggest amount of apps One of the fanciest ways to visualize such data is to use WordCloud. With a few lines of code, we can create an illustration that shows what categories have the biggest amount of apps. In [30]: import matplotlib.pyplot as plt import wordcloud from wordcloud import WordCloud import seaborn as sns color = sns.color_palette() %matplotlib inline   In [33]: from plotly import tools from plotly.offline import iplot, init_notebook_mode from IPython.display import Image import plotly.offline as py import plotly.graph_objs as go import plotly.io as pio import numpy as np py.init_notebook_mode()   In [34]: wc = WordCloud(max_font_size=250,collocations=False, max_words=33,width=1600, height=800,background_color="white").generate(' '.join(googleplaystore['Category'])) plt.figure( figsize=(20,10)) plt.imshow(wc, interpolation="bilinear") plt.axis("off") plt.tight_layout(pad=0) plt.show()     Exploring app ratings across top categories In [35]: groups = googleplaystore.groupby('Category').filter(lambda x: len(x) > 286).reset_index() array = groups['Rating'].hist(by=groups['Category'], sharex=True, figsize=(20,20))   As we can see, average apps ratings are quite different across the categories. Average Rating of all the Apps   And what insight will we get, if we explore average rating for all of the apps? In [36]: avg_rate_data = go.Figure() avg_rate_data.add_histogram( x = googleplaystore.Rating, xbins = {'start': 1, 'size': 0.1, 'end' :6} ) iplot(avg_rate_data)   In [38]: img_bytes = pio.to_image(avg_rate_data, format='png', width=1600, height=800, scale=2) In [39]: Image(img_bytes) Out[39]:     As we can see, most of the apps clearly hold a rating above 4.0! Actually, quite a lot of apps seem to have 5.0 rating. Let's check how many apps do have the highest possible rating. In [40]: googleplaystore.Rating[googleplaystore['Rating'] == 5 ].count() Out[40]: 271 But does any feature from the dataset really affect on the apps' rating? Let's try to figure out how size, amount of installs, reviews, and price correlate between each other and then explore the impact of every feature on the rating. First of all, let's build a heatmap. For exploring correlations between features, a heatmap is among the best visual tools. The individual values in the data matrix are represented by different colors helping quickly see what features have the most and the least dependencies. In [41]: sns.heatmap(googleplaystore.corr(), annot=True, linewidth=0.5) Out[41]: <matplotlib.axes._subplots.AxesSubplot at 0x11f75fbe0>     A positive correlation of 0.62 exists between the number of reviews and the number of installations, which means that customers tend to download a given app more if it has been reviewed by a larger number of people. This also means that many active users who download the app usually give feedback. Sizing strategy: How does size of the app impact rating? Despite the fact that modern phones and pads have enough memory to deal with various kinds of tasks and store Gigabytes of data, the size of the apps still matters. Let's explore whether this value really affects app rating or not. To find an answer to this question, we will use scatterplot which is definitely the most common and informant way to see how two variables correlate. In [42]: groups = googleplaystore.groupby('Category').filter(lambda x: len(x) >= 50).reset_index() In [43]: sns.set_style("whitegrid") ax = sns.jointplot(googleplaystore['Size'], googleplaystore['Rating'])   /anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.   As we can see, most of the apps with the highest rating have a size between approximately 20Mb and 40Mb. Pricing: How does price affect app rating? In [44]: paid_apps = googleplaystore[googleplaystore.Price>0] p = sns.jointplot( "Price", "Rating", paid_apps)   So, the top-rated apps do not have big prices: only a few apps have a price higher than $20. Pricing across categories In [45]: sns.set_style('whitegrid') fig, ax = plt.subplots() fig.set_size_inches(15, 8) p = sns.stripplot(x="Price", y="Category", data=googleplaystore, jitter=True, linewidth=1) title = ax.set_title('App pricing trends across categories')    As we can see, there are apps with a price higher than $200! Let's see, what categories these apps belong to. In [46]: googleplaystore[['Category', 'App']][googleplaystore.Price > 200].groupby([ "Category"], as_index=False).count() Out[46]:     Category App 0 FAMILY 4 1 FINANCE 6 2 LIFESTYLE 5   Price vs. installation: are free apps downloaded more than paid? For visualizing this answer we will use boxplot, so we can compare the range and distribution of the number of downloads for paid and free apps. Boxplots also help to answer questions like: what are the key values (average, median, first quartile, and so on) does our data have outliers and what are their values whether our data is symmetric how tightly the data is grouped is the data shifted and, if so, in which direction, etc. In [47]: trace0 = go.Box( y=np.log10(googleplaystore['Installs'][googleplaystore.Type=='Paid']), name = 'Paid', marker = dict( color = 'rgb(214, 12, 140)', ) ) trace1 = go.Box( y=np.log10(googleplaystore['Installs'][googleplaystore.Type=='Free']), name = 'Free', marker = dict( color = 'rgb(0, 128, 128)', ) ) layout = go.Layout( title = "Paid apps Vs free apps", yaxis= {'title': 'Downloads (log-scaled)'} ) data = [trace0, trace1] iplot({'data': data, 'layout': layout})    As we can see, paid apps are downloaded less frequently than free ones. Conclusion Exploratory data analysis is an inherent part of data exploration that helps to get a general knowledge about the dataset you work with as well as find basic conceptions and outlines to get first insights. In this tutorial we walked through the general approaches for initial data exploration on the example of apps categories and rating columns. However, there are a lot of other interesting dependencies and correlations left within other columns. The dataset we used is available via the following link:  https://www.kaggle.com/lava18/google-play-store-apps/activity
Introduction Exploratory data analysis (EDA) is an approach to data analysis to summarize the main characteristics of data. It can be performed using various methods, among which data visualization takes a great place. The idea of EDA is to recognize what information can data give us beyond the formal modeling or hypothesis testing task. In other words, if initially we don’t have at all or there are not enough priori ideas about the pattern and nature of the relationships within the data, an exploratory data analysis comes for help allowing us to identify main tendencies, properties, and nature of the information. In return, based on the information obtained, the researcher will be able to evaluate the structure and nature of the available data, which can ease the search and identification of questions and the purpose of data exploration. So, EDA is a crucial step before feature engineering and can involve a part of data preprocessing. In this tutorial, we will show you how to perform simple EDA using  Google Play Store Apps Data Set . To begin with, let’s install and load all the necessary libraries that we will need. # Remove warnings options(warn=-1) # Load libraries require(ggplot2) require(highcharter) require(dplyr) require(tidyverse) require(corrplot) require(RColorBrewer) require(xts) require(treemap) require(lubridate) Data overview The Play Store apps insights can tell developers information about the Android market. Each row of the dataset has values for the category, rating, size, and more apps characteristics. Here are the columns of our dataset: App - name of the application. Category - category of the app. Rating - application’s rating on Play Store. Reviews - number of the app’s reviews. Size - size of the app. Install - number of installs of the app. Type - whether the app is free or paid. Price - price of the app (0 if free). Content Rating - target audience of the app. Genres - genre the app belongs to. Last Updated - date the app was last updated. Current Ver - current version of the application. Android Ver - minimum Android version required to run the app. Now, let’s load data and view the first rows. For that, we use head() function: df<-read.csv("googleplaystore.csv",na.strings = c("NaN","NA","")) head(df) ## App Category ## 1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN ## 2 Coloring book moana ART_AND_DESIGN ## 3 U Launcher Lite â\200“ FREE Live Cool Themes, Hide Apps ART_AND_DESIGN ## 4 Sketch - Draw & Paint ART_AND_DESIGN ## 5 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN ## 6 Paper flowers instructions ART_AND_DESIGN ## Rating Reviews Size Installs Type Price Content.Rating ## 1 4.1 159 19M 10,000+ Free 0 Everyone ## 2 3.9 967 14M 500,000+ Free 0 Everyone ## 3 4.7 87510 8.7M 5,000,000+ Free 0 Everyone ## 4 4.5 215644 25M 50,000,000+ Free 0 Teen ## 5 4.3 967 2.8M 100,000+ Free 0 Everyone ## 6 4.4 167 5.6M 50,000+ Free 0 Everyone ## Genres Last.Updated Current.Ver ## 1 Art & Design January 7, 2018 1.0.0 ## 2 Art & Design;Pretend Play January 15, 2018 2.0.0 ## 3 Art & Design August 1, 2018 1.2.4 ## 4 Art & Design June 8, 2018 Varies with device ## 5 Art & Design;Creativity June 20, 2018 1.1 ## 6 Art & Design March 26, 2017 1.0 ## Android.Ver ## 1 4.0.3 and up ## 2 4.0.3 and up ## 3 4.0.3 and up ## 4 4.2 and up ## 5 4.4 and up ## 6 2.3 and up It’s useful to see data format to perform analysis. Also, we can review data by columns type using str function: str(df) ## 'data.frame': 10841 obs. of 13 variables: ## $ App : Factor w/ 9660 levels "- Free Comics - Comic Apps",..: 7229 2563 8998 8113 7294 7125 8171 5589 4948 5826 ... ## $ Category : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ... ## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ... ## $ Reviews : Factor w/ 6002 levels "0","1","10","100",..: 1183 5924 5681 1947 5924 1310 1464 3385 816 485 ... ## $ Size : Factor w/ 462 levels "1,000+","1.0M",..: 55 30 368 102 64 222 55 118 146 120 ... ## $ Installs : Factor w/ 22 levels "0","0+","1,000,000,000+",..: 8 20 13 16 11 17 17 4 4 8 ... ## $ Type : Factor w/ 3 levels "0","Free","Paid": 2 2 2 2 2 2 2 2 2 2 ... ## $ Price : Factor w/ 93 levels "$0.99","$1.00",..: 92 92 92 92 92 92 92 92 92 92 ... ## $ Content.Rating: Factor w/ 6 levels "Adults only 18+",..: 2 2 2 5 2 2 2 2 2 2 ... ## $ Genres : Factor w/ 120 levels "Action","Action;Action & Adventure",..: 10 13 10 10 12 10 10 10 10 12 ... ## $ Last.Updated : Factor w/ 1378 levels "1.0.19","April 1, 2016",..: 562 482 117 825 757 901 76 726 1317 670 ... ## $ Current.Ver : Factor w/ 2832 levels "0.0.0.2","0.0.1",..: 120 1019 465 2825 278 114 278 2392 1456 1430 ... ## $ Android.Ver : Factor w/ 33 levels "1.0 and up","1.5 and up",..: 16 16 16 19 21 9 16 19 11 16 ... As you can see, we got similar information as using head function, but here we are more concentrated on data type, rather than content. Now, we will use a function that produces summaries of the results of various model fitting functions: summary(df) ## App ## ROBLOX : 9 ## CBS Sports App - Scores, News, Stats & Watch Live: 8 ## 8 Ball Pool : 7 ## Candy Crush Saga : 7 ## Duolingo: Learn Languages Free : 7 ## ESPN : 7 ## (Other) :10796 ## Category Rating Reviews ## FAMILY :1972 Min. : 1.000 0 : 596 ## GAME :1144 1st Qu.: 4.000 1 : 272 ## TOOLS : 843 Median : 4.300 2 : 214 ## MEDICAL : 463 Mean : 4.193 3 : 175 ## BUSINESS : 460 3rd Qu.: 4.500 4 : 137 ## PRODUCTIVITY: 424 Max. :19.000 5 : 108 ## (Other) :5535 NA's :1474 (Other):9339 ## Size Installs Type Price ## Varies with device:1695 1,000,000+ :1579 0 : 1 0 :10040 ## 11M : 198 10,000,000+:1252 Free:10039 $0.99 : 148 ## 12M : 196 100,000+ :1169 Paid: 800 $2.99 : 129 ## 14M : 194 10,000+ :1054 NA's: 1 $1.99 : 73 ## 13M : 191 1,000+ : 907 $4.99 : 72 ## 15M : 184 5,000,000+ : 752 $3.99 : 63 ## (Other) :8183 (Other) :4128 (Other): 316 ## Content.Rating Genres Last.Updated ## Adults only 18+: 3 Tools : 842 August 3, 2018: 326 ## Everyone :8714 Entertainment: 623 August 2, 2018: 304 ## Everyone 10+ : 414 Education : 549 July 31, 2018 : 294 ## Mature 17+ : 499 Medical : 463 August 1, 2018: 285 ## Teen :1208 Business : 460 July 30, 2018 : 211 ## Unrated : 2 Productivity : 424 July 25, 2018 : 164 ## NA's : 1 (Other) :7480 (Other) :9257 ## Current.Ver Android.Ver ## Varies with device:1459 4.1 and up :2451 ## 1.0 : 809 4.0.3 and up :1501 ## 1.1 : 264 4.0 and up :1375 ## 1.2 : 178 Varies with device:1362 ## 2.0 : 151 4.4 and up : 980 ## (Other) :7972 (Other) :3169 ## NA's : 8 NA's : 3 NA analysis After getting acquainted with the dataset, we should analyze it on NA and duplicates. Detecting and removing such records helps to build a model with better accuracy. First, let’s analyze missing values. We can review the result as a table: sapply(df,function(x)sum(is.na(x))) ## App Category Rating Reviews Size ## 0 0 1474 0 0 ## Installs Type Price Content.Rating Genres ## 0 1 0 1 0 ## Last.Updated Current.Ver Android.Ver ## 0 8 3 Or as a chart: key value Columns with NA values Rating Current.Ver Android.Ver 0 250 500 750 1000 1250 1500 1750   As you can see, there are three columns containing missing values, and Rating column has the largest number of them. Let’s remove such values. df = na.omit(df) Duplicate records removal The next step is to check whether there are duplicates. We can check the difference between all and unique objects. distinct <- nrow(df %>% distinct()) nrow(df) - distinct ## [1] 474 After detecting duplicates, we need to remove them: df=df[!duplicated(df), ] When data is precleaned, we can begin further visual analysis. Analysis using visualization tools To start off, we will review the Category column. Let’s examine which categories are the most and the least popular: df %>% count(Category, Installs) %>% group_by(Category) %>% summarize( TotalInstalls = sum(as.numeric(Installs)) ) %>% arrange(-TotalInstalls) %>% hchart('scatter', hcaes(x = "Category", y = "TotalInstalls", size = "TotalInstalls", color = "Category")) %>% hc_add_theme(hc_theme_538()) %>% hc_title(text = "Most popular categories (# of installs)") Category TotalInstalls Most popular categories (# of installs) -2 -1 GAME SOCIAL COMMUNICATION FAMILY PRODUCTIVITY TOOLS HEALTH_AND_FITNESS BUSINESS SPORTS NEWS_AND_MAGAZINES LIFESTYLE PERSONALIZATION VIDEO_PLAYERS BOOKS_AND_REFERENCE FINANCE MEDICAL PHOTOGRAPHY SHOPPING MAPS_AND_NAVIGATION TRAVEL_AND_LOCAL FOOD_AND_DRINK DATING WEATHER EVENTS AUTO_AND_VEHICLES ART_AND_DESIGN PARENTING BEAUTY COMICS ENTERTAINMENT LIBRARIES_AND_DEMO EDUCATION HOUSE_AND_HOME 100 150 200 250 50 300 Here we can see that Game is the most popular category by installs. It’s interesting that Education has almost the lowest popularity. Moreover, Comics are also at the bottom according to popularity ratings. Now, we want to see a percentage of the apps in each category. The pie chart is not a widespread type of visual, but when you need to know the percentage, it is one of the best options. Let’s count the apps in each category and expand our color palette. freq<-table(df$Category) fr<-as.data.frame(freq) fr <- fr %>% arrange(desc(Freq)) coul = brewer.pal(12, "Paired") # We can add more tones to this palette: coul = colorRampPalette(coul)(15) op <- par(cex = 0.5) pielabels <- sprintf("%s = %3.1f%s", fr$Var1, 100*fr$Freq/sum(fr$Freq), "%") pie(fr$Freq, labels=NA, clockwise=TRUE, col=coul, border="black", radius=0.5, cex=1) legend("right",legend=pielabels,bty="n", fill=coul) We can see that Family now becomes a leader among the categories. Also, Education here has a higher percentage than Comics. Now, let’s look closer to the prices of the apps and review how many free apps are available in the Play Market. tmp <- df %>% count(Type) %>% mutate(perc = round((n /sum(n))*100)) %>% arrange(desc(perc)) hciconarray(tmp$Type, tmp$perc, size = 5) %>% hc_title(text="Percentage of paid vs. free apps")   Percentage of paid vs. free apps Free Paid As you can see, 93% of the apps are free. Let’s see the median price in each category:   df %>% filter(Type == "Paid") %>% group_by(Category) %>% summarize( Price = median(as.numeric(Price)) ) %>% arrange(-Price) %>% hchart('treemap', hcaes(x = 'Category', value = 'Price', color = 'Price')) %>% hc_add_theme(hc_theme_elementary()) %>% hc_title(text="Median price per category") %>% hc_legend(align = "left", verticalAlign = "top", layout = "vertical", x = 0, y = 100) Median price per category PARENTING PARENTING DATING DATING FINANCE FINANCE FOOD_AND_DRINK FOOD_AND_DRINK LIFESTYLE LIFESTYLE ENTERTAINMENT ENTERTAINMENT BUSINESS BUSINESS EDUCATION EDUCATION WEATHER WEATHER PRODUCTIVITY PRODUCTIVITY TRAVEL_AND_LOCAL TRAVEL_AND_LOCAL MEDICAL MEDICAL BOOKS_AND_REFERENCE BOOKS_AND_REFERENCE FAMILY FAMILY GAME GAME HEALTH_AND_FITNESS HEALTH_AND_FITNESS PHOTOGRAPHY PHOTOGRAPHY SPORTS SPORTS TOOLS TOOLS SHOPPING SHOPPING COMMUNICATION COMMUNICATION NEWS_AND_MAGAZINES NEWS_AND_MAGAZINES ART_AND_DESIGN ART_AND_DESIGN VIDEO_PLAYERS VIDEO_PLAYERS PERSONALIZATION PERSONALIZATION SOCIAL SOCIAL 0 25 50 75 NEWS_AND_MAGAZINES : 21.5 This chart is a treemap. In general, it is used to display a data hierarchy, to see summary based on two values (size and color). Therefore, we can see that Parenting category has the highest price while Personalization and Social the lowest. Now, we will build a correlation heatmap, previously performing some data preprocessing. df <- df %>% mutate( Installs = gsub("\\+", "", as.character(Installs)), Installs = as.numeric(gsub(",", "", Installs)), Size = gsub("M", "", Size), Size = ifelse(grepl("k", Size), 0, as.numeric(Size)), Rating = as.numeric(Rating), Reviews = as.numeric(Reviews), Price = as.numeric(gsub("\\$", "", as.character(Price))) )%>% filter( Type %in% c("Free", "Paid") ) extract = c("Rating","Reviews","Size","Installs","Price") df.extract = df[extract] df.extract %>% filter(is.nan(df.extract$Reviews)) %>% filter(is.na(df.extract$Size)) ## [1] Rating Reviews Size Installs Price ## <0 rows> (or 0-length row.names) df.extract = na.omit(df.extract) cor_matrix = cor(df.extract) corrplot(cor_matrix,method = "color",order = "AOE",addCoef.col = "grey") Unfortunately, there is no strong relation between columns. Also, let’s see a number of installs by content rating. tmp <- df %>% group_by(Content.Rating) %>% summarize(Total.Installs = sum(Installs)) %>% arrange(-Total.Installs) highchart() %>% hc_chart(type = "funnel") %>% hc_add_series_labels_values( labels = tmp$Content.Rating, values = tmp$Total.Installs ) %>% hc_title( text="Number of Installs by Content Rating" ) %>% hc_add_theme(hc_theme_elementary()) Number of Installs by Content Rating Everyone Everyone Teen Teen Everyone 10+ Everyone 10+ Mature 17+ Mature 17+ Adults only 18+ Adults only 18+ Unrated Unrated As you might have guessed, teens take an active part in rating Play Store. You may notice a  hc_add_theme  line in the code. It adds a theme to your chart. Highchart has an extensive list of themes, and you can choose one via this  link . One of the most popular chart types is time series which we will explore at last. Also, we will transform our date type using lubridate package. # Get number of apps by last updated date tmp <- df %>% count(Last.Updated) # Transform date column type from text to date tmp$Last.Updated<-mdy(tmp$Last.Updated) # Transform data into time series time_series <- xts( tmp$n, order.by = tmp$Last.Updated ) highchart(type = "stock") %>% hc_title(text = "Last updated date") %>% hc_subtitle(text = "Number of applications by date of last update") %>% hc_add_series(time_series) %>% hc_add_theme(hc_theme_economist()) Last updated date Number of applications by date of last update Jan '13 Jul '14 Jul '15 Jan '16 Jul '16 Jan '17 Jul '17 Jan '18 Jul '18 2013 2015 2016 2017 2018 0 50 100 150 200 250 300 350 Zoom 1m 3m 6m YTD 1y All From May 21, 2010 To Aug 8, 2018 Sunday, Aug 5, 2018 ●  Series 1:  45 Such visualization is very convenient as it contains zoom options, range slider, date filtering, and points hovering. Using this chart, we can see that the number of updates is increasing with time.   Conclusion To sum up, exploration data analysis is a powerful tool for a comprehensive analysis of a dataset. In general, we can divide EDA into the next stages: data overview, duplicate records analysis, NA analysis, and data exploration. So, starting with reviewing the data structure, columns, contents, etc., we move forward to estimating and preparing our data for further analysis. Finally, visual data exploration helps to find dependencies, distribution, and more. // // .nav-tabs { display: inline-table; max-height: 500px; min-height: 44px; overflow-y: auto; background: white; border: 1px solid #ddd; border-radius: 4px; } .tabset-dropdown > .nav-tabs > li.active:before { content: ""; font-family: 'Glyphicons Halflings'; display: inline-block; padding: 10px; border-right: 1px solid #ddd; } .tabset-dropdown > .nav-tabs.nav-tabs-open > li.active:before { content: ""; border: none; } .tabset-dropdown > .nav-tabs.nav-tabs-open:before { content: ""; font-family: 'Glyphicons Halflings'; display: inline-block; padding: 10px; border-right: 1px solid #ddd; } .tabset-dropdown > .nav-tabs > li.active { display: block; } .tabset-dropdown > .nav-tabs > li > a, .tabset-dropdown > .nav-tabs > li > a:focus, .tabset-dropdown > .nav-tabs > li > a:hover { border: none; display: inline-block; border-radius: 4px; } .tabset-dropdown > .nav-tabs.nav-tabs-open > li { display: block; float: none; } .tabset-dropdown > .nav-tabs > li { display: none; } -->
View all blog posts