Your Career Platform for Big Data

Be part of the digital revolution in Switzerland


Latest Jobs

Arobase Geneva, Switzerland
Full time
For one of our client based in Geneva, we are looking for an  Data Market Analyst   You will be responsible for providing analysis and Insight which help your division growth their business profitably You should have business and communication skills and able to explain complex solutions to people from various functions and hierarchical levels.  You are comfortable to work in challenging environments with a contagious passion for data analytics and how it drives change. Your contributions will drive the next generation of decision making and action taking in Global Aftermarket, Marketing and Brand at …, you will also join a network of analysts from multiple divisions, helping you grow your business understanding and Analytical skills. Responsibilities : Understand business needs and key decisions points Deliver insight to support the above decisions Provide relevant visualization to generate Insight Leverage Analytical skills for on demand analysis or project based analytics  Lead on establishing data requirements according to business objectives Required Skills : 3+ years of Business analyst experience Very strong analytical skills  Robust SQL knowledge / Familiarity with R and / or Python Robust visualization skills (Tableau /PowerBI) (Please provide examples of the above as part of application) Desired Skills : Ability to work with technical & non-technical stakeholders with the ability to take & deliver Analysis & Insight.  Good working knowledge of CRM's, data warehouses & more  Bachelor degree in computer science or equivalent Data-Experienced :                 - Subject matter expert on data best practices                 - Attention to details                 - Developed from scratch databases from multiple initial sources with a BI end purpose                 - Knowledge of Data Governance practices and processes Data systems savvy :                 - Hadoop                - AWS                - SQL                - Alteryx                - Others Conditions : Missions for a one year contract with possibility to be internalized at client.
Hochschule Luzern Luzern, Switzerland
Full time
Die Hochschule Luzern – Wirtschaft ist ein Schweizer Kompetenzzentrum für Aus- und Weiterbildung sowie Forschung in Wirtschaft und Management.
Atos SE Zürich, Switzerland
Full time
Atos Consulting is a leading Swiss IT consulting company, part of the Atos Group. We work with prestigious international clients in the areas of Industry, Pharma, Insurances, Transport, Trading, Telecommunications, Media and Energy Sector. For our Zurich and Basel team, we are currently looking for a  Big Data/IoT Architect (m/f)   Your responsibilities The Big Data/IoT Architect will be responsible for conceptualizing and guiding the full architectural lifecycle of a Big Data solution, including requirements analysis, governance, capacity requirements, technical architecture design, application design, testing and deployment Provide technical direction in a team that designs and develops path breaking large-scale cluster data processing systems and delivering end-to-end solutions/services to customers Interact with domain experts, solution architects and analytics developers to define data models for streaming input and delivering analytics output Helping CxO-level stakeholders to develop strategies that maximize the value of their data, becoming data-driven organizations Help establish thought leadership in the Big Data space by contributing internal papers, technical commentary to the user community and SME activities on presales strategy Your profile The ideal candidate comes from a hands-on background as either BI DWH Development / Data Engineer (+5 years) and evolved to Solution Architect/Innovation Leader Solid cross-vertical expertise (at least 3 verticals between Pharma, Manufacturing, Telco, Insurance, Finance, Consumer Packaged Goods, Retail, Health & Government) Experienced in designing, conceptualizing and implementing Big Data architectures Proven design and implementation experience in: IoT (e.g. Azure, Amazon, GCP) MS BI Stack (SSAS, SSRS, SSIS, Power BI) Big Data Platforms (e.g. Hortonworks, Cloudera) Proven experience in Agile methodology (SCRUM, SAFe,…) Fluent language skills in German and English Your application Are you interested joining a global and dynamic organization, working on challenging and exciting projects and being supported with training opportunities? Please submit your complete application online. We look forward. For recruitment agencies:  Atos Consulting does not accept unsolicited applications for any of the roles that we advertise. Furthermore, please note that no legal obligations or respective. compensation would derive from such unsolicited applications. Thank you for your understanding.

DataCareer Blog

What is Exploratory Data Analysis Exploratory data analysis (EDA) is a powerful tool for a comprehensive study of the available information providing answers to basic data analysis questions. What distinguishes it from traditional analysis based on testing a priori hypothesis is that EDA makes it possible to detect — by using various methods — all potential systematic correlations in the data. Exploratory data analysis is practically unlimited in time and methods allowing to identify curious data fragments and correlations. Therefore, you are able to examine information more deeply and accurately, as well as choose a proper model for further work. In Python language environment, there is a wide range of libraries that can not only ease but also streamline the process of exploring a dataset. We will use  Google Play Store Apps dataset  and go through the main tasks of exploration analysis to find out if there are any trends that can facilitate the process of setting and resolving a business problem. Data overview Before we start exploring our data, we must import the dataset and Python libraries needed for further work. We will use pandas library, a very powerful tool for comprehensive data analysis. In [1]: import pandas as pd In [2]: googleplaystore = pd.read_csv("googleplaystore.csv") Let's explore the structure of our dataframe by viewing the first and the last 10 rows. In [3]: googleplaystore.head(10) Out[3]:     App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver 0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up 1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up 2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up 3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up 4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up 5 Paper flowers instructions ART_AND_DESIGN 4.4 167 5.6M 50,000+ Free 0 Everyone Art & Design March 26, 2017 1.0 2.3 and up 6 Smoke Effect Photo Maker - Smoke Editor ART_AND_DESIGN 3.8 178 19M 50,000+ Free 0 Everyone Art & Design April 26, 2018 1.1 4.0.3 and up 7 Infinite Painter ART_AND_DESIGN 4.1 36815 29M 1,000,000+ Free 0 Everyone Art & Design June 14, 2018 4.2 and up 8 Garden Coloring Book ART_AND_DESIGN 4.4 13791 33M 1,000,000+ Free 0 Everyone Art & Design September 20, 2017 2.9.2 3.0 and up 9 Kids Paint Free - Drawing Fun ART_AND_DESIGN 4.7 121 3.1M 10,000+ Free 0 Everyone Art & Design;Creativity July 3, 2018 2.8 4.0.3 and up In [4]: googleplaystore.tail(10) Out[4]:     App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver 10831 MAPS_AND_NAVIGATION NaN 38 9.8M 5,000+ Free 0 Everyone Maps & Navigation June 13, 2018 4.0 and up 10832 FR Tides WEATHER 3.8 1195 582k 100,000+ Free 0 Everyone Weather February 16, 2014 6.0 2.1 and up 10833 Chemin (fr) BOOKS_AND_REFERENCE 4.8 44 619k 1,000+ Free 0 Everyone Books & Reference March 23, 2014 0.8 2.2 and up 10834 FR Calculator FAMILY 4.0 7 2.6M 500+ Free 0 Everyone Education June 18, 2017 1.0.0 4.1 and up 10835 FR Forms BUSINESS NaN 0 9.6M 10+ Free 0 Everyone Business September 29, 2016 1.1.5 4.0 and up 10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0 Everyone Education July 25, 2017 1.48 4.1 and up 10837 Fr. Mike Schmitz Audio Teachings FAMILY 5.0 4 3.6M 100+ Free 0 Everyone Education July 6, 2018 1.0 4.1 and up 10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical January 20, 2017 1.0 2.2 and up 10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.5 114 Varies with device 1,000+ Free 0 Mature 17+ Books & Reference January 19, 2015 Varies with device Varies with device 10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE 4.5 398307 19M 10,000,000+ Free 0 Everyone Lifestyle July 25, 2018 Varies with device Varies with device   We can see that dataframe googleplaystore has such problem as missing values. But for a more complex view on data, let's do a few more things. Firstly, we will use describe() pandas method that will help us to get a statistic summary of numerical columns in our dataset. We can also use info() method to check data types in each column as well as missing values and shape() for retrieving a number of rows and columns in the dataframe. In [5]: googleplaystore.describe() Out[5]:     Rating count 9367.000000 mean 4.193338 std 0.537431 min 1.000000 25% 4.000000 50% 4.300000 75% 4.500000 max 19.000000 In [6]: <class 'pandas.core.frame.DataFrame'> RangeIndex: 10841 entries, 0 to 10840 Data columns (total 13 columns): App 10841 non-null object Category 10841 non-null object Rating 9367 non-null float64 Reviews 10841 non-null object Size 10841 non-null object Installs 10841 non-null object Type 10840 non-null object Price 10841 non-null object Content Rating 10840 non-null object Genres 10841 non-null object Last Updated 10841 non-null object Current Ver 10833 non-null object Android Ver 10838 non-null object dtypes: float64(1), object(12) memory usage: 1.1+ MB In [7]: googleplaystore.shape Out[7]: (10841, 13) In [8]: googleplaystore.dtypes Out[8]: App object Category object Rating float64 Reviews object Size object Installs object Type object Price object Content Rating object Genres object Last Updated object Current Ver object Android Ver object dtype: object So, what information do we have after these small actions? Firstly, we have some number of apps that are divided into various categories. Secondly, although such columns as, for example, "Reviews" contain numeric data, they have non-numeric type, that can cause some problems while further data processing. We are also interested in the total amount of apps and available categories in the dataset. To get the exact amount of apps, we will find all the unique values in the corresponding column. In [9]: len(googleplaystore["App"].unique()) Out[9]: 9660 In [10]: unique_categories = googleplaystore["Category"].unique() In [11]: unique_categories Out[11]: array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL', 'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER', 'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION', '1.9'], dtype=object) Duplicate records removal Usually, the duplicates of data appear in datasets, and this can aggravate the quality and accuracy of exploration. Plus, such data clogs the dataset, so we need to get rid of it. In [14]: googleplaystore.drop_duplicates(keep='first', inplace = True) In [15]: googleplaystore.shape Out[15]: (10358, 13) For removing rows with duplicates from a dataset, pandas has powerful and customizable method drop_duplicates(), which takes certain parameters needed to be considered while cleaning dataset. "keep=False" means that method will drop all the duplicates found in dataset with keeping only one value. "inplace = True" means that all the manipulations will be done and stored in the dataset we are currently using. As we can see above, our initial googleplaystore dataset contained 10841 rows. After removing duplicates, the number of rows decreased to 9948. NA analysis Another common problem of almost every dataset is columns with missing values. We will explore only the most common ways to clean a dataset from missing values. Firstly, let's look at the total amount of missing values in every column for each dataset. One of the great things about pandas is that it allows users to combine various operations in a single action, that brings great optimization opportunities and makes the code more compact. In [14]: googleplaystore.isnull().sum().sort_values(ascending=False) Out[14]: Rating 1465 Current Ver 8 Android Ver 3 Content Rating 1 Type 1 Last Updated 0 Genres 0 Price 0 Installs 0 Size 0 Reviews 0 Category 0 App 0 dtype: int64 Now, let's get rid of all the rows with missing values. Although some statistical approaches allow us to impute missing data with some values (like the most common value or mean value), today we will work only with cleared data. Pandas dropna() method also allows users to set parameters for proper data processing depending on the expected result. Here we stated that program must drop every row that contains any NA values and all the changes will be stored directly in our dataframe. In [16]: googleplaystore.dropna(how ='any', inplace = True) Let's now check the shape of the dataframe after all cleaning manipulations were performed. In [17]: googleplaystore.shape Out[17]: (8886, 13) If we look closer at our dataset and result of the dtypes method, we would see that such columns like "Reviews", "Size", "Price" and "Installs" should definitely have numeric values. So, let's see what values every column has in order to specify our further manipulations. In [18]: googleplaystore.Price.unique() Out[18]: array(['0', '$4.99', '$3.99', '$6.99', '$7.99', '$5.99', '$2.99', '$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49', '$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99', '$15.99', '$33.99', '$39.99', '$3.95', '$4.49', '$1.70', '$8.99', '$1.49', '$3.88', '$399.99', '$17.99', '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50', '$1.59', '$6.49', '$1.29', '$299.99', '$379.99', '$37.99', '$18.99', '$389.99', '$8.49', '$1.75', '$14.00', '$2.00', '$3.08', '$2.59', '$19.40', '$3.90', '$4.59', '$15.46', '$3.04', '$13.99', '$4.29', '$3.28', '$4.60', '$1.00', '$2.95', '$2.90', '$1.97', '$2.56', '$1.20'], dtype=object) In [19]: googleplaystore.Installs.unique() Out[19]: array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+', '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+', '1,000,000,000+', '1,000+', '500,000,000+', '100+', '500+', '10+', '5+', '50+', '1+'], dtype=object) In [20]: googleplaystore.Size.unique() Out[20]: array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M', '28M', '12M', '20M', '21M', '37M', '5.5M', '17M', '39M', '31M', '4.2M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.7M', '2.5M', '7.0M', '16M', '3.4M', '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M', '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M', '7.1M', '22M', '6.4M', '3.2M', '8.2M', '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M', '2.3M', '2.1M', '42M', '9.1M', '55M', '23k', '7.3M', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M', '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M', '79k', '8.4M', '3.7M', '118k', '44M', '695k', '1.6M', '6.2M', '53M', '1.4M', '3.0M', '7.2M', '5.8M', '3.8M', '9.6M', '45M', '63M', '49M', '77M', '4.4M', '70M', '9.3M', '8.1M', '36M', '6.9M', '7.4M', '84M', '97M', '2.0M', '1.9M', '1.8M', '5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M', '72M', '43M', '7.7M', '6.3M', '334k', '93M', '65M', '79M', '100M', '58M', '50M', '68M', '64M', '34M', '67M', '60M', '94M', '9.9M', '232k', '99M', '624k', '95M', '8.5k', '41k', '292k', '80M', '1.7M', '10.0M', '74M', '62M', '69M', '75M', '98M', '85M', '82M', '96M', '87M', '71M', '86M', '91M', '81M', '92M', '83M', '88M', '704k', '862k', '899k', '378k', '4.8M', '266k', '375k', '1.3M', '975k', '980k', '4.1M', '89M', '696k', '544k', '525k', '920k', '779k', '853k', '720k', '713k', '772k', '318k', '58k', '241k', '196k', '857k', '51k', '953k', '865k', '251k', '930k', '540k', '313k', '746k', '203k', '26k', '314k', '239k', '371k', '220k', '730k', '756k', '91k', '293k', '17k', '74k', '14k', '317k', '78k', '924k', '818k', '81k', '939k', '169k', '45k', '965k', '90M', '545k', '61k', '283k', '655k', '714k', '93k', '872k', '121k', '322k', '976k', '206k', '954k', '444k', '717k', '210k', '609k', '308k', '306k', '175k', '350k', '383k', '454k', '1.0M', '70k', '812k', '442k', '842k', '417k', '412k', '459k', '478k', '335k', '782k', '721k', '430k', '429k', '192k', '460k', '728k', '496k', '816k', '414k', '506k', '887k', '613k', '778k', '683k', '592k', '186k', '840k', '647k', '373k', '437k', '598k', '716k', '585k', '982k', '219k', '55k', '323k', '691k', '511k', '951k', '963k', '25k', '554k', '351k', '27k', '82k', '208k', '551k', '29k', '103k', '116k', '153k', '209k', '499k', '173k', '597k', '809k', '122k', '411k', '400k', '801k', '787k', '50k', '643k', '986k', '516k', '837k', '780k', '20k', '498k', '600k', '656k', '221k', '228k', '176k', '34k', '259k', '164k', '458k', '629k', '28k', '288k', '775k', '785k', '636k', '916k', '994k', '309k', '485k', '914k', '903k', '608k', '500k', '54k', '562k', '847k', '948k', '811k', '270k', '48k', '523k', '784k', '280k', '24k', '892k', '154k', '18k', '33k', '860k', '364k', '387k', '626k', '161k', '879k', '39k', '170k', '141k', '160k', '144k', '143k', '190k', '376k', '193k', '473k', '246k', '73k', '253k', '957k', '420k', '72k', '404k', '470k', '226k', '240k', '89k', '234k', '257k', '861k', '467k', '676k', '552k', '582k', '619k'], dtype=object) First of all, let's get rid of the dollar sign in "Price" column and turn values into numeric type. In [21]: googleplaystore['Price'] = googleplaystore['Price'].apply(lambda x: x.replace('$', '') if '$' in str(x) else x) googleplaystore['Price'] = googleplaystore['Price'].apply(lambda x: float(x)) Now, we will work with "Installs" column. We must get rid of plus sign and convert values to numeric. In [22]: googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x) googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x) googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: int(x)) Also, convert "Reviews" column to numeric type. In [23]: googleplaystore['Reviews'] = googleplaystore['Reviews'].apply(lambda x: int(x)) Finally, let's work with "Size" column as it needs more complex approach. This column contains various types of data. Among numeric values which can be whether in Mb or Kb, there are null values and strings. Moreover, we need to deal with the difference in values written in Mb and Kb. In [24]: googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace(',', '') if 'M' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: float(str(x).replace('k', '')) / 1000 if 'k' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: float(x)) Let's call describe() method one more time. As we can see, now we have statistical summary for all the needed columns that contain numeric values. In [25]: googleplaystore.describe() Out[25]:     Rating Reviews Size Installs Price count 8886.000000 8.886000e+03 7418.000000 8.886000e+03 8886.000000 mean 4.187959 4.730928e+05 22.760829 1.650061e+07 0.963526 std 0.522428 2.906007e+06 23.439210 8.640413e+07 16.194792 min 1.000000 1.000000e+00 0.008500 1.000000e+00 0.000000 25% 4.000000 1.640000e+02 5.100000 1.000000e+04 0.000000 50% 4.300000 4.723000e+03 14.000000 5.000000e+05 0.000000 75% 4.500000 7.131325e+04 33.000000 5.000000e+06 0.000000 max 5.000000 7.815831e+07 100.000000 1.000000e+09 400.000000   Building visualizations Visualization is probably one of the most useful approaches in data analysis. Sometimes not all the correlations and dependencies can be seen from the tabular data, and therefore various plots and diagrams can help to clearly depict them. Let's go through the different ways we can explore categories. Exploring which categories have the biggest amount of apps One of the fanciest ways to visualize such data is to use WordCloud. With a few lines of code, we can create an illustration that shows what categories have the biggest amount of apps. In [30]: import matplotlib.pyplot as plt import wordcloud from wordcloud import WordCloud import seaborn as sns color = sns.color_palette() %matplotlib inline In [32]: sys.path()   --------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-32-7ffc7f27bab0> in <module> ----> 1 system.path() NameError: name 'system' is not defined In [33]: from plotly import tools from plotly.offline import iplot, init_notebook_mode from IPython.display import Image import plotly.offline as py import plotly.graph_objs as go import as pio import numpy as np py.init_notebook_mode()     In [34]: wc = WordCloud(max_font_size=250,collocations=False, max_words=33,width=1600, height=800,background_color="white").generate(' '.join(googleplaystore['Category'])) plt.figure( figsize=(20,10)) plt.imshow(wc, interpolation="bilinear") plt.axis("off") plt.tight_layout(pad=0)     Exploring app ratings across top categories In [35]: groups = googleplaystore.groupby('Category').filter(lambda x: len(x) > 286).reset_index() array = groups['Rating'].hist(by=groups['Category'], sharex=True, figsize=(20,20))   As we can see, average apps ratings are quite different across the categories. Average Rating of all the Apps   And what insight will we get, if we explore average rating for all of the apps? In [36]: avg_rate_data = go.Figure() avg_rate_data.add_histogram( x = googleplaystore.Rating, xbins = {'start': 1, 'size': 0.1, 'end' :6} ) iplot(avg_rate_data)   In [38]: img_bytes = pio.to_image(avg_rate_data, format='png', width=1600, height=800, scale=2) In [39]: Image(img_bytes) Out[39]:     As we can see, most of the apps clearly hold a rating above 4.0! Actually, quite a lot of apps seem to have 5.0 rating. Let's check how many apps do have the highest possible rating. In [40]: googleplaystore.Rating[googleplaystore['Rating'] == 5 ].count() Out[40]: 271 But does any feature from the dataset really affect on the apps' rating? Let's try to figure out how size, amount of installs, reviews, and price correlate between each other and then explore the impact of every feature on the rating. First of all, let's build a heatmap. For exploring correlations between features, a heatmap is among the best visual tools. The individual values in the data matrix are represented by different colors helping quickly see what features have the most and the least dependencies. In [41]: sns.heatmap(googleplaystore.corr(), annot=True, linewidth=0.5) Out[41]: <matplotlib.axes._subplots.AxesSubplot at 0x11f75fbe0>     A positive correlation of 0.62 exists between the number of reviews and the number of installations, which means that customers tend to download a given app more if it has been reviewed by a larger number of people. This also means that many active users who download the app usually give feedback. Sizing strategy: How does size of the app impact rating? Despite the fact that modern phones and pads have enough memory to deal with various kinds of tasks and store Gigabytes of data, the size of the apps still matters. Let's explore whether this value really affects app rating or not. To find an answer to this question, we will use scatterplot which is definitely the most common and informant way to see how two variables correlate. In [42]: groups = googleplaystore.groupby('Category').filter(lambda x: len(x) >= 50).reset_index() In [43]: sns.set_style("whitegrid") ax = sns.jointplot(googleplaystore['Size'], googleplaystore['Rating'])   /anaconda3/lib/python3.7/site-packages/scipy/stats/ FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.   As we can see, most of the apps with the highest rating have a size between approximately 20Mb and 40Mb. Pricing: How does price affect app rating? In [44]: paid_apps = googleplaystore[googleplaystore.Price>0] p = sns.jointplot( "Price", "Rating", paid_apps)   So, the top-rated apps do not have big prices: only a few apps have a price higher than $20. Pricing across categories In [45]: sns.set_style('whitegrid') fig, ax = plt.subplots() fig.set_size_inches(15, 8) p = sns.stripplot(x="Price", y="Category", data=googleplaystore, jitter=True, linewidth=1) title = ax.set_title('App pricing trends across categories')    As we can see, there are apps with a price higher than $200! Let's see, what categories these apps belong to. In [46]: googleplaystore[['Category', 'App']][googleplaystore.Price > 200].groupby([ "Category"], as_index=False).count() Out[46]:     Category App 0 FAMILY 4 1 FINANCE 6 2 LIFESTYLE 5   Price vs. installation: are free apps downloaded more than paid? For visualizing this answer we will use boxplot, so we can compare the range and distribution of the number of downloads for paid and free apps. Boxplots also help to answer questions like: what are the key values (average, median, first quartile, and so on) does our data have outliers and what are their values whether our data is symmetric how tightly the data is grouped is the data shifted and, if so, in which direction, etc. In [47]: trace0 = go.Box( y=np.log10(googleplaystore['Installs'][googleplaystore.Type=='Paid']), name = 'Paid', marker = dict( color = 'rgb(214, 12, 140)', ) ) trace1 = go.Box( y=np.log10(googleplaystore['Installs'][googleplaystore.Type=='Free']), name = 'Free', marker = dict( color = 'rgb(0, 128, 128)', ) ) layout = go.Layout( title = "Paid apps Vs free apps", yaxis= {'title': 'Downloads (log-scaled)'} ) data = [trace0, trace1] iplot({'data': data, 'layout': layout})    As we can see, paid apps are downloaded less frequently than free ones. Conclusion Exploratory data analysis is an inherent part of data exploration that helps to get a general knowledge about the dataset you work with as well as find basic conceptions and outlines to get first insights. In this tutorial we walked through the general approaches for initial data exploration on the example of apps categories and rating columns. However, there are a lot of other interesting dependencies and correlations left within other columns. The dataset we used is available via the following link:
In the modern world, the information flow which befalls on a person is daunting. This led to a rather abrupt change in the basic principles of data perception. Therefore visualization is becoming the main tool for presenting information. With the help of visualization, information is presented to the audience in a more accessible, clear, visual form. Properly chosen method of visualization can make it possible to structure large data arrays, schematically depict elements that are insignificant in content, and make information more comprehensible. One of the most popular languages for data processing and analysis is Python, largely due to the high speed of creating and development of the libraries which grant basically unlimited possibilities for various data processing. The same is true for data visualization libraries. In this article, we will look at the basic tools of visualizing data that are used in the Python development environment. Matplotlib Matplotlib is perhaps the most widely known Python library for data visualization. Being easy to use, it offers ample opportunities to fine tune the way data is displayed. Polar area chart The library provides main visualization algorithms, including scatter plots, line plots, histograms, bar plots, box plots, and more. It is worth noting that the library has fairly extensive documentation, that makes it comfortable enough to work with even for beginners in the sphere of data processing and visualization. Multicategorical plot One of the main advantages of this library is a well-thought hierarchical structure. The highest level is represented by the functional interface called  matplotlib.pyplot , which allows users to create complex infographics with just a couple of lines of code by choosing ready-made solutions from the functions offered by the interface. Histogram The convenience of creating visualizations using matplotlib is provided not only due to the presence of a number of built-in graphic commands but also due to the rich arsenal on the configuration of standard forms. Settings include the ability to set arbitrary colors, shapes, line type or marker, line thickness, transparency level, font size and type, and so on. Seaborn Despite the wide popularity of the Matplotlib library, it has one drawback, which can become critical for some users: the low-level API and therefore, in order to create truly complex infographics, you may need to write a lot of generic code. Hexbin plot Fortunately, this problem is successfully leveled by the Seaborn library, which is a kind of high-level wrapper over Matplotlib. With its help, users are able to create colorful specific visualizations: heat maps, time series, violin charts, and much more. Seaborn heatmap Being highly customizable, Seaborn allows users wide opportunities to add unique and fancy looks to their charts in a quite a simple way with no time costs. ggplot Those users who have experience with R, probably heard about ggplot2, a powerful data visualization tool within the R programming language environment. This package is recognized as one of the best tools for graphical presentation of information. Fortunately, the extensive capabilities of this library are now available in the Python environment due to porting the package, which is available there under the name  ggplot . Box plot As we mentioned earlier, the process of data visualization has a deep internal structure. In other words, the process of creating a visualization is a clearly structured system, which largely influences the way of the thoughts in the process of creating infographics. And ggplot teaches the user to think in such a structured approach, to think according to this system so that in the process of consistently building commands, the user automatically starts detecting patterns in the data. Scatter plot Moreover, the library is very flexible. Ggplot provides users with ample opportunities for customizing how data will be displayed and preprocessing datasets before they are rendered. Bokeh Despite the rich potential of the ggplot library, some users may lack interactivity. Therefore, for those who need interactive data visualization, the Bokeh library has been created. Stacked area chart Bokeh is an open-source Javascript library with client-side for Python that allows users to create flexible, powerful and beautiful visualizations for web applications. With its help, users can create both simple bar charts and complex, highly detailed interactive visualizations without writing a single line in Javascript. Please have a look at  this gallery  to get an idea of the interactive features of Bokeh. plotly For those who need interactive diagrams, we recommend to check out the plotly library. It is positioned primarily as an online  platform , on which the users can create and publish their own visualizations. However, the library can also be used offline without uploading the visualization to the plotly server. Contour plot Due to the fact that this library is positioned by developers mostly as an autonomous product, it is constantly being refined and expanded. So, it provides users truly unlimited possibilities for data visualization, whether it’s interactive graphics or contours. You can find some examples of Plotly through the link below and have a look at the features of the library. Conclusion Over the past few years, data visualization tools available to Python developers have made a significant leap forward. Many powerful packages have appeared and are expanding in every possible way, implementing quite complex ways of graphical representation of information. This allows users not only to create various infographics but also to make them truly attractive and understandable to the audience.
The more carefully you process the data and go into details, the more valuable information you can get for your benefit. Data visualization is an efficient and handy tool for gaining insights from data. Moreover, you can make the data far more understandable, colorful and pleasant with the help of visualization tools. As data is changing every second, it is an urgent task to investigate it carefully and get the insights as fast as possible. Data visualization tools cover a full scope of opportunities and additional functions which are called upon to facilitate the visualization process for you. Thus, we attempted to make an overview of the most popular and useful libraries for data visualization in R. Ggplot2 Ggplot2 is a system for creating charts based on the Grammar of Graphics. It proved to be one of the best R libraries for visualization. Ggplot2 works with both univariate and multivariate numerical and categorical data. Thus, it is very flexible. The plot specification is at a high level of abstraction and has complete graphics system. It contains a variety of labels, themes, different coordinate systems, etc. Therefore, you get the opportunity to: control data values with scales option filter, sort, summarize datasets create complex plots. However, some activities are not available with ggplot2 such as 3d graphics, graph-theory type graphs, and interactive graphics. Here are several examples of the visualization plots made with the help of Ggplot2. Density plot Boxplot Scatterplot Plotly Plotly is an online platform for data visualization, available in R and Python. This package creates interactive web-based plots using plotly.js library. Its advantage is that it can build contour plots, candlestick charts, maps, and 3D charts, which cannot be created using most packages. In addition, it has 30 repositories available. Plotly gives you an opportunity to interact with graphs, change their scale and point out the necessary record. The library also supports graph hovering. Moreover, you can easily add Plotly in knitr/R Markdown or Shiny apps. Have a look at several plots and charts created with Plotly. Contour plot Candlestick chart 3d scatterplot Dygraphs Dygraphs is an R interface to the JavaScript charting library. This library proved to be fast and flexible in its application. It facilitates the work with dense data. Dygraphs is a useful tool for charting time-series data in R. The benefits of this library include the support of visualizing xts objects, support of graph covering such as shaded regions, event lines, and point annotations, interaction with graphs, showing upper/lower bars, synchronization and the range selector, and more. You can also easily add Dygraphs in knitr/R Markdown or Shiny apps. Moreover, huge datasets with millions of points don’t affect its speed. Also, you can use RColorBrewer with Dygraphs to increase the range of colors. Below you can see a vivid representation of the data visualization with Dygraphs package. Leaflet Leaflet is a well-known package based on JavaScript libraries for interactive maps. It is widely used for mapping and working with the customization and design of interactive maps. Besides, Leaflet provides an opportunity to make these maps mobile-friendly. It's abilities include: interaction with plots, and the change in their scale map design (Markers, Pop-ups, GeoJSON) easy integration with knitr/R Markdown or Shiny apps work with latitude/longitude columns support of the Shiny logic using map bounds and mouse events visualization of maps in non-spherical Mercator projections. Rgl Rgl package may become a perfect fit for creating interactive 3D plots in R. It offers a variety of 3D shapes, lighting effects, different types of the objects, and even the ability to make an animation of your 3D scene. Rgl contains high-graphics commands and low-level structure. The plot types include meaning points, lines, segments from z=0, and spheres. Moreover, with Rgl, you can: interact with graphs apply various decorative functions easily add Rgl in knitr/R Markdown or Shiny apps create new shapes Conclusion To sum up, data visualization is more than a charming picture of your data. It's a chance to see the data under the hood. R is one of the powerful visualization tools. Using R you can build a variety of charts from a simple pie chart to more sophisticated such as 3d graphs, interactive graphs, maps, etc. Of course, this list is not complete and there exist many other great visualization tools which can bring their specific benefits to your data visualization. Nevertheless, we compiled this list from our experience. Summarizing everything mentioned before, Plotly, Dygraph, and Leaflet support zooming, moving your graphs. If you are plotting time series, you can filter dates using a range selector. For building 3d models it is highly suitable to use Rgl. Do your best with handy R visualization tools!  
View all blog posts