Your Career Platform for Big Data

Be part of the digital revolution in Switzerland

 

Latest Jobs

Fossil Group, Inc Basel, BS, Switzerland
15/07/2019
Full time
At   Fossil , we dare to dream, disrupt, and deliver in a better way. Our goal is simple - bring innovation, style, and connectivity to an industry ripe for change. Fossil is on a mission to revamp the way fashion accessories are done. We are committed to creating great watches, jewelry, handbags, small leather goods and wearables by investing in technology and long-term value creation. With our diverse portfolio of proprietary and licensed brands, along with department stores, specialty stores, eCommerce websites, and company-owned and operated retail stores, we are building a leading fashion- and tech-forward accessories company.   Are you in? Make An Impact For our Office in   Basel , Switzerland, we need an organized, creative and self-motivated professional. We are currently looking for an Trading/Operations Specialist E-Commerce EMEA (m/f) As a vital member of the E-Commerce team, you will be responsible for the daily operations and expansion of Fossil Group brand in EMEA. With an impeccable sales judgment and an efficient data-driven and resolution-focused approach, you develop, manage and maintain the brand’s websites and ensure the smooth running of the e-commerce business, troubleshooting where needed. In this role you ensure that all activities (sales, merchandising, customer service, shop management, finance, stock, promotion and warehouse fulfillment) are performed accurately, to a high standard and in a timely manner whilst maximizing sales and profit. Your impact: Assisting the Head of D2C in developing and delivering the online trading and merchandising strategy and driving the success of Fossil and Skagen websites across EMEA Proactively contribute to the development of the day to day trading performance, margin and P&L, as well as ranging and promotional activity Main lead on all E-Commerce operational and SAP related issues for EMEA Report issues through the ticket system and follow up on open tickets Deal with E-Commerce-related issues reported from the customer care and/or other departments Monitor of all shop sites to identify issues which may prevent customers from buying online and work with the relevant teams to identify changes and improvements necessary Create and execute weekly, monthly, quarterly, annual E-Commerce sales reports and comment Ensure that product updates and promotions are delivered within the set time frames on all shops Drive ongoing improvement of operational processes (e.g. warehouse, external service providers, etc) to deliver efficiencies, ensuring consistency across the entire operations Undertake ad hoc projects as required in the development of the department objectives Who you are: 2+ years experience in a relevant commercial function with significant exposure to e-commerce at a trading level Significant experience with SAP (required), other systems optional but preferred (e.g. SAP BW, GCP, Tableau, OMS, etc) Experience with Salesforce and Google Analytics highly preferred Knowledge of stock management, financial procedures and merchandising best practice, with an understanding of user behavior Experience in a fast paced sales environment driving performance to reach targets Outstanding communication skills - relay complex, technical information in a clear and concise way with the ability to communicate these in non-technical terms Evidence of proactively having taken initiatives to improve business efficiency or performance Ability to use independent judgment, work across functions and regions Business English required. Fluency in German, Italian or French an advantage We are looking for people who embody our core values;   Authenticity , we are all in with our unique selves. Everyone is different at fossil and we love it!   Grit , we push through, we bounce back, and we set our sights on the prize & go after it.   Curiosity , we ask what if? What's next?   Sense of humor , we don't take ourselves too seriously. Yeah, seriously.   Making an Impact   we go big. We perform. We make a difference. Life is Short, Work Somewhere Awesome! Job Status Full Time Department E-Commerce
Fossil Group, Inc Basel, BS, Switzerland
15/07/2019
Full time
At   Fossil , we dare to dream, disrupt, and deliver in a better way. Our goal is simple - bring innovation, style, and connectivity to an industry ripe for change. Fossil is on a mission to revamp the way fashion accessories are done. We are committed to creating great watches, jewelry, handbags, small leather goods and wearables by investing in technology and long-term value creation. With our diverse portfolio of proprietary and licensed brands, along with department stores, specialty stores, eCommerce websites, and company-owned and operated retail stores, we are building a leading fashion- and tech-forward accessories company.   Are you in? Make An Impact For our Office in   Basel , Switzerland, we need an organized, creative and self-motivated professional. We are currently looking for an Sales Analyst (m/f) The Sales Analyst is responsible for converting the top-down targets into an operational plan on a channel-, subregion- or country level. This role provides ad hoc reporting for business gap analysis and provides recommendations on actions. The Sales Analyst works in conjunction with the Commercial Finance Team to develop a best practice reporting suite with critical data to provide insights to the sales teams as well as using standardized tools to enable the optimum range and stock holding for our B2B point of sales. Your impact: Defines and tracks operating plans & business cases (building blocks) together with the Head of Sales Bottom-up planning Provides Input for the Forecasting process Short-term projections Sales Performance Analysis on account level Ad-hoc Sales Reporting Identifies sell through opportunities on account level from sell-in / sell-out data Responsible for analyzing the standardized monthly sell in reporting pack by B2B retailer and providing subsequent action points at various levels. Provides insights to the Sales Teams Who you are: Analytical Experience operating at a senior level Strong business acumen Business fluent in speaking English is a must, additional languages preferred Strong networking and communication Skills Ability to manage multiple stakeholders Comfortable working within a complex matrix organization Can demonstrate experience of developing diverse relationships at all levels Experience of effectively managing multiple projects Well organized individual who can manage a busy workload whilst meeting deadlines We are looking for people who embody our core values;   Authenticity , we are all in with our unique selves. Everyone is different at fossil and we love it!   Grit , we push through, we bounce back, and we set our sights on the prize & go after it.   Curiosity , we ask what if? What's next?   Sense of humor , we don't take ourselves too seriously. Yeah, seriously.   Making an Impact   we go big. We perform. We make a difference. Life is Short, Work Somewhere Awesome! Job Status Full Time Department Sales
CREALOGIX Switzerland
15/07/2019
Full time
Die CREALOGIX Gruppe ist ein global agierendes, unabhängiges Schweizer Softwarehaus und gehört als Fintech Top 100 Unternehmen zu den Marktführern im Digital Banking. CREALOGIX entwickelt und implementiert innovative Fintech-Lösungen für die digitale Bank von morgen. Die 1996 gegründete Gruppe beschäftigt weltweit rund 700 Mitarbeitende. Zur Erweiterung unseres Teams im Geschäftsbereich Rechenzentrum am Standort   Coburg   suchen wir zum nächstmöglichen Eintrittstermin eine/einen   IT Security Engineer / Big Data Analyst (m/w)   Sie suchen komplexe Herausforderungen im Bereich der IT und möchten auch neue Themen der IT kennenlernen? Monotone Arbeit ist nichts für Sie? Der Kontakt mit Mitarbeitern und Kunden bereitet Ihnen Freude? Dann sind Sie hier richtig. Entdecken auch Sie Ihr persönliches Plus bei CREALOGIX und bewerben Sie sich jetzt! Ihr Aufgabenbereich: Durchführung von reaktiven und präventiven Maßnahmen im Umfeld der technischen Security Beraten und unterstützen bei Konzepten, Projekten und Fragestellungen im Bereich der IT-Security Identifizieren von Schwachstellen im Security Operations Center, sowie überprüfen der Logdaten Betreuung der Sicherheitsprozesssysteme SIEM, PAM und Configurationsmanagement Ihre Qualifikation: Abgeschlossenes Studium der Fachrichtung Informatik oder eine vergleichbare Qualifikation Affinität für Sicherheitstools und Programmiersprachen Know-how im Bereich Cyber Threat Intelligence Kenntnisse im IT-Security- und Netzwerkumfeld Gute Kenntnisse im Bereich Linux und Linux-Hardening Kenntnisse in den Bereichen HSM und PKI Erfahrung mit Big Data Tools wie z. B. Splunk Strukturierte, lösungsorientierte und eigenverantwortliche Arbeitsweise Lernbereitschaft hinsichtlich technologischer Entwicklungen Gute Deutsch- und Englischkenntnisse Wir bieten   Ihnen flexible Arbeitszeiten, eine attraktive, leistungsgerechte Vergütung und eine ausgewogene Work-Life-Balance. Ein ansprechendes Arbeitsumfeld mit moderner Infrastruktur und eine gute Atmosphäre motivieren Sie, den nächsten Wachstumsschritt der CREALOGIX aktiv mitzugestalten. Haben Sie Fragen? Dann steht Ihnen Nadine Bisch unter +49 9561 55430 gerne zur Verfügung. Sind Sie interessiert? Wir freuen uns auf Ihre vollständige Onlinebewerbung!
Dathena Lausanne, VD, Switzerland
15/07/2019
Full time
Dathena is a Cybersecurity and Data Governance Startup based in Switzerland, Singapore and Paris. We work in close partnership with PwC, Google and NVIDIA. We are looking for motivated interns who will work with us and our partners to improve our leading technology in managing the risk of confidentiality of the Major banks and Fortune 500 clients and discover key insights on the data they have. You will be able to work on challenging big data projects and help us building and improving our products' scalability. If you have fun with us and you like working for a fast growing company, you will have the opportunity to apply to a permanent position after the internship in one of our office! Job Purpose We are looking for a motivated Scala/Spark Developer that will help us improve data ingestion and processing performance by analysing current Spark applications to design and implement efficient architecture solutions. Your focus will be to optimize current implementations of ML algorithms Responsabilities Objectives of the internship: Design and implement streaming pipelines for a large application Benchmark different solution approaches and analyze performances Optimise code and resource usage Present solutions approaches and architecture choices Maintain high-performance and data integrity of critical database (NoSQL & SQL type) Support new projects/integrations working with R&D team Skills Good programming skills: Scala, knowledge in using query languages such as SQL and NoSQL Nice to have: Kafka, Spark-Streaming, HBase Software engineering best practices: continuous integration with git, Jira, BitBucket Data-oriented personality Exceptional Oral and Written Communication Skills Time management Interpersonal Skills Critical Thinking Presentation Skills Working Conditions The Scala/Spark Software Engineer Intern will be part of a highly qualified and dynamic team where he will be able to learn and improve himself. The Scala/Spark Software Engineer Intern must fully embrace the team spirit of a young and innovative Start-up. They must be able to adapt to a multi-cultural environment. Travel and remote location might be required.

DataCareer Blog

The way other people think about one or another product or service has a big impact on our everyday process of making decisions. Earlier, people relied on the opinion of their friends, relatives, or products and services reposts, but the era of the Internet has made significant changes. Today opinions are collected from different people around the world via reviewing e-commerce sites as well as blogs and social nets. To transform gathered data into helpful information on how a product or service is perceived among the people, the sentiment analysis is needed. What is sentiment analysis and why do we need it Sentiment analysis is a computing exploration of opinions, sentiments, and emotions expressed in textual data. The reason why sentiment analysis is used increasingly by companies is due to the fact that the extracted information can result in the products and services monetizing. Words express various kinds of sentiments. They can be positive, negative, or have no emotional overtone, be neutral. To perform analysis of the text's sentiment, the understanding of the polarity of the words is needed in order to classify sentiments into positive, negative, or neutral categories. This goal can be achieved through the use of sentiment lexicons. Common approaches for classifying sentiment  Sentiment analysis can be done in three ways: using ML algorithms, using dictionaries and lexicons, and combining these techniques. The approach based on the ML algorithms got significant popularity nowadays as it gives wide opportunities for performing identification of different sentiment expressions in the text. For performing lexicon-based approach various dictionaries with polarity scores can be found. Such dictionaries can help in establishing the connotation of the word. One of the pros of such an approach is that you don't need a training set for performing analysis, and that is why even a small piece of data can be successfully classified. However, the problem is that many words are still missing in sentiment lexicons that somewhat diminishes results of the classification. Sentiment analysis based on the combination of ML and lexicon-based techniques is not much popular but allows to achieve much more promising results then the results of independent use of the two approaches. The central part of the lexicon-based sentiment analysis belongs to the dictionaries. The most popular are afinn, bing, and nrc that can be found and installed on  python packages repository  All dictionaries are based on the polarity scores that can be positive, negative, or neutral. For Python developers, two useful sentiment tools will be helpful - VADER and TextBlob. VADER is a rule and lexicon-based tool for sentiment analysis that is adapted to sentiments that can be found in social media posts. VADER uses a list of tokens that are labeled according to their semantic connotation. TextBlob is a useful library for text processing. It provides general dealing with such tasks like phrase extraction, sentiment analysis, classification and so on. Things needed to be done before SA  In this tutorial, we will build a lexicon-based sentiment classifier of Donald Trump tweets with the help of the TextBlob. Let's look, which sentiments generally prevail in the scope of tweets. As every data exploration, there are some steps needed to be done before analysis, problem statement and data preparation. As the theme of our study is already stated, let's concentrate on data preparation. We will get tweets directly from Twitter, the data will come to us in some unordered look and that is why we need to order data into dataframe and do cleaning, removing links and stopwords. Building sentiment classifier First of all, we have to install packages needed for dealing with the task. In [41]: import tweepy import pandas as pd import numpy as np import matplotlib.pyplot as plt import re import nltk import nltk.corpus as corp from textblob import TextBlob  The next step is to connect our app to Twitter via Twitter API. Provide the needed credentials that will be used in our function for connection and extracting tweets from Donald Trump's account. In [4]: CONSUMER_KEY = "Key" CONSUMER_SECRET = "Secret" ACCESS_TOKEN = "Token" ACCESS_SECRET = "Secret" In [7]: def twitter_access(): auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET) api = tweepy.API(auth) return api twitter = twitter_access() In [9]: tweets = twitter.user_timeline("RealDonaldTrump", count=600) This is how our dataset looks: In [81]: tweets[0] Out[81]: Status(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'created_at': 'Wed Jun 26 02:34:41 +0000 2019', 'id': 1143709133234954241, 'id_str': '1143709133234954241', 'text': 'Presidential Harassment!', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, 'source': '<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 10387, 'favorite_count': 48141, 'favorited': False, 'retweeted': False, 'lang': 'in'}, created_at=datetime.datetime(2019, 6, 26, 2, 34, 41), id=1143709133234954241, id_str='1143709133234954241', text='Presidential Harassment!', truncated=False, entities={'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': []}, source='Twitter for iPhone', source_url='http://twitter.com/download/iphone', in_reply_to_status_id=None, in_reply_to_status_id_str=None, in_reply_to_user_id=None, in_reply_to_user_id_str=None, in_reply_to_screen_name=None, author=User(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, id=25073877, id_str='25073877', name='Donald J. Trump', screen_name='realDonaldTrump', location='Washington, DC', description='45th President of the United States of America🇺🇸', url='https://t.co/OMxB0x7xC5', entities={'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, protected=False, followers_count=61369691, friends_count=47, listed_count=104541, created_at=datetime.datetime(2009, 3, 18, 13, 46, 38), favourites_count=7, utc_offset=None, time_zone=None, geo_enabled=True, verified=True, statuses_count=42533, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=True, profile_background_color='6D5C18', profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png', profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_tile=True, profile_image_url='http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_image_url_https='https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_banner_url='https://pbs.twimg.com/profile_banners/25073877/1560920145', profile_link_color='1B95E0', profile_sidebar_border_color='BDDCAD', profile_sidebar_fill_color='C5CEC0', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=False, default_profile=False, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='regular'), user=User(_api=<tweepy.api.API object at 0x7f987f0ce240>, _json={'id': 25073877, 'id_str': '25073877', 'name': 'Donald J. Trump', 'screen_name': 'realDonaldTrump', 'location': 'Washington, DC', 'description': '45th President of the United States of America🇺🇸', 'url': 'https://t.co/OMxB0x7xC5', 'entities': {'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, 'protected': False, 'followers_count': 61369691, 'friends_count': 47, 'listed_count': 104541, 'created_at': 'Wed Mar 18 13:46:38 +0000 2009', 'favourites_count': 7, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 42533, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': True, 'profile_background_color': '6D5C18', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': True, 'profile_image_url': 'http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/25073877/1560920145', 'profile_link_color': '1B95E0', 'profile_sidebar_border_color': 'BDDCAD', 'profile_sidebar_fill_color': 'C5CEC0', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'regular'}, id=25073877, id_str='25073877', name='Donald J. Trump', screen_name='realDonaldTrump', location='Washington, DC', description='45th President of the United States of America🇺🇸', url='https://t.co/OMxB0x7xC5', entities={'url': {'urls': [{'url': 'https://t.co/OMxB0x7xC5', 'expanded_url': 'http://www.Instagram.com/realDonaldTrump', 'display_url': 'Instagram.com/realDonaldTrump', 'indices': [0, 23]}]}, 'description': {'urls': []}}, protected=False, followers_count=61369691, friends_count=47, listed_count=104541, created_at=datetime.datetime(2009, 3, 18, 13, 46, 38), favourites_count=7, utc_offset=None, time_zone=None, geo_enabled=True, verified=True, statuses_count=42533, lang=None, contributors_enabled=False, is_translator=False, is_translation_enabled=True, profile_background_color='6D5C18', profile_background_image_url='http://abs.twimg.com/images/themes/theme1/bg.png', profile_background_image_url_https='https://abs.twimg.com/images/themes/theme1/bg.png', profile_background_tile=True, profile_image_url='http://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_image_url_https='https://pbs.twimg.com/profile_images/874276197357596672/kUuht00m_normal.jpg', profile_banner_url='https://pbs.twimg.com/profile_banners/25073877/1560920145', profile_link_color='1B95E0', profile_sidebar_border_color='BDDCAD', profile_sidebar_fill_color='C5CEC0', profile_text_color='333333', profile_use_background_image=True, has_extended_profile=False, default_profile=False, default_profile_image=False, following=False, follow_request_sent=False, notifications=False, translator_type='regular'), geo=None, coordinates=None, place=None, contributors=None, is_quote_status=False, retweet_count=10387, favorite_count=48141, favorited=False, retweeted=False, lang='in') Not very informative, huh? Let's make our dataset look more legible. In [101]: tweetdata = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=["tweets"]) In [102]: tweetdata["Created at"] = [tweet.created_at for tweet in tweets] tweetdata["retweets"] = [tweet.retweet_count for tweet in tweets] tweetdata["source"] = [tweet.source for tweet in tweets] tweetdata["favorites"] = [tweet.favorite_count for tweet in tweets] And this is how it looks now. Much better, isn't it? In [103]: tweetdata.head() Out[103]:   tweets Created at retweets source favorites 0 Presidential Harassment! 2019-06-26 02:34:41 10387 Twitter for iPhone 48141 1 Senator Thom Tillis of North Carolina has real... 2019-06-25 22:20:42 11127 Twitter for iPhone 45202 2 Staff Sgt. David Bellavia - today, we honor yo... 2019-06-25 21:38:42 11455 Twitter for iPhone 48278 3 Today, it was my great honor to present the Me... 2019-06-25 20:27:19 10389 Twitter for iPhone 44485 4 ....Martha is strong on Crime and Borders, the... 2019-06-25 19:25:20 9817 Twitter for iPhone 52995   The next step needed to be taken is cleaning our dataset from useless words that bring no sense and improving our dataset that will then contain, among default tweet data, its connotation (whether it's positive, negative, or neutral), sentimental score, and subjectivity. In [104]: stopword = corp.stopwords.words('english') + ['rt', 'https', 'co', 'u', 'go'] def clean_tweet(tweet): tweet = tweet.lower() filteredList = [] global stopword tweetList = re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split() for i in tweetList: if not i in stopword: filteredList.append(i) return ' '.join(filteredList) In [105]: scores = [] status = [] sub = [] fullText = [] for tweet in tweetdata['tweets']: analysis = TextBlob(clean_tweet(tweet)) fullText.extend(analysis.words) value = analysis.sentiment.polarity subject = analysis.sentiment.subjectivity if value > 0: sent = 'positive' elif value == 0: sent = 'neutral' else: sent = 'negative' scores.append(value) status.append(sent) sub.append(subject) In [106]: tweetdata['sentimental_score'] = scores tweetdata['sentiment_status'] = status tweetdata['subjectivity'] = sub tweetdata.drop(tweetdata.columns[2:5], axis=1, inplace=True) In [107]: tweetdata.head() Out[107]:   tweets Created at sentimental_score sentiment_status subjectivity 0 Presidential Harassment! 2019-06-26 02:34:41 0.000000 neutral 0.000000 1 Senator Thom Tillis of North Carolina has real... 2019-06-25 22:20:42 0.081481 positive 0.588889 2 Staff Sgt. David Bellavia - today, we honor yo... 2019-06-25 21:38:42 0.333333 positive 1.000000 3 Today, it was my great honor to present the Me... 2019-06-25 20:27:19 0.400000 positive 0.375000 4 ....Martha is strong on Crime and Borders, the... 2019-06-25 19:25:20 0.086667 positive 0.396667   For a better understanding of the obtained results, let's do some visualization. In [109]: positive = len(tweetdata[tweetdata['sentiment_status'] == 'positive']) negative = len(tweetdata[tweetdata['sentiment_status'] == 'negative']) neutral = len(tweetdata[tweetdata['sentiment_status'] == 'neutral']) In [110]: fig, ax = plt.subplots(figsize = (10,5)) index = range(3) plt.bar(index[2], positive, color='green', edgecolor = 'black', width = 0.8) plt.bar(index[0], negative, color = 'orange',edgecolor = 'black', width = 0.8) plt.bar(index[1], neutral, color = 'grey',edgecolor = 'black', width = 0.8) plt.legend(['Positive', 'Negative', 'Neutral']) plt.xlabel('Sentiment Status ',fontdict = {'size' : 15}) plt.ylabel('Sentimental Frequency', fontdict = {'size' : 15}) plt.title("Donald Trump's Twitter sentiment status", fontsize = 20) Out[110]: Text(0.5, 1.0, "Donald Trump's Twitter sentiment status")     Conclusion Sentiment analysis is a great way to explore emotions and opinions among society. We created basic sentiment classifier that can be used for analyzing textual data in social nets. The lexicon-based analysis allows creating own lexicon dictionaries thanks to what you can perform fine sentiment tuning depending on the task, textual data, and the goal of the analysis.
Among the variety of open source relational databases, PostgreSQL is probably one of the most popular due to its functional capacities. That is why it is frequently used among all the areas of work where databases are involved. In this article, we will go through connection and usage of PostgreSQL in R. R is an open source language for statistical and graphics data analysis providing scientists, statisticians, and academics powerful tools for various manipulations. Besides, it allows creating and running emulations of the real-world data. Usually, R comes with an RStudio IDE, so that will be used while connecting and using PostgreSQL. PostgreSQL deployment in R One of the great things about R language is that it has numerous packages for almost every kind of needs. Moreover, the package library is constantly growing, as the packages are set up and developed by the community. Two main packages can be found in the library for connecting PostgreSQL in R environment:  RPostgreSQL  and  RPostgres . Both of them provide great functionality for database interactions, the difference is only in the way of installation. The RPostgreSQL package is available on the CRAN, a Comprehensive R Archive Network, and is installed with the following command run in the IDE: install.packages('RPostgreSQL') As for the RPostgres package, it can be installed in two ways: cloning from Github and installing directly from CRAN. To install the package from Github, first, devtools and remotes packages must be installed with the commands. install.packages('devtools') install.packages(‘remotes’)   Then, for installing package, run remotes::install_github("r-dbi/RPostgres")   To install package from CRAN, the next basic command is used: install.packages(‘RPostgres’)   The difference in these two ways is that in CRAN the latest stable version of a package is stored while on Github users can find the latest development version. The truth is, RPostgreSQL and RPostgres packages have no difference in the way they connect to the PostgreSQL database. They both use a special DBI package in R that provides a wide range of methods and classes to establishing connection with DBs. Note: we used RPostgres package for establishing the connection. Establishing basic connection with the database using R The Postgres package comes with the next command: con<-dbConnect(RPostgres::Postgres())   With the following steps you can set up the connection to a specific database:   library(DBI) db <- 'DATABASE'  #provide the name of your db host_db <- ‘HOST’ #i.e. # i.e. 'ec2-54-83-201-96.compute-1.amazonaws.com'   db_port <- '98939'  # or any other port specified by the DBA db_user <- USERNAME   db_password <- ‘PASSWORD’ con <- dbConnect(RPostgres::Postgres(), dbname = db, host=host_db, port=db_port, user=db_user, password=db_password)     To check if the connection is established, we can run the dbListTables(con) function that returns the list of the tables in our database. As you can see, no tables are stored in our database, so now it’s time to create one. Working with database As we’ve already mentioned, the R language provides a great pack of simulated datasets, that can be directly used from the IDE without downloading them previously. For our examples, we will use a popular “mtcars” dataset example, which contains data from the 1974 Motor Trend magazine car road test. Let’s first add it to the database and then check whether it has appeared in our database. The basic command to add “mtcars” to our database is dbWriteTable(con, "mtcars", mtcars) But we will do a little trick, that can make our table a little bit more readable. What we’ve done, is set up the table as a dataframe in R, renamed the first column to ‘carname’ and then removed initial dataset with the rm(mtcars) command as it is stored in the variable my_data. Using the   dbWriteTable method, we can write our dataframe to a PostgreSQL table. Then, let’s check how our table looks. Having a table in the database, we can now explore queries. For working with queries, two basic methods are needed: dbGetQuery and dbSendQuery The dbGetQuery method returns all the query results in a dataframe. The dbSendQuery registers the request for the data that has to be called by dbFetch for RPostgres to receive data. The dbFetch method allows setting parameters to query your data in some batches. The database table must have some primary key, basically, a unique identifier for every record in the table. Let’s assign the names of the cars in our table as a primary key using dbGetQuery method.   dbGetQuery(con, 'ALTER TABLE cars ADD CONSTRAINT cars_pk PRIMARY KEY ("carname")')   We have already used the dbReadTable method, but let’s return to it for a little bit to clarify the way it works. The dbReadTable method returns an overview of the data stored in the database and basically does the same function as dbGetQuery(con, ‘SELECT * FROM cars’) method. It should be noted that after using dbSendQuery requests, the dbClearResult method must be called, to remove any pending queries from the database to the current working environment. The dbGetQuery method does this by default and therefore there is no need to call dbClearResult after the execution. Creating basic queries The way of creating queries for a customized data table is basically the same as in SQL. The only difference is that the results of queries in R are stored as a variable. First, we extracted the query with the needed data from our cars table to a new variable. Then, we fetched it to the resulting variable, from which we can create a new table in our database and analyze the output of our query. Finally, the connection must be closed with the dbDisconnect(con) method. Conclusion In this article, we tried to cover the basis of connecting and using PostgreSQL in the R environment. Knowing the essentials of the SQL syntax, querying and modifying data in R is enough to connect to any standard database.. Nevertheless, we suggest reading through the package documentation, which will give you more insights on how to query data from PostgreSQL to the R environment.
What is Exploratory Data Analysis Exploratory data analysis (EDA) is a powerful tool for a comprehensive study of the available information providing answers to basic data analysis questions. What distinguishes it from traditional analysis based on testing a priori hypothesis is that EDA makes it possible to detect — by using various methods — all potential systematic correlations in the data. Exploratory data analysis is practically unlimited in time and methods allowing to identify curious data fragments and correlations. Therefore, you are able to examine information more deeply and accurately, as well as choose a proper model for further work. In Python language environment, there is a wide range of libraries that can not only ease but also streamline the process of exploring a dataset. We will use  Google Play Store Apps dataset  and go through the main tasks of exploration analysis to find out if there are any trends that can facilitate the process of setting and resolving a business problem. Data overview Before we start exploring our data, we must import the dataset and Python libraries needed for further work. We will use pandas library, a very powerful tool for comprehensive data analysis. In [1]: import pandas as pd In [2]: googleplaystore = pd.read_csv("googleplaystore.csv") Let's explore the structure of our dataframe by viewing the first and the last 10 rows. In [3]: googleplaystore.head(10) Out[3]:     App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver 0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up 1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up 2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up 3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up 4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up 5 Paper flowers instructions ART_AND_DESIGN 4.4 167 5.6M 50,000+ Free 0 Everyone Art & Design March 26, 2017 1.0 2.3 and up 6 Smoke Effect Photo Maker - Smoke Editor ART_AND_DESIGN 3.8 178 19M 50,000+ Free 0 Everyone Art & Design April 26, 2018 1.1 4.0.3 and up 7 Infinite Painter ART_AND_DESIGN 4.1 36815 29M 1,000,000+ Free 0 Everyone Art & Design June 14, 2018 6.1.61.1 4.2 and up 8 Garden Coloring Book ART_AND_DESIGN 4.4 13791 33M 1,000,000+ Free 0 Everyone Art & Design September 20, 2017 2.9.2 3.0 and up 9 Kids Paint Free - Drawing Fun ART_AND_DESIGN 4.7 121 3.1M 10,000+ Free 0 Everyone Art & Design;Creativity July 3, 2018 2.8 4.0.3 and up In [4]: googleplaystore.tail(10) Out[4]:     App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver 10831 payermonstationnement.fr MAPS_AND_NAVIGATION NaN 38 9.8M 5,000+ Free 0 Everyone Maps & Navigation June 13, 2018 2.0.148.0 4.0 and up 10832 FR Tides WEATHER 3.8 1195 582k 100,000+ Free 0 Everyone Weather February 16, 2014 6.0 2.1 and up 10833 Chemin (fr) BOOKS_AND_REFERENCE 4.8 44 619k 1,000+ Free 0 Everyone Books & Reference March 23, 2014 0.8 2.2 and up 10834 FR Calculator FAMILY 4.0 7 2.6M 500+ Free 0 Everyone Education June 18, 2017 1.0.0 4.1 and up 10835 FR Forms BUSINESS NaN 0 9.6M 10+ Free 0 Everyone Business September 29, 2016 1.1.5 4.0 and up 10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0 Everyone Education July 25, 2017 1.48 4.1 and up 10837 Fr. Mike Schmitz Audio Teachings FAMILY 5.0 4 3.6M 100+ Free 0 Everyone Education July 6, 2018 1.0 4.1 and up 10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical January 20, 2017 1.0 2.2 and up 10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.5 114 Varies with device 1,000+ Free 0 Mature 17+ Books & Reference January 19, 2015 Varies with device Varies with device 10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE 4.5 398307 19M 10,000,000+ Free 0 Everyone Lifestyle July 25, 2018 Varies with device Varies with device   We can see that dataframe googleplaystore has such problem as missing values. But for a more complex view on data, let's do a few more things. Firstly, we will use describe() pandas method that will help us to get a statistic summary of numerical columns in our dataset. We can also use info() method to check data types in each column as well as missing values and shape() for retrieving a number of rows and columns in the dataframe. In [5]: googleplaystore.describe() Out[5]:     Rating count 9367.000000 mean 4.193338 std 0.537431 min 1.000000 25% 4.000000 50% 4.300000 75% 4.500000 max 19.000000 In [6]: googleplaystore.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 10841 entries, 0 to 10840 Data columns (total 13 columns): App 10841 non-null object Category 10841 non-null object Rating 9367 non-null float64 Reviews 10841 non-null object Size 10841 non-null object Installs 10841 non-null object Type 10840 non-null object Price 10841 non-null object Content Rating 10840 non-null object Genres 10841 non-null object Last Updated 10841 non-null object Current Ver 10833 non-null object Android Ver 10838 non-null object dtypes: float64(1), object(12) memory usage: 1.1+ MB In [7]: googleplaystore.shape Out[7]: (10841, 13) In [8]: googleplaystore.dtypes Out[8]: App object Category object Rating float64 Reviews object Size object Installs object Type object Price object Content Rating object Genres object Last Updated object Current Ver object Android Ver object dtype: object So, what information do we have after these small actions? Firstly, we have some number of apps that are divided into various categories. Secondly, although such columns as, for example, "Reviews" contain numeric data, they have non-numeric type, that can cause some problems while further data processing. We are also interested in the total amount of apps and available categories in the dataset. To get the exact amount of apps, we will find all the unique values in the corresponding column. In [9]: len(googleplaystore["App"].unique()) Out[9]: 9660 In [10]: unique_categories = googleplaystore["Category"].unique() In [11]: unique_categories Out[11]: array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL', 'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER', 'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION', '1.9'], dtype=object) Duplicate records removal Usually, the duplicates of data appear in datasets, and this can aggravate the quality and accuracy of exploration. Plus, such data clogs the dataset, so we need to get rid of it. In [14]: googleplaystore.drop_duplicates(keep='first', inplace = True) In [15]: googleplaystore.shape Out[15]: (10358, 13) For removing rows with duplicates from a dataset, pandas has powerful and customizable method drop_duplicates(), which takes certain parameters needed to be considered while cleaning dataset. "keep=False" means that method will drop all the duplicates found in dataset with keeping only one value. "inplace = True" means that all the manipulations will be done and stored in the dataset we are currently using. As we can see above, our initial googleplaystore dataset contained 10841 rows. After removing duplicates, the number of rows decreased to 9948. NA analysis Another common problem of almost every dataset is columns with missing values. We will explore only the most common ways to clean a dataset from missing values. Firstly, let's look at the total amount of missing values in every column for each dataset. One of the great things about pandas is that it allows users to combine various operations in a single action, that brings great optimization opportunities and makes the code more compact. In [14]: googleplaystore.isnull().sum().sort_values(ascending=False) Out[14]: Rating 1465 Current Ver 8 Android Ver 3 Content Rating 1 Type 1 Last Updated 0 Genres 0 Price 0 Installs 0 Size 0 Reviews 0 Category 0 App 0 dtype: int64 Now, let's get rid of all the rows with missing values. Although some statistical approaches allow us to impute missing data with some values (like the most common value or mean value), today we will work only with cleared data. Pandas dropna() method also allows users to set parameters for proper data processing depending on the expected result. Here we stated that program must drop every row that contains any NA values and all the changes will be stored directly in our dataframe. In [16]: googleplaystore.dropna(how ='any', inplace = True) Let's now check the shape of the dataframe after all cleaning manipulations were performed. In [17]: googleplaystore.shape Out[17]: (8886, 13) If we look closer at our dataset and result of the dtypes method, we would see that such columns like "Reviews", "Size", "Price" and "Installs" should definitely have numeric values. So, let's see what values every column has in order to specify our further manipulations. In [18]: googleplaystore.Price.unique() Out[18]: array(['0', '$4.99', '$3.99', '$6.99', '$7.99', '$5.99', '$2.99', '$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49', '$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99', '$15.99', '$33.99', '$39.99', '$3.95', '$4.49', '$1.70', '$8.99', '$1.49', '$3.88', '$399.99', '$17.99', '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50', '$1.59', '$6.49', '$1.29', '$299.99', '$379.99', '$37.99', '$18.99', '$389.99', '$8.49', '$1.75', '$14.00', '$2.00', '$3.08', '$2.59', '$19.40', '$3.90', '$4.59', '$15.46', '$3.04', '$13.99', '$4.29', '$3.28', '$4.60', '$1.00', '$2.95', '$2.90', '$1.97', '$2.56', '$1.20'], dtype=object) In [19]: googleplaystore.Installs.unique() Out[19]: array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+', '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+', '1,000,000,000+', '1,000+', '500,000,000+', '100+', '500+', '10+', '5+', '50+', '1+'], dtype=object) In [20]: googleplaystore.Size.unique() Out[20]: array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M', '28M', '12M', '20M', '21M', '37M', '5.5M', '17M', '39M', '31M', '4.2M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.7M', '2.5M', '7.0M', '16M', '3.4M', '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M', '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M', '7.1M', '22M', '6.4M', '3.2M', '8.2M', '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M', '2.3M', '2.1M', '42M', '9.1M', '55M', '23k', '7.3M', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M', '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M', '79k', '8.4M', '3.7M', '118k', '44M', '695k', '1.6M', '6.2M', '53M', '1.4M', '3.0M', '7.2M', '5.8M', '3.8M', '9.6M', '45M', '63M', '49M', '77M', '4.4M', '70M', '9.3M', '8.1M', '36M', '6.9M', '7.4M', '84M', '97M', '2.0M', '1.9M', '1.8M', '5.3M', '47M', '556k', '526k', '76M', '7.6M', '59M', '9.7M', '78M', '72M', '43M', '7.7M', '6.3M', '334k', '93M', '65M', '79M', '100M', '58M', '50M', '68M', '64M', '34M', '67M', '60M', '94M', '9.9M', '232k', '99M', '624k', '95M', '8.5k', '41k', '292k', '80M', '1.7M', '10.0M', '74M', '62M', '69M', '75M', '98M', '85M', '82M', '96M', '87M', '71M', '86M', '91M', '81M', '92M', '83M', '88M', '704k', '862k', '899k', '378k', '4.8M', '266k', '375k', '1.3M', '975k', '980k', '4.1M', '89M', '696k', '544k', '525k', '920k', '779k', '853k', '720k', '713k', '772k', '318k', '58k', '241k', '196k', '857k', '51k', '953k', '865k', '251k', '930k', '540k', '313k', '746k', '203k', '26k', '314k', '239k', '371k', '220k', '730k', '756k', '91k', '293k', '17k', '74k', '14k', '317k', '78k', '924k', '818k', '81k', '939k', '169k', '45k', '965k', '90M', '545k', '61k', '283k', '655k', '714k', '93k', '872k', '121k', '322k', '976k', '206k', '954k', '444k', '717k', '210k', '609k', '308k', '306k', '175k', '350k', '383k', '454k', '1.0M', '70k', '812k', '442k', '842k', '417k', '412k', '459k', '478k', '335k', '782k', '721k', '430k', '429k', '192k', '460k', '728k', '496k', '816k', '414k', '506k', '887k', '613k', '778k', '683k', '592k', '186k', '840k', '647k', '373k', '437k', '598k', '716k', '585k', '982k', '219k', '55k', '323k', '691k', '511k', '951k', '963k', '25k', '554k', '351k', '27k', '82k', '208k', '551k', '29k', '103k', '116k', '153k', '209k', '499k', '173k', '597k', '809k', '122k', '411k', '400k', '801k', '787k', '50k', '643k', '986k', '516k', '837k', '780k', '20k', '498k', '600k', '656k', '221k', '228k', '176k', '34k', '259k', '164k', '458k', '629k', '28k', '288k', '775k', '785k', '636k', '916k', '994k', '309k', '485k', '914k', '903k', '608k', '500k', '54k', '562k', '847k', '948k', '811k', '270k', '48k', '523k', '784k', '280k', '24k', '892k', '154k', '18k', '33k', '860k', '364k', '387k', '626k', '161k', '879k', '39k', '170k', '141k', '160k', '144k', '143k', '190k', '376k', '193k', '473k', '246k', '73k', '253k', '957k', '420k', '72k', '404k', '470k', '226k', '240k', '89k', '234k', '257k', '861k', '467k', '676k', '552k', '582k', '619k'], dtype=object) First of all, let's get rid of the dollar sign in "Price" column and turn values into numeric type. In [21]: googleplaystore['Price'] = googleplaystore['Price'].apply(lambda x: x.replace('$', '') if '$' in str(x) else x) googleplaystore['Price'] = googleplaystore['Price'].apply(lambda x: float(x)) Now, we will work with "Installs" column. We must get rid of plus sign and convert values to numeric. In [22]: googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x) googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x) googleplaystore['Installs'] = googleplaystore['Installs'].apply(lambda x: int(x)) Also, convert "Reviews" column to numeric type. In [23]: googleplaystore['Reviews'] = googleplaystore['Reviews'].apply(lambda x: int(x)) Finally, let's work with "Size" column as it needs more complex approach. This column contains various types of data. Among numeric values which can be whether in Mb or Kb, there are null values and strings. Moreover, we need to deal with the difference in values written in Mb and Kb. In [24]: googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: str(x).replace(',', '') if 'M' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: float(str(x).replace('k', '')) / 1000 if 'k' in str(x) else x) googleplaystore['Size'] = googleplaystore['Size'].apply(lambda x: float(x)) Let's call describe() method one more time. As we can see, now we have statistical summary for all the needed columns that contain numeric values. In [25]: googleplaystore.describe() Out[25]:     Rating Reviews Size Installs Price count 8886.000000 8.886000e+03 7418.000000 8.886000e+03 8886.000000 mean 4.187959 4.730928e+05 22.760829 1.650061e+07 0.963526 std 0.522428 2.906007e+06 23.439210 8.640413e+07 16.194792 min 1.000000 1.000000e+00 0.008500 1.000000e+00 0.000000 25% 4.000000 1.640000e+02 5.100000 1.000000e+04 0.000000 50% 4.300000 4.723000e+03 14.000000 5.000000e+05 0.000000 75% 4.500000 7.131325e+04 33.000000 5.000000e+06 0.000000 max 5.000000 7.815831e+07 100.000000 1.000000e+09 400.000000   Building visualizations Visualization is probably one of the most useful approaches in data analysis. Sometimes not all the correlations and dependencies can be seen from the tabular data, and therefore various plots and diagrams can help to clearly depict them. Let's go through the different ways we can explore categories. Exploring which categories have the biggest amount of apps One of the fanciest ways to visualize such data is to use WordCloud. With a few lines of code, we can create an illustration that shows what categories have the biggest amount of apps. In [30]: import matplotlib.pyplot as plt import wordcloud from wordcloud import WordCloud import seaborn as sns color = sns.color_palette() %matplotlib inline   In [33]: from plotly import tools from plotly.offline import iplot, init_notebook_mode from IPython.display import Image import plotly.offline as py import plotly.graph_objs as go import plotly.io as pio import numpy as np py.init_notebook_mode()   In [34]: wc = WordCloud(max_font_size=250,collocations=False, max_words=33,width=1600, height=800,background_color="white").generate(' '.join(googleplaystore['Category'])) plt.figure( figsize=(20,10)) plt.imshow(wc, interpolation="bilinear") plt.axis("off") plt.tight_layout(pad=0) plt.show()     Exploring app ratings across top categories In [35]: groups = googleplaystore.groupby('Category').filter(lambda x: len(x) > 286).reset_index() array = groups['Rating'].hist(by=groups['Category'], sharex=True, figsize=(20,20))   As we can see, average apps ratings are quite different across the categories. Average Rating of all the Apps   And what insight will we get, if we explore average rating for all of the apps? In [36]: avg_rate_data = go.Figure() avg_rate_data.add_histogram( x = googleplaystore.Rating, xbins = {'start': 1, 'size': 0.1, 'end' :6} ) iplot(avg_rate_data)   In [38]: img_bytes = pio.to_image(avg_rate_data, format='png', width=1600, height=800, scale=2) In [39]: Image(img_bytes) Out[39]:     As we can see, most of the apps clearly hold a rating above 4.0! Actually, quite a lot of apps seem to have 5.0 rating. Let's check how many apps do have the highest possible rating. In [40]: googleplaystore.Rating[googleplaystore['Rating'] == 5 ].count() Out[40]: 271 But does any feature from the dataset really affect on the apps' rating? Let's try to figure out how size, amount of installs, reviews, and price correlate between each other and then explore the impact of every feature on the rating. First of all, let's build a heatmap. For exploring correlations between features, a heatmap is among the best visual tools. The individual values in the data matrix are represented by different colors helping quickly see what features have the most and the least dependencies. In [41]: sns.heatmap(googleplaystore.corr(), annot=True, linewidth=0.5) Out[41]: <matplotlib.axes._subplots.AxesSubplot at 0x11f75fbe0>     A positive correlation of 0.62 exists between the number of reviews and the number of installations, which means that customers tend to download a given app more if it has been reviewed by a larger number of people. This also means that many active users who download the app usually give feedback. Sizing strategy: How does size of the app impact rating? Despite the fact that modern phones and pads have enough memory to deal with various kinds of tasks and store Gigabytes of data, the size of the apps still matters. Let's explore whether this value really affects app rating or not. To find an answer to this question, we will use scatterplot which is definitely the most common and informant way to see how two variables correlate. In [42]: groups = googleplaystore.groupby('Category').filter(lambda x: len(x) >= 50).reset_index() In [43]: sns.set_style("whitegrid") ax = sns.jointplot(googleplaystore['Size'], googleplaystore['Rating'])   /anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.   As we can see, most of the apps with the highest rating have a size between approximately 20Mb and 40Mb. Pricing: How does price affect app rating? In [44]: paid_apps = googleplaystore[googleplaystore.Price>0] p = sns.jointplot( "Price", "Rating", paid_apps)   So, the top-rated apps do not have big prices: only a few apps have a price higher than $20. Pricing across categories In [45]: sns.set_style('whitegrid') fig, ax = plt.subplots() fig.set_size_inches(15, 8) p = sns.stripplot(x="Price", y="Category", data=googleplaystore, jitter=True, linewidth=1) title = ax.set_title('App pricing trends across categories')    As we can see, there are apps with a price higher than $200! Let's see, what categories these apps belong to. In [46]: googleplaystore[['Category', 'App']][googleplaystore.Price > 200].groupby([ "Category"], as_index=False).count() Out[46]:     Category App 0 FAMILY 4 1 FINANCE 6 2 LIFESTYLE 5   Price vs. installation: are free apps downloaded more than paid? For visualizing this answer we will use boxplot, so we can compare the range and distribution of the number of downloads for paid and free apps. Boxplots also help to answer questions like: what are the key values (average, median, first quartile, and so on) does our data have outliers and what are their values whether our data is symmetric how tightly the data is grouped is the data shifted and, if so, in which direction, etc. In [47]: trace0 = go.Box( y=np.log10(googleplaystore['Installs'][googleplaystore.Type=='Paid']), name = 'Paid', marker = dict( color = 'rgb(214, 12, 140)', ) ) trace1 = go.Box( y=np.log10(googleplaystore['Installs'][googleplaystore.Type=='Free']), name = 'Free', marker = dict( color = 'rgb(0, 128, 128)', ) ) layout = go.Layout( title = "Paid apps Vs free apps", yaxis= {'title': 'Downloads (log-scaled)'} ) data = [trace0, trace1] iplot({'data': data, 'layout': layout})    As we can see, paid apps are downloaded less frequently than free ones. Conclusion Exploratory data analysis is an inherent part of data exploration that helps to get a general knowledge about the dataset you work with as well as find basic conceptions and outlines to get first insights. In this tutorial we walked through the general approaches for initial data exploration on the example of apps categories and rating columns. However, there are a lot of other interesting dependencies and correlations left within other columns. The dataset we used is available via the following link:  https://www.kaggle.com/lava18/google-play-store-apps/activity
View all blog posts