Your Job Board for Big Data

Be part of the digital revolution in Switzerland

 

DataCareer Blog

Image recognition has been a major challenge in machine learning, and working with large labelled datasets to train your algorithms can be time-consuming. One efficient approach for getting such data is to outsource the work to a large crowd of users. Google uses this approach with the game “Quick, Draw!” to create the world’s largest doodling dataset, which has recently been made publicly available . In this game, you are told what to draw in less than 20 seconds while a neural network is predicting in real time what the drawing represents. Once the prediction is correct, you can't finish your drawing, which results in some drawings looking rather weird (e.g., animals missing limbs). The dataset consists of 50 million drawings spanning 345 categories, including animals, vehicles, instruments and other items. By contrast, the MNIST dataset – also known as the “Hello World” of machine learning – includes no more than 70,000 handwritten digits. Compared with digits, the variability within each category of the “Quick, Draw!” data is much bigger, as there are many more ways to draw a cat than to write the number 8, say. A Convolutional Neural Network in Keras Performs Best With the project summarized below, we aimed to compare different machine learning algorithms from scikit-learn and Keras tasked with classifying drawings made in “Quick, Draw!”. Here we will present the results without providing any code, but you can find our Python code on Github . We used a dataset that had already been preprocessed to a uniform 28x28 pixel image size. Other datasets are available too, including for the original resolution (which varies depending on the device used), timestamps collected during the drawing process, and the country where the drawing was made. We first tested different binary classification algorithms to distinguish cats from sheep, a task that shouldn’t be too difficult. Figure 1 gives you an idea of what these drawings look like. While there are over 120,000 images per category available, we only used up to 7,500 per category, as training gets very time-consuming with an increasing number of samples. Figure 1 : Samples of the drawings made by users of the game “Quick, Draw!” We tested the Random Forest, K-Nearest Neighbors (KNN) and Multi-Layer Perceptron (MLP) classifiers in scikit-learn as well as a Convolutional Neural Network (CNN) in Keras. The performance of these algorithms on a test set is shown in Figure 2. As one might expect, the CNN had the highest accuracy by far (up to 96%), but it also required the longest computing time. The KNN classifier came in second, followed by the MLP and Random Forest. Figure 2 : Accuracy scores for RF, KNN, MLP and CNN classifiers Optimizing the Parameters The most important parameter for the Random Forest classifier is the number of trees, which is set to 10 by default in scikit-learn. Increasing this number usually achieves a higher accuracy, but computing time also increases significantly. For this task, we found 100 trees to be a sufficient number, as there is hardly any increase in accuracy beyond that. In the K-Nearest Neighbor classifier, the central parameter is K, the number of neighbors. The default option is 5, which we found to be optimal in this case. For the Multi-Layer Perceptron (MLP), the structure of the hidden layer(s) is a major point to consider. The default option is one layer of 100 nodes. We tried one or two layers of 784 nodes, which slightly increased accuracy at the cost of a much higher computing time. We chose two layers of 100 nodes as a compromise between accuracy and time use (adding a second layer did not change the accuracy, but it reduced fitting time by half). The difference between learning rates was very small, with the best results using 0.001. For the Convolutional Neural Network in Keras (using TensorFlow backend), we adapted the architecture from a tutorial by Jason Brownlee. In short, this CNN is composed of the following 9 layers: 1) Convolutional layer with 30 feature maps of size 5×5, 2) Pooling layer taking the max over 2*2 patches, 3) Convolutional layer with 15 feature maps of size 3×3, 4) Pooling layer taking the max over 2*2 patches, 5) Dropout layer with a probability of 20%, 6) Flatten layer, 7) Fully connected layer with 128 neurons and rectifier activation, 8) Fully connected layer with 50 neurons and rectifier activation, 9) Output layer. Evaluation of the CNN Classifier Beside the accuracy score, there are other useful ways to evaluate a classifier. We can get valuable insights by looking at the images that were classified incorrectly. Figure 3 shows some examples for the top performing CNN. Many of these are hard or impossible even for competent humans to classify, but they also include some that seem doable. Figure 3: Examples of images misclassified by the CNN (trained on 15,000 samples) To obtain the actual number of images classified incorrectly, we can calculate the confusion matrix (Figure 4). In this example, the decision threshold was set to 0.5, implying that the label with the higher probability was predicted. For other applications, e.g. testing for a disease, a different threshold may be better suited. One metric that considers all possible thresholds is the Area Under the Curve (AUC) score that is calculated from the Receiver Operating Characteristic (ROC) curve. The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings (but it doesn’t show which threshold was used for a specific point of the curve). As the name says, the AUC is simply the area under the ROC curve, which would be 1 for a perfect classifier and 0.5 for random guessing. The ROC curve for the CNN is shown in Figure 4, and the associated AUC score is at a very respectable 0.994. Figure 4: Confusion matrix and ROC curve of the CNN Multi-Class Classification To make things more challenging, we also tested the algorithms on five different classes (dog, octopus, bee, hedgehog, giraffe), using 2,500 images of each class for training. As expected, we got a similar ranking as before, but the accuracies were lower: 79% for Random Forest, 81% MLP, 82% KNN, and 90% CNN. Looking at the actual probabilities placed on each class allows a better idea of what the classifier is predicting. Figure 5 shows some examples for the CNN. Most predictions come with a high degree of certainty, which is not surprising given an accuracy of 90%. If the predictions are well calibrated – which they are in this case –, the average certainty of the predicted class must be 90% as well. Figure 5: Some CNN prediction probabilities. Figure 6 shows a few predictions that were incorrect. While these are rather unconventional drawings, competent humans might well recognize them all, or give more accurate probability distributions at least. There certainly is room for improvement, the easiest way being to simply increase the number of training examples (we only used about 2% of the images). Figure 6: Some drawings that were misclassified by the CNN. The true label is shown below the image.   In conclusion, what have we learned from the above analysis? If you take a look at the code, you will see that implementing a CNN in Python takes more effort than the regular scikit-learn classifiers do, which comprise just a few lines. The superior accuracy of the CNN makes this investment worthwhile, though. Due to its huge size, the “Quick, Draw!” dataset is very valuable if you’re interested in image recognition and deep learning. We have barely scratched the surface of the dataset here and there remains huge potential for further analysis. To check out the code and some additional analyses, visit our Github page . We’re very happy to receive any feedback and suggestions for improvement by email .   A blog post by David Kradolfer     Back
Much has been written on the most popular software and programming languages for Data Science (recall, for instance, the infamous “Python vs R battle”). We approached this question by scraping job ads from Indeed and counting the frequency at which each software is mentioned as a measure of current employer demand. In a recent blog post , we analyzed the Data Science software Swiss employers want job applicants to know (showing that Python has a slight edge in popularity over R). In this post, we look at worldwide job ads and analyze them separately for different job titles. We included 6,000 positions for Data Scientists, 7,500 for Data Analysts, 2,200 for Data Engineers and 1,700 for Machine Learning specialists, as well as a population of tens of thousands of data professionals on LinkedIn. Our leading questions: Which programming languages are most popular in general, as database technologies, as machine learning libraries, and for business analysis purposes? How well does the employers’ demand for software skills match the supply, i.e. the skills data professionals currently possess? And are there software skills in high demand that are rare among data professionals? (Yes!) Check out the results below. A detailed description of our methods can be found in this post . The programming language most in demand: Python Being mentioned in over 60% of current job ads, Python is the top programming language for Data Scientists, Machine Learning specialists and Data Engineers. R is highly popular in Data Science (62%) job openings, but less so in the areas of Machine Learning and Data Engineering. Java and Scala are important in Data Engineering, where they rank second and third with 55% and 35% frequencies. In Machine Learning, Java and C/C++/C# are the second most popular languages, at about 40% each. Less programming skills are required for Data Analysts, with Python and R reaching 15%.   Figure 1: Most popular programming languages for data-related job descriptions The leading database technology: SQL SQL is the most in demand database skill for Data Scientists and Analysts, mentioned in 50% of current job ads. For Data Engineering, SQL, Hadoop and Apache Spark are about equally popular, with roughly 60% each. The same is true for Machine Learning, but at a much lower level of about 20%.   Figure 2: Most popular database technologies   Most popular machine learning library: Tensorflow For Machine Learning jobs, Tensorflow is leading the list with 20%, followed by Caffe, Scikit-learn and Theano at about 9%. In Data Science job ads, Scikit-learn is at 7% and slightly more frequent than Tensorflow.   Figure 3: Most popular Machine Learning libraries Business-related software in high demand Other frequent requirements are commercial software tools for statistics, business intelligence, visualization and enterprise resource planning (ERP). In this space, SAS and Tableau are the top mentions, followed by SPSS.   Figure 4:  Software for statistics, business intelligence and enterprise resource planning   Scala programmers are a rare breed among Data Scientists How well do these employer requirements match the available skills of Data Scientists? To obtain a rough estimate of the number of professionals knowledgeable in top-ranked programming languages, we searched LinkedIn for the different languages and filtered the results for people with the job title of Data Scientist. Figure 5 shows this Data Scientist population in relation to the number of Data Science job ads demanding the skill. By a large margin, Python and R turn out to be the most widespread programming skills in the Data Scientist population. Interestingly, MATLAB, C/C#/C++ and Scala are roughly equally frequent in job ads, but the number of professionals possessing these skills varies greatly. Scala is the rarest skill by far, while MATLAB expertise is fairly common, being five times as frequent in the Data Scientist population as Scala. Also, knowledge of the Apache Ecosystem (Hadoop, Spark and Hive) is still relatively rare among Data Scientists. Figure 5: Software skills – current supply (proxy: LinkedIn) vs. demand (proxy: Indeed)     In conclusion, which programming languages should you focus on? Python and R remain the top choices, and mastering at least one of them is a must. If you decide to learn or refine additional languages, pick Scala, which many Data Scientists expect to be the language of the future (Jean-François Puget analyzed the impressive rise of Scala in search trends on Indeed) A further language holding significant potential is Julia. Julia is mentioned in 2% of current Data Science job ads on Indeed, but its trendline is highly promising.    Post by David Kradolfer     Back
Social science researchers collect much of their data through online surveys. In many cases, they offer incentives to the participants. These incentives can take the form of lotteries for more valuable prices or individual gift card codes. We are doing the latter in our studies here at CEPA Labs at Stanford. Specifically, our survey participants receive a gift card code from Amazon.     However, sending these gift card codes to respondents is challenging. In Qualtrics, our online survey platform, we can embed a code for each potential respondent, and then trigger an email with the code attached after their survey is completed. While this is a very convenient feature, it has one substantial drawback. We need to purchase all codes up front, yet many participants may never answer. There is an option of waiting until the survey is closed and then purchasing the codes for all respondents at once. However, respondents tend to become impatient if they do not receive their code in a timely manner and start reaching out to you and possibly to the IRB office. This creates administrative work and might reduce response rates if potential respondents can talk to each other about their experience. Furthermore, we reach most of our participants with text messages and have found from experience that emails often go unnoticed or end up in the spam folder. Given these problems, we decided to send codes using python and the Qualtrics API. This way, we can send codes immediately and do not need to purchase all codes upfront. We used the Amazon Incentives API, which allows its users to request gift card codes on demand. Codes are generated on the fly, and the amount is charged to our account. An alternative would be Giftbit, which even allows its users to send gift links with different gift card providers as option. I believe this approach would be useful to many social science researchers. After I had developed the program, we immediately thought of a second project where we already had enough codes for the anticipated 70% response rate. We stored the codes in a .csv file. In this post, for simplicity, I will describe a python program that gets the codes from a .csv file and sends them out by email. The other (original) program fetched the codes from the Amazon Incentives API and sent them over the EZtexting.com API. Both versions can be found on GitHub . It is also possible to send text messages per email. The program checks Qualtrics continuously for new responses and then sends each new respondent a code. In a loop, it downloads (new) responses, writes their contact information to an SQL database, assigns a code, and adds it to the data base, and then sends the codes by email. The program is written in a way that guarantees that if the program gets interrupted, it can just be executed again without any problem. Before I go into detail, here is a quick disclaimer. I am not a professional python developer. If you see ways how to improve the program, please let me know. I am always happy to learn new things. I also will not take any responsibility for the proper functioning of the program. If you decide to use it, it is your responsibility to adapt it to your application and thoroughly test it. Furthermore, you can find more scripts to work with the Qualtrics API on GitHub. I used python 2.7.11 (64-bit). SQLite needs to be installed. Let’s get started with importing the required packages and setting up some parameters. The location for the survey download, the SQLite data base, the backups, and the file containing the codes need to be specified. All elements of the code to be changed are bold. import requests import zipfile import pandas as pd import os import sqlite3 import datetime import time import shutil import smtplib import sys from email.MIMEMultipart import MIMEMultipart from email.MIMEText import MIMEText from email.Utils import COMMASPACE, formatdate # Set path of the main folder path= 'D:/YOUR PROJECT FOLDER' os.chdir(path) # Setting path for data export and storage SurveyExportPath = path + '/ DownloadFolder' SQLDatabasePath = path + '/DataBase/' SQLBackupPath = path + '/DataBase/Archive/' SQLDatabaseName = 'ResponseTracker.sqlite' # Set path for files that holds gift card codes CODEPath = path + '/---Your file with codes---.csv' Next, I declare which columns of the Qualtrics data will be loaded into memory later.You can adjust the program, for instance, to allow for different gift card code amounts, send the codes to phone numbers, check if the respondent has answered sufficient questions, or include their name in the email. To do so, you will have to upload that information within the contact lists, embed it within the Qualtrics survey flow, and adjust the columns here to import the columns containing the respective information. # Columns to include when read in data from surveys pdcolumns = ['ResponseID', 'ExternalDataReference', 'RecipientEmail'] Here, you need to declare a dictionary using Qualtrics survey IDs and survey names for all surveys that you want to include. The survey IDs are the keys. You can find the IDs in the Qualtrics account settings. Note that it is important to add a ‘.csv’ to the name. This is necessary because the survey data is downloaded from Qualtrics as “SURVEYNAME.csv”. # List survey is Ids and file extensions in dictionary. Possible to include multiple surveys surveyIdsDic = {'---Your survey ID 1---': '---Your survey name 1---.csv', '---Your survey ID 2---': '---Your survey name 2---.csv', '---Your survey ID 3---': '---Your survey name 3---.csv'} surveyIds = surveyIdsDic.keys() The next variables hold the number of times this script checks for new answers and sends out the codes, the time it waits between each iteration in seconds, and the number of iterations it waits before creating a backup of the SQL database. With these settings, the program would run for at least 695 days (not counting the time the program takes for each iteration) and create a backup every five minutes. # Number of repetitions for the loop, time to wait between each iteration in seconds reps = 1000000 waittime = 60 backupn = 300 Next, we need to set up the parameters of the Qualtrics API call. You have to declare your API Token, which you can find in the account settings in your Qualtrics account, the format of the downloaded file (.csv), and your data center. The following code declares the API URL and the information to be sent in the API call. # Setting user Parameters for Qualtrics API # Add you Qualtrics token and survey ids apiToken = "---Your Token---" fileFormat = "csv" dataCenter = "---Your data center ---" # Setting static parameters for Qualtrics API baseUrl = "https://" + dataCenter + ".qualtrics.com/API/v3/responseexports/".format(dataCenter) headers = { "content-type": "application/json", "x-api-token": apiToken, } The program defines a function to create the message to the respondent containing the gift code. In this simple version, it only takes the code as parameter. It could be easily extended to be personalized with the respondent’s name. The message is in HTML and can be styled with in-line CSS. # Set email message to send the code in HTML def genMessage(code): message= """""" <html> <head></head> <body> <strong>Thank you!</strong> Thank you very much for taking the time to complete our survey! Please accept the electronic gift certificate code below. <span style="font-weight: bold; font-size=1.5em; text-align:center;">""" \ + code + \ """</span> Thank you again </body> </html> """ return message The program then defines a function to send the message to the respondent by email. You need to specify your correspondence email address and your SMTP host. It could potentially take a list of email addresses. However, that will not be the case in this program. # Specify you email host and email address def sendMail(to, subject, text): assert type(to)==list fro = "---Your email address--- " #add your correspondence email address to send out the code msg = MIMEMultipart() msg['From'] = fro msg['To'] = COMMASPACE.join(to) msg['Date'] = formatdate(localtime=True) msg['Subject'] = subject msg.attach(MIMEText(text, 'html')) smtp = smtplib.SMTP( '---Your mail host---' ) #add your SMTP Host smtp.sendmail(fro, to, msg.as_string() ) smtp.close() The next function creates a backup of the SQL database. def createBackup(path, database): # Check if the provdied path is valid if not os.path.isdir(path): raise Exception("Backup directory does not exist: {}".format(path)) # Define file name for the backup, includes date and time backup_file = os.path.join(path, 'backup' + time.strftime("-%Y%m%d-%H%M")+ '.sqlite') # Lock database before making a backup cur.execute('begin immediate') # Make new backup file shutil.copyfile(database, backup_file) print ("\nCreating {}...".format(backup_file)) # Unlock database sqlconn.rollback() Next, the program sets up the SQL database schema. It tries to connect to the database. If the database does not exist, the program creates two tables “respondents” and “surveys.” “Respondents” tracks the responses and the gift codes. “Surveys” tracks the surveys that have been answered. If the database exists, nothing happens. This SQL code assures that no respondent is paid twice or that the same gift card code is sent to more than one respondent by accident. The attribute UNIQUE guarantees that the program only creates a new record if the response ID, the respondent ID, the email address, and the gift card code do not already exist. # SQL schema # database path + file name database = SQLDatabasePath+SQLDatabaseName # Connect to SQLite API sqlconn = sqlite3.connect(database) cur = sqlconn.cursor() # Execute SQL code to create new database with a table for respondents and for surveys # If database and these tables already exist, nothing will happen cur.executescript(''' CREATE TABLE IF NOT EXISTS Survey ( id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, name TEXT UNIQUE ); CREATE TABLE IF NOT EXISTS Respondent ( id INTEGER NOT NULL PRIMARY KEY AUTOINCREMENT UNIQUE, individual_id INTEGER UNIQUE, response_id TEXT UNIQUE, survey_id INTEGER, email TEXT UNIQUE, create_date DATETIME DEFAULT (DATETIME(CURRENT_TIMESTAMP, 'LOCALTIME')), redeem_code TEXT UNIQUE, sentstatus INTEGER, sent_date DATETIME ) ''') # Commit (save) changes to database sqlconn.commit() After all the setup, the actual program starts here. Everything below will be repeated as many times as specified above. The program prints the date, time, and the number of the iteration to the console so that we can check if it’s running properly. # Everything below will be repeated for the specified number of iterations for t in xrange(reps): # Provide some information in the console print "----------" print "Iteration :" , t, "Date: ", datetime.datetime.now().strftime("%m/%d/%y %H:%M:%S") First, previously downloaded survey data files are deleted. This step is not necessary as the next download would replace these files. However, often enough surveys are not downloaded due to API connection problems. In that case, the old and already processed files would be processed again. That would slow down the program. # Delete all files in extraction path, in case they don’t get downloaded in this iteration deletelist = [os.path.join(subdir, file) for subdir, dirs, files in os.walk(SurveyExportPath) for file in files] for file in deletelist: os.remove(file) The next bit of code downloads the responses for each survey separately. It iterates over each survey. First, it tries to collect the last response ID from the SQL database from the last respondent of the respective survey. If someone has answered the survey before, the program passes the ID of the last response as parameters to the Qualtrics API call. In that case, the program only downloads responses that happened afterwards. If no one has answered the survey before, an error exception handler sets the parameters such that all responses will be downloaded. It then initiates the export. Once the export is complete, a zip file with the data is downloaded. The program includes several exceptions for API connection errors and only allows the export to take up to three minutes to make sure it does not get hung up in this step. If an error occurs, the survey will not be downloaded in this iteration, but the program will try again in the next iteration. The program then unzips the zip file. # Iterate over survey IDs to download each one separately for surveyId in surveyIds: survey = surveyIdsDic[surveyId] # Identify the survey try: # Fetch last response id from database, used to download only new responses cur.execute('''SELECT response_id FROM Respondent WHERE id == (SELECT max(id) FROM Respondent WHERE survey_id == (SELECT id FROM Survey WHERE name == ? )) ''', (survey,)) lastResponseId=cur.fetchone()[0] # Set parameters to send to Qualtrics API downloadRequestPayload = '{"format":"' + fileFormat + '","surveyId":"' + surveyId + '","lastResponseId":"' + lastResponseId +'"}' # Set exception for case that no one has answered to this survey yet except (TypeError, sqlconn.OperationalError) as e: print e # Set parameters without specifying last response id (all responses will be downloaded) downloadRequestPayload = '{"format":"' + fileFormat + '","surveyId":"' + surveyId +'"}' downloadRequestUrl = baseUrl try: # Connect to Qualtrics API and send download request downloadRequestResponse = requests.request("POST", downloadRequestUrl, data=downloadRequestPayload, headers=headers) progressId = downloadRequestResponse.json()["result"]["id"] # Checking on data export progress and waiting until export is ready startlooptime = time.time() # Record time to make sure the loop doesn't run forever requestCheckProgress = 0 # As long as export not complete, keep checking while requestCheckProgress > 100: # Abort if download takes more than 3 minutes print "Download took more than three minutes. Try again next iteration." complete = 0 break # If export complete, download and unzip file if complete==1: requestDownloadUrl = baseUrl + progressId + '/file' requestDownload = requests.request("GET", requestDownloadUrl, headers=headers, stream=True) with open("RequestFile.zip", "wb+") as f: for chunk in requestDownload.iter_content(chunk_size=1024): f.write(chunk) except (KeyError, requests.ConnectionError, ValueError, IOError) as e: # Something went wrong with the Qualtrics API (retry) print "Survey not downloaded: ", e continue try: zipfile.ZipFile("RequestFile.zip").extractall(SurveyExportPath) except (zipfile.BadZipfile, IOError) as e: print SurveyExportPath print "Zipfile not extracted: ", e continue Once the files have been downloaded and unzipped, the program loops over all downloaded files. It loads the data of each file into memory, and if new responses are in the data, it adds the them to the SQL database. The IGNORE in the SQL code makes sure that, if the response has already been recorded, it will not be added to the data base without causing an error. The program saves the changes to the SQL database after each record is created. This choice reduces speed but avoids problems from any possible connection errors. SurveyFileList = [os.path.join(subdir, file) for subdir, dirs, files in os.walk(SurveyExportPath) for file in files] for file in SurveyFileList: data=pd.read_csv(file, encoding = 'utf-8-sig', sep = ',', usecols = pdcolumns, low_memory=False, error_bad_lines = False) data=data.iloc[2:,:].reset_index(drop=True) survey = file.split('\\')[-1] # Identify the survey if len(data.index)>0: # Only if new responses are recorded for row in xrange(0, len(data.index)): individual_id = data.loc[row, 'ExternalDataReference'] response_id = data.loc[row, 'ResponseID'] email = data.loc[row, 'RecipientEmail'] # Record the survey name cur.execute(''' INSERT or IGNORE INTO Survey (name) VALUES (?)''', (survey,)) # Fetch survey id to enter in response table cur.execute('''SELECT id FROM Survey WHERE name == ? ''', (survey,)) survey_id=cur.fetchone()[0] cur.execute(''' INSERT or IGNORE INTO Respondent (email, individual_id, response_id, survey_id) VALUES (?,?,?,?)''', (email, individual_id, response_id, survey_id)) sqlconn.commit() After all new responses have been added to the SQL database, the program assigns a gift card code to each new respondent. It first collects all respondents without a gift card code from the SQL database. It then loads the file with the gift card codes into memory and fetches the last gift card code that had been assigned to a respondent. It then selects the codes that come after this last code and assigns them to the new respondents. If no codes have been assigned before, an error is raised and the exception handler makes sure that the first code is used. The newly assigned codes are added to the SQL database. The program counts the number of codes assigned and logs it to the console after all new respondents are assigned a code. # Select new respondents who need a code cur.execute('''SELECT id FROM Respondent WHERE redeem_code IS NULL ''') NeedGiftCards = cur.fetchall() numCodesAssigned = 0 # Count number of the codes assigned if len(NeedGiftCards)>0: # Only if new respondents # Import csv file that holds the codes allcodes=pd.read_csv(CODEPath, encoding = 'utf-8-sig', sep = ',', low_memory=False, error_bad_lines = False) # Identify last redeem code used try: cur.execute('''SELECT redeem_code FROM Respondent WHERE id == (SELECT max(id) FROM Respondent WHERE redeem_code IS NOT NULL)''') lastcode = cur.fetchone()[0] row=allcodes[allcodes.code==lastcode].index.values[0] # Get index value for last code except TypeError: row = -1 usecodes=allcodes[allcodes.index>row] # Select all codes after that value for needcard in NeedGiftCards: row +=1 # Extract data sqlDB_id = needcard[0] redeem_code=usecodes.code[row] numCodesAssigned += 1 # Add code to SQL database cur.execute(''' UPDATE Respondent SET redeem_code = ? WHERE id == ?''', (redeem_code, sqlDB_id)) sqlconn.commit() print 'Number of gift card codes assigned:', numCodesAssigned Finally, the program fetches all new responses for which the codes have not been sent out yet. For each respondent, it creates a message including the code and sends it via email. It counts the number of emails sent and logs it to the console to keep track. # Getting all contacts and codes for which the code has not been sent cur.execute('''SELECT id, email, redeem_code FROM Respondent WHERE redeem_code IS NOT NULL and sentstatus IS NULL''') contacts = cur.fetchall() numCodesSent = 0 # Count the number of codes sent if len(contacts)>0: # Only new respondents for contact in contacts: sqlDB_id = contact[0] email = contact[1] code = contact[2] message = genMessage(code) TOADDR = [email] # Send message try: sendMail(TOADDR, "Thank you for your participation!", message) numCodesSent += 1 sentstatus = 1 cur.execute(''' UPDATE Respondent SET sentstatus = ?, sent_date=datetime(CURRENT_TIMESTAMP, 'localtime') WHERE id == ?''', (sentstatus, sqlDB_id)) sqlconn.commit() except: e = sys.exc_info()[0] print "Error:", e continue print 'Number of codes sent:', numCodesSent The very last step is to create a backup if it’s time. After all iterations are done, the program closes the SQL database connection. if t % backupn == 0: """Create timestamped database copy""" createBackup(SQLBackupPath, database) time.sleep(waittime) # Close SQL connection sqlconn.close() I hope this program can help you with your projects. Feel free to contact me with questions or suggestions.   Guest Post by Hans Fricke , Director of Research at CEPA Labs - Stanford University This article has first been published on the CEPA Labs blog .     Back
View all blog posts

Latest Jobs

Arobase Geneva, Switzerland
23/10/2017
Full time
Arobase SA sélectionne et recrute des candidats pour le secteur informatique, dans le cadre de postes fixes et temporaires. Elle fait partie d'Interiman Group, l'un des 4 plus importants acteurs de ce secteur en Suisse, avec plus de 60 agences spécialisées.Notre vocation est « La passion de votre réussite ». Spécialistes dans leur domaine, nos consultants sauront valoriser vos compétences, être à votre écoute, et trouver l'emploi qui vous correspond. For one of our client based in Geneva, we are looking for a: Python/PHP Web Developer We are looking for a Back-End Web Developer responsible for managing the interchange of data between the server and the browser. Your primary focus will be development of all server-side logic, definition and maintenance of the central database, and ensuring high performance and responsiveness to requests from the front-end. You will also be responsible for helping in the integration of the front-end elements built by your Frontend coworkers into the application. A basic understanding of front-end technologies is therefore necessary as well. Tasks Conceive and build the core logic of the application Building reusable code and libraries for future use, following code style convention Optimization of the application for maximum speed and scalability Implementation of security and data protection Integration of user-facing elements developed by a front-end developers with server side logic Design and implementation of data storage solutions Design database model Help in the setup of the system infrastructure Skill Set Proficient knowledge of a back-end programming language including at the first place Python, and preferably with additional knowledge in PHP7, Server-side JavaScript Perfect control of back-end framework such as Django and Flask, additional Creating database schemas that represent and support business processes Understanding accessibility and security compliance {{Depending on a specific project}} User authentication and authorization between multiple systems, servers, and environments Integration of multiple data sources and databases into one system Management of hosting environment, including database administration and scaling an application to support load changes Data migration, transformation, and scripting Setup and administration of backups Outputting data in different formats Understanding differences between multiple delivery platforms such as mobile vs desktop, and optimizing output to match the specific platform Implementing automated testing platforms and unit tests Proficient understanding of code versioning tools, such as Git / Github Proficient understanding of OWASP security principles Understanding of “session management” in a distributed server environment About the client Startup Quality oriented Entrepreneurship spirit Interested by a new challenge ? Feel free to apply I am interested CONTACT Agency: Arobase SA - Geneva Address: Place de la Fusterie 9-11 1204 - Genève Tel: +41 22 552 90 20
Mirai Solutions Zürich, Switzerland
23/10/2017
Full time
As Junior IT Consultant you will contribute to all aspects of client projects. You will receive the necessary support to meet the demands of daily project work. You identify and concretize business requirements in close cooperation with business users, provide technical specifications and participate in design, architecture, development, documentation, testing and operations of customized data analysis and risk quantification applications. You further also contribute to internal projects which include implementation of prototypes, frameworks and solutions supporting our business development. Basic qualifications: Fluency in English, both written and oral Masters degree in Computer Science or related field (Engineering, Math,…) Strong understanding of functional programming principles and object-oriented design Solid understanding of data structures, algorithms, distributed systems and concurrent programming Ability to think critically and arrive at elegant solutions to challenging problems Proficiency in Scala and/or Java Experience in developing in Linux and Windows environments Ability, passion and drive to continually improve your knowledge and learn new technologies Excellent communication skills Team-oriented, reliable and self-driven Service-orientation with a strong focus on customer satisfaction Living in CH Additional preferred qualifications: Knowledge of at least one more additional modern programming language such as C# and C/C++ Knowledge of at least one statistical/technical computing programming language such as R, Python or MATLAB Fluency in German, both written and oral Experience with modern web technologies such as JavaScript and HTML5 If you are a fast learner and are interested to contribute in a small but strong team in a dynamic environment, please send your application (cover letter, CV, references), indicating preferred and earliest possible starting date, to: Mirai Solutions GmbH Guido Maggio recruiting@mirai-solutions.com If you have any additional questions about the job offer or would like to get more information, please contact us by email Mirai Solutions GmbH  |  Tödistrasse 48  |  CH-8002 Zürich  |   info@miraisolutions.com
CK Group Basel, Switzerland
23/10/2017
Full time
Consultant:   Jennifer Woolley Contact:  01438 768 710 Discipline:   Clinical Data Management Jenni Woolley is recruiting for a Trial Data Scientist to join a company in the Pharmaceutical industry at their site based in Basel, Switzerland on a 12 month contract. The main purpose of the role will be to Lead Clinical Develop Data Management (CDDM) activities for assigned study Write all study Data Management documents Organise, monitor and track the testing of data entry screens, data cleaning/ review tools, and their implementation in the production environment Organise, monitor and track data cleaning, data review, query management, and database lock; makes sure processes are driven in collaboration with key Global Clinical Development stakeholders Generate study metrics and status reports Ensure STDM deliverables are created, validated and provided as per agreed timelines Perform and/or coordinate QC Further responsibilities will include: Provide CDDM input to the development of the study protocol Coordinate the development of the e(CRF) in line with the company standards Represent CDDM at Clinical Trial Team meetings Manage and is accountable for CDDM activities in studies where CDDM is outsourced Participate in the development and review of policies, SOPs and associated documents for CDDM-Forecast study team resource requirements Ensure Data Scientists assigned to study have required training In order to be considered for this role, you will be required to have the following qualifications, skills and experience: Recognised degree in life sciences, mathematics, statistics, informatics or related disciplines Proven experience within the Pharmaceutical or Biotech industry within Clinical Research and/ or Clinical Data Management with good level of functional expertise in Data Management Good knowledge of international clinical research regulations and requirements Experience in clinical trial databases and applications, clinical data flow, data review and CRF design Ability to lead and coordingate the activity of Data Scientists allocated to his/her study Fluent in written and spoken English For more information or to apply for this position, please contact Jenni Woolley on 01438 768 710 or email jwoolley@ckclinical.co.uk Alternatively, please click on the link below to apply online now. Please note that your CV should show exact dates of employment (month and year) and any gaps of a month or more should be explained. CK Group is an Equal Opportunities employer and welcomes applications from all who meet our selection criteria. If you do not hear back from us within 5 working days of your application for this role, it means that on this occasion you have not been shortlisted for the next stage of the recruitment campaign. Entitlement to work in the EEA is essential. Please quote reference CL40422 in all correspondence. CK Group - registered company in England & Wales No: 2611749  Registered office address: Brunswick House, 4 The Bridge Business Centre, Beresford Way, Chesterfield, S41 9FG
Arobase Geneva, Switzerland
23/10/2017
Full time
Arobase SA selects and recruits candidates in the IT sector, for fixed and temporary positions. It is part of Interiman Group, one of the 4 leading recruitment companies in Switzerland, with over 60 specialized agencies. Our driving force is "the passion for your success". Our specialized consultants will know how to highlight your core competences, listen to your needs, and find the job that fits your personality. For one of our client based in Geneva, we are recruiting a Microsoft BI SSAS Developer  Within the BI Team and under its governance, the BI SSAS Developer mission is to work within the BI Team to provide analysis, design, deployment for SSAS solutions, implementing project good practices and quality processes.  Responsabilities: Provide analysis, design, implementation and support for BI solutions and in particular SSAS solutions: Data quality management, master data management, Data modeling, ETL, Reporting & Dashboards Ensure the understanding of business needs and be able to translate business need into technical solutions Carry on developments using the Agile methodology Design and build prototypes, ensure industrialization  Adapt to several functional domains in a quick time Participate in the promotion of the BI solutions among the business Ensure all tasks are performed in compliance with security and quality aspects described in related Corporate Standard Operating Procedures Participate in BI support Provide timely, complete and accurate responses to the management and inquiries in a positive and constructive manner Complete ad hoc tasks and projects as required by the Company Pro-actively communicate relevant opportunities, risks, and issues to the management Participate defining and implementing BI project good practices and quality processe  Performance Skills: Listening and understanding  Problem solving Commitment to tasks Analytical skills Deal with ambiguity Organization and planning Curious and willing to learn Team player Good communication skills, spoken and written Qualifications and Experience: Minimum 3 years of experience in BI development - Required Processes, Systems and Data minded - Preferred Experience in Agile methodology - Preferred Technical: Minimum 1 years of experience in SSAS development - Required Excellent Knowledge on DAX and/or  MDX - Required Knowledge in SSAS tabular & multidimensional - Required Good knowledge in Microsoft BI Suite - Preferred  Proven Experience on data modeling - Required Proven Experience in data quality or data management - Preferred Knowledge of SQL language - Required Knowledge of Informatica Powercenter is an asset Knowledge of any other BI reporting tool appreciated Languages: Fluency in French (oral and written) Fluency in English (oral and written) Interested by a new challenge ? Feel free to apply CONTACT Agency: Arobase SA - Geneva Address: Place de la Fusterie 9-11 1204 - Genève Tel: +41 22 552 90 20 I am interested


Looking for data professionals?

 

Post a Job