-
Accessing Redshift using Python and PySpark
Redshift is a cloud-based data warehousing service provided by Amazon Web Services, while Pandas is a popular data analysis library for the Python programming language, and PySpark is a powerful data processing engine that can handle large-scale data processing tasks. In this article, we will explore how to use Pandas and PySpark to read data from Redshift, enabling us to process and analyze large datasets efficiently. Further, we will also explore how to write data to a Redshift sandbox.
-
Make Your Own Python Package and Share with Others in AWS SageMaker
Packaging in Python is the process of creating distributable packages of code that can be easily installed and used by other developers. These packages can contain modules, functions, classes, and other components that can be shared and reused across different projects.
-
Customer Segmentation
Customer segmentation is a critical component of marketing that helps businesses understand their customers better and tailor their marketing strategies to their specific needs. One popular technique for customer segmentation is k-means clustering, which groups customers based on their similarities in various attributes. In this article, we’ll discuss how you can use k-means clustering to segment your customers and extract valuable insights from your data.
-
Analyzing Website Traffic Using Google Analytics and AWS
Web analytics refers to the collection, measurement, analysis, and reporting of web data to understand and optimize web usage. It involves gathering data on user behavior on websites, such as pageviews, time spent on a page, clickthrough rates, and conversion rates, and analyzing this data to gain insights into user behavior and website performance. These insights can be used to make informed decisions about website design, content, and marketing strategies to improve user engagement, increase traffic, and drive conversions. Web analytics tools, such as Google Analytics, provide a range of metrics and reports to track and analyze website performance. In this blog entry I will explore how to do web analytics using a combinaiton of Google Analytics and AWS (Amazon Web Services).
-
Burnout in Analytics Teams
In today’s fast-paced business environment, analytics teams are playing an increasingly critical role in driving decision-making and business strategy. However, the high pressure and demands placed on analytics teams can lead to burnout, which can negatively impact both individual team members and the overall success of the team.
-
30 Day Daily Map Challenge
The #30DayMapChallenge was a daily social mapping project held in November 2020 that I participated in. I had a good understanding of Python before participating but very less knowledge of working with GIS data. So I had plotted some data on maps using Microsoft Excel, the data for which was usually in tabular format in an Excel or CSV. However, I had never heard of Shape files, DEM – Digital Elevation Models, DSM – Digital Surface Models, or DTM – Digital Terrain Models, etc. Understanding these formats, what they contain and how they can be processed was the biggest challenge I encountered.
-
30 Day Daily Chart Challenge
During the month of April 2021, I participated in the #30DayMapChallenge, a daily social data project hosted by Cédric Scherer, Maya Gans and Dominic Royé. In this blog post I wanted to briefly talk about the project and share my experience, learnings in terms of data and tools used, and challenges faced. However, before I delve into that, here is a collage of my submissions. HD Version:
-
TidyTuesday Weekly Visualization Challenge
Since October 2020, I have been participating in TidyTuesday, a weekly social data project hosted by R for Data Science Online Learning Community. This enabled to me to quickly up-skill the visualization aspect of my data science journey. I also had the pleasure to engage with several likeminded people and curate and learn from their submissions and thought process. In the following post, I strive to share my experience and learnings.
-
Important GCP Services that you need to Know Now
Google Cloud Platform (GCP) is a cloud computing platform offered by Google. GCP provides a comprehensive set of tools and services for building, deploying, and managing cloud applications. It includes services for compute, storage, networking, machine learning, analytics, and more. Some of the most commonly used GCP services include Compute Engine, Cloud Storage, BigQuery, and Kubernetes Engine.
-
Important Azure Services that you need to Know Now
Azure is a cloud computing platform and set of services offered by Microsoft. It provides a wide range of services such as virtual machines, databases, storage, and networking, among others, that users can access and use to build, deploy, and manage their applications and services. Azure also offers a variety of tools and services to help users with tasks such as data analytics, artificial intelligence, and machine learning. Azure provides a pay-as-you-go pricing model, allowing users to only pay for the services they use.
-
Important AWS Services that you need to Know Now
Amazon Web Services (AWS) is a cloud-based platform that provides a wide range of infrastructure, platform, and software services. It was launched in 2006 and has since become one of the most popular cloud computing platforms in the world, used by individuals, small businesses, and large enterprises alike.
-
Implementing Self Organizing Maps using Python
SOM stands for Self-Organizing Map, which is a type of artificial neural network that is used for unsupervised learning and dimensionality reduction. SOMs are inspired by the structure and function of the human brain, and they can be used to visualize and explore complex, high-dimensional data in a two-dimensional map or grid.
-
Implementing Convolutional Neural Networks using Python
Convolutional Neural Networks (CNNs) are a type of deep neural network that are commonly used in computer vision tasks such as image classification, object detection, and segmentation. They are able to automatically learn and extract features from images, allowing them to identify patterns and structures in complex visual data.
-
Implementing Recurrent Neural Networks using Python
Recurrent Neural Networks, or RNNs, are a type of artificial neural network designed to process sequential data, such as time-series or natural language. While traditional neural networks process input data independently of one another, RNNs allow for the input of past data to influence current output. This is done by introducing a loop within the neural network, allowing previous output to be fed back into the input layer.
-
Implementing Artificial Neural Networks using Python
Artificial Neural Networks (ANNs) are a type of machine learning model that are designed to simulate the function of a biological neural network. ANNs are composed of interconnected nodes or artificial neurons that process and transmit information to one another. The structure of an ANN consists of an input layer, one or more hidden layers, and an output layer.
-
Overview of Deep Learning Activation Functions
Activation functions are a key component of neural networks in deep learning. They are mathematical functions applied to the output of a neural network layer to determine whether or not a neuron should be activated (i.e., “fired”). This output is then passed to the next layer of the neural network for further processing. There are many different activation functions that can be used in deep learning, including sigmoid, ReLU, and tanh. The choice of activation function can have a significant impact on the performance of a neural network, so it is an important consideration when designing and training a deep learning model.
-
Overview of Deep Learning Techniques
Deep learning is a subset of machine learning that involves training artificial neural networks to learn and perform complex tasks. While both deep learning and machine learning involve training models on data to make predictions or decisions, deep learning models typically have many layers and are capable of learning increasingly complex representations of data, whereas traditional machine learning models often require feature engineering to create effective representations of data. Additionally, deep learning models are often better suited for tasks such as image recognition, speech recognition, and natural language processing, which require high-dimensional input data and benefit from the ability to learn hierarchical representations of features.
-
Boosting vs Bagging Model Improvement Techniques
In machine learning, there are two popular techniques for improving the accuracy of models: boosting and bagging. Both techniques are used to reduce the variance of a model, which is the tendency to overfit to the training data. While they have similar goals, they differ in their approach and functionality. In this article, we’ll explore the differences between boosting and bagging to help you decide which technique is right for your machine learning project.
-
Implementing XGBoost in Python
XGBoost (Extreme Gradient Boosting) is a popular algorithm for supervised learning problems, including regression, classification, and ranking tasks. In the financial services industry, XGBoost can be used for a variety of regression problems, such as predicting stock prices, credit risk scoring, and forecasting financial time series.
-
Implementing Reinforcement Learning in Python and R
Reinforcement learning is a branch of machine learning that involves training agents to make a sequence of decisions in an environment to maximize a reward function. The agent receives feedback in the form of a reward signal for every action it takes, and its goal is to learn a policy that maximizes the long-term expected reward. In this article, we’ll discuss how to implement reinforcement learning in Python.
-
Implementing Association Rule Learning using APRIORI in Python and R
Association rule learning is a popular technique used in the financial services industry for analyzing customer behavior, identifying patterns, and making data-driven decisions.
-
Implementing K-Means Clustering in Python and R
K-means clustering is a popular unsupervised learning technique used to cluster data points based on their similarity. In this article, we will explore what k-means clustering is, how it works, and how to implement it in Python and R.
-
Implementing Random Forest Classification in Python and R
Random Forest Classification is a machine learning algorithm used for classification tasks. It is an extension of the decision tree algorithm, where multiple decision trees are built and combined to make a more accurate and stable prediction.
-
Implementing Decision Tree Classification in Python and R
Decision tree classification is a widely used machine learning algorithm that is used to predict a categorical output variable based on one or more input variables. The algorithm works by constructing a tree-like model that maps the observations in the input space to the output variable. In this article, we will discuss how to implement decision tree classification in Python and R.
-
Implementing Logistic Regression in Python and R
Logistic regression is a type of statistical analysis (also known as logit model). It is often used for predictive analytics and modeling, and extends to applications in machine learning. In this analytics approach, the dependent variable is finite or categorical: either A or B (binary regression) or a range of finite options A, B, C or D (multinomial regression). It is used to understand the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic regression equation.
-
Implementing Random Forest Regression in Python and R
Random forest regression is a popular machine learning algorithm used for predicting numerical values. It is a variant of the random forest algorithm and is well-suited for regression problems where the response variable is continuous. In this article, we will learn how to implement random forest regression using Python and R.
-
Support Vector Regression
Support Vector Regression (SVR) is a type of regression algorithm that uses Support Vector Machines (SVM) to perform regression analysis. In contrast to traditional regression algorithms, which aim to minimize the error between the predicted and actual values, SVR aims to fit a “tube” around the data such that the majority of the data points fall within the tube. The goal of SVR is to find a function that has a maximum margin from the tube.
-
Implementing Linear Regression in Python and R
Regression is a supervised learning technique to predict the value of a continuous target or dependent variable using a combination of predictor or independent variables. Linear regression is a type of regression where the primary consideration is that the independent and dependent variables have a linear relationship. Linear regression is of two broad types - simple linear regression and multiple linear regression. In simple linear regression there is only one independent variable. Whereas, multiple linear regression refers to a statistical technique that uses two or more independent variables to predict the outcome of a dependent variable. Linear regression also has some modifications such as lasso, ridge or elastic-net regression. However, in this article we will cover multiple linear regression.
-
An Overview of Machine Learning Techniques
Machine learning is a subfield of artificial intelligence (AI) that allows systems to learn and improve from experience without being explicitly programmed. Essentially, machine learning involves the use of algorithms that can learn from data and improve performance over time. This means that machine learning can be used to identify patterns and make predictions, and can be used in a wide variety of applications, such as image and speech recognition, fraud detection, recommender systems, and many more.
-
A Premier on Chi-squared test
The chi-square test is a statistical hypothesis test that is used to determine whether there is a significant association between two categorical variables. It is widely used in data analysis, particularly in fields such as social sciences, marketing, and biology, to examine relationships between categorical data. In this article, we will discuss the chi-square test, its applications, and how to perform it using Python.
-
A Premier on ANOVA
ANOVA (Analysis of Variance) is a statistical method used to analyze and test the differences between the means of three or more groups. ANOVA compares the variation within groups to the variation between groups to determine whether the differences in means are statistically significant or just due to random chance.
-
A Premier on T-tests
T-tests are a class of statistical tests used to determine whether there is a significant difference between the means of two groups of data. T-tests are often used to compare the means of a sample to the population mean, or to compare the means of two independent samples or two paired samples.
-
Statistical Hypothesis Testing
Hypothesis testing is a statistical method used to determine whether a hypothesis about a population parameter is supported by the data. It is a powerful tool for making decisions based on data, and is widely used in many fields including medicine, social sciences, and business.
-
Statistical Distributions
In this article we will cover some distributions that I have found useful while analysing data. I have split them based on whether they are for a continuous or a discrete random variable. First I give a small theoretical introduction about the distribution, its probability density function, and then how to use python to represent it graphically.
-
Visualize data using SAS
This is the third of a series of articles that I will write to give a gentle introduction to statistics. In this article we will cover how we can visualize data using various charts and how to read them. I will show how to create these charts using SAS and will include code snippets as well. For a full version of the code visit my GitHub repository.
-
Visualize data using Python
This is the second of a series of articles that I will write to give a gentle introduction to statistics. In this article we will cover how we can visualize data using various charts and how to read them. I will show how to create these charts using Python and will include code snippets as well. For a full version of the code visit my GitHub repository.
-
Describe your data using Python
This is the first of a series of articles that I will write to give a gentle introduction to statistics. In this article we will introduce some basic statistical concepts and learn how to use basic statistics to help you describe your data.
-
An Agile Approach to Analytics
Scrum is an agile framework for software development, but it can also be applied to other types of projects, including analytics. Scrum emphasizes collaboration, continuous improvement, and flexibility. It is designed to help teams work together to deliver high-quality results quickly and efficiently. In this article, we’ll discuss how to use Scrum in analytics teams.
-
Optimizing Retention through Machine Learning
Acquiring a new customer in the financial services sector can be as much as five to 25 times more expensive than retaining an existing one. Therefore, prevention of costumer churn is of paramount importance for the business. Advances in the area of Machine Learning, availability of large amount of customer data, and more sophisticated methods for predicting churn can help devise data backed strategy to prevent customers from churning.
-
Customer Lifecycle Analytics
How important is it to align your analytics efforts with the customer lifecycle? Imagine you are a credit card department within the consumer banking branch of large bank. You are sending periodic mailers offering credit cards to your customers. Before sending these mail offers you do a minimum screening in a way that you only offer these to customers that have been with the bank for at-least 2 years and have maintained a balance above a certain threshold. However, you notice that the acceptance of your mail offers remains low even after a few campaigns. Why do you think is that?
-
An Introduction to GitHub
A three part article series on version control using Git and GitHub. This is the third article in the series in which I will give a very brief introduction to GitHub. This will allow most readers to understand enough to utilize it for version control during development.
-
Git Cheatsheet
A three part article series on version control using Git and GitHub. This is the second article in the series in which I will share my Git cheatsheet. This will enable the reader to quickly recall important commands to aid development.
-
An Introduction to Git
A three part article series on version control using Git and GitHub. This is the first article in the series in which I will give a very brief introduction to Git. This will allow most readers to understand enough to utilize it for version control during development.
-
Introduction to Programming in R
R is a programming language and environment for statistical computing and graphics. It was created in the early 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. R is now widely used in academia, industry, and government for data analysis, statistical modeling, and data visualization.
-
Introduction to Programming in Markdown
Markdown is a lightweight markup language that is used to format text in a simple and consistent way. It was first created in 2004 by John Gruber and Aaron Swartz as a way to write content for the web that was easy to read and write.
-
Introduction to Programming in Python
Python is a high-level, interpreted programming language that was first released in 1991 by Guido van Rossum. It is a general-purpose language that is designed to be easy to use, with a focus on readability and simplicity. Python is often used for web development, data analysis, artificial intelligence, scientific computing, and other types of software development.
-
Introduction to Programming in Julia
Julia is a high-level, high-performance programming language that was created in 2012 by a team of computer scientists led by Jeff Bezanson, Stefan Karpinski, and Viral Shah. Julia was designed to address the limitations of traditional scientific computing languages, such as MATLAB, Python, and R, while still retaining their ease of use and flexibility.
-
Introduction to Programming in Ruby
Ruby is a high-level, interpreted programming language that was created in the mid-1990s by Yukihiro “Matz” Matsumoto. It is a general-purpose language that is designed to be easy to use and read, with syntax that is similar to natural language. Ruby is often used for web development, as well as for building command-line utilities, desktop applications, and other types of software.
-
Introduction to Programming in C++
C++ is a powerful and popular programming language that was developed in the 1980s as an extension of the C programming language. It is a high-level, object-oriented language that is used to develop a wide range of applications, including operating systems, device drivers, game engines, and more. C++ is also widely used in the field of finance and quantitative analysis, due to its speed and efficiency.
-
Introduction to Programming in Microsoft Excel VBA
Excel VBA, or Visual Basic for Applications, is a programming language that can be used to automate tasks and enhance functionality in Microsoft Excel. VBA is a powerful tool that allows users to write custom macros and functions to automate repetitive tasks, perform complex calculations, and create custom solutions.