Now Loading ...
-
-
Manage blog comments with Giscus
Giscus is a free comments system powered without your own database. Giscus uses the Github Discussions to store and load associated comments based on a chosen mapping (URL, pathname, title, etc.).
To comment, visitors must authorize the giscus app to post on their behalf using the GitHub OAuth flow. Alternatively, visitors can comment on the GitHub Discussion directly. You can moderate the comments on GitHub.
Prerequisites
Create a github repo
You need a GitHub repository first. If you gonna use GitHub Pages for hosting your website, you can choose the corresponding repository (i.e., [userID].github.io)
The repository should be public, otherwise visitors will not be able to view the discussion.
Turn on Discussion feature
In your GitHub repository Settings, make sure that General > Features > Discussions feature is enabled.
Activate Giscus API
Follow the steps in Configuration guide. Make sure the verification of your repository is successful.
Then, scroll down from the manual page and choose the Discussion Category options. You don’t need to touch other configs.
Copy _config.yml
Now, you get the giscus script. Copy the four properties marked with a red box as shown below:
Paste those values to _config.yml placed in the root directory.
# External API
giscus_repo: "[ENTER REPO HERE]"
giscus_repoId: "[ENTER REPO ID HERE]"
giscus_category: "[ENTER CATEGORY NAME HERE]"
giscus_categoryId: "[ENTER CATEGORY ID HERE]"
None
· 2024-02-03
-
-
Markdown from A to Z
Headings
To create a heading, add number signs (#) in front of a word or phrase. The number of number signs you use should correspond to the heading level. For example, to create a heading level three (<h3>), use three number signs (e.g., ### My Header).
Markdown
HTML
Rendered Output
# Header 1
<h1>Header 1</h1>
Header 1
## Header 2
<h2>Header 2</h2>
Header 2
### Header 3
<h3>Header 3</h3>
Header 3
Emphasis
You can add emphasis by making text bold or italic.
Bold
To bold text, add two asterisks (e.g., **text** = text) or underscores before and after a word or phrase. To bold the middle of a word for emphasis, add two asterisks without spaces around the letters.
Italic
To italicize text, add one asterisk (e.g., *text* = text) or underscore before and after a word or phrase. To italicize the middle of a word for emphasis, add one asterisk without spaces around the letters.
Blockquotes
To create a blockquote, add a > in front of a paragraph.
> Yongha Kim is the best developer in the world.
>
> Factos 👍👀
Yongha Kim is the best developer in the world.
Factos 👍👀
Lists
You can organize items into ordered and unordered lists.
Ordered Lists
To create an ordered list, add line items with numbers followed by periods. The numbers don’t have to be in numerical order, but the list should start with the number one.
1. First item
2. Second item
3. Third item
4. Fourth item
First item
Second item
Third item
Fourth item
Unordered Lists
To create an unordered list, add dashes (-), asterisks (*), or plus signs (+) in front of line items. Indent one or more items to create a nested list.
* First item
* Second item
* Third item
* Fourth item
First item
Second item
Third item
Fourth item
Code
To denote a word or phrase as code, enclose it in backticks (`).
Markdown
HTML
Rendered Output
At the command prompt, type `nano`.
At the command prompt, type <code>nano</code>.
At the command prompt, type nano.
Escaping Backticks
If the word or phrase you want to denote as code includes one or more backticks, you can escape it by enclosing the word or phrase in double backticks (``).
Markdown
HTML
Rendered Output
``Use `code` in your Markdown file.``
<code>Use `code` in your Markdown file.</code>
Use `code` in your Markdown file.
Code Blocks
To create code blocks that spans multiple lines of code, set the text inside three or more backquotes ( ``` ) or tildes ( ~~~ ).
<html>
<head>
</head>
</html>
def foo():
a = 1
for i in [1,2,3]:
a += i
Horizontal Rules
To create a horizontal rule, use three or more asterisks (***), dashes (---), or underscores (___) on a line by themselves.
***
---
_________________
Links
To create a link, enclose the link text in brackets (e.g., [Blue Archive]) and then follow it immediately with the URL in parentheses (e.g., (https://bluearchive.nexon.com)).
My favorite mobile game is [Blue Archive](https://bluearchive.nexon.com).
The rendered output looks like this:
My favorite mobile game is Blue Archive.
Adding Titles
You can optionally add a title for a link. This will appear as a tooltip when the user hovers over the link. To add a title, enclose it in quotation marks after the URL.
My favorite mobile game is [Blue Archive](https://bluearchive.nexon.com "All senseis are welcome!").
The rendered output looks like this:
My favorite mobile game is Blue Archive.
URLs and Email Addresses
To quickly turn a URL or email address into a link, enclose it in angle brackets.
<https://www.youtube.com/>
<fake@example.com>
The rendered output looks like this:
https://www.youtube.com/
fake@example.com
Images
To add an image, add an exclamation mark (!), followed by alt text in brackets, and the path or URL to the image asset in parentheses. You can optionally add a title in quotation marks after the path or URL.
![Tropical Paradise](/assets/img/example.jpg "Maldives, in October")
The rendered output looks like this:
Linking Images
To add a link to an image, enclose the Markdown for the image in brackets, and then add the link in parentheses.
[![La Mancha](/assets/img/La-Mancha.jpg "La Mancha: Spain, Don Quixote")](https://www.britannica.com/place/La-Mancha)
The rendered output looks like this:
Escaping Characters
To display a literal character that would otherwise be used to format text in a Markdown document, add a backslash (\) in front of the character.
\* Without the backslash, this would be a bullet in an unordered list.
The rendered output looks like this:
* Without the backslash, this would be a bullet in an unordered list.
Characters You Can Escape
You can use a backslash to escape the following characters.
Character
Name
`
backtick
*
asterisk
_
underscore
{}
curly braces
[]
brackets
<>
angle brackets
()
parentheses
#
pound sign
+
plus sign
-
minus sign (hyphen)
.
dot
!
exclamation mark
|
pipe
HTML
Many Markdown applications allow you to use HTML tags in Markdown-formatted text. This is helpful if you prefer certain HTML tags to Markdown syntax. For example, some people find it easier to use HTML tags for images. Using HTML is also helpful when you need to change the attributes of an element, like specifying the color of text or changing the width of an image.
To use HTML, place the tags in the text of your Markdown-formatted file.
This **word** is bold. This <span style="font-style: italic;">word</span> is italic.
The rendered output looks like this:
This word is bold. This word is italic.
None
· 2023-09-05
-
Important AWS Services that you need to Know Now
Introduction
Amazon Web Services (AWS) is a cloud-based platform that provides a wide range of infrastructure, platform, and software services. It was launched in 2006 and has since become one of the most popular cloud computing platforms in the world, used by individuals, small businesses, and large enterprises alike.
AWS is known for its flexibility, scalability, and cost-effectiveness, allowing businesses to pay only for the services they use and scale up or down as needed. Its reliability and security features also make it a popular choice for businesses that need to store and process sensitive data.
AWS provides a wide range of services, including compute, storage, databases, analytics, machine learning, Internet of Things (IoT), security, and more. It also offers a variety of deployment models, including public, private, and hybrid clouds, as well as edge computing services that allow computing to be performed closer to the source of data.
Key services
Compute services, such as Amazon Elastic Compute Cloud (EC2) and AWS Lambda
Storage services, such as Amazon Simple Storage Service (S3) and Elastic Block Store (EBS)
Database services, such as Amazon Relational Database Service (RDS) and DynamoDB
Networking services, such as Amazon Virtual Private Cloud (VPC) and Elastic Load Balancing (ELB)
Management and monitoring services, such as AWS CloudFormation and AWS CloudWatch
Security and compliance services, such as AWS Identity and Access Management (IAM) and AWS Key Management Service (KMS)
Analytics and machine learning services, such as Amazon SageMaker and Amazon Redshift
Users can interact with these services using
SSH (Secure Shell) and AWS CLI (Command Line Interface)
boto3 (a Python library used to interact with AWS resources)
Deep dive into services
EC2
Amazon Elastic Compute Cloud (EC2) is a web service that provides resizable compute capacity in the cloud. EC2 allows users to launch and manage virtual machines, called instances, in the AWS cloud. With EC2, users can select from a variety of instance types optimized for different use cases, and can scale up or down their compute resources as needed. EC2 also allows users to choose from a range of operating systems, including Amazon Linux, Ubuntu, Windows, and others. EC2 instances can be used to run a wide range of applications, from simple web servers to complex, multi-tier applications.
SSH (Secure Shell) is used to establish a secure, encrypted connection with the EC2 instance, allowing you to remotely log in and execute commands on the server. The basic syntax for the SSH command is
ssh [username]@[EC2 instance public DNS]
You will need to replace [username] with the username you created when setting up your EC2 instance, and [EC2 instance public DNS] with the public DNS address for your instance. Once you have established an SSH connection, you can execute commands on the EC2 instance just as you would on a local machine.
Lambda
AWS Lambda is a serverless compute service provided by Amazon Web Services. With Lambda, you can run your code in response to various events such as changes to data in an Amazon S3 bucket or an update to a DynamoDB table. You upload your code to Lambda, and it takes care of everything required to run and scale your code with high availability. AWS Lambda is an event-driven service, meaning it only runs when an event triggers it.
AWS Lambda console or AWS Command Line Interface (CLI) can be used to create, configure, and deploy your Lambda functions. Following commands allow you to perform common tasks such as creating, updating, invoking, and deleting your Lambda functions using the AWS CLI.
aws lambda create-function: Creates a new Lambda function.
aws lambda update-function-code: Uploads new code to an existing Lambda function.
aws lambda invoke: Invokes a Lambda function.
aws lambda delete-function: Deletes a Lambda function.
Boto3 can also be used to create a new Lambda function. Following code uses the boto3.client() method to create a client object for interacting with AWS Lambda. It then defines the code for the Lambda function, which is stored in a ZIP file. Finally, it uses the lambda_client.create_function() method to create the Lambda function, specifying the function name, runtime, IAM role, handler, and code. The response from this method call contains information about the newly created function.
import boto3
# create a client object to interact with AWS Lambda
lambda_client = boto3.client('lambda')
# define the Lambda function's code
code = {'ZipFile': open('lambda_function.zip', 'rb').read()}
# create the Lambda function
response = lambda_client.create_function(
FunctionName='my-function',
Runtime='python3.8',
Role='arn:aws:iam::123456789012:role/lambda-role',
Handler='lambda_function.handler',
Code=code
)
print(response)
S3 (Simple Storage Service)
Amazon S3 (Simple Storage Service) is a cloud storage service offered by Amazon Web Services (AWS). It provides a simple web interface to store and retrieve data from anywhere on the internet. S3 is designed to be highly scalable, durable, and secure, making it a popular choice for data storage, backup and archival, content distribution, and many other use cases. S3 allows you to store and retrieve any amount of data at any time, from anywhere on the web. It also offers different storage classes to help optimize costs based on access frequency and retrieval times.
Following AWS CLI code can be used in SSH to upload, download, and delete a file from S3:
aws s3 cp /path/to/local/file s3://bucket-name/key-name #code to upload
aws s3 cp s3://bucket-name/key-name /path/to/local/file #code to download
aws s3 rm s3://bucket-name/key-name #code to delete a file
Following boto3 code can be used to upload, download, and delete a file from S3. Note that you need to replace your-region, your-access-key, your-secret-key, your-bucket-name, your-file-name, and new-file-name with your own values. Also, make sure that you have the necessary permissions to read, write, and delete objects in your S3 bucket.
import boto3
# set the S3 region and access keys
s3 = boto3.resource('s3', region_name='your-region', aws_access_key_id='your-access-key', aws_secret_access_key='your-secret-key')
# read a file from S3
bucket_name = 'your-bucket-name'
file_name = 'your-file-name'
object = s3.Object(bucket_name, file_name)
file_content = object.get()['Body'].read().decode('utf-8')
print(file_content)
# write a file to S3
new_file_name = 'new-file-name'
new_file_content = 'This is the content of the new file.'
object = s3.Object(bucket_name, new_file_name)
object.put(Body=new_file_content.encode('utf-8'))
# delete a file from S3
object = s3.Object(bucket_name, file_name)
object.delete()
Here’s an example of how to use a KMS key to encrypt and decrypt data in S3 using boto3:
import boto3
# Create a KMS key
kms = boto3.client('kms')
response = kms.create_key()
kms_key_id = response['KeyMetadata']['KeyId']
# Create an S3 bucket or select an existing one
s3 = boto3.resource('s3')
bucket_name = 'my-s3-bucket'
# Grant permissions to the KMS key
key_policy = {
"Version": "2012-10-17",
"Id": "key-policy",
"Statement": [
{
"Sid": "Enable IAM User Permissions",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::123456789012:root"},
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "Allow use of the key",
"Effect": "Allow",
"Principal": {"AWS": "arn:aws:iam::123456789012:user/username"},
"Action": [
"kms:Encrypt",
"kms:Decrypt",
"kms:ReEncrypt*",
"kms:GenerateDataKey*",
"kms:DescribeKey"
],
"Resource": "*"
}
]
}
kms.put_key_policy(KeyId=kms_key_id, Policy=json.dumps(key_policy))
# Enable default encryption on the S3 bucket
s3_bucket = s3.Bucket(bucket_name)
s3_bucket.put_encryption(
ServerSideEncryptionConfiguration={
'Rules': [
{
'ApplyServerSideEncryptionByDefault': {
'SSEAlgorithm': 'aws:kms',
'KMSMasterKeyID': kms_key_id
}
}
]
}
)
# Upload an object to S3 with encryption
s3_client = boto3.client('s3')
s3_client.put_object(
Bucket=bucket_name,
Key='example_object',
Body=b'Hello, world!',
ServerSideEncryption='aws:kms',
SSEKMSKeyId=kms_key_id
)
# Download an object from S3 with decryption
response = s3_client.get_object(
Bucket=bucket_name,
Key='example_object'
)
body = response['Body'].read()
print(body.decode())
Example code that uses the boto3 library to connect to an S3 bucket, list the contents of a folder, and download the latest file based on its modified timestamp:
import boto3
from datetime import datetime
# set up S3 client
s3 = boto3.client('s3')
# specify bucket and folder
bucket_name = 'my-bucket'
folder_name = 'my-folder/'
# list all files in the folder
response = s3.list_objects(Bucket=bucket_name, Prefix=folder_name)
# sort the files by last modified time
files = response['Contents']
files = sorted(files, key=lambda k: k['LastModified'], reverse=True)
# download the latest file
latest_file = files[0]['Key']
s3.download_file(bucket_name, latest_file, 'local_filename')
EBS (Elastic Block Storage)
Amazon Elastic Block Store (EBS) is a block-level storage service that provides persistent block storage volumes for use with Amazon EC2 instances. EBS volumes are highly available and reliable storage volumes that can be attached to running instances, allowing you to store persistent data separate from the instance itself. EBS volumes are designed for mission-critical systems, so they are optimized for low-latency and consistent performance. EBS also supports point-in-time snapshots, which can be used for backup and disaster recovery.
Using SSH/AWS CLI to manage Elastic Block Store (EBS) volumes
aws ec2 create-volume --availability-zone us-east-1a --size 10 --volume-type gp2
aws ec2 attach-volume --volume-id vol-0123456789abcdef --instance-id i-0123456789abcdef --device /dev/sdf
aws ec2 detach-volume --volume-id vol-0123456789abcdef
aws ec2 delete-volume --volume-id vol-0123456789abcdef
Using python/boto3 to manage Elastic Block Store (EBS) volumes
import boto3
# create an EC2 client object
ec2 = boto3.client('ec2')
# create an EBS volume
response = ec2.create_volume(
AvailabilityZone='us-west-2a',
Encrypted=False,
Size=100,
VolumeType='gp2'
)
print(response)
# attach a volume
response = ec2.attach_volume(
Device='/dev/sdf',
InstanceId='i-0123456789abcdef0',
VolumeId='vol-0123456789abcdef0'
)
print(response)
# detach a volume
response = ec2.detach_volume(
VolumeId='vol-0123456789abcdef0'
)
print(response)
# delete a volume
response = ec2.delete_volume(
VolumeId='vol-0123456789abcdef0'
)
print(response)
RDS (Relational Database Service)
Amazon Relational Database Service (Amazon RDS) is a managed database service offered by Amazon Web Services (AWS) that simplifies the process of setting up, operating, and scaling a relational database in the cloud. It provides cost-efficient, resizable capacity for an industry-standard relational database and manages common database administration tasks, freeing up developers to focus on applications and customers. With Amazon RDS, you can choose from several different database engines, including Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server.
To use AWS RDS with PostgreSQL, follow these steps:
Log in to your AWS console and navigate to the RDS dashboard.
Click the “Create database” button.
Choose “Standard Create” and select “PostgreSQL” as the engine.
Choose the appropriate version of PostgreSQL that you want to use.
Set up the rest of the database settings, such as the instance size, storage, and security group settings.
Click the “Create database” button to create your RDS instance.
Once the instance is created, you can connect to it using a PostgreSQL client, such as pgAdmin or the psql command-line tool.
To connect to the RDS instance using pgAdmin, follow these steps:
Open pgAdmin and right-click on “Servers” in the Object Browser.
Click “Create Server”.
Enter a name for the server and switch to the “Connection” tab.
Enter the following information:
– Host: This is the endpoint for your RDS instance, which you can find in the RDS dashboard.
– Port: This is the port number for your PostgreSQL instance, which is usually 5432.
– Maintenance database: This is the name of the default database that you want to connect to.
– Username: This is the username that you specified when you created the RDS instance.
– Password: This is the password that you specified when you created the RDS instance.
Click “Save” to create the server.
You can now connect to the RDS instance by double-clicking on the server name in the Object Browser.
DynamoDB
Amazon DynamoDB is a fully-managed NoSQL database service that provides fast and predictable performance with seamless scalability. DynamoDB lets users offload the administrative burdens of operating and scaling a distributed database so that they don’t have to worry about hardware provisioning, setup, and configuration, replication, software patching, or cluster scaling. DynamoDB is known for its high performance, ease of use, and flexibility. It supports document and key-value store models, making it suitable for a wide range of use cases, including mobile and web applications, gaming, ad tech, IoT, and more.
Using SSH/AWS CLI to access DynamoDB:
aws dynamodb create-table --table-name <table-name> --attribute-definitions AttributeName=<attribute-name>,AttributeType=S --key-schema AttributeName=<attribute-name>,KeyType=HASH --provisioned-throughput ReadCapacityUnits=1,WriteCapacityUnits=1
aws dynamodb put-item --table-name <table-name> --item '{"<attribute-name>": {"S": "<attribute-value>"}}'
aws dynamodb get-item --table-name <table-name> --key '{"<attribute-name>": {"S": "<attribute-value>"}}'
aws dynamodb delete-item --table-name <table-name> --key '{"<attribute-name>": {"S": "<attribute-value>"}}'
aws dynamodb delete-table --table-name <table-name>
Using boto3 to work with DynamoDB:
import boto3
# create a DynamoDB resource
dynamodb = boto3.resource('dynamodb')
# create a table
table = dynamodb.create_table(
TableName='my_table',
KeySchema=[
{
'AttributeName': 'id',
'KeyType': 'HASH'
}
],
AttributeDefinitions=[
{
'AttributeName': 'id',
'AttributeType': 'S'
}
],
ProvisionedThroughput={
'ReadCapacityUnits': 5,
'WriteCapacityUnits': 5
}
)
# put an item into the table
table.put_item(
Item={
'id': '1',
'name': 'Alice',
'age': 30
}
)
# get an item from the table
response = table.get_item(
Key={
'id': '1'
}
)
# print the item
item = response['Item']
print(item)
# delete the table
table.delete()
Amazon Virtual Private Cloud (VPC)
It is a service offered by Amazon Web Services (AWS) that enables users to launch Amazon Web Services resources into a virtual network that they define. With Amazon VPC, users can define a virtual network topology including subnets and routing tables, and control network security using firewall rules and access control lists.
AWS Elastic Load Balancer (ELB)
It is a managed load balancing service provided by Amazon Web Services. It automatically distributes incoming application traffic across multiple targets, such as Amazon EC2 instances, containers, and IP addresses, in one or more Availability Zones. ELB allows you to easily scale your application by increasing or decreasing the number of resources, such as EC2 instances, behind the load balancer, and it provides high availability and fault tolerance for your applications. There are three types of load balancers available in AWS: Application Load Balancer, Network Load Balancer, and Classic Load Balancer.
AWS CloudFormation
It is a service provided by Amazon Web Services that helps users model and set up their AWS resources. It allows users to create templates of AWS resources in a declarative way, which can then be versioned and managed like any other code. AWS CloudFormation automates the deployment and updates of the AWS resources specified in the templates. This service makes it easier to manage and maintain infrastructure as code and provides a simple way to achieve consistency and repeatability across environments. Users can define the infrastructure for their applications and services, and then AWS CloudFormation takes care of provisioning, updating, and deleting resources, based on the templates defined by the user.
AWS CloudWatch
It is a monitoring service provided by Amazon Web Services (AWS) that collects, processes, and stores log files and metrics from AWS resources and custom applications. With CloudWatch, users can collect and track metrics, collect and monitor log files, and set alarms. CloudWatch is designed to help users identify and troubleshoot issues, and it can be used to monitor AWS resources such as EC2 instances, RDS instances, and load balancers, as well as custom metrics and logs from any other application running on AWS. CloudWatch can also be used to gain insights into the performance and health of applications and infrastructure, and it integrates with other AWS services such as AWS Lambda, AWS Elastic Beanstalk, and AWS EC2 Container Service.
AWS Key Management Service (KMS)
It is a managed service that makes it easy for you to create and control the encryption keys used to encrypt your data. KMS is integrated with many other AWS services, such as Amazon S3, Amazon EBS, and Amazon RDS, allowing you to easily protect your data with encryption. With KMS, you can create, manage, and revoke encryption keys, and you can audit key usage to ensure compliance with security best practices. KMS also supports hardware security modules (HSMs) for added security.
import boto3
# create a KMS client
client = boto3.client('kms')
# Encrypt data using a KMS key
response = client.encrypt(
KeyId='alias/my-key',
Plaintext=b'My secret data'
)
# Decrypt data using a KMS key
response = client.decrypt(
CiphertextBlob=response['CiphertextBlob']
)
# Manage encryption keys
# Create a new KMS key
response = client.create_key(
Description='My encryption key',
KeyUsage='ENCRYPT_DECRYPT',
Origin='AWS_KMS'
)
# List all KMS keys in your account
response = client.list_keys()
# Handle errors gracefully when using KMS
import botocore.exceptions
try:
response = client.encrypt(
KeyId='alias/my-key',
Plaintext=b'My secret data'
)
except botocore.exceptions.ClientError as error:
print(f'An error occurred: {error}')
Amazon SageMaker
It is a fully-managed service that enables developers and data scientists to easily build, train, and deploy machine learning models at scale. With SageMaker, you can quickly create an end-to-end machine learning workflow that includes data preparation, model training, and deployment. The service offers a variety of built-in algorithms and frameworks, as well as the ability to bring your own custom algorithms and models.
SageMaker provides a range of tools and features to help you manage your machine learning projects. You can use the built-in Jupyter notebooks to explore and visualize your data, and use SageMaker’s automatic model tuning capabilities to find the best hyperparameters for your model. The service also offers integration with other AWS services such as S3, IAM, and CloudWatch, making it easy to build and deploy machine learning models in the cloud.
With SageMaker, you only pay for what you use, with no upfront costs or long-term commitments. The service is designed to scale with your needs, so you can start small and grow your machine learning projects as your data and business needs evolve.
Amazon Redshift
It is a cloud-based data warehousing service provided by Amazon Web Services (AWS). It is a fully managed, petabyte-scale data warehouse that enables businesses to analyze data using existing SQL-based business intelligence tools. Redshift is designed to handle large data sets and to scale up or down as needed, making it a flexible and cost-effective solution for data warehousing.
Using redshift-data to connect to Redshift cluster:
import boto3
# Connect to the Redshift cluster
client = boto3.client('redshift-data', region_name='us-west-2')
response = client.connect(Database='my_database', DbUser='my_user', DbPassword='my_password', ClusterIdentifier='my_cluster')
# Execute a SQL statement
response = client.execute_statement(Sql='SELECT * FROM my_table', ConnectionId=response['ConnectionId'])
# Fetch the results
statement_id = response['Id']
response = client.get_statement_result(Id=statement_id)
# Print the results
for row in response['Records']:
print(row)
# parse the response JSON and create a Pandas DataFrame
# we parse the JSON using json.dumps and create a Pandas DataFrame from the resulting string using pd.read_json. The resulting DataFrame will have the columns and rows returned by the SQL query.
df = pd.read_json(json.dumps(response['Records']))
Using psycopg2 to connect to Redshift cluster:
pip install boto3 psycopg2
import boto3
client = boto3.client('redshift')
# create a new Redshift cluster using boto3:
response = client.create_cluster(
ClusterIdentifier='my-redshift-cluster',
NodeType='dc2.large',
MasterUsername='myusername',
MasterUserPassword='mypassword',
ClusterSubnetGroupName='my-subnet-group',
VpcSecurityGroupIds=['my-security-group'],
ClusterParameterGroupName='default.redshift-1.0',
NumberOfNodes=2,
PubliclyAccessible=False,
Encrypted=True,
HsmClientCertificateIdentifier='my-hsm-certificate',
HsmConfigurationIdentifier='my-hsm-config',
Tags=[{'Key': 'Name', 'Value': 'My Redshift Cluster'}]
)
print(response)
# connect to the Redshift cluster and create a new database
import psycopg2
conn = psycopg2.connect(
host='my-redshift-cluster.xxxxxxxx.us-west-2.redshift.amazonaws.com',
port=5439,
dbname='mydatabase',
user='myusername',
password='mypassword'
)
cur = conn.cursor()
cur.execute('CREATE DATABASE mynewdatabase')
conn.commit()
cur.close()
conn.close()
# Query the Redshift database
import psycopg2
conn = psycopg2.connect(
host='my-redshift-cluster.xxxxxxxx.us-west-2.redshift.amazonaws.com',
port=5439,
dbname='mynewdatabase',
user='myusername',
password='mypassword'
)
cur = conn.cursor()
cur.execute('SELECT * FROM mytable')
for row in cur:
print(row)
cur.close()
conn.close()
# Delete the Redshift cluster using boto3
import boto3
client = boto3.client('redshift')
response = client.delete_cluster(
ClusterIdentifier='my-redshift-cluster',
SkipFinalClusterSnapshot=True
)
print(response)
Comments welcome!
Data Science
· 2022-07-02
-
Implementing Self Organizing Maps using Python
What are Self Organizing Maps (SOMs)?
SOM stands for Self-Organizing Map, which is a type of artificial neural network that is used for unsupervised learning and dimensionality reduction. SOMs are inspired by the structure and function of the human brain, and they can be used to visualize and explore complex, high-dimensional data in a two-dimensional map or grid.
SOMs consist of an input layer, a layer of computational nodes, and an output layer. The input layer receives the data, and the computational nodes perform computations on the data. The output layer is the two-dimensional grid of nodes that represents the input data. During training, the nodes in the output layer are adjusted to represent the input data in a way that preserves the topological relationships between the input data points.
SOMs have a wide range of applications, including image processing, data visualization, data clustering, feature extraction, and anomaly detection. They are particularly useful for visualizing and exploring large, complex datasets, as they can reveal patterns and relationships that might not be apparent from the raw data.
Implementation
To implement Self-Organizing Maps (SOM) in Python, you can use the SOMPY library. SOMPY is a Python library for Self Organizing Map (SOM), and it provides an easy-to-use interface to implement SOM in Python.
Here are the steps to implement SOM using SOMPY library in Python:
Install SOMPY library: You can install SOMPY library using pip by running the following command in the terminal:
pip install sompy
Import the SOMPY library: To use the SOMPY library, you need to import it first. You can do this using the following code:
from sompy.sompy import SOMFactory
Load data: You need to load the data you want to cluster using SOM. You can load data from a file or create a numpy array.
Create a SOM object: You need to create a SOM object using SOMFactory class. You can set the parameters of SOM object such as the number of nodes, learning rate, and neighborhood function.
som = SOMFactory.build(data, mapsize=[20,20], normalization='var', initialization='pca', component_names=features)
Here, data is the input data you loaded in the previous step, mapsize is the number of nodes in the SOM, normalization is the normalization method, initialization is the initialization method, and component_names is the feature names of the input data.
Train the SOM: You can train the SOM object using the following code:
som.train(n_job=1, verbose=False)
Here, n_job is the number of processors to use, and verbose is the flag to print the training progress.
Plot the SOM: You can visualize the SOM using the following code:
from sompy.visualization.mapview import View2D
from sompy.visualization.bmuhits import BmuHitsView
# View the map
view2D = View2D(10,10,"rand data",text_size=10)
view2D.show(som, col_sz=4, which_dim="all", denormalize=True)
# View the hit map
hits = BmuHitsView(4,4,"Hits Map",text_size=12)
hits.show(som, anotate=True, onlyzeros=False, labelsize=12, cmap="Greys", logaritmic=False)
Here, View2D is used to view the map, and BmuHitsView is used to view the hit map. You can set the number of columns in the map and other parameters to adjust the size and style of the map.
That’s it! These are the basic steps to implement SOM using the SOMPY library in Python. You can customize the SOM object and visualization methods to fit your requirements.
Comments welcome!
Data Science
· 2022-06-04
-
Implementing Convolutional Neural Networks using Python
What are Convolutional Neural Networks (CNNs)?
Convolutional Neural Networks (CNNs) are a type of deep neural network that are commonly used in computer vision tasks such as image classification, object detection, and segmentation. They are able to automatically learn and extract features from images, allowing them to identify patterns and structures in complex visual data.
The key component of a CNN is the convolutional layer, which performs a series of convolutions between the input image and a set of learnable filters. Each filter is designed to detect a specific pattern or feature in the image, such as edges, corners, or textures. The result of the convolution is a feature map that captures the presence and location of the detected feature.
In addition to the convolutional layer, a typical CNN architecture also includes pooling layers, which reduce the spatial resolution of the feature maps while retaining their most important information, and fully connected layers, which combine the extracted features into a final output.
One of the major advantages of CNNs is their ability to learn hierarchical representations of images, where lower-level features such as edges and corners are combined to form higher-level features such as shapes and objects. This makes them highly effective for image classification and object detection tasks, where they can achieve state-of-the-art performance on benchmark datasets.
Implementation
CNNs can be implemented in various deep learning frameworks such as TensorFlow, PyTorch, and Keras. These frameworks provide pre-built layers and functions for building and training CNN models, making it relatively easy to implement even for those with limited programming experience.
Using Tensorflow library
Here’s an example of how to implement a basic convolutional neural network (CNN) using TensorFlow in Python:
import tensorflow as tf
# Define the model architecture
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(10, activation='softmax')
])
# Compile the model with an optimizer, loss function, and metrics
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
# Load the training and test data
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()
# Preprocess the data
train_images = train_images.reshape(train_images.shape[0], 28, 28, 1)
train_images = train_images.astype('float32') / 255
train_labels = tf.keras.utils.to_categorical(train_labels, num_classes=10)
test_images = test_images.reshape(test_images.shape[0], 28, 28, 1)
test_images = test_images.astype('float32') / 255
test_labels = tf.keras.utils.to_categorical(test_labels, num_classes=10)
# Train the model
model.fit(train_images, train_labels, batch_size=128, epochs=10, validation_data=(test_images, test_labels))
In this example, we define a simple CNN architecture with one convolutional layer, one max pooling layer, one flattening layer, and one fully connected (dense) layer. We use the MNIST dataset for training and testing the model. We compile the model with the Adam optimizer, categorical cross-entropy loss function, and accuracy metric. Finally, we train the model for 10 epochs and evaluate its performance on the test data.
Using keras library
Here is an example of how to implement a convolutional neural network (CNN) in Keras:
# First, you need to import the required libraries:
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
Next, you can define your CNN model using the Sequential API. Here’s an example model:
model = Sequential()
# Add a convolutional layer with 32 filters, a 3x3 kernel size, and ReLU activation
model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
# Add a max pooling layer with a 2x2 pool size
model.add(MaxPooling2D(pool_size=(2, 2)))
# Add another convolutional layer with 64 filters and a 3x3 kernel size
model.add(Conv2D(64, (3, 3), activation='relu'))
# Add another max pooling layer
model.add(MaxPooling2D(pool_size=(2, 2)))
# Flatten the output from the previous layer
model.add(Flatten())
# Add a fully connected layer with 128 neurons and ReLU activation
model.add(Dense(128, activation='relu'))
# Add an output layer with 10 neurons (for a 10-class classification problem) and softmax activation
model.add(Dense(10, activation='softmax'))
This CNN model has two convolutional layers with 32 and 64 filters, respectively, each followed by a max pooling layer with a 2x2 pool size. The output from the last max pooling layer is flattened and fed into a fully connected layer with 128 neurons, which is then connected to an output layer with 10 neurons and softmax activation for a 10-class classification problem.
Finally, you can compile and train the model using the compile() and fit() methods, respectively. Here’s an example of compiling and training the model on the MNIST dataset:
# Compile the model with categorical crossentropy loss and Adam optimizer
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# Train the model on the MNIST dataset
model.fit(X_train, y_train, batch_size=128, epochs=10, validation_data=(X_test, y_test))
In this example, X_train and y_train are the training data and labels, respectively, and X_test and y_test are the validation data and labels, respectively. The model is compiled with categorical crossentropy loss and Adam optimizer, and trained for 10 epochs with a batch size of 128. The model’s training and validation accuracy are recorded and printed after each epoch.
Using PyTorch library
To implement a Convolutional Neural Network (CNN) in PyTorch, you can follow these steps:
# Import the necessary PyTorch libraries:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
Define the CNN architecture by creating a class that inherits from the nn.Module class:
class CNN(nn.Module):
def __init__(self):
super(CNN, self).__init__()
self.conv1 = nn.Conv2d(3, 32, kernel_size=3)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3)
self.pool = nn.MaxPool2d(2, 2)
self.dropout = nn.Dropout(p=0.5)
self.fc1 = nn.Linear(64 * 6 * 6, 128)
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.dropout(x)
x = self.pool(F.relu(self.conv2(x)))
x = self.dropout(x)
x = x.view(-1, 64 * 6 * 6)
x = F.relu(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
return x
Here, we have defined a CNN architecture with two convolutional layers, two max pooling layers, two dropout layers, and two fully connected layers.
# Define the loss function and the optimizer:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)
# Train the model:
for epoch in range(num_epochs):
for i, data in enumerate(train_loader, 0):
inputs, labels = data
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# Evaluate the model:
correct = 0
total = 0
with torch.no_grad():
for data in test_loader:
images, labels = data
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
print('Accuracy of the network on the 10000 test images: %d %%' % (100 * correct / total))
This is a basic example of how to implement a CNN using PyTorch. Of course, there are many ways to customize the architecture, loss function, optimizer, and training procedure based on your specific needs.
In summary, CNNs are a powerful and widely used tool in computer vision and have led to significant advancements in areas such as image recognition, object detection, and segmentation. With the availability of deep learning frameworks, it has become easier than ever to implement and experiment with CNN models for a wide range of applications.
Comments welcome!
Data Science
· 2022-05-07
-
Implementing Recurrent Neural Networks using Python
What are Recurrent Neural Networks (RNNs)?
Recurrent Neural Networks, or RNNs, are a type of artificial neural network designed to process sequential data, such as time-series or natural language. While traditional neural networks process input data independently of one another, RNNs allow for the input of past data to influence current output. This is done by introducing a loop within the neural network, allowing previous output to be fed back into the input layer.
The ability to process sequential data makes RNNs particularly useful for a variety of tasks. For example, in natural language processing, RNNs can be used to generate text or to predict the next word in a sentence. In speech recognition, RNNs can be used to transcribe audio to text. In financial modeling, RNNs can be used to predict stock prices based on historical data.
The core of an RNN is its hidden state, which is a vector that is updated at each time step. The state vector summarizes information from previous inputs, and is used to predict the output at the current time step. The state vector is updated using a set of weights that are learned during training.
One common issue with RNNs is that the hidden state can become “saturated” and lose information from previous time steps. To address this, several variations of RNNs have been developed, including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs), which can better maintain the memory of the network over longer periods of time.
Implementation
Implementing an RNN in Python can be done using several popular deep learning frameworks, such as TensorFlow, Keras, and PyTorch. These frameworks provide high-level APIs that make it easier to build and train complex neural networks. With the popularity of RNNs increasing, they have become a powerful tool for a variety of applications across many different fields.
Using TensorFlow library
Here is an example of how to implement a simple RNN using TensorFlow:
import tensorflow as tf
import numpy as np
# Define the RNN model
num_inputs = 1
num_neurons = 100
num_outputs = 1
learning_rate = 0.001
X = tf.placeholder(tf.float32, [None, None, num_inputs])
y = tf.placeholder(tf.float32, [None, num_outputs])
cell = tf.contrib.rnn.OutputProjectionWrapper(
tf.contrib.rnn.BasicRNNCell(num_units=num_neurons, activation=tf.nn.relu),
output_size=num_outputs)
outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)
# Define the loss function and optimizer
loss = tf.reduce_mean(tf.square(outputs - y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
train = optimizer.minimize(loss)
# Generate some sample data
t_min, t_max = 0, 30
resolution = 0.1
t = np.linspace(t_min, t_max, int((t_max - t_min) / resolution))
x = np.sin(t)
# Train the model
n_iterations = 500
batch_size = 50
init = tf.global_variables_initializer()
with tf.Session() as sess:
init.run()
for iteration in range(n_iterations):
X_batch = x.reshape(-1, batch_size, num_inputs)
y_batch = x.reshape(-1, batch_size, num_outputs)
sess.run(train, feed_dict={X: X_batch, y: y_batch})
# Make some predictions
X_new = x.reshape(-1, 1, num_inputs)
y_pred = sess.run(outputs, feed_dict={X: X_new})
This is a simple RNN that is trained on a sin wave and is able to predict the next value in the sequence. You can modify the code to work with your own data and adjust the parameters to improve the accuracy of the model.
Using keras library
Here’s an example code for implementing RNN using Keras in Python:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN
# define the data
X = np.array([[[1], [2], [3], [4], [5]], [[6], [7], [8], [9], [10]]])
y = np.array([[[6], [7], [8], [9], [10]], [[11], [12], [13], [14], [15]]])
# define the model
model = Sequential()
model.add(SimpleRNN(1, input_shape=(5, 1), return_sequences=True))
# compile the model
model.compile(optimizer='adam', loss='mse')
# fit the model
model.fit(X, y, epochs=1000, verbose=0)
# make predictions
predictions = model.predict(X)
print(predictions)
In this example, we define a simple RNN model using Keras to predict the next value in a sequence. We input two sequences, each of length 5, and output two sequences, each of length 5. We define the model using the Sequential class and add a single SimpleRNN layer with a single neuron. We compile the model using the adam optimizer and mean squared error loss function. We then fit the model on the input and output sequences, running for 1000 epochs. Finally, we use the model to make predictions on the input sequences, printing the predictions.
Using PyTorch library
Here is an example of implementing a Recurrent Neural Network (RNN) in Python using PyTorch:
import torch
import torch.nn as nn
# Define the RNN model
class RNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(RNN, self).__init__()
self.hidden_size = hidden_size
self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
self.i2o = nn.Linear(input_size + hidden_size, output_size)
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, input, hidden):
combined = torch.cat((input, hidden), 1)
hidden = self.i2h(combined)
output = self.i2o(combined)
output = self.softmax(output)
return output, hidden
def initHidden(self):
return torch.zeros(1, self.hidden_size)
# Set the hyperparameters
input_size = 5
hidden_size = 10
output_size = 2
# Create the RNN model
rnn = RNN(input_size, hidden_size, output_size)
# Define the input and the initial hidden state
input = torch.randn(1, input_size)
hidden = torch.zeros(1, hidden_size)
# Run the RNN model
output, next_hidden = rnn(input, hidden)
This code defines an RNN model using PyTorch’s nn.Module class, which includes an input layer, a hidden layer, and an output layer. The forward method defines how the input is processed through the network, and the initHidden method initializes the hidden state.
To run the RNN model, we first set the hyperparameters such as input_size, hidden_size, and output_size. Then we create an instance of the RNN model and pass in an input tensor and an initial hidden state to the forward method. The output of the RNN model is the output tensor and the next hidden state.
Note that this is just a simple example, and there are many variations of RNNs that can be implemented in PyTorch depending on the specific use case.
Comments welcome!
Data Science
· 2022-04-02
-
Implementing Artificial Neural Networks using Python
What are Artificial Neural Networks (ANNs)?
Artificial Neural Networks (ANNs) are a type of machine learning model that are designed to simulate the function of a biological neural network. ANNs are composed of interconnected nodes or artificial neurons that process and transmit information to one another. The structure of an ANN consists of an input layer, one or more hidden layers, and an output layer.
The input layer is where data is introduced to the network, while the output layer produces the network’s prediction or classification. Hidden layers contain a variable number of artificial neurons, which allow the network to model non-linear relationships in the data. The connections between the neurons in the hidden layers have weights that can be adjusted through training to optimize the performance of the network.
ANNs can be used for a variety of machine learning tasks, including regression, classification, and clustering. For regression, ANNs can be trained to model the relationship between input variables and output variables. In classification, ANNs can be trained to classify input data into different categories. In clustering, ANNs can be used to group similar data points together.
The training process of an ANN involves adjusting the weights of the connections between the neurons to minimize the difference between the predicted output and the actual output. This process involves passing data through the network multiple times and updating the weights based on the difference between the predicted output and the actual output. The goal is to find a set of weights that minimize the error and optimize the performance of the network.
Implementation
Python has several libraries that can be used to implement ANNs, including scikit-learn, TensorFlow, Keras, and PyTorch. These libraries provide high-level abstractions that make it easier to build and train ANNs. In addition, they provide a wide range of pre-built layers and functions that can be used to customize the architecture of the network.
Using scikit-learn library
Here’s an example of how to create a simple ANN using the scikit-learn library:
# Import the necessary libraries
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Generate a random dataset for classification
X, y = make_classification(n_features=4, random_state=0)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)
# Create an ANN classifier with one hidden layer
clf = MLPClassifier(hidden_layer_sizes=(5,), max_iter=1000, random_state=0)
# Train the classifier on the training set
clf.fit(X_train, y_train)
# Evaluate the classifier on the testing set
score = clf.score(X_test, y_test)
print("Accuracy: {:.2f}%".format(score*100))
In this example, we first import the necessary libraries, generate a random dataset for classification, and split the data into training and testing sets. We then create an ANN classifier with one hidden layer and train it on the training set. Finally, we evaluate the classifier on the testing set and print the accuracy.
This is just a basic example, and there are many ways to customize and optimize your ANN, depending on your specific use case.
Using Tensorflow library
Here’s an example of how to implement an artificial neural network using TensorFlow without Keras:
import tensorflow as tf
import numpy as np
# Define the input data and expected outputs
input_data = np.array([[0,0], [0,1], [1,0], [1,1]], dtype=np.float32)
expected_output = np.array([[0], [1], [1], [0]], dtype=np.float32)
# Define the network architecture
num_input = 2
num_hidden = 2
num_output = 1
learning_rate = 0.1
# Define the weights and biases for the network
weights = {
'hidden': tf.Variable(tf.random.normal([num_input, num_hidden])),
'output': tf.Variable(tf.random.normal([num_hidden, num_output]))
}
biases = {
'hidden': tf.Variable(tf.random.normal([num_hidden])),
'output': tf.Variable(tf.random.normal([num_output]))
}
# Define the forward propagation step
def neural_network(input_data):
hidden_layer = tf.add(tf.matmul(input_data, weights['hidden']), biases['hidden'])
hidden_layer = tf.nn.sigmoid(hidden_layer)
output_layer = tf.add(tf.matmul(hidden_layer, weights['output']), biases['output'])
output_layer = tf.nn.sigmoid(output_layer)
return output_layer
# Define the loss function and optimizer
loss_func = tf.keras.losses.MeanSquaredError()
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)
# Define the training loop
num_epochs = 10000
for epoch in range(num_epochs):
with tf.GradientTape() as tape:
# Forward propagation
output = neural_network(input_data)
loss = loss_func(expected_output, output)
# Backward propagation and update the weights and biases
gradients = tape.gradient(loss, [weights['hidden'], weights['output'], biases['hidden'], biases['output']])
optimizer.apply_gradients(zip(gradients, [weights['hidden'], weights['output'], biases['hidden'], biases['output']]))
if epoch % 1000 == 0:
print(f"Epoch {epoch} Loss: {loss:.4f}")
# Test the network
test_data = np.array([[0,0], [0,1], [1,0], [1,1]], dtype=np.float32)
predictions = neural_network(test_data)
print(predictions)
In this example, we define the architecture of the neural network by specifying the number of input, hidden, and output nodes. We also define the learning rate and the weight and bias variables. The forward propagation step is defined by using the tf.add() and tf.matmul() functions to compute the weighted sum and then applying the sigmoid activation function. The loss function and optimizer are defined using the tf.keras.losses and tf.keras.optimizers modules, respectively. Finally, we train the network by performing forward and backward propagation steps in a loop, and then we test the network using test data.
Using keras library
Keras is a high-level neural network API that can run on top of TensorFlow. It provides a simplified interface for building and training deep learning models. Here is an example of how to implement an Artificial Neural Network (ANN) in Python using Keras:
# Import the necessary libraries
from tensorflow import keras
from tensorflow.keras import layers
# Define the model architecture
model = keras.Sequential([
layers.Dense(64, activation='relu', input_shape=[X_train.shape[1]]),
layers.Dense(64, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
# Compile the model
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
# Train the model
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=100,
batch_size=32
)
# Evaluate the model
test_scores = model.evaluate(X_test, y_test, verbose=2)
print(f'Test loss: {test_scores[0]}')
print(f'Test accuracy: {test_scores[1]}')
This example creates a model with 2 hidden layers and 1 output layer. The first 2 hidden layers have 64 nodes each and use the ReLU activation function. The output layer has a single node and uses the sigmoid activation function. The model is trained using the Adam optimizer and binary cross-entropy loss. The accuracy metric is used to evaluate the model.
To use this code, you will need to replace X_train, y_train, X_val, y_val, X_test, and y_test with your own training, validation, and test data.
Using PyTorch library
To implement Artificial Neural Networks (ANN) using PyTorch, you can follow these general steps:
# Import the necessary libraries: PyTorch, NumPy, and Pandas.
import torch
import numpy as np
import pandas as pd
# Load the dataset: You can use Pandas to load the dataset.
data = pd.read_csv('dataset.csv')
# Split the dataset into training and testing sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], data.iloc[:, -1], test_size=0.2, random_state=0)
# Convert the data into PyTorch tensors:
X_train = torch.from_numpy(np.array(X_train)).float()
y_train = torch.from_numpy(np.array(y_train)).float()
X_test = torch.from_numpy(np.array(X_test)).float()
y_test = torch.from_numpy(np.array(y_test)).float()
# Define the neural network architecture: You can define the neural network using the torch.nn module.
class ANN(torch.nn.Module):
def __init__(self):
super(ANN, self).__init__()
self.fc1 = torch.nn.Linear(8, 16)
self.fc2 = torch.nn.Linear(16, 8)
self.fc3 = torch.nn.Linear(8, 1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = torch.sigmoid(self.fc3(x))
return x
model = ANN()
# In this example, we define an ANN with 3 fully connected layers, where the first two layers have a ReLU activation function and the last layer has a sigmoid activation function.
# Define the loss function and optimizer:
loss_fn = torch.nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Train the model:
for epoch in range(100):
y_pred = model(X_train)
loss = loss_fn(y_pred, y_train.unsqueeze(1))
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Test the model:
y_pred_test = model(X_test)
y_pred_test = (y_pred_test > 0.5).float()
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred_test)
# Save the model:
torch.save(model.state_dict(), 'model.pth')
This is a general template for implementing an ANN using PyTorch. You can customize it based on your specific requirements.
In conclusion, ANNs are a powerful machine learning model that can be used to model non-linear relationships in data. The structure of an ANN consists of an input layer, one or more hidden layers, and an output layer. Python has several libraries that can be used to implement ANNs, including TensorFlow, Keras, and PyTorch.
Comments welcome!
Data Science
· 2022-03-05
-
Overview of Deep Learning Activation Functions
What are Activation functions?
Activation functions are a key component of neural networks in deep learning. They are mathematical functions applied to the output of a neural network layer to determine whether or not a neuron should be activated (i.e., “fired”). This output is then passed to the next layer of the neural network for further processing. There are many different activation functions that can be used in deep learning, including sigmoid, ReLU, and tanh. The choice of activation function can have a significant impact on the performance of a neural network, so it is an important consideration when designing and training a deep learning model.
Sigmoid activation function
The sigmoid activation function is one of the most commonly used activation functions in deep learning. It is a mathematical function that maps any input value to a value between 0 and 1. The function is named after its S-shaped curve, which resembles the letter “S”. The sigmoid function is often used in binary classification problems, where the output is either 0 or 1. It is also used as a base for other more complex activation functions, such as the hyperbolic tangent and the softmax function.
The formula for the sigmoid activation function is:
f(x) = 1 / (1 + e^-x)
where x is the input to the function, and e is the mathematical constant 2.71828. The output of the sigmoid function ranges between 0 and 1. When x is negative, the output of the function is close to 0, and when x is positive, the output is close to 1. When x is 0, the output of the function is 0.5.
The sigmoid function is popular in neural networks because it is differentiable, meaning that it can be used in backpropagation to calculate the gradient of the loss function. This is important because deep learning algorithms use gradient descent to optimize the weights of the neural network. The sigmoid function is also a smooth function, which helps in the convergence of the optimization algorithm.
However, the sigmoid function has some limitations. One of the main limitations is that it is prone to the vanishing gradient problem. When the input to the sigmoid function is too large or too small, the gradient of the function approaches zero. This can make it difficult for the algorithm to learn from the data. Another limitation of the sigmoid function is that it is not zero-centered, which can make it difficult to optimize the weights of the neural network.
To overcome these limitations, other activation functions have been developed. One such function is the Rectified Linear Unit (ReLU) function, which is now the most widely used activation function in deep learning. The ReLU function is zero-centered and does not suffer from the vanishing gradient problem.
In conclusion, the sigmoid activation function is an important component of deep learning. It is useful in binary classification problems and can serve as a base for other more complex activation functions. However, it has some limitations, which have led to the development of other activation functions. When choosing an activation function, it is important to consider the specific requirements of the problem and the strengths and limitations of the different activation functions.
Rectified Linear Unit (ReLU) activation function
Rectified Linear Unit (ReLU) is a popular activation function used in deep learning, especially for image classification tasks. It is a piecewise linear function that maps any negative input value to zero, and it is defined as f(x) = max(0, x).
The ReLU activation function has become one of the most popular activation functions in deep learning due to its computational efficiency and the fact that it helps to mitigate the vanishing gradient problem that can arise in deep neural networks.
The ReLU function is simple, non-linear, and can be computed very efficiently, making it a great choice for large datasets with a lot of inputs. It is also effective in preventing overfitting in deep neural networks.
One of the biggest advantages of ReLU is that it is very computationally efficient compared to other activation functions. This is because the function is simple to compute and requires only a single mathematical operation.
The ReLU activation function also helps to mitigate the vanishing gradient problem, which can occur in deep neural networks. When the gradient of the activation function becomes very small, the weights in the network are not updated properly, which can lead to a decline in the performance of the network. ReLU helps to prevent this problem by keeping the gradients from becoming too small.
There are some potential issues with using the ReLU activation function, however. One of the main issues is that ReLU neurons can “die” during training, meaning that they become permanently inactive and stop contributing to the network’s output. This can happen if the neuron’s weights become too large, causing the neuron to always output zero. This can be addressed through careful initialization of the network’s weights.
Another issue with ReLU is that it is not centered around zero, which can make it difficult to optimize certain types of networks. This has led to the development of several variations of the ReLU function, including the leaky ReLU and the parametric ReLU, which are designed to address these issues.
In conclusion, the ReLU activation function is a powerful and computationally efficient choice for deep learning tasks, especially for image classification. It is effective at mitigating the vanishing gradient problem, which can be a major challenge in deep neural networks. While there are some potential issues with using ReLU, these can be addressed through careful initialization of weights and the use of variations of the function. Overall, ReLU is an excellent choice for deep learning tasks, and it is likely to continue to be a popular activation function in the years to come.
Tanh activation function
The tanh activation function is a popular choice in deep learning, and is used in many different types of neural networks. Tanh stands for hyperbolic tangent, and is a type of activation function that transforms the input of a neuron into an output between -1 and 1. This makes it a useful choice for many different types of neural networks, including those used in image recognition, natural language processing, and more.
The tanh activation function is a smooth, continuous function that is shaped like a sigmoid curve. It is symmetric around the origin, with values ranging from -1 to 1. When the input to a neuron is close to zero, the output of the tanh function is also close to zero. As the input becomes more positive or negative, the output of the function increases or decreases, respectively, until it reaches its maximum value of 1 or -1.
One of the main benefits of the tanh activation function is that it is differentiable, which means it can be used in backpropagation algorithms to update the weights and biases of a neural network during training. This allows the network to learn from data and improve its performance over time.
Another benefit of the tanh activation function is that it is centered around zero, which can help prevent vanishing gradients and improve the convergence of the neural network during training. This is because the output of the function is always positive or negative, which helps to maintain the magnitude of the gradients.
However, one drawback of the tanh activation function is that it is prone to saturation when the input to a neuron is large, which can cause the gradients to become very small and slow down the learning process. This is known as the “exploding gradients” problem, and can be mitigated using techniques such as weight initialization and gradient clipping.
In conclusion, the tanh activation function is a useful tool for deep learning, thanks to its smooth, differentiable nature, centered output, and ability to prevent vanishing gradients. While it is prone to saturation and the exploding gradients problem, these issues can be mitigated with proper techniques and training procedures. As with any activation function, the choice of tanh should be made based on the specific requirements of the neural network and the nature of the data being processed.
Comments welcome!
Data Science
· 2022-02-05
-
Overview of Deep Learning Techniques
Deep learning is a subset of machine learning that involves training artificial neural networks to learn and perform complex tasks. While both deep learning and machine learning involve training models on data to make predictions or decisions, deep learning models typically have many layers and are capable of learning increasingly complex representations of data, whereas traditional machine learning models often require feature engineering to create effective representations of data. Additionally, deep learning models are often better suited for tasks such as image recognition, speech recognition, and natural language processing, which require high-dimensional input data and benefit from the ability to learn hierarchical representations of features.
Key aplications of Deep Learning
Deep learning can be used to solve regression, classification, and clustering problems. For example, convolutional neural networks (CNNs) can be used for image classification tasks, recurrent neural networks (RNNs) can be used for sequence classification tasks, and autoencoders can be used for clustering tasks. Additionally, deep learning models can be used for regression tasks, such as predicting stock prices or housing prices, by training a neural network to predict a continuous value.
Further, Deep learning has many applications in the financial services industry. Here are some examples:
Fraud detection: Deep learning algorithms can be used to detect fraudulent activities such as credit card fraud, money laundering, and identity theft.
Stock price prediction: Deep learning algorithms can be used to analyze large amounts of financial data to predict stock prices and market trends.
Algorithmic trading: Deep learning algorithms can be used to analyze market data and execute trades automatically.
Customer service: Deep learning algorithms can be used to analyze customer data and provide personalized services such as financial advice and investment recommendations.
Risk assessment: Deep learning algorithms can be used to assess the creditworthiness of customers and predict the likelihood of loan defaults.
Cybersecurity: Deep learning algorithms can be used to identify and mitigate cybersecurity threats such as hacking and phishing attacks.
Overall, the use of deep learning in the financial services industry has the potential to increase efficiency, reduce costs, and improve customer satisfaction.
Popular Deep Learning algorithms
There are several popular deep learning algorithms, each designed to solve different types of problems. Some of the most commonly used deep learning algorithms are:
ANNs (Artificial Neural Networks) are a type of machine learning algorithms that are inspired by the structure and function of the human brain. ANNs are composed of nodes that are interconnected in layers. Each node receives input signals, processes them, and produces an output signal. ANNs are often used for tasks such as classification, regression, pattern recognition, and optimization.
RNNs (Recurrent Neural Networks) are commonly used for sequential data such as natural language processing and time-series data analysis. They use feedback loops to store information from previous inputs, making them well-suited for tasks that involve processing sequential data.
CNNs (Convolutional Neural Networks) are commonly used for image and video recognition tasks. They work by performing convolutions on input images and learning features that can be used to identify objects or patterns within the images.
Autoencoders are a type of neural network that is commonly used for unsupervised learning, particularly for dimensionality reduction, feature learning, anomaly detection, image compression and noise reduction. They work by encoding input data into a lower-dimensional representation and then decoding it back to its original form. Furthermore, Autoencoder model can be trained to compress the input data into a low-dimensional latent space, where similar input data points are mapped to nearby points in the latent space. The latent space can then be used to cluster the input data based on their proximity in the latent space. This approach is sometimes referred to as “autoencoder-based clustering” or “deep clustering”.
SOM (Self-Organizing Map) is a type of artificial neural network that can be used for unsupervised learning tasks, such as clustering, visualization, and dimensionality reduction.
Component of Deep Learning algorithms
Hyperparameters in machine learning are model parameters that cannot be learned from training data directly, but need to be set before training. They are typically set by the data scientist or machine learning engineer and control the learning process of the model. Examples of hyperparameters include learning rate, regularization parameter, number of hidden layers and the number of neurons in each hidden layer. The values of hyperparameters can significantly affect the model’s performance, and finding the optimal values is often done through a trial-and-error process.
Activation functions are a key component of neural networks in deep learning. They are mathematical functions applied to the output of a neural network layer to determine whether or not a neuron should be activated (i.e., “fired”). This output is then passed to the next layer of the neural network for further processing. There are many different activation functions that can be used in deep learning, including sigmoid, ReLU, and tanh. The choice of activation function can have a significant impact on the performance of a neural network, so it is an important consideration when designing and training a deep learning model.
Loss function is measure of how well the model is performing during training. The goal is to minimize the loss function, which is accomplished through optimization.
Optimizer is a method for updating the model’s weights during training in order to minimize the loss function. Popular optimizers include stochastic gradient descent, Adam, and Adagrad.
Regularization is a set of techniques for preventing overfitting, which occurs when the model memorizes the training data instead of generalizing to new data. Popular regularization techniques include L1 and L2 regularization, dropout, and early stopping.
Layers are the basic building blocks of a neural network. Layers transform the input data in some way and pass it to the next layer.
Backpropagation is the algorithm used to calculate the gradients of the loss function with respect to the model’s weights, which is necessary for optimization.
Computational cost of Deep Learning algorithms
Deep learning models, particularly large ones, can be computationally expensive to train and run. The cost of training a deep learning model depends on various factors such as the size of the model, the complexity of the problem, the size of the training data, the number of layers, and the number of parameters.
Training a deep learning model can take hours, days, or even weeks, depending on the size and complexity of the model and the computing resources available. To mitigate this, deep learning engineers often use distributed training, which involves training the model across multiple machines, to reduce the overall training time.
In addition to the cost of training, running a deep learning model in production can also be expensive, particularly if the model requires a lot of computing resources or if it needs to process large amounts of data in real-time. To reduce these costs, engineers often use specialized hardware such as Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs) that are optimized for running deep learning models.
Therefore, it is important to carefully consider the computational costs of deep learning models before deciding to use them, and to ensure that the benefits of using deep learning outweigh the associated costs.
Deep learning has the potential to revolutionize the way we solve complex problems in a variety of fields, from healthcare to finance, to transportation and beyond. With the ability to learn and adapt from vast amounts of data, deep learning models have already achieved remarkable breakthroughs in image and speech recognition, natural language processing, and game playing, just to name a few examples.
However, as with any powerful tool, there are challenges and limitations to consider when working with deep learning models. Issues such as overfitting, interpretability, and computational cost must be carefully addressed to ensure that deep learning solutions are accurate, reliable, and practical.
Despite these challenges, the potential benefits of deep learning are undeniable, and the field is advancing at a rapid pace. As researchers and practitioners continue to push the boundaries of what’s possible, we can expect to see even more exciting breakthroughs and applications of deep learning in the years to come.
Comments welcome!
Data Science
· 2022-01-01
-
Boosting vs Bagging Model Improvement Techniques
In machine learning, there are two popular techniques for improving the accuracy of models: boosting and bagging. Both techniques are used to reduce the variance of a model, which is the tendency to overfit to the training data. While they have similar goals, they differ in their approach and functionality. In this article, we’ll explore the differences between boosting and bagging to help you decide which technique is right for your machine learning project.
Bagging
Bagging, short for bootstrap aggregating, is a technique that involves training multiple models on different random subsets of the training data. The goal of bagging is to reduce the variance of a model by averaging the predictions of multiple models. Each model in the ensemble is trained independently and the final prediction is the average of all models. Bagging can be used with any algorithm, but it is most commonly used with decision trees. The most popular implementation of bagging is the random forest algorithm, which uses an ensemble of decision trees to make predictions.
Boosting
Boosting is a technique that involves training multiple weak models on the same training data sequentially. The goal of boosting is to improve the accuracy of a model by adding new models that focus on the misclassified samples of the previous model. Each model in the ensemble is trained on the same dataset, but with different weights assigned to each sample. The weights are adjusted based on the misclassified samples of the previous model. The final prediction is a weighted average of all models in the ensemble. Boosting is commonly used with decision trees, but it can be used with any algorithm.
Differences between Boosting and Bagging
While boosting and bagging have similar goals, they differ in their approach and functionality. The main differences between these two techniques are:
Approach: Bagging involves training multiple models independently on different random subsets of the training data, while boosting trains multiple models sequentially on the same dataset with different weights assigned to each sample.
Sample Weighting: Bagging assigns equal weight to each sample in the training data, while boosting assigns higher weight to misclassified samples.
Model Selection: In bagging, the final prediction is the average of all models in the ensemble, while in boosting, the final prediction is a weighted average of all models in the ensemble.
Performance: Bagging can reduce the variance of a model and improve its stability, but it may not improve its accuracy. Boosting can improve the accuracy of a model, but it may increase its variance and overfitting.
Conclusion
In conclusion, boosting and bagging are two popular techniques for improving the accuracy of machine learning models. While they have similar goals, they differ in their approach and functionality. Bagging involves training multiple models independently on different subsets of the training data, while boosting trains multiple models sequentially on the same dataset with different weights assigned to each sample. Which technique is right for your machine learning project depends on your specific needs and goals. Bagging can improve model stability, while boosting can improve model accuracy.
Comments welcome!
Data Science
· 2021-12-04
-
Implementing XGBoost in Python
XGBoost (Extreme Gradient Boosting) is a popular algorithm for supervised learning problems, including regression, classification, and ranking tasks. In the financial services industry, XGBoost can be used for a variety of regression problems, such as predicting stock prices, credit risk scoring, and forecasting financial time series.
One advantage of XGBoost is that it can handle missing values and outliers in the data. It can also automatically handle feature selection and feature engineering, which are important steps in preparing data for regression analysis. XGBoost is also highly optimized for performance and can handle large datasets with millions of rows and thousands of features.
Use-case of xgboost for regression
For example, in the stock market, XGBoost can be used to predict the future price of a stock based on historical data. XGBoost can also be used for credit scoring to assess the creditworthiness of borrowers by analyzing various features such as credit history, income, and debt-to-income ratio. In addition, XGBoost can be used for forecasting financial time series, such as predicting the future values of stock market indices or exchange rates.
Use-case of xgboost for classification
One such application is in the classification of credit risk.
Credit risk classification is a fundamental task in the financial industry. The goal is to predict the probability of a borrower defaulting on a loan, based on a variety of factors such as credit score, income, employment status, and loan amount. This information can help banks and financial institutions make informed decisions about lending and managing risk.
XGBoost has been shown to be effective in credit risk classification tasks, achieving high accuracy and predictive power. In a typical use case, the algorithm is trained on historical data, which includes information about borrowers and their credit outcomes. The model is then used to predict the probability of default for new loan applications.
Implementation of XGBoost for regression using Python:
First, we’ll need to install the XGBoost library:
!pip install xgboost
Then, we can import the necessary libraries and load our dataset. In this example, we’ll use the Boston Housing dataset, which is built into scikit-learn:
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
# Load data
boston = load_boston()
X, y = boston.data, boston.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Next, we’ll define our XGBoost model and fit it to the training data:
# Define model
xg_reg = xgb.XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1,
max_depth=5, alpha=10, n_estimators=10)
# Fit model
xg_reg.fit(X_train, y_train)
We can then use the trained model to make predictions on the test set and evaluate its performance using mean squared error:
# Make predictions on test set
y_pred = xg_reg.predict(X_test)
# Evaluate model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE: %f" % (rmse))
That’s it! We’ve trained an XGBoost model for regression and evaluated its performance on a test set. Note that in practice, you would likely want to tune the hyperparameters of the model using a validation set or cross-validation.
Implementing XGBoost for binary classification in Python:
In this example, we load the dataset into a Pandas dataframe and split it into training and testing sets using train_test_split from scikit-learn. We then define the XGBoost classifier with hyperparameters such as the number of trees, maximum depth of each tree, learning rate, and fraction of samples and features used in each tree. We train the model on the training data using fit and make predictions on the test data using predict. Finally, we evaluate the performance of the model using accuracy score.
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the dataset into a Pandas dataframe
data = pd.read_csv('path/to/dataset.csv')
# Split the data into input features (X) and target variable (y)
X = data.drop('target_variable', axis=1)
y = data['target_variable']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Define the XGBoost classifier with hyperparameters
xgb_model = xgb.XGBClassifier(
n_estimators=100, # number of trees
max_depth=5, # maximum depth of each tree
learning_rate=0.1, # learning rate
subsample=0.8, # fraction of samples used in each tree
colsample_bytree=0.8, # fraction of features used in each tree
objective='binary:logistic', # objective function
seed=42 # random seed for reproducibility
)
# Train the XGBoost classifier on the training data
xgb_model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = xgb_model.predict(X_test)
# Evaluate the performance of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
Implement XGBoost for multi-class classification using Python
In this example, we first load a multi-class classification dataset and split it into training and testing sets. We then initialize an XGBoost classifier and fit it on the training data. Finally, we make predictions on the test data and calculate the accuracy of the model. Note that the XGBClassifier class automatically handles multi-class classification problems, so we don’t need to do any additional preprocessing.
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
data = pd.read_csv('dataset.csv')
# Separate target variable from features
X = data.iloc[:, :-1]
y = data.iloc[:, -1]
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Initialize the XGBoost classifier with default hyperparameters
model = xgb.XGBClassifier()
# Fit the model on the training data
model.fit(X_train, y_train)
# Make predictions on the test data
y_pred = model.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}%'.format(accuracy * 100))
Overall, XGBoost is a powerful tool for regression in the financial services industry and is widely used by financial institutions and investment firms to make data-driven decisions.
Comments welcome!
Data Science
· 2021-11-06
-
Implementing Reinforcement Learning in Python and R
Reinforcement learning is a branch of machine learning that involves training agents to make a sequence of decisions in an environment to maximize a reward function. The agent receives feedback in the form of a reward signal for every action it takes, and its goal is to learn a policy that maximizes the long-term expected reward. In this article, we’ll discuss how to implement reinforcement learning in Python.
Reinforcement learning can be used in various ways in the financial services industry. Here are a few examples:
Algorithmic trading: Reinforcement learning can be used to create trading algorithms that can learn from market data and make decisions on when to buy, sell, or hold assets.
Portfolio management: Reinforcement learning can be used to optimize portfolios by selecting the most appropriate assets to invest in based on market conditions, past performance, and other factors.
Fraud detection: Reinforcement learning can be used to detect fraudulent transactions by learning from historical data and identifying patterns that indicate fraud.
Risk management: Reinforcement learning can be used to develop risk models that can predict and manage the risk of various financial instruments, such as derivatives.
Credit scoring: Reinforcement learning can be used to create credit scoring models that can learn from borrower behavior and other factors to predict creditworthiness and default risk.
There are several popular Python libraries for implementing reinforcement learning, such as TensorFlow, Keras, PyTorch, and OpenAI Gym. In this tutorial, we’ll use OpenAI Gym to create a simple reinforcement learning environment.
OpenAI Gym provides a collection of pre-built environments for reinforcement learning, such as CartPole and MountainCar. These environments provide a simple interface for creating agents that learn to interact with the environment and maximize the reward.
Let’s start by installing OpenAI Gym:
!pip install gym
Now, let’s create an environment for our agent:
import gym
env = gym.make('CartPole-v0')
This creates an instance of the CartPole environment, which is a classic control problem in reinforcement learning. The goal of the agent is to balance a pole on a cart by applying forces to the cart.
Now, let’s define our agent. We’ll use a Q-learning algorithm to learn a policy that maximizes the long-term expected reward. Q-learning is a simple reinforcement learning algorithm that learns an action-value function, which estimates the expected reward for taking a particular action in a particular state.
import numpy as np
num_states = env.observation_space.shape[0]
num_actions = env.action_space.n
q_table = np.zeros((num_states, num_actions))
This creates a Q-table, which is a table that maps each state-action pair to a Q-value, which estimates the expected reward for taking that action in that state.
Now, let’s train our agent. We’ll use a simple epsilon-greedy policy, which selects the action with the highest Q-value with probability 1-epsilon, and a random action with probability epsilon.
epsilon = 0.1
gamma = 0.99
alpha = 0.5
num_episodes = 10000
for i in range(num_episodes):
state = env.reset()
done = False
while not done:
if np.random.uniform() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(q_table[state, :])
next_state, reward, done, info = env.step(action)
q_table[state, action] += alpha * (reward + gamma * np.max(q_table[next_state, :]) - q_table[state, action])
state = next_state
This trains our agent for 10,000 episodes using the Q-learning algorithm. During training, the agent updates the Q-values in the Q-table based on the rewards it receives.
Finally, let’s test our agent:
num_episodes = 100
total_reward = 0
for i in range(num_episodes):
state = env.reset()
done = False
while not done:
action = np.argmax(q_table[state, :])
next_state, reward, done, info = env.step(action)
total_reward += reward
state = next_state
print('Average reward:', total_reward / num_episodes)
This tests our agent by running it for 100 episodes and averaging the rewards. If everything went well, the agent should be able to balance the pole on the cart and achieve a high average reward.
In conclusion, reinforcement learning is a powerful technique for training agents to make a sequence of decisions in an environment to maximize a reward function.
Comments welcome!
Data Science
· 2021-10-02
-
Implementing Association Rule Learning using APRIORI in Python and R
Association rule learning is a popular technique used in the financial services industry for analyzing customer behavior, identifying patterns, and making data-driven decisions.
Examples of association rule learning
Some examples of using association rule learning in the financial services industry are:
Cross-selling: Association rule learning can be used to identify the products that are frequently bought together by customers. This information can be used to create targeted cross-selling strategies and improve sales.
Fraud detection: Association rule learning can help in detecting fraudulent transactions. By analyzing the patterns of transactions, it can identify the transactions that deviate from the normal patterns and flag them for further investigation.
Risk management: Association rule learning can be used to analyze historical data and identify the factors that contributed to the financial risks. Based on these factors, financial institutions can create risk management strategies to mitigate the risks.
Customer segmentation: Association rule learning can help in segmenting customers based on their buying patterns. By analyzing the data, it can identify the groups of customers who share similar characteristics and create targeted marketing strategies.
Market basket analysis: Association rule learning can be used to analyze the purchase patterns of customers and identify the products that are frequently bought together. This information can be used to optimize the inventory management and improve the supply chain efficiency.
Implement Association rule learning (APRIORI algorithm) using Python
In order to use the Apriori algorithm, we need to install the apyori package. You can install the package using the following command:
!pip install apyori
Once you have installed the package, you can use the following code to apply the Apriori algorithm on a dataset:
from apyori import apriori
import pandas as pd
# Load the dataset
dataset = pd.read_csv('path/to/dataset.csv', header=None)
# Convert the dataset to a list of lists
records = []
for i in range(len(dataset)):
records.append([str(dataset.values[i,j]) for j in range(len(dataset.columns))])
# Run the Apriori algorithm
association_rules = apriori(records, min_support=0.005, min_confidence=0.2, min_lift=3, min_length=2)
# Print the association rules
for rule in association_rules:
print(rule)
In the code above, we first load the dataset into a Pandas dataframe and convert it into a list of lists. We then apply the Apriori algorithm on the dataset using the apriori() function from the apyori package. The min_support, min_confidence, min_lift, and min_length parameters are used to set the minimum support, confidence, lift, and length of the association rules. Finally, we print the association rules using a loop.
Implement Association rule learning (APRIORI algorithm) using R
To perform association rule learning using apriori algorithm in R, we first need to install and load the arules package. This package provides various functions to generate and analyze itemsets, as well as mine association rules.
Here’s an example of how to use apriori algorithm in R to generate association rules from a dataset:
# Install and load arules package
install.packages("arules")
library(arules)
# Load dataset
data("Groceries")
# Convert dataset to transactions
transactions <- as(Groceries, "transactions")
# Generate frequent itemsets
frequent_itemsets <- apriori(transactions, parameter = list(support = 0.005, confidence = 0.5))
# Generate association rules
association_rules <- apriori(transactions, parameter = list(support = 0.005, confidence = 0.5),
control = list(verbose = FALSE), appearance = list(rhs = c("whole milk"),
default = "lhs"))
# Inspect frequent itemsets and association rules
inspect(frequent_itemsets)
inspect(association_rules)
In the above example, we first loaded the Groceries dataset from the arules package. We then converted this dataset into a transaction object using the as() function.
Next, we used the apriori() function to generate frequent itemsets and association rules. The support parameter specifies the minimum support for an itemset to be considered frequent, while the confidence parameter specifies the minimum confidence for an association rule to be considered interesting.
We also specified a constraint on the association rules using the appearance parameter. In this case, we only generated association rules with “whole milk” on the right-hand side.
Finally, we used the inspect() function to visualize the frequent itemsets and association rules.
Overall, association rule learning is a powerful technique that can help financial institutions to make data-driven decisions, improve customer satisfaction, and increase revenue.
Comments welcome!
Data Science
· 2021-09-04
-
Implementing K-Means Clustering in Python and R
K-means clustering is a popular unsupervised learning technique used to cluster data points based on their similarity. In this article, we will explore what k-means clustering is, how it works, and how to implement it in Python and R.
What is K-means Clustering?
K-means clustering is a clustering algorithm that partitions n data points into k clusters based on their similarity. It aims to find the optimal center point for each cluster that minimizes the sum of squared distances between each data point and its respective cluster center. The algorithm iteratively assigns each data point to its nearest cluster center and re-computes the center point of each cluster.
How K-means Clustering Works?
K-means clustering follows a simple procedure to partition the data into k clusters. Here are the main steps involved in the k-means clustering algorithm:
Initialization: Choose k random points from the data as the initial cluster centroids.
Assignment: Assign each data point to the nearest cluster centroid based on the Euclidean distance.
Update: Calculate the new cluster centroid for each cluster based on the mean of all data points assigned to it.
Repeat: Repeat steps 2 and 3 until the cluster assignments no longer change or a maximum number of iterations is reached.
Elbow method to choose the optimal number of clusters
The elbow method is a popular technique for choosing the optimal number of clusters in k-means clustering. It involves plotting the values of the within-cluster sum of squares (WSS) against the number of clusters, and identifying the “elbow” in the curve as the point at which additional clusters no longer provide a significant reduction in WSS.
Here’s how to implement the elbow method for choosing the optimal number of clusters in Python:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Create an array of the WSS values for a range of k values (number of clusters):
wss_values = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(X)
wss_values.append(kmeans.inertia_)
# Plot the WSS values against the number of clusters:
plt.plot(range(1, 11), wss_values)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WSS')
plt.show()
# Identify the "elbow" in the curve and select the optimal number of clusters
How to Implement K-means Clustering in Python?
Python has many machine learning libraries that provide built-in functions for implementing k-means clustering. Here is a simple example using the scikit-learn library:
from sklearn.cluster import KMeans
import numpy as np
# Generate some random data
data = np.random.rand(100, 2)
# Initialize KMeans object
kmeans = KMeans(n_clusters=2, random_state=0)
# Fit the data to the KMeans object
kmeans.fit(data)
# Print the cluster centers
print(kmeans.cluster_centers_)
In the above code, we first import the KMeans class from the scikit-learn library and generate some random data. We then initialize the KMeans object with the number of clusters and a random state for reproducibility. Finally, we fit the data to the KMeans object and print the resulting cluster centers.
Implementing K-means Clustering in R
To implement k-means clustering in R, we first need to load a dataset. For this example, we will use the iris dataset that comes with R. The iris dataset contains measurements of various attributes of iris flowers, such as sepal length, sepal width, petal length, and petal width. The dataset also includes the species of the flower.
# Load the iris dataset
data(iris)
# Select the columns that we want to cluster
data <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")]
# Scale the data
scaled_data <- scale(data)
Next, we will use the kmeans function to perform the clustering. We will set the number of clusters to 3 since there are 3 species of iris flowers in the dataset.
# Perform k-means clustering
kmeans_result <- kmeans(scaled_data, centers = 3)
Finally, we can plot the results to visualize the clusters.
# Plot the results
library(ggplot2)
df <- data.frame(scaled_data, cluster = as.factor(kmeans_result$cluster))
ggplot(df, aes(x = Sepal.Length, y = Sepal.Width, color = cluster)) + geom_point()
The resulting plot shows the three clusters that were formed by the algorithm.
Conclusion
K-means clustering is a popular unsupervised learning technique used for clustering data points based on their similarity. In this article, we explored what k-means clustering is, how it works, and how to implement it in Python (using the scikit-learn library) and R. K-means clustering is a powerful tool that has many applications in fields such as data mining, image processing, and natural language processing.
Comments welcome!
Data Science
· 2021-08-07
-
Implementing Random Forest Classification in Python and R
Random Forest Classification is a machine learning algorithm used for classification tasks. It is an extension of the decision tree algorithm, where multiple decision trees are built and combined to make a more accurate and stable prediction.
In a random forest, each decision tree is built using a random subset of the features in the dataset, which helps to reduce overfitting and improve the generalization performance of the model. The final prediction is made by aggregating the predictions of all the decision trees, usually through a voting mechanism.
Advantages of Random Forest Classification
The key advantages of Random Forest Classification are:
It can handle high-dimensional datasets with a large number of features.
It can handle missing data and outliers in the dataset.
It can model non-linear relationships between the input and output variables.
It is relatively easy to interpret the model and understand the importance of each feature in the prediction.
It is a robust and stable model that is less prone to overfitting compared to other classification algorithms.
Random Forest Classification can be implemented in various programming languages, including Python and R. The scikit-learn library in Python and the randomForest package in R are popular tools for building random forest models.
Math behind Random Forest Classification
Random Forest Classification is a machine learning algorithm that is based on the principles of decision trees and ensemble learning. The math behind Random Forest Classification can be broken down into the following steps:
Bootstrapped samples: The Random Forest algorithm creates multiple decision trees by randomly sampling the data with replacement (i.e., bootstrap samples). Each bootstrap sample has the same size as the original dataset, but with some of the data points repeated and others omitted.
Feature subset selection: For each decision tree, a random subset of features is selected to determine the best split at each node of the tree. This process helps to reduce the variance of the model and improve its generalization performance.
Decision tree construction: For each bootstrap sample and feature subset, a decision tree is constructed by recursively splitting the data into smaller subsets based on the selected features. The split is chosen to maximize the information gain, which is a measure of how well the split separates the classes.
Voting: Once all the decision trees have been constructed, their predictions are combined through a voting mechanism. Each decision tree predicts the class label of a test instance, and the final prediction is based on the majority vote of all the decision trees.
Implementing Random Forest Classificaiton in Python
To implement Random Forest Classification in Python, we can use the scikit-learn library. Here is an example code snippet:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
# Load the dataset
data = pd.read_csv('path/to/dataset.csv')
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.drop('target', axis=1), data['target'], test_size=0.3, random_state=42)
# Create a Random Forest Classifier with 100 trees
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
# Fit the model on the training data
rfc.fit(X_train, y_train)
# Predict the classes of the testing data
y_pred = rfc.predict(X_test)
# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))
In this example, we first load the dataset and split it into training and testing sets using the train_test_split function from scikit-learn. We then create a RandomForestClassifier object with 100 trees and fit the model on the training data using the fit method. We use the predict method to predict the classes of the testing data and calculate the accuracy of the model using the accuracy_score function from scikit-learn.
Note that in this example, we assume that the dataset is stored in a CSV file, where the target variable is in the column named “target”. You will need to adjust the code to match your dataset’s format and feature names.
Implementing Random Forest Classification in R
To implement Random Forest Classification in R, we can use the randomForest package. Here is an example code snippet:
library(randomForest)
# Load the dataset
data <- read.csv('path/to/dataset.csv')
# Split the dataset into training and testing sets
set.seed(42)
train_index <- sample(nrow(data), floor(nrow(data) * 0.7))
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
# Create a Random Forest Classifier with 100 trees
rfc <- randomForest(target ~ ., data=train_data, ntree=100)
# Predict the classes of the testing data
y_pred <- predict(rfc, newdata=test_data)
# Calculate the accuracy of the model
accuracy <- mean(y_pred == test_data$target)
print(paste0("Accuracy: ", round(accuracy * 100, 2), "%"))
In this example, we first load the dataset and split it into training and testing sets using the sample function. We then create a randomForest object with 100 trees and fit the model on the training data using the formula target ~ . to specify that the “target” variable should be predicted using all the other variables in the dataset. We use the predict function to predict the classes of the testing data and calculate the accuracy of the model using the mean function.
Note that in this example, we assume that the dataset is stored in a CSV file, where the target variable is in the column named “target”. You will need to adjust the code to match your dataset’s format and feature names.
Comments welcome!
Data Science
· 2021-07-03
-
Implementing Decision Tree Classification in Python and R
Decision tree classification is a widely used machine learning algorithm that is used to predict a categorical output variable based on one or more input variables. The algorithm works by constructing a tree-like model that maps the observations in the input space to the output variable. In this article, we will discuss how to implement decision tree classification in Python and R.
Implementing Decision tree classification in Python
Step 1: Import the Required Libraries
Before we start coding, we need to import the required libraries for implementing the decision tree classification algorithm in Python. We will be using the scikit-learn library to implement this algorithm. The scikit-learn library is a popular machine learning library in Python that provides various algorithms and tools for machine learning applications.
# import libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
Step 2: Load the Data
The second step is to load the data. In this example, we will be using the iris dataset, which is a popular dataset in machine learning. The iris dataset contains information about the sepal length, sepal width, petal length, and petal width of three different species of iris flowers. The objective is to predict the species of the iris flower based on the input variables.
# load the data
iris = load_iris()
X = iris.data
y = iris.target
Step 3: Split the Data
The third step is to split the data into training and testing datasets. We will be using 70% of the data for training and the remaining 30% for testing.
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
Step 4: Train the Model
The fourth step is to train the decision tree classification model using the training data.
# train the model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
Step 5: Test the Model
The fifth step is to test the decision tree classification model using the testing data.
# test the model
y_pred = clf.predict(X_test)
Step 6: Evaluate the Model
The final step is to evaluate the performance of the decision tree classification model. We will be using the accuracy score to evaluate the performance of the model.
# evaluate the model
from sklearn.metrics import accuracy_score
print("Accuracy:", accuracy_score(y_test, y_pred))
Implementing Decision tree classification in R
Step 1: Load the Dataset
The first step in implementing decision tree classification is to load the dataset. For this article, we will use the iris dataset, which is a popular dataset in machine learning.
To load the iris dataset, we can use the following code:
data(iris)
This will load the iris dataset into the R environment.
Step 2: Split the Dataset into Training and Test Sets
The next step is to split the dataset into training and test sets. We will use the training set to build the decision tree, and the test set to evaluate its performance.
To split the dataset, we can use the following code:
set.seed(123)
train <- sample(nrow(iris), 0.7 * nrow(iris))
train_data <- iris[train,]
test_data <- iris[-train,]
This code will split the iris dataset into training and test sets. The set.seed function is used to ensure that the split is reproducible. We are using 70% of the data for training and 30% for testing.
Step 3: Build the Decision Tree
The next step is to build the decision tree. We will use the rpart package in R to build the decision tree.
To build the decision tree, we can use the following code:
library(rpart)
fit <- rpart(Species ~ ., data=train_data, method="class")
This code will build the decision tree using the rpart function in R. The formula Species ~ . specifies that we want to predict the Species variable using all the other variables in the dataset. The method=”class” argument specifies that we are building a classification tree.
Step 4: Visualize the Decision Tree
The next step is to visualize the decision tree. We can use the plot function in R to visualize the decision tree.
To visualize the decision tree, we can use the following code:
plot(fit, margin=0.1)
text(fit, use.n=TRUE, all=TRUE, cex=.8)
This code will create a plot of the decision tree. The margin=0.1 argument specifies that we want to add a margin around the plot. The text function is used to add labels to the nodes of the decision tree.
Step 5: Make Predictions on the Test Set
The final step is to make predictions on the test set. We will use the decision tree to make predictions on the test set, and then evaluate its performance.
To make predictions on the test set, we can use the following code:
predictions <- predict(fit, test_data, type="class")
This code will make predictions on the test set using the decision tree. The type=”class” argument specifies that we want to make class predictions.
In conclusion, decision tree classification is a powerful algorithm that can be used to predict a categorical output variable based on one or more input variables. The Python scikit-learn library and R rpart library provide an easy-to-use implementation of this algorithm.
Comments welcome!
Data Science
· 2021-06-05
-
Implementing Logistic Regression in Python and R
Logistic regression is a type of statistical analysis (also known as logit model). It is often used for predictive analytics and modeling, and extends to applications in machine learning. In this analytics approach, the dependent variable is finite or categorical: either A or B (binary regression) or a range of finite options A, B, C or D (multinomial regression). It is used to understand the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic regression equation.
This type of analysis can help you predict the likelihood of an event happening or a choice being made. For example, you may want to know the likelihood of a visitor choosing an offer made on your website — or not (dependent variable). Your analysis can look at known characteristics of visitors, such as sites they came from, repeat visits to your site, behavior on your site (independent variables). Logistic regression models help you determine a probability of what type of visitors are likely to accept the offer — or not. As a result, you can make better decisions about promoting your offer or make decisions about the offer itself.
Logistic regression formula
Here p is the probability of a positive outcome.
Logit(p) = log(p / (1-p))
Types of logistic models
Following are some types of predictive models that use logistic analysis.
Generalized linear model
Discrete choice
Multinomial logit
Mixed logit
Probit
Multinomial probit
Ordered logit
Assumptions of logistic regression
Before we apply the logistic regression model, we also need to check if the following assumptions hold true.
The Response Variable is Binary
The Observations are Independent - The easiest way to check this assumption is to create a plot of residuals against time (i.e. the order of the observations) and observe whether or not there is a random pattern. If there is not a random pattern, then this assumption may be violated.
There is No Multicollinearity Among Explanatory Variables - The most common way to detect multicollinearity is by using the variance inflation factor (VIF), which measures the correlation and strength of correlation between the predictor variables in a regression model.
There are No Extreme Outliers - The most common way to test for extreme outliers and influential observations in a dataset is to calculate Cook’s distance for each observation. If there are indeed outliers, you can choose to (1) remove them, (2) replace them with a value like the mean or median, or (3) simply keep them in the model but make a note about this when reporting the regression results.
There is a Linear Relationship Between Explanatory Variables and the Logit of the Response Variable. The easiest way to see if this assumption is met is to use a Box-Tidwell test.
Implementing the model in python and R
Implementing the model consists of the following key steps.
Data pre-processing: This is similar for most ML models, so we tackle this in a separate article and not here
Training the model
Using the model for prediction
Data pre-processing
At this stage we do several pre-processing activities including splitting the data into training set and test set. We usually can follow the 80:20 principle, meaning that we use 80% of our data to train the model and remaining 20% of the data to test the model, and catch under or overfitting.
Training the model
We use the generalized linear model to obtain an equation that predicts the dependent variable using independent variables from the training set.
Using python
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
Using R
classifier = glm(formula = Purchased ~ ., family = binomial, data = training_set)
Using the model
Now, we use the obtained equation to predict the dependent variable using the test set independent variables.
Using python
y_pred = classifier.predict(X_test)
Using R
prob_pred = predict(classifier, type = 'response', newdata = test_set[-3])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
Visualizing results
Visualising the outcome of the model through a confusion matrix.
Using python
from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
accuracy_score(y_test, y_pred)
Using R
cm = table(test_set[, 3], y_pred > 0.5)
For full implementation, check out my github repository - python and github repository - R.
Comments welcome!
Data Science
· 2021-05-01
-
Implementing Random Forest Regression in Python and R
Random forest regression is a popular machine learning algorithm used for predicting numerical values. It is a variant of the random forest algorithm and is well-suited for regression problems where the response variable is continuous. In this article, we will learn how to implement random forest regression using Python and R.
What is Random Forest Regression?
Random forest regression is an ensemble learning method that builds a collection of decision trees and aggregates their predictions to make a final prediction. Each decision tree is built using a subset of the training data and a subset of the features. Random forest regression uses bagging (bootstrap aggregating) to build each tree and random feature selection to reduce overfitting.
Implementing random forest regression using Python:
Step 1: Import Libraries
We start by importing the necessary libraries. We need the pandas library to load and manipulate the data, and the scikit-learn library for building and evaluating the model.
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
Step 2: Load and Prepare the Data
Next, we load the data into a pandas dataframe and prepare it for training. We need to split the data into the independent variables (features) and dependent variable (target) and split the data into training and testing sets.
# load the data into a pandas dataframe
df = pd.read_csv('data.csv')
# split the data into features and target
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Step 3: Train the Model
Next, we create an instance of the RandomForestRegressor class and fit it to the training data.
# create an instance of the random forest regressor class
rf = RandomForestRegressor(n_estimators=100, random_state=0)
# fit the model to the training data
rf.fit(X_train, y_train)
Step 4: Evaluate the Model
Finally, we evaluate the performance of the model using the testing set. We calculate the R-squared score and mean squared error to determine how well the model is performing.
# make predictions using the testing set
y_pred = rf.predict(X_test)
# calculate the R-squared score
r2 = r2_score(y_test, y_pred)
print('R-squared: {:.2f}'.format(r2))
# calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print('Mean Squared Error: {:.2f}'.format(mse))
Step 5: Make Predictions
Once we have trained the model, we can use it to make predictions on new data. We can pass in new data to the predict method to get the predicted values.
# make a prediction for a new sample
new_sample = [[5, 10, 15]]
prediction = rf.predict(new_sample)
print('Prediction: {:.2f}'.format(prediction[0]))
Implementing random forest regression using R:
Step 1: Import Libraries
Let’s start by loading the necessary packages and data for our implementation:
# Load necessary libraries
library(randomForest)
Step 2: Load and Prepare the Data
In this example, we will be using the mtcars dataset, which contains information on various car models, including miles per gallon (mpg), horsepower (hp), and weight.
data(mtcars)
Next, we will split the data into training and testing sets. We will be using 70% of the data for training and 30% for testing.
# Split the data into training and testing sets
set.seed(1234)
train <- sample(nrow(mtcars), 0.7 * nrow(mtcars))
test <- setdiff(seq_len(nrow(mtcars)), train)
Step 3: Train the Model
Now, we can build our random forest regression model using the randomForest function. We will use the mpg column as our response variable and the hp and wt columns as our predictor variables.
# Build the random forest regression model
rf <- randomForest(mpg ~ hp + wt, data = mtcars[train,], ntree = 500)
The ntree parameter specifies the number of trees to include in the model. In this example, we have set ntree to 500.
Step 4: Make Predictions
We can now use the predict function to make predictions on the test data and compare them to the actual values.
# Make predictions on the test data
predictions <- predict(rf, mtcars[test,])
# Calculate the root mean squared error (RMSE)
rmse <- sqrt(mean((predictions - mtcars[test, "mpg"])^2))
print(rmse)
The RMSE value will give us an idea of how accurate our model is. In this example, we obtained an RMSE value of 3.441.
Step 5: Visualize
We can also plot the predicted values against the actual values to visualize the accuracy of our model.
# Plot predicted values against actual values
plot(predictions, mtcars[test, "mpg"],
xlab = "Predicted MPG", ylab = "Actual MPG")
This will produce a scatter plot with the predicted values on the x-axis and the actual values on the y-axis.
Conclusion
In this article, we learned how to implement random forest regression using Python and R. We used the scikit-learn library in Python and randomForest library in R to build and evaluate the model. Random forest regression is a powerful algorithm for predicting continuous values and can be used for a variety of regression problems.
Comments welcome!
Data Science
· 2021-04-03
-
Support Vector Regression
Support Vector Regression (SVR) is a type of regression algorithm that uses Support Vector Machines (SVM) to perform regression analysis. In contrast to traditional regression algorithms, which aim to minimize the error between the predicted and actual values, SVR aims to fit a “tube” around the data such that the majority of the data points fall within the tube. The goal of SVR is to find a function that has a maximum margin from the tube.
In SVR, the input data is transformed into a higher-dimensional space, where a linear regression model is applied. The SVM then finds the best fit line for the transformed data, which corresponds to a non-linear fit in the original data space.
Implementing SVR in Python
To implement SVR in Python, we can use the SVR class from the sklearn.svm module in scikit-learn, which is a popular Python machine learning library. Here’s an example code to implement SVR in Python:
from sklearn.svm import SVR
import numpy as np
# Generate some sample data
X = np.sort(5 * np.random.rand(100, 1), axis=0)
y = np.sin(X).ravel()
# Create an SVR object and fit the model to the data
clf = SVR(kernel='rbf', C=1e3, gamma=0.1)
clf.fit(X, y)
# Make some predictions with the trained model
y_pred = clf.predict(X)
# Print the mean squared error of the predictions
mse = np.mean((y_pred - y) ** 2)
print(f"Mean squared error: {mse:.2f}")
In this example, we generate some sample data by randomly selecting 100 points along the sine curve. We then create an SVR object with an RBF kernel and some hyperparameters C and gamma. We fit the model to the sample data and make some predictions with the trained model. Finally, we calculate the mean squared error between the predicted values and the true values.
Note that the hyperparameters C and gamma control the regularization and non-linearity of the SVR model, respectively. These values can be tuned to optimize the performance of the model on a particular dataset. Additionally, provides many other options for configuring and fine-tuning the SVR model.
Implementing SVR in R
In R, we can implement SVR using the e1071 package, which provides the svm function for fitting support vector machines. Here’s an example code to implement SVR in R:
library(e1071)
# Generate some sample data
set.seed(1)
x <- sort(5 * runif(100))
y <- sin(x)
# Fit an SVR model to the data
model <- svm(x, y, kernel = "radial", gamma = 0.1, cost = 1000)
# Make some predictions with the trained model
y_pred <- predict(model, x)
# Print the mean squared error of the predictions
mse <- mean((y_pred - y) ^ 2)
cat(sprintf("Mean squared error: %.2f\n", mse))
In this example, we generate some sample data by randomly selecting 100 points along the sine curve. We then fit an SVR model to the data using the svm function from the e1071 package. We use a radial basis function (RBF) kernel and specify some hyperparameters gamma and cost. We make some predictions with the trained model and calculate the mean squared error between the predicted values and the true values.
Note that the hyperparameters gamma and cost control the non-linearity and regularization of the SVR model, respectively. These values can be tuned to optimize the performance of the model on a particular dataset. Additionally, the scikit-learn (Python) and e1071 (R) package provides many other options for configuring and fine-tuning the SVM model.
Math behind SVR
The math behind Support Vector Regression (SVR) is based on the same principles as Support Vector Machines (SVM), with some modifications to handle regression tasks. Here is a brief overview of the math behind SVR:
Given a set of training data, SVR first transforms the input data to a high-dimensional feature space using a kernel function. The kernel function computes the similarity between two data points in the original space and maps them to a higher-dimensional space where they can be more easily separated by a linear hyperplane.
The goal of SVR is to find a hyperplane in the feature space that maximally separates the training data while maintaining a margin around it. This is done by solving an optimization problem that involves minimizing the distance between the hyperplane and the training data while maximizing the margin.
In SVR, the margin is defined as a tube around the hyperplane, rather than a margin between two parallel hyperplanes as in SVM. The width of the tube is controlled by two parameters, ε (epsilon) and C. ε defines the width of the tube and C controls the trade-off between the size of the margin and the amount of training data that is allowed to violate it.
The optimization problem in SVR is typically formulated as a quadratic programming problem, which can be solved using numerical optimization techniques.
Once the hyperplane is found, SVR uses it to make predictions for new data points by computing their distance to the hyperplane in the feature space. The distance is transformed back to the original space using the kernel function to obtain the predicted output.
Overall, the math behind SVR involves finding a hyperplane that maximizes the margin around the training data while maintaining a tube around the hyperplane. This is done by transforming the data to a high-dimensional feature space, solving an optimization problem to find the hyperplane, and using the hyperplane to make predictions for new data points.
Advantages of SVR
Support Vector Regression (SVR) has several advantages over other regression models:
Non-linearity: SVR can model non-linear relationships between the input and output variables, while linear regression models can only model linear relationships.
Robustness to outliers: SVR is less sensitive to outliers in the input data compared to other regression models. This is because the optimization process in SVR only considers data points near the decision boundary, rather than all data points.
Flexibility: SVR allows for the use of different kernel functions, which can be used to model different types of non-linear relationships between the input and output variables.
Regularization: SVR incorporates a regularization term in the objective function, which helps to prevent overfitting and improve the generalization performance of the model.
Efficient memory usage: SVR uses only a subset of the training data (support vectors) to build the decision boundary. This results in a more efficient memory usage, which is particularly useful when dealing with large datasets.
Overall, SVR is a powerful and flexible regression model that can handle a wide range of regression tasks. Its ability to model non-linear relationships, its robustness to outliers, and its efficient memory usage make it a popular choice for many machine learning applications.
Comments welcome!
Data Science
· 2021-03-06
-
Implementing Linear Regression in Python and R
Regression is a supervised learning technique to predict the value of a continuous target or dependent variable using a combination of predictor or independent variables. Linear regression is a type of regression where the primary consideration is that the independent and dependent variables have a linear relationship. Linear regression is of two broad types - simple linear regression and multiple linear regression. In simple linear regression there is only one independent variable. Whereas, multiple linear regression refers to a statistical technique that uses two or more independent variables to predict the outcome of a dependent variable. Linear regression also has some modifications such as lasso, ridge or elastic-net regression. However, in this article we will cover multiple linear regression.
Intuition behind linear regression
Before we begin, let us take a look at the equation of multiple linear regression. Y is the target variable that we are trying to predict. x1, x2, .. , xn are the n predictor variables. b0, b1, .. , bn are the n constants that the linear regression (OLS - ordinary least square) model will help us figure out. Example, we can use linear regression to predict a real value, like profit.
Y = b0 + b1*x1 + b2*x2 + .. + bn*xn
profit = b0 + b1*r_n_d_spend + b2*administration + b3*marketing_spend + b4*state
The ordinary least squares method gets the best fitting line by identifying the line that minimizes square of distance between actual and predicted values.
sum ( y_actual - y_hat ) ^ 2 -> minimize
Assumptions of linear regression
Before we apply the linear regression model, we also need to check if the following assumptions hold true.
Linearity: The relationship between X and the mean of Y is linear
Homoscedasticity: The variance of residual is the same for any value of X
Independence: Observations are independent of each other
Normality: For any fixed value of X, Y is normally distributed
Implementing the model in python and R
Implementing the model consists of the following key steps.
Data pre-processing: This is similar for most ML models, so we tackle this in a separate article and not here
Training the model
Using the model for prediction
Data pre-processing
At this stage we do several pre-processing activities including splitting the data into training set and test set. We usually can follow the 80:20 principle, meaning that we use 80% of our data to train the model and remaining 20% of the data to test the model, and catch under or overfitting.
Training the model
We use the ordinary least squares method to obtain an equation that predicts the dependent variable using independent variables from the training set.
Using python
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Using R
regressor = lm(formula = Profit ~ ., data = training_set)
Using the model
Now, we use the obtained equation to predict the dependent variable using the test set independent variables.
Using python
y_pred = regressor.predict(X_test)
Using R
y_pred = predict(regressor, newdata = test_set)
Visualizing results
Visualising actual (x-axis) vs predicted (y-axis) test set values
Using python
plt.scatter(y_test, y_pred)
Using R
ggplot() + geom_point(aes(x = test_set$Profit, y = y_pred))
For full implementation, check out my github repository - python and github repository - R.
Comments welcome!
Data Science
· 2021-02-06
-
An Overview of Machine Learning Techniques
Machine learning is a subfield of artificial intelligence (AI) that allows systems to learn and improve from experience without being explicitly programmed. Essentially, machine learning involves the use of algorithms that can learn from data and improve performance over time. This means that machine learning can be used to identify patterns and make predictions, and can be used in a wide variety of applications, such as image and speech recognition, fraud detection, recommender systems, and many more.
The process of building a machine learning model typically involves several steps, including data cleaning and preprocessing, selecting appropriate features, selecting an appropriate model or algorithm, training the model on a labeled dataset, and then evaluating its performance on a separate test dataset. This process is often iterative, with adjustments made to the model and its parameters until the desired level of performance is achieved.
There are several types of machine learning, including supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning involves training a model on labeled data, meaning that the desired output is already known.
Regression
Regression is used to predict a continuous value, such as a number or a quantity. It is used to model the relationship between a dependent variable (the output) and one or more independent variables (the inputs). Regression is commonly used for tasks such as predicting stock prices, weather forecasting, or predicting sales figures.
Following are some common regression algorithms:
Linear Regression: This is a simple algorithm that models the relationship between a dependent variable and one or more independent variables.
Ridge Regression: This is a type of linear regression that includes a penalty term to prevent overfitting.
Lasso Regression: This is another type of linear regression that includes a penalty term, but it has the added benefit of performing feature selection.
Elastic Net Regression: This algorithm is a combination of Ridge and Lasso regression, allowing for both feature selection and regularization.
Polynomial Regression: This algorithm fits a polynomial equation to the data, allowing for more complex relationships between the dependent and independent variables.
Support Vector Regression: This algorithm models the data by finding a hyperplane that maximizes the margin between the data points.
Decision Tree Regression: This algorithm builds a decision tree based on the data, allowing for nonlinear relationships between the dependent and independent variables.
Random Forest Regression: This is an extension of decision tree regression that builds multiple trees and averages their predictions to improve accuracy.
Gradient Boosting Regression: This is an ensemble method that combines multiple weak regression models to create a strong model.
Classification
Classification, on the other hand, is used to predict a categorical value, such as a label or a class. It is used to identify the class or category to which a given data point belongs based on the features or attributes of that data point. Classification is commonly used for tasks such as image recognition, spam filtering, or predicting whether a customer will churn or not.
Following are some common classification algorithms:
Logistic Regression: Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).
Support Vector Machines: Support Vector Machines (SVM) are a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. SVM works by finding the hyperplane that maximizes the margin between the two classes, and then classifying new data points based on which side of the hyperplane they fall on.
K-Nearest Neighbors: K-Nearest Neighbors (KNN) is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN is a type of instance-based learning or lazy learning where the function is only approximated locally and all computation is deferred until classification.
Naive Bayes: Naive Bayes is a probabilistic algorithm that makes predictions based on the probability of a certain outcome. It works by calculating the probability of each class given a set of input features, and then choosing the class with the highest probability.
Decision Trees: A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. Decision trees are popular because they are easy to understand and interpret.
Random Forest works by creating multiple decision trees, each based on a different random subset of the original data. The trees are then combined to make predictions on new data by taking a majority vote. The main advantage of Random Forest is that it can handle both categorical and numerical data, and can also handle missing values. It is known for its high accuracy and is often used in real-world applications such as image classification, fraud detection, and recommendation systems. However, it can be computationally expensive and may overfit if the number of trees is too large.
Unsupervised learning involves training a model on unlabeled data, meaning that the model must identify patterns and relationships on its own.
Clustering
Clustering is a technique used in unsupervised machine learning to group similar data points together based on their attributes or features.
Following are some common clustering algorithms:
K-Means Clustering: This algorithm groups data points into k clusters based on their distance from k centroids. The algorithm iteratively adjusts the centroids to minimize the sum of squared distances between data points and their respective centroids.
Hierarchical Clustering: This algorithm creates a hierarchy of clusters by either starting with individual data points as clusters and combining them iteratively or starting with all data points as a single cluster and splitting them iteratively.
DBSCAN: This algorithm groups data points together that are closely packed together in high-density regions and separates out data points that are in low-density regions.
Gaussian Mixture Models: This algorithm models data as a combination of multiple Gaussian distributions and groups data points together based on the probabilities of belonging to different distributions.
Spectral Clustering: This algorithm uses graph theory to group data points together based on the similarity of their eigenvectors.
Association rule-based learning
Association rule-based learning algorithms are a type of unsupervised machine learning algorithm that identify interesting relationships, associations, or correlations among different variables in a dataset. These algorithms are commonly used in market basket analysis, where the goal is to identify relationships between items that are frequently purchased together.
Following are some common association rule learning algorithms:
Apriori algorithm: A classic algorithm that discovers frequent itemsets in a dataset and generates association rules based on these itemsets.
FP-Growth algorithm: A faster algorithm than Apriori that builds a compact representation of the dataset, known as a frequent pattern (FP) tree, to efficiently mine frequent itemsets and generate association rules.
Eclat algorithm: Another algorithm that mines frequent itemsets in a dataset, but instead of generating association rules, it focuses on finding frequent itemsets that share a common prefix.
Reinforcement learning involves training a model to make decisions based on trial-and-error feedback.
Reinforcement learning, is a broader class of problems in which an agent interacts with an environment over a period of time, and the agent’s goal is to learn a policy that maximizes its total reward over the long run.
On the other hand, the multi-armed bandit problem is often considered as a simpler version of reinforcement learning. In multi-armed bandit problem, an agent repeatedly selects an action (often referred to as a “bandit arm”) and receives a reward associated with that action. The agent’s goal is to maximize its total reward over a fixed period of time.
For example, there are a number of slot machines (or “one-armed bandits”) that a player can choose to play. Each slot machine has a different probability of paying out, and the player’s goal is to figure out which slot machine has the highest payout probability in the shortest amount of time.
Following are some common algorithms to solve the multi-armed bandit problem:
Upper Confidence Bound (UCB) algorithm approaches this problem by keeping track of the average payout for each slot machine, as well as the number of times each machine has been played. It then calculates an upper confidence bound for each machine based on these values, which represents the upper limit of what the true payout probability could be for that machine. The player then chooses the slot machine with the highest upper confidence bound, which balances the desire to play machines that have paid out well in the past with the desire to explore other machines that may have a higher payout probability. Over time, as more data is collected on each machine’s payout probability, the upper confidence bound for each machine will become narrower and more accurate, leading to better decisions and higher payouts for the player.
Thompson sampling is a Bayesian algorithm for decision making under uncertainty. It is a probabilistic algorithm that can be used to solve multi-armed bandit problems. The algorithm works by updating a prior distribution on the unknown parameters of the problem based on the observed data. At each step, the algorithm chooses the action with the highest expected reward, where the expected reward is calculated by averaging over the posterior distribution of the unknown parameters. The algorithm is often used in online advertising, where it can be used to choose the best ad to display to a user based on their past behavior.
Overall, machine learning is a powerful tool that has the potential to revolutionize many industries and improve our lives in countless ways. As more data becomes available and computing power continues to increase, we can expect to see even more impressive applications of machine learning in the years to come.
Comments welcome!
Data Science
· 2021-01-02
-
A Premier on Chi-squared test
The chi-square test is a statistical hypothesis test that is used to determine whether there is a significant association between two categorical variables. It is widely used in data analysis, particularly in fields such as social sciences, marketing, and biology, to examine relationships between categorical data. In this article, we will discuss the chi-square test, its applications, and how to perform it using Python.
Understanding the Chi-Square Test
The chi-square test is a non-parametric test that compares the observed frequencies of categorical data with the expected frequencies. The test is based on the chi-square statistic, which is calculated by summing the squared difference between the observed and expected frequencies, divided by the expected frequency, for each category.
The chi-square test is used to test the null hypothesis that there is no significant association between the two variables. If the calculated chi-square value is greater than the critical value, we can reject the null hypothesis and conclude that there is a significant association between the variables.
There are two types of chi-square tests: the chi-square goodness of fit test and the chi-square test of independence. The goodness of fit test is used to test whether the observed data follows a particular distribution, while the test of independence is used to test whether there is a significant association between two categorical variables.
Applications of the Chi-Square Test
The chi-square test is widely used in research and data analysis, with a range of applications across various fields. Some common applications include:
Market research: To determine if there is a significant association between demographic factors and consumer behavior, such as age, gender, and income level.
Biology: To test whether different species of plants or animals are distributed randomly or in patterns in their environment.
Social sciences: To test whether there is a significant relationship between socio-economic status and educational attainment.
Quality control: To test whether a sample of products is defective, based on the number of products that pass or fail inspection.
Performing the Chi-Square Test in Python
Python has several libraries that can be used to perform the chi-square test, including SciPy, Pandas, and StatsModels. Here is an example of how to perform the chi-square test of independence using the chi2_contingency function in the SciPy library:
import scipy.stats as stats
import pandas as pd
# Load data into a Pandas DataFrame
data = pd.read_csv('my_data.csv')
# Create a contingency table
contingency_table = pd.crosstab(data['variable_1'], data['variable_2'])
# Perform the chi-square test of independence
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
# Print the results
print('Chi-square statistic:', chi2)
print('P-value:', p)
In this example, we load data from a CSV file into a Pandas DataFrame, create a contingency table using the crosstab function, and then use the chi2_contingency function to perform the chi-square test of independence. The function returns the chi-square statistic, the p-value, the degrees of freedom, and the expected frequencies.
Conclusion
The chi-square test is a valuable statistical tool for examining the relationship between two categorical variables. By performing the test, we can determine whether there is a significant association between the variables and draw conclusions about the data. With the help of Python and its many data analysis libraries, we can easily perform the chi-square test and gain valuable insights from our data.
Comments welcome!
Data Science
· 2020-12-05
-
A Premier on ANOVA
ANOVA (Analysis of Variance) is a statistical method used to analyze and test the differences between the means of three or more groups. ANOVA compares the variation within groups to the variation between groups to determine whether the differences in means are statistically significant or just due to random chance.
The basic idea behind ANOVA is that if the variation between groups is significantly greater than the variation within groups, then there is evidence to suggest that the means of the groups are different. ANOVA allows us to test the null hypothesis that all of the group means are equal against the alternative hypothesis that at least one group mean is different from the others.
ANOVA is used in a wide range of applications, including biology, social sciences, economics, and engineering. It is often used in experimental research to test the effects of different treatments or interventions on a particular outcome.
There are several types of ANOVA, including one-way ANOVA, which compares the means of three or more groups that are unrelated, and repeated measures ANOVA, which compares the means of three or more groups that are related (i.e., the same group is measured under different conditions). ANOVA can be performed using software such as R, Python, or SPSS. In this article, we will be using Python.
Assumptions of ANOVA
ANOVA (Analysis of Variance) has several assumptions that should be met to ensure the validity and reliability of the test. The main assumptions of ANOVA are:
Normality: The dependent variable should be normally distributed in each group. One way to check this is by examining the distribution of the residuals (the differences between the observed values and the predicted values) for each group.
Homogeneity of variances: The variances of the dependent variable should be equal in each group. This can be checked by examining the variance of the residuals for each group.
Independence: The observations should be independent of each other. This means that there should be no systematic relationship between the observations in one group and the observations in another group.
Random Sampling: The observations should be randomly sampled from each group in the population.
If these assumptions are not met, the results of the ANOVA may not be reliable. In addition, violating these assumptions can lead to a higher probability of type I errors (rejecting the null hypothesis when it is actually true) or type II errors (failing to reject the null hypothesis when it is actually false).
Types of ANOVA tests
One-way ANOVA: This test is used to compare the means of more than two independent groups.
Two-way ANOVA: This test is used to compare the means of two or more independent groups while controlling for one or more other variables.
One-way ANOVA
One-way ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups. It is used to determine whether there are significant differences between the means of the groups based on the variability within each group and the variability between groups. In this article, we will walk through how to perform a one-way ANOVA test using Python.
Performing a one-way ANOVA test in Python:
To perform a one-way ANOVA test in Python, we can use the scipy.stats module. Here’s an example code snippet:
import scipy.stats as stats
import pandas as pd
# Create data
group1 = [1, 2, 3, 4, 5]
group2 = [6, 7, 8, 9, 10]
group3 = [11, 12, 13, 14, 15]
# Combine data into a pandas dataframe
data = pd.DataFrame({'Group1': group1, 'Group2': group2, 'Group3': group3})
# Perform one-way ANOVA test
fvalue, pvalue = stats.f_oneway(data['Group1'], data['Group2'], data['Group3'])
# Print results
print('F-value:', fvalue)
print('P-value:', pvalue)
In this example, we create three groups of data (group1, group2, and group3) and combine them into a pandas dataframe. We then use the f_oneway() function from the scipy.stats module to perform the one-way ANOVA test on the three groups. The output of the test includes the F-value and the p-value.
Interpreting the results:
The F-value is a measure of the variance between the groups compared to the variance within the groups. A higher F-value indicates that there is more variability between the groups and less variability within the groups. The p-value is a measure of the statistical significance of the F-value. A p-value less than 0.05 indicates that there is a statistically significant difference between the means of the groups.
In the example above, the F-value is 75 and the p-value is less than 0.05, which suggests that there is a statistically significant difference between the means of the three groups.
Two-way ANOVA
Two-way ANOVA is a statistical test used to determine the difference in the means of two or more groups. It involves testing the effects of two different factors on a response variable. In this article, we will go over how to perform two-way ANOVA in Python using the statsmodels package.
To illustrate two-way ANOVA in Python, we will use a dataset called ‘PlantGrowth’. It is a dataset of 30 plants, each receiving one of three different treatments (control, trt1, and trt2) and measuring their weight after a set period. We are interested in testing the effects of the treatments and the type of seed on the weight of the plants.
[{'weight': '4.17', 'group': 'ctrl', 'plant': 'plant_1'},
{'weight': '5.58', 'group': 'ctrl', 'plant': 'plant_2'},
{'weight': '5.18', 'group': 'ctrl', 'plant': 'plant_3'},
{'weight': '6.11', 'group': 'ctrl', 'plant': 'plant_4'},
{'weight': '4.50', 'group': 'ctrl', 'plant': 'plant_5'},
{'weight': '4.61', 'group': 'ctrl', 'plant': 'plant_6'},
{'weight': '5.17', 'group': 'ctrl', 'plant': 'plant_7'},
{'weight': '4.53', 'group': 'ctrl', 'plant': 'plant_8'},
{'weight': '5.33', 'group': 'ctrl', 'plant': 'plant_9'},
{'weight': '5.14', 'group': 'trt1', 'plant': 'plant_10'},
{'weight': '4.81', 'group': 'trt1', 'plant': 'plant_11'},
{'weight': '4.17', 'group': 'trt1', 'plant': 'plant_12'},
{'weight': '4.41', 'group': 'trt1', 'plant': 'plant_13'},
{'weight': '3.59', 'group': 'trt1', 'plant': 'plant_14'},
{'weight': '5.87', 'group': 'trt1', 'plant': 'plant_15'},
{'weight': '3.83', 'group': 'trt1', 'plant': 'plant_16'},
{'weight': '6.03', 'group': 'trt1', 'plant': 'plant_17'},
{'weight': '4.89', 'group': 'trt1', 'plant': 'plant_18'},
{'weight': '4.32', 'group': 'trt2', 'plant': 'plant_19'},
{'weight': '4.69', 'group': 'trt2', 'plant': 'plant_20'},
{'weight': '6.31', 'group': 'trt2', 'plant': 'plant_21'},
{'weight': '5.12', 'group': 'trt2', 'plant': 'plant_22'},
{'weight': '5.54', 'group': 'trt2', 'plant': 'plant_23'},
{'weight': '5.50', 'group': 'trt2', 'plant': 'plant_24'},
{'weight': '5.37', 'group': 'trt2', 'plant': 'plant_25'},
{'weight': '5.29', 'group': 'trt2', 'plant': 'plant_26'},
{'weight': '4.92', 'group': 'trt2', 'plant': 'plant_27'}]
Here’s how to perform a two-way ANOVA in Python:
Step 1: Load the required libraries and dataset
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
data = pd.read_csv('PlantGrowth.csv')
Step 2: Create a model formula and fit the model
model = ols('weight ~ C(treatment) + C(seed) + C(treatment):C(seed)', data).fit()
Here, ‘weight’ is the dependent variable, and ‘treatment’ and ‘seed’ are the two independent variables.
Step 3: Perform the two-way ANOVA using anova_lm()
anova_results = anova_lm(model, typ=2)
print(anova_results)
The typ parameter specifies the type of sum of squares to use. Here, we use type 2 sum of squares.
The anova_lm() function returns a table with the results of the ANOVA. The table includes the sum of squares, degrees of freedom, F-value, and p-value for each main effect and interaction effect.
Step 4: Interpret the results
The ANOVA table shows that both the main effects of ‘treatment’ and ‘seed’ are statistically significant, as well as the interaction effect between ‘treatment’ and ‘seed’. This suggests that both the type of treatment and the type of seed have a significant effect on the weight of the plants, and that the effect of the treatment depends on the type of seed.
In conclusion, performing a two-way ANOVA in Python is straightforward using the statsmodels package. It is important to ensure that the assumptions of the ANOVA are met before interpreting the results.
Finally, to close, ANOVA is a powerful statistical technique that can be used to compare the means of two or more groups. Whether you are testing the effectiveness of different treatments, analyzing the impact of a categorical variable, or trying to determine if there are significant differences between groups, ANOVA can help you identify these differences and draw meaningful conclusions. By using Python and its many data analysis libraries, you can easily perform ANOVA and other statistical tests on your data and gain valuable insights that can inform your decisions and actions. With the right approach and tools, ANOVA can be a valuable addition to your statistical toolbox.
Comments welcome!
Data Science
· 2020-11-07
-
A Premier on T-tests
T-tests are a class of statistical tests used to determine whether there is a significant difference between the means of two groups of data. T-tests are often used to compare the means of a sample to the population mean, or to compare the means of two independent samples or two paired samples.
Following are the most common types of t-tests are the one-sample t-test that we will cover:
One-sample t-test: This test is used to compare the mean of a single sample to a known or hypothesized population mean.
Independent samples t-test: This test is used to compare the means of two independent groups.
Paired samples t-test: This test is used to compare the means of two dependent (paired) groups.
T-tests have several assumptions that need to be met in order for the test to be valid. The most important assumptions are:
Normality: The data should follow a normal distribution. This means that the sample means should be normally distributed.
Independence: The samples should be independent of each other. This means that the observations in one sample should not be related to the observations in the other sample.
Homogeneity of variances: The variances of the two samples should be approximately equal. This means that the spread of the data should be similar in both groups.
If these assumptions are not met, the results of the t-test may be invalid or misleading. There are also different types of t-tests that make different assumptions. For example, the paired samples t-test assumes that the differences between paired observations are normally distributed, while the independent samples t-test assumes that the two samples have equal variances. It’s important to carefully consider the assumptions of the test and to use caution when interpreting the results.
How to perform T-tests in Python
One-sample t-test
A one-sample t-test is used to compare the mean of a single sample to a known or hypothesized population mean. This test is useful for determining whether a sample differs significantly from the population mean.
To perform a one-sample t-test in Python, you can use the scipy.stats.ttest_1samp function. Here’s an example:
import numpy as np
from scipy.stats import ttest_1samp
# Generate a sample of data
data = np.random.normal(loc=10, scale=2, size=100)
# Set the hypothesized population mean
pop_mean = 9
# Perform the one-sample t-test
t_stat, p_val = ttest_1samp(data, pop_mean)
# Print the results
print("t-statistic: {:.3f}".format(t_stat))
print("p-value: {:.3f}".format(p_val))
In this example, we first generate a sample of data using the numpy.random.normal function, which generates a sample of data from a normal distribution with the specified mean (loc) and standard deviation (scale). We then set the hypothesized population mean to 9.
We then perform the one-sample t-test using the ttest_1samp function, which takes two arguments: the sample data and the hypothesized population mean. The function returns two values: the t-statistic and the p-value.
Finally, we print the results using the print function, formatting the t-statistic and p-value to three decimal places.
If the p-value is less than the significance level (usually 0.05), we can reject the null hypothesis and conclude that the sample mean differs significantly from the population mean. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest a significant difference between the sample mean and the population mean.
Independent samples t-test
An independent samples t-test is used to compare the means of two independent groups to determine if they are significantly different. This test is used when the two groups being compared are completely independent of each other.
To perform an independent samples t-test in Python, we can use the scipy.stats.ttest_ind function from the SciPy library. Here’s an example:
import numpy as np
from scipy.stats import ttest_ind
# Generate two independent samples of data
sample1 = np.random.normal(loc=10, scale=2, size=100)
sample2 = np.random.normal(loc=12, scale=2, size=100)
# Perform the independent samples t-test
t_stat, p_val = ttest_ind(sample1, sample2)
# Print the results
print("t-statistic: {:.3f}".format(t_stat))
print("p-value: {:.3f}".format(p_val))
In this example, we first generate two independent samples of data using the numpy.random.normal function. We then perform the independent samples t-test using the ttest_ind function, which takes two arguments: the two samples being compared. The function returns two values: the t-statistic and the p-value.
Finally, we print the results using the print function, formatting the t-statistic and p-value to three decimal places.
If the p-value is less than the significance level (usually 0.05), we can reject the null hypothesis and conclude that the means of the two groups are significantly different. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest a significant difference between the means of the two groups.
Paired samples t-test
A paired samples t-test is a statistical test used to determine whether there is a statistically significant difference between the means of two related groups. In other words, it helps us determine whether the two groups are significantly different from each other or not.
To perform a paired samples t-test in Python, we can use the scipy.stats module, which contains a variety of statistical functions including the ttest_rel() function. This function computes the t-test for two related samples of scores.
Here is an example code snippet for performing a paired samples t-test in Python:
import numpy as np
from scipy.stats import ttest_rel
# Create two related random samples of data
before = np.random.normal(5, 1, 100)
after = before + np.random.normal(1, 0.5, 100)
# Compute the t-test
t_stat, p_val = ttest_rel(before, after)
# Print the results
print("t-statistic: {}".format(t_stat))
print("p-value: {}".format(p_val))
In this example, we first create two related random samples of data using the numpy.random.normal() function. We create the second sample by adding some random noise to the first sample. We then compute the paired samples t-test for these two samples using the ttest_rel() function. The function returns two values: the t-statistic and the p-value.
Finally, we print the results of the test using the print() function. If the p-value is less than the significance level (usually 0.05), we can reject the null hypothesis and conclude that the means of the two related groups are significantly different. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest a significant difference between the means of the two related groups.
It’s important to note that a paired samples t-test assumes that the differences between the pairs of observations are normally distributed. If this assumption is not met, other tests or transformations may be needed. Additionally, like any statistical test, it’s important to carefully consider the context and limitations of the test and to avoid drawing causal conclusions from statistical associations alone.
To close, T-tests are useful because they provide a simple and easy-to-interpret method for comparing two groups of data. They are widely used in a variety of fields including psychology, medicine, education, and more. However, it’s important to note that t-tests have certain assumptions, such as normality of the data and equal variances, which need to be met for the test to be valid. It’s also important to use caution when interpreting t-test results and to consider the context and limitations of the test.
Comments welcome!
Data Science
· 2020-10-03
-
Statistical Hypothesis Testing
Hypothesis testing is a statistical method used to determine whether a hypothesis about a population parameter is supported by the data. It is a powerful tool for making decisions based on data, and is widely used in many fields including medicine, social sciences, and business.
The basic steps in hypothesis testing are as follows:
Formulate the null and alternative hypotheses: The null hypothesis is the statement that the population parameter is equal to a specified value, while the alternative hypothesis is the statement that the population parameter is not equal to the specified value. For example, if you want to test whether the mean height of a population is 65 inches, the null hypothesis would be “the mean height is equal to 65 inches” and the alternative hypothesis would be “the mean height is not equal to 65 inches.”
Choose a level of significance: The level of significance is the probability of rejecting the null hypothesis when it is actually true. Commonly used levels of significance are 0.05 (5%) and 0.01 (1%).
Collect data and calculate test statistic: Next, you need to collect a sample of data and calculate a test statistic, which is a measure of how far the sample data is from what is expected under the null hypothesis. The test statistic will depend on the type of test being used, such as t-test or chi-squared test.
Determine the p-value: The p-value is the probability of obtaining a test statistic as extreme or more extreme than the observed test statistic, assuming the null hypothesis is true. If the p-value is less than the chosen level of significance, then the null hypothesis is rejected and the alternative hypothesis is supported.
Interpret the results: Finally, the results of the hypothesis test need to be interpreted in the context of the problem being studied. If the null hypothesis is rejected, it may be concluded that there is evidence to support the alternative hypothesis. However, if the null hypothesis is not rejected, it cannot be concluded that the null hypothesis is true, only that there is not enough evidence to reject it.
Hypothesis testing is a powerful tool for making decisions based on data, but it is important to use it correctly and to interpret the results carefully. When conducting a hypothesis test, it is important to ensure that the assumptions of the test are met, and to choose the appropriate test based on the type of data being analyzed. By following the steps outlined above and taking care to interpret the results correctly, hypothesis testing can be a valuable tool for making evidence-based decisions.
There are many different types of hypothesis tests, each suited to different types of data and research questions. Here are a few of the most common types:
One-sample t-test: This test is used to compare the mean of a single sample to a known or hypothesized population mean.
Independent samples t-test: This test is used to compare the means of two independent groups.
Paired samples t-test: This test is used to compare the means of two dependent (paired) groups.
One-way ANOVA: This test is used to compare the means of more than two independent groups.
Two-way ANOVA: This test is used to compare the means of two or more independent groups while controlling for one or more other variables.
Chi-squared test: This test is used to compare the frequencies of categorical data between two or more groups.
Mann-Whitney U test: This non-parametric test is used to compare the medians of two independent groups when the data are not normally distributed.
Kruskal-Wallis test: This non-parametric test is used to compare the medians of more than two independent groups when the data are not normally distributed.
Wilcoxon signed-rank test: This non-parametric test is used to compare the medians of two dependent groups when the data are not normally distributed.
Friedman test: This non-parametric test is used to compare the medians of more than two dependent groups when the data are not normally distributed.
These are just a few examples of the many types of hypothesis tests that are used in statistical analysis. Choosing the right test for a given research question depends on the type of data being analyzed and the specific hypotheses being tested.
Comments welcome!
Data Science
· 2020-09-05
-
Important GCP Services that you need to Know Now
Introduction
Google Cloud Platform (GCP) is a cloud computing platform offered by Google. GCP provides a comprehensive set of tools and services for building, deploying, and managing cloud applications. It includes services for compute, storage, networking, machine learning, analytics, and more. Some of the most commonly used GCP services include Compute Engine, Cloud Storage, BigQuery, and Kubernetes Engine.
GCP is known for its powerful data analytics and machine learning capabilities. It offers a range of machine learning services that allow users to build, train, and deploy machine learning models at scale. GCP also provides powerful data analytics tools, including BigQuery, which allows users to analyze massive datasets quickly and easily.
GCP is a popular choice for businesses of all sizes, from small startups to large enterprises. It offers flexible pricing options, with pay-as-you-go and monthly subscription plans available. Additionally, GCP offers a range of tools and services to help businesses optimize their cloud costs, including cost management tools and usage analytics.
Some of the most commonly used GCP services are:
Google Compute Engine (GCE) - a virtual machine service for running applications on the cloud.
Google Kubernetes Engine (GKE) - a managed Kubernetes service for container orchestration.
Google Cloud Storage (GCS) - a scalable object storage service for unstructured data.
Google Cloud Bigtable - a NoSQL database service for large, mission-critical applications.
Google Cloud SQL - a fully managed relational database service.
Google Cloud Datastore - a NoSQL document database service for web and mobile applications.
Google Cloud Pub/Sub - a messaging service for real-time data delivery and streaming.
Google Cloud Dataproc - a fully managed cloud service for running Apache Hadoop and Apache Spark workloads.
Google Cloud ML Engine - a managed service for training and deploying machine learning models.
Google Cloud Vision API - an image analysis API that can identify objects, faces, and other visual content.
Google Cloud Speech-to-Text - a speech recognition service that transcribes audio files to text.
Google Cloud Text-to-Speech - a text-to-speech conversion service that creates natural-sounding speech from text input.
How to access GCP services
use the Cloud Client Libraries or the Cloud APIs directly. To use the Cloud Client Libraries, you’ll need to first authenticate your application. You can do this by creating a service account, downloading a JSON file containing your credentials, and setting the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of the file. Once you’ve authenticated, you can import the relevant client library and start using GCP services.
use the Cloud APIs directly by making REST requests. To make requests to the Cloud APIs, you’ll need to authenticate and authorize your application by creating a service account and generating a private key. You can then use this key to sign your requests using OAuth 2.0. Once you’ve authenticated, you can make requests to the relevant API endpoints using HTTP requests.
Comments welcome!
Data Science
· 2020-09-03
-
Important Azure Services that you need to Know Now
Introduction
Azure is a cloud computing platform and set of services offered by Microsoft. It provides a wide range of services such as virtual machines, databases, storage, and networking, among others, that users can access and use to build, deploy, and manage their applications and services. Azure also offers a variety of tools and services to help users with tasks such as data analytics, artificial intelligence, and machine learning. Azure provides a pay-as-you-go pricing model, allowing users to only pay for the services they use.
Key Services
Azure Virtual Machines: a cloud computing service that allows users to create and manage virtual machines in the cloud.
Azure App Service: a platform as a service (PaaS) offering that allows developers to build, deploy, and scale web and mobile apps.
Azure Functions: a serverless computing service that allows developers to run small pieces of code (functions) in the cloud.
Azure Blob Storage: a cloud storage service that allows users to store and access large amounts of unstructured data.
Azure SQL Database: a fully managed relational database service that allows users to build, deploy, and manage applications with a variety of languages and frameworks.
Azure Active Directory: a cloud-based identity and access management service that provides secure access and single sign-on to various cloud applications.
Azure Cosmos DB: a globally distributed, multi-model database service that allows users to manage and store large volumes of data with low latency and high availability.
Azure Machine Learning: a cloud-based machine learning service that allows users to build, train, and deploy machine learning models at scale.
Azure DevOps: a set of services that provides development teams with everything they need to plan, build, test, and deploy applications.
Azure Kubernetes Service: a fully managed Kubernetes container orchestration service that allows users to deploy and manage containerized applications at scale.
How to access the services
Azure Portal: The Azure Portal is a web-based user interface that provides access to Azure services. Users can log in and manage their resources in the Azure Portal.
Azure CLI: The Azure Command-Line Interface (CLI) is a cross-platform command-line tool that allows you to manage Azure resources.
Azure PowerShell: Azure PowerShell is a command-line tool that allows users to manage Azure resources using Windows PowerShell.
Azure SDKs: Azure provides Software Development Kits (SDKs) for various programming languages, such as .NET, Java, Python, Ruby, and Node.js. These SDKs provide libraries and tools for interacting with Azure services.
REST APIs: Azure services can be accessed using REST APIs. Developers can use any programming language that supports HTTP/HTTPS to interact with Azure services.
Azure Functions: Azure Functions is a serverless compute service that allows you to run code on demand. You can use Azure Functions to access Azure services.
Azure Logic Apps: Azure Logic Apps is a cloud-based service that allows you to create workflows that integrate with various Azure services.
Azure DevOps: Azure DevOps is a set of development tools that includes features such as source control, continuous integration, and continuous delivery. Developers can use Azure DevOps to manage and deploy their applications to Azure services.
Comments welcome!
Data Science
· 2020-08-06
-
Statistical Distributions
In this article we will cover some distributions that I have found useful while analysing data. I have split them based on whether they are for a continuous or a discrete random variable. First I give a small theoretical introduction about the distribution, its probability density function, and then how to use python to represent it graphically.
Continuous Distributions:
Uniform distribution
Normal Distribution, also known as Gaussian distribution
Standard Normal Distribution - case of normal distribution where loc or mean = 0 and scale or sd = 1
Gamma distribution - exponential, chi-squared, erlang distributions are special cases of the gamma distribution
Erlang distribution - special form of Gamma distribution when a is an integer ?
Exponential distribution - special form of Gamma distribution with a=1
Lognormal - not covered
Chi-Squared - not covered
Weibull - not covered
t Distribution - not covered
F Distribution - not covered
Discrete Distributions:
Poisson distribution is a limiting case of a binomial distribution under the following conditions: n tends to infinity, p tends to zero and np is finite
Binomial Distribution
Negative Binomial - not covered
Bernoulli Distribution is a special case of the binomial distribution where a single trial is conducted n=1
Geometric - not covered
Lets import some basic libraries that we will be using:
import numpy as np
import pandas as pd
import scipy.stats as spss
import plotly.express as px
import seaborn as sns
Continuous Distributions
Uniform distribution
As the name suggests, in uniform distribution the probability of all outcomes is same. The shape of this distribution is a rectange. Now, lets plot this using python. First we will generate an array of random variables using scipy. We will specifically use scipy.stats.uniform.rvs function with following three inputs:
size specifies number of random variates
loc corresponds to mean
scale corresponds to standard deviation
rv_array = spss.uniform.rvs(size=10000, loc = 10, scale=20)
Now we can plot this using the plotly library or the seaborn library. Infact seaborn has a couple of different function, namely the distplot and the histplot, both of which can be used to visually view the unoform data. Lets see the examples one by one:
We can directly plot the data from the array:
px.histogram(rv_array) # plotted using plotly express
sns.histplot(rv_array, kde=True) # plotted using seaborn
Or we can convert array into a dataframe and then plot the data frame:
rv_df = pd.DataFrame(rv_array, columns=['value_of_random_variable'])
px.histogram(rv_df, x='value_of_random_variable', nbins=20) # plotted using plotly express
sns.histplot(data=rv_df, x='value_of_random_variable', kde=True) # plotted using seaborn
Normal Distribution, also known as Gaussian distribution:
The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena.
Normal distribution is a limiting case of Poisson distribution with the parameter lambda tends to infinity. Additionally since poisson distribution is a for of binomial distribution, normal distribution is also a form of binomial distribution.
This distribution has a bell-shaped density curve described by its mean and standard deviation. The mean represents the location and the sd represents the spread of the distribution. The curve represents that the data near the mean occurrs more frequently than the data far from the mean.
Lets plot it using seaborn:
rv_array = spss.norm.rvs(size=10000,loc=10,scale=100) # size specifies number of random variates, loc corresponds to mean, scale corresponds to standard deviation
sns.histplot(rv_array, kde=True)
We can add x and y labels, change the number of bins, color of bars, etc. With distplot we can supply additional arguments for adjusting width of bars, transparency, etc.
ax = sns.distplot(rv_array, bins=100, kde=True, color='cornflowerblue', hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Normal Distribution', ylabel='Frequency')
Standard Normal Distribution
Is a special case of the normal distribution where mean = 0 and sd = 1
Lets plot it using seaborn:
rv_array = spss.norm.rvs(size=10000,loc=0,scale=1)
sns.histplot(rv_array, kde=True)
Gamma distribution is a two-parameter family of continuous probability distributions
Exponential, chi-squared, erlang distributions are special cases of the gamma distribution
Lets plot it using seaborn:
rv_array = spss.gamma.rvs(a=5, size=10000) # size specifies number of random variates, a is the shape parameter
sns.distplot(rv_array, kde=True)
Erlang distribution
Special case of Gamma distribution when a is an integer.
Exponential distribution
Special case of Gamma distribution with a=1.
Exponential distribution describes the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate.
Lets plot it using seaborn:
rv_array = spss.expon.rvs(scale=1,loc=0,size=1000) # size specifies number of random variates, loc corresponds to mean, scale corresponds to standard deviation
sns.distplot(rv_array, kde=True)
Discrete Distributions
Binomial Distribution
Distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose. Additionally, the probability of success and failure is same for all the trials. Further, the outcomes need not be equally likely, and each trial is independent of each other.
The probability of observing k events in an interval is given by the equation: f(k;n,p) = nCk * (p^k) * ((1-p)^(n-k))
Where, nCk = (n)! / ((k)! * (n-k)!)
n=total number of trials
p=probability of success in each trial
Lets plot it using seaborn:
rv_array = spss.binom.rvs(n=10,p=0.8,size=10000) # n = number of trials, p = probability of success, size = number of times to repeat the trials
sns.distplot(rv_array, kde=True)
Poisson Distribution
Poisson random variable is typically used to model the number of times an event happened in a time interval. For example, the number of users registered for a web service in an interval can be thought of as a Poisson process. Poisson distribution is described in terms of the rate (μ) at which the events happen. The average number of events in an interval is designated λ (lambda). Lambda is the event rate, also called the rate parameter.
The probability of observing k events in an interval is given by the equation: P(k events in interval) = e^(-lambda) * (lambda^k / k!)
Poisson distribution is a limiting case of a binomial distribution under the following conditions:
The number of trials is indefinitely large or n tends to infinity
The probability of success for each trial is same and indefinitely small or p tends to zero
np = lambda, is finite.
Lets plot it using seaborn:
rv_array = spss.poisson.rvs(mu=3, size=10000) # size specifies number of random variates, loc corresponds to mean, scale corresponds to standard deviation
sns.distplot(rv_array, kde=True)
Bernoulli distribution
This distribution has only two possible outcomes, 1 (success) and 0 (failure), and a single trial, for example, a coin toss. The random variable X which has a Bernoulli distribution can take value 1 with the probability of success, p, and the value 0 with the probability of failure, q or 1-p. The probabilities of success and failure need not be equally likely.
Probability mass function of Bernoulli distribution: f(k;p) = (p^k) * ((1-p)^(1-k))
Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (n=1)
Lets plot it using seaborn:
rv_array = spss.bernoulli.rvs(size=10000,p=0.6) # p = probability of success, size = number of times to repeat the trial
sns.distplot(rv_array, kde=True)
Hope you found this summary of distributions useful. I refer to this from time to time to jog my memory on the various distributions.
Comments welcome!
Data Science
· 2020-08-01
-
Visualize data using SAS
This is the third of a series of articles that I will write to give a gentle introduction to statistics. In this article we will cover how we can visualize data using various charts and how to read them. I will show how to create these charts using SAS and will include code snippets as well. For a full version of the code visit my GitHub repository.
SAS has an in-built procedure called sgplot that allows you to create several kinds of plots. Also available is proc univariate which allows you to create histograms and normal probability plots, also known as the QQ plots. In this article we will work with the tips dataset that we also used for our Python demonstration.
Before we start plotting, we need to import the dataset. In SAS we do this using the data step.
proc import datafile='/home/u50248307/data/tips.csv'
out=tips
dbms=csv
replace;
getnames=yes;
run;
Once we have imported the dataset, we can view it using the proc print statement.
proc print data=tips;
run;
Lets take a quick look at how the tips dataset is structured:
We can further see some summary information on the dataset using proc contents statement.
proc contents data=tips;
run;
You will notice that I am ending all lines with a semicolon. Unlike Python, SAS does not depend on indentation to show scope of statements. Therefore we use the semicolon, just like we do in C++ to signify end of a statement.
Now lets move to visualizing this data. We will cover the following charts in this article:
Dot plot shows changes between two (or more) points in time or between two (or more) conditions.
proc sgplot data=tips;
title 'Mean of Total bill by Day';
dot day / response=total_bill stat=mean;
xaxis label='Mean of Total Bill';
yaxis label='Day';
run;
proc sgplot data=tips;
title 'Mean of Total bill by Day by Gender';
dot day / response=total_bill group=sex stat=mean;
xaxis label='Mean of Total Bill';
yaxis label='Day';
run;
Bar (horizontal and vertical) chart is used when you want to show a distribution of data points or perform a comparison of metric values across different subgroups of your data.
# horizontal bar chart;
proc sgplot data=tips;
title 'Mean Total bill by Day';
/*hbar day;*/ /*if no other option is specified then it just shows row frequency by cat variable*/
hbar day / response=total_bill stat=mean;
/*hbar day / response=tip stat=mean y2axis;*/
run;
# vertical bar chart;
proc sgplot data=tips;
title 'Mean Total bill by Day';
vbar day / response=total_bill stat=mean;
run;
# XAXISTABLE and YAXISTABLES statements create axis tables which display data values at specific locations along an axis. The only required argument is a list of one or more variables to be displayed.;
proc sgplot data=tips;
title 'Mean Total bill by Day: XAXISTABLE and YAXISTABLES Example';
vbar day / response=total_bill stat=mean;
xaxistable tip size / stat=mean position=top;
xaxistable var1 / stat=freq label="N"; /* the var1 variable was chosen arbitrarily in order to obtain frequency counts of the number of records in each category */
run;
Stacked Bar char is useful when you want to show more than one categorical variable per bar
# Offset Dual Horizontal Bar Plot;
proc sgplot data=tips;
title 'Dual Bar Chart: Mean Total bill and Tip by Day';
hbar day / response=total_bill stat=mean barwidth=0.25 discreteoffset=-0.15;
hbar day / response=tip stat=mean barwidth=0.25 discreteoffset=0.15 y2axis;
run;
# Offset Dual Vertical Bar Plot;
proc sgplot data=tips;
title 'Dual Vertical Bar Chart: Mean Total bill and Tip by Day';
vbar day / response=total_bill stat=mean barwidth=0.25 discreteoffset=-0.15;
vbar day / response=tip stat=mean barwidth=0.25 discreteoffset=0.15 y2axis;
run;
# Stacked Bar Plot;
proc sgplot data=tips;
title 'Stacked Bar Chart with Data Lables';
vbar day / response=total_bill group=sex stat=mean datalabel datalabelattrs=(weight=bold);
xaxis display=(nolabel);
yaxis grid label='total_bill';
run;
Needle plot is similar to barplot and a scatter plot, it can be used to plot datasets that have too many mutations for a barplot to be meaningful.
proc sgplot data=tips;
title 'Needle Chart: Total bill by Meal size';
needle x=size y=total_bill ;
run;
proc sgplot data=tips;
title 'Needle Chart: Total bill by Day';
needle x=day y=tip ;
run;
Boxplot (horizontal and vertical) In a box plot, numerical data is divided into quartiles, and a box is drawn between the first and third quartiles, with an additional line drawn along the second quartile to mark the median. In some box plots, the minimums and maximums outside the first and third quartiles are depicted with lines, which are often called whiskers.
# Vertical Box plot;
proc sgplot data=tips;
title 'Vertical Box plot';
vbox total_bill / category=day boxwidth=0.25 discreteoffset=-0.15;
vbox tip / category=day boxwidth=0.25 discreteoffset=0.15 y2axis;
run;
# Horizontal Box plot;
proc sgplot data=tips;
title 'Horizontal Box plot';
hbox total_bill / category=day boxwidth=0.25 discreteoffset=-0.15;
hbox tip / category=day boxwidth=0.25 discreteoffset=0.15 y2axis;
run;
Histogram is a visual representation of the frequency distribution of your data. The frequencies are represented by bars.
proc sgplot data=tips;
title'Histogram using Proc Sgplot';
histogram total_bill;
run;
proc univariate data=tips;
title'Histogram using Proc Univariate';
histogram total_bill;
run;
Probability Plot is a way of visually comparing the data coming from different distributions. It can be of two types - pp plot or qq plot
pp plot (Probability-to-Probability) is the way to visualize the comparing of cumulative distribution function (CDFs) of the two distributions (empirical and theoretical) against each other.
qq plot (Quantile-to-Quantile) is used to compare the quantiles of two distributions. The quantiles can be defined as continuous intervals with equal probabilities or dividing the samples between a similar way The distributions may be theoretical or sample distributions from a process, etc.
Normal probability plot is a case of the qq plot. It is a way of knowing whether the dataset is normally distributed or not
proc univariate data=tips;
title'Normal probability (QQ) plot using Proc Univariate';
probplot total_bill;
run;
Scatter plot shows the relationship between two numerical variables.
proc sgplot data=tips;
title "total bill vs tip by gender";
scatter x=total_bill y=tip / group=sex markerattrs=(symbol=Square size=10px);
/* SYMBOL: Circle, CircleFilled, Square, Star, Plus, X
SIZE: 0.2in, 3mm, 10pt, 5px, 25pct
COLOR: red, blue, lightscreen, aquamarine, CXFFFFFF */
refline 6 / axis=y lineattrs=(color=green thickness=3px pattern=ShortDashDot); /* REFLINE statement adds horizontal or vertical reference lines to a plot. Its unnamed required argument is a numeric variable, value, or list of values. A reference line will be added for each value listed or for each value of the variable specified. */
run;
# Scatter plot with attribute cycling - when multiple lists of attributes are specified on the STYLEATTRS statement (for example, a list of marker shapes and a list of marker colors);
proc sgplot data=tips;
title "total bill vs tip by gender";
styleattrs datasymbols=(SquareFilled CircleFilled) datacontrastcolors=(purple green);
scatter x=total_bill y=tip / group=sex markerattrs=(size=10px);
/* SYMBOL: Circle, CircleFilled, Square, Star, Plus, X
SIZE: 0.2in, 3mm, 10pt, 5px, 25pct
COLOR: red, blue, lightscreen, aquamarine, CXFFFFFF */
run;
/* SYMBOLCHAR statement is used to define a marker symbol from a Unicode value. */
proc sgplot data=tips;
title "Total bill vs Tip by Gender";
scatter x=total_bill y=tip / group=sex markerattrs=(size=40);
symbolchar name=female_sign char="2640"x; /* identifiers “female_sign” and “male_sign” are arbitrary names */
symbolchar name=male_sign char="2642"x;
styleattrs datasymbols=(female_sign male_sign);
run;
/* Using Data as a Symbol Marker */
proc sgplot data=tips;
title "Total bill vs Tip by Gender";
scatter x=total_bill y=tip / group=sex markerchar=sex markercharattrs=(weight=bold size=10pt);
run;
Line plot is used to visualize the value of something over time. VLINE statement is used to create a vertical line chart (which consists of horizontal lines). The endpoints of the line segments are statistics based on a categorical variable as opposed to raw data values.
LOCATION=Specifies whether legend will appear INSIDE or OUTSIDE (default) the axis area.
POSITION=Specifies the position of the legend: TOP, BOTTOM (default), LEFT, RIGHT, TOPLEFT, TOPRIGHT, BOTTOMLEFT, BOTTOMRIGHT
DOWN=Specifies number of rows in legend
ACROSS=Specifies number of columns in legend
TITLEATTRS=Specifies text attributes of legend title
VALUEATTRS=Specifies text attributes of legend values
# Basic line plot
proc sgplot data=tips;
title 'Line chart showing Average total bill by Day';
vline day / response=total_bill stat=mean markers;
run;
# Line Chart with Dual Axes;
proc sgplot data=tips;
title 'Line Chart with Dual Axes';
vline day / response=total_bill stat=mean markers;
vline day / response=tip stat=mean markers y2axis;
run;
# Line Chart by group with Modifying Line Attributes and Legend;
proc sgplot data=tips;
title 'Line Chart by group with Modifying Line Attributes and Legend';
styleattrs datasymbols=(TriangleFilled CircleFilled) datalinepatterns=(ShortDash LongDash);
vline day / response=total_bill stat=mean markers group=sex lineattrs=(thickness=4px);
keylegend / location=inside position=topleft across=1 titleattrs=(weight=bold size=12pt) valueattrs=(color=green size=12pt);
run;
# XAXIS and YAXIS statements are used to control the features and structure of the X and Y axes, respectively;
proc sgplot data=tips;
title 'Line plot: XAXIS and YAXIS statements';
vline size / response=total_bill stat=mean;
vline size / response=tip stat=mean y2axis;
yaxis min=0 max=40 minor minorcount=9 valueattrs=(style=italic) label='Total Bill ($)';
y2axis offsetmin=0.1 offsetmax=0.1 labelattrs=(color=purple);
/* offsets are proportional to axis length, so between 0 and 1 */
run;
The best way to get better at visualization is through practice. What I have found useful is participating in a weekly visualization challenge called the TidyTuesday!
Comments welcome!
Data Science
· 2020-07-04
-
Visualize data using Python
This is the second of a series of articles that I will write to give a gentle introduction to statistics. In this article we will cover how we can visualize data using various charts and how to read them. I will show how to create these charts using Python and will include code snippets as well. For a full version of the code visit my GitHub repository.
Python has many libraries that allow creating visually appealing charts. In this article we will work with the in-built tips dataset and then plot using the following libraries:
import seaborn as sns
tips = sns.load_dataset("tips") # tips dataset can be loaded from seaborn
sns.get_dataset_names() # to get a list of other available datasets
import plotly.express as px
tips = px.data.tips() # tips dataset can be loaded from plotly
# data_canada = px.data.gapminder().query("country == 'Canada'")
import pandas as pd
tips.to_csv('/Users/vivekparashar/Downloads/tips.csv') # we can save the dataset into a csv and then load it into SAS or R for plotting
import altair as alt
import statsmodels.api as sm
Lets take a quick look at how the tips dataset is structured:
We will cover the following charts in this article:
Dot plot shows changes between two (or more) points in time or between two (or more) conditions.
# Using plotly library
t = tips.groupby(['day','sex']).mean()[['total_bill']].reset_index()
px.scatter(t, x='day', y='total_bill', color='sex',
title='Average bill by gender by day',
labels={'day':'Day of the week', 'total_bill':'Average Bill in $'})
Bar (horizontal and vertical) chart is used when you want to show a distribution of data points or perform a comparison of metric values across different subgroups of your data.
# Using pandas plot
tips.groupby('sex').mean()['total_bill'].plot(kind='bar')
tips.groupby('sex').mean()['tip'].plot(kind='barh')
# Using plotly
t = tips.groupby(['day','sex']).mean()[['total_bill']].reset_index()
px.bar(t, x='day', y='total_bill') # Using plotly
px.bar(t, x='total_bill', y="day", orientation='h')
Stacked Bar char is useful when you want to show more than one categorical variable per bar
# using pandas plot; kind='barh' for horizontal plot
# need to unstack one of the levels and fill na values
tips.groupby(['day','sex']).mean()[['total_bill']]\
.unstack('sex').fillna(0)\
.plot(kind='bar', stacked=True)
# Using plotly
t = tips.groupby(['day','sex']).mean()[['total_bill']].reset_index()
px.bar(t, x="day", y="total_bill", color="sex", title="Average bill by Gender and Day") # vertical
px.bar(t, x="total_bill", y="day", color="sex", title="Average bill by Gender and Day", orientation='h') # horizontal
Boxplot (horizontal and vertical) In a box plot, numerical data is divided into quartiles, and a box is drawn between the first and third quartiles, with an additional line drawn along the second quartile to mark the median. In some box plots, the minimums and maximums outside the first and third quartiles are depicted with lines, which are often called whiskers.
# using pandas plot
# we specify y=variable for vertical and x=variable for horizontal for horizontal box plot respectively
tips[['total_bill']].plot(kind='box')
# using plotly
px.box(tips, y='total_bill')
# using seaborn
sns.boxplot(y=tips["total_bill"])
Violin plot is a variation of box plot
# Using seaborn
sns.violinplot(y=tips.total_bill)
sns.violinplot(data=tips, x='day', y='total_bill',
hue='smoker',
palette='muted', split=True,
scale='count', inner='quartile',
order=['Thur','Fri','Sat','Sun'])
sns.catplot(x='sex', y='total_bill',
hue='smoker', col='time',
data=tips, kind='violin', split=True,
height=4, aspect=.7)
Histogram is a visual representation of the frequency distribution of your data. The frequencies are represented by bars.
# using pandas plot
tips.total_bill.plot(kind='hist')
# using plotly
px.histogram(tips, x="total_bill")
# using seaborn
sns.histplot(data=tips, x="total_bill")
# using altair
alt.Chart(tips).mark_bar().encode(alt.X('total_bill:Q', bin=True),y='count()')
Probability Plot is a way of visually comparing the data coming from different distributions. It can be of two types - pp plot or qq plot
pp plot (Probability-to-Probability) is the way to visualize the comparing of cumulative distribution function (CDFs) of the two distributions (empirical and theoretical) against each other.
qq plot (Quantile-to-Quantile) is used to compare the quantiles of two distributions. The quantiles can be defined as continuous intervals with equal probabilities or dividing the samples between a similar way The distributions may be theoretical or sample distributions from a process, etc.
Normal probability plot is a case of the qq plot. It is a way of knowing whether the dataset is normally distributed or not
# using statsmodels
import statsmodels.graphics.gofplots as sm
import numpy as np
sm.ProbPlot(np.array(tips.total_bill)).ppplot(line='s')
sm.ProbPlot(np.array(tips.total_bill)).qqplot(line='s')
Scatter plot shows the relationship between two numerical variables.
# using plotly
px.scatter(tips, x='total_bill', y='tip', color='sex', size='size', hover_data=['day'])
# using pandas plot
tips.plot(x='total_bill', y='tip', kind='scatter')
Reg plot creates a regression line between 2 parameters and helps to visualize their linear relationships
# using seaborn
sns.regplot(x="total_bill", y="tip", data=tips, marker='+')
# for categorical variables we can add jitter to see overlapping points
sns.regplot(x="size", y="total_bill", data=tips, x_jitter=.1)
Line plot is used to visualize the value of something over time
# using pandas plot
tips['total_bill'].plot(kind='line')
# using plotly
px.line(tips, y='total_bill', title='Total bill')
t = tips.groupby('day').sum()[['total_bill']].reset_index()
px.line(t, x='day',y='total_bill', title='Total bill by day')
# using altair
alt.Chart(t).mark_line().encode(x='day', y='total_bill')
# using seaborn
sns.lineplot(data=t, x='day', y='total_bill')
Area plot is like a line chart in terms of how data values are plotted on the chart and connected using line segments. In an area plot, however, the area between the line segments and the x-axis is filled with color.
# using pandas plot
tips.groupby('day').sum()[['total_bill']].plot(kind='area')
# stacked area can be done using pandas.plot as well
t = tips.groupby(['day','sex']).count()[['total_bill']].reset_index()
t_pivoted = t.pivot(index='day', columns='sex', values='total_bill')
t_pivoted.plot.area()
# using plotly
px.area(t, x='day', y='total_bill', color='sex',line_group='sex')
# using altair
alt.Chart(t).mark_area().encode(x='day', y='total_bill')
Pie chart is a circular statistical graphic, which is divided into slices to illustrate numerical proportion. In a pie chart, the arc length of each slice, is proportional to the quantity it represents.
# using pandas plot
tips.groupby('sex').count()['tip'].plot(kind='pie')
# using plotly
px.pie(tips, values='tip', names='day')
Sunburst chart is ideal for displaying hierarchical data. Each level of the hierarchy is represented by one ring or circle with the innermost circle as the top of the hierarchy.
px.sunburst(tips, path=['sex', 'day', 'time'], values='total_bill', color='day')
Radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point.
# using plotly
t = tips.groupby('day').mean()[['total_bill']].reset_index()
px.line_polar(t, r='total_bill', theta='day', line_close=True)
The best way to get better at visualization is through practice. What I have found useful is participating in a weekly visualization challenge called the TidyTuesday!
Comments welcome!
Data Science
· 2020-06-06
-
Describe your data using Python
This is the first of a series of articles that I will write to give a gentle introduction to statistics. In this article we will introduce some basic statistical concepts and learn how to use basic statistics to help you describe your data.
We will cover the following topics in this article:
The difference between a population and a sample
The difference between Descriptive and Inferential statistics
Different types of variables
Types of descriptive statistics
Normal or Gaussian distribution
The difference between a population and a sample:
Population denotes a large group consisting of elements having at least one common feature; it is the complete set of observations
Sample is a finite subset of the population; it is a subset of observations from a population. We get a sample from the population in either of the following ways
Representative sampling - here the sample’s characteristics are similar to the population characteristics
- A simple random sample is the most common approach to obtain a representative sample
- A systematic random sample
- A cluster random sample
- A stratified random sample
Convenience sampling - here we collect sample from section of population that is easily available
The difference between Descriptive and Inferential statistics:
Descriptive statistics - its all about organizing, describing and summarizing data
Exploratory data analysis (EDA)
measures of location - such as Mean, Median, Mode
measures of variability or dispersion - such as Variance, Standard deviation, Range, Inter quartile range (IQR)
Inferential statistics - its all about drawing conclusions about a population from analysis of a random sample drawn from the populaiton
Exploratory modelling - how is x related to y?
Predictive modelling - if you know x, can you predict y?
Different types of variables:
Quantitative
Discrete: a variable whose value is obtained by counting. Example, number of students in a class
Continuous: a variable whose value is obtained by measuring. Example, height of all students in a class
Interval: this is scale of measurement where continuous data is rank ordered
Ratio: this is scale of measurement where continuous data is rank ordered + has meaningful spacing
Qualitative or Categorical
Nominal: example gender - female or male
Ordinal: example size - small, medium, or large
Types of descriptive statistics:
Measures of location: mainly measures of central tendency
Mean: sum of all values divided by the number of values
import seaborn as sns
tips = sns.load_dataset('tips')
tips.mean() # shows mean of all numeric variables
Median: middle value in a given sequence of values ordered by rank
tips.median() # shows median of all numeric variables
Mode: most frequent value in a set of values
tips.mode() # shows mode of all variables
Measures of variability, spread or dispersion
Range: Maximum value - Minimum value
range = tips.total_bill.max() - tips.total_bill.min() # range
IQR (Inter quartile range): 75th percentile - 25th percentile
tips.total_bill.quantile(.75) - tips.total_bill.quantile(.25) # IQR
Variance: Measure of variability of data around the mean
tips.total_bill.var() # variance of total_bill variable
Standard deviation: how spread out the data is, i.e. how much variance there is from the mean
tips.total_bill.std() # standard deviation of total_bill variable
Coefficient of variance (C.V.): measure of standard deviation expressed as a percentage of the mean
cv = lambda x: x.std() / x.mean() * 100
cv(tips.total_bill)
Measures of symmetry and peakedness: Skewness measures symmetry and Kurtosis measures peakedness
Normal or Gaussian distribution
This is one of the most common statistical distribution. The curve of this distribution is shaped like a bell.
The shape of the bell depends on mean and standard deviation of the data
Larger the standard deviation, wider the distribution
A tip to quickly assess normality is to see if mean and median are nearly equal
Skewness and Kurtosis
Skewness measures tendency of data to be spread out on one side of the mean than the other. Skewness value indicates
Negative value indicates the data is left skewed
Positive value indicates the data is right skewed
Closer to zero for the data to be normally distributed
import scipy.stats as s
s.skew(tips.total_bill, bias=False) #calculate sample skewness
Kurtosis measures tendency of data to be concentrated around the center or tails. Kurtosis value indicates
Platykurtic: Negative value indicates lower than normal peakedness
Leptokurtic: Positive value indicates higher than normal peakedness
Mesokurtic: Closer to zero for the data to be normally distributed
import scipy.stats as s
s.kurtosis(tips.total_bill, bias=False) #calculate sample kurtosis
Comments welcome!
Data Science
· 2020-05-02
-
Optimizing Retention through Machine Learning
Acquiring a new customer in the financial services sector can be as much as five to 25 times more expensive than retaining an existing one. Therefore, prevention of costumer churn is of paramount importance for the business. Advances in the area of Machine Learning, availability of large amount of customer data, and more sophisticated methods for predicting churn can help devise data backed strategy to prevent customers from churning.
Imagine that you are a large bank facing a challenge in this area. You are witnessing an increasing amount of customers churn, which has starting hitting your profit margin. You establish a team of analysts to review your current customer development and retention program. The analysts quickly uncover that the current program is a patchwork of mostly reactive strategies applied in various silos within the bank. However, the upside is that the bank has already collected rich data on customer interactions that could possibly help get a deeper understanding of reasons for churn.
Based on this initial assessment, the team recommends a data driven retention solution which uses machine learning to identify the reasons for churn and possible measures to prevent it. The solution consists of an array of sub-solutions focused towards specific areas of retention. The first level of sub-solutions consists of insights that can be directly derived from the existing customer data, answering for example the following business questions:
Churn History Analysis: What are characteristics of churning customers? Are there any events that indicate an increased probability for churn, like long periods without contact to the customer, several months of default on a credit product etc.?
Customer Segmentation: Are there groups of customers that have similar behavior and characteristics? Do any of these groups show higher churn rates?
Customer Profitability: How much profit is the business generating by different customers? What are characteristics of profitable customers?
First results can be drawn by these analyses. Additional insights are generated by combining them with data points such as the historical monthly profit that a business loses due to churn. Further, the data can be used for training supervised machine learning models which allow predicting future months or help classifying customers for which rich data is not available yet. This is the idea behind the second level of sub-solutions.
Customer Life Time Value: What is the expected profitability for a given customer in the future?
Churn Prediction: Which customers are in risk of churn? For which customers a quick intervention can improve retention?
The early detection of customers at risk of churn is crucial for improving retention. However, not only is it beneficial to know the churn likelihood but also the expected profit loss that is connected with each customer in case of churn. Constant and fast advances in the area of Machine Learning help to improve these results.
Being able to process large amounts of data allows for more customized results that are focused on the individuality of each customer. This is an important point as every customer has different preferences when it comes to contact with the bank, different reactions when it comes to offers and different needs and goals. Combining previously mentioned analyses and a large amount of customer data provides the third level of sub-solutions which allow individualized prescriptive solutions for at-risk customers. The idea behind this prescriptive retention solution is the simulation of alternative paths combined with optimization techniques along different parameters like how many days passed since the last contact of the client with the bank.
The first set of descriptive or diagnostic solutions can be implemented relatively quickly as siloed analytics teams within the bank are already exploring them on their own. The second set of solutions which is more predictive in nature could take upto an year to implement. Built atop these, the prescriptive solution utilizes the outcome of previous analyses to suggest improved and individualized retention strategies. As a result the bank can now take different preventive retention measures for each customer.
Comments welcome!
-
Customer Journey Analytics
How important is it to align your analytics efforts with the customer lifecycle? Imagine you are a credit card department within the consumer banking branch of large bank. You are sending periodic mailers offering credit cards to your customers. Before sending these mail offers you do a minimum screening in a way that you only offer these to customers that have been with the bank for at-least 2 years and have maintained a balance above a certain threshold. However, you notice that the acceptance of your mail offers remains low even after a few campaigns. Why do you think is that?
The answer lies in a simple concept, but one that is often overlook by analytics teams. Are you trying to identify which life stage the customer is in? Are you trying to synchronize your sales effort with the customer lifecycle? What is customer lifecycle you ask?
Customer lifecycle can be understood as a framework to track the relationship between a customer and a bank. It starts off with the Acquisition stage where your primary focus is to figure out ways to identify and bring on-board customers with which a mutually beneficial relationship can be created. After this comes the Development stage, where the customer is encouraged to expand his portfolio with your products through cross-sell efforts, etc. Finally, comes the Retention stage where the customer has been with you for more than a decade, so you try to enhance the relationship and monitor customer satisfaction so that the customer can act as a good ambassador for you.
These are the three basic stages, Acquire > Develop > Retain. You could break-down these stages further to target any pain-points you might be facing in a particular stage. For example, your acquisition through campaigns this year has not been as fruitful as previous years. So you break down Acquisition into Awareness > Consideration > Purchase to pin-point the root cause. Data suggests that the advertising budget is same as previous years. Marketing campaigns to tip consumers in the consideration stage into the purchase stage are also being sent in a timely manner. However, you are still loosing prospective customers in the purchase stage. You sanction a study to identify any changes that might have happened in the way you on-board a customer. Voilà! You identify that the on-boarding form has been appended with two new sections seeking a little more information about the customer before on-boarding. You weigh the necessity of collecting the information which on-boarding and decide to drop these additional sections. Few months later, Acquisition metrics start to return to previous years ballpark.
Perhaps the most important aspect in the world of data driven decision making is to align the reporting and analytical efforts with the customer lifecycle. For example, during the acquisition phase your primary aim is to provide the right product just when the prospect customer needs it. This could be achieved though an analysis such as the Best Next Offer, where you use Machine Learning techniques to match your products with profile of prospects created using demographic, psychographic, etc. factors. Similarly, during the Development stage you focus on meticulously reporting and driving cross-sell efforts to increase your product presence in the customer portfolio. Lastly, during the Retention stage your focus should be on minimizing churn through customer satisfaction and this can be achieved through churn analysis on the quality data you collected in this aspect.
To close I will reemphasize the importance of collecting good data, analytics and aligning it closely with customer lifecycle for optimal data driven decision making.
Comments welcome!
-
An Introduction to GitHub
A three part article series on version control using Git and GitHub. This is the third article in the series in which I will give a very brief introduction to GitHub. This will allow most readers to understand enough to utilize it for version control during development.
What is GitHub?
GitHub is a popular platform for hosting and sharing code repositories, and is widely used for version control and collaborative coding projects. If you’re new to using GitHub for version control, here are some key things to keep in mind:
Create a GitHub account: The first step in using GitHub is to create an account. You can sign up for a free account, which gives you access to public repositories, or a paid account, which gives you access to private repositories and additional features.
Create a new repository: Once you have an account, you can create a new repository by clicking the “New repository” button on your GitHub dashboard. You can choose to make the repository public or private, and can add a README file and other files as needed.
Clone the repository to your local machine: Once you have created a repository on GitHub, you can clone it to your local machine using Git. This allows you to make changes to the code locally, and push those changes back to the remote repository on GitHub.
Make changes and commit them: Once you have cloned the repository to your local machine, you can make changes to the code and commit those changes to Git. Be sure to write clear and descriptive commit messages that explain the changes made.
Push changes to the remote repository: After committing changes to Git, you can push those changes back to the remote repository on GitHub. This allows other team members to see the changes and collaborate on the code.
Use pull requests for code reviews: When working on a team, it’s a good practice to use pull requests to review code changes before merging them into the main branch. This allows other team members to review the code and provide feedback before changes are merged.
Use branches for new features or bug fixes: When working on a new feature or bug fix, it’s important to create a new branch in Git rather than making changes directly to the main branch. This keeps the main branch stable and allows for easier collaboration with other team members.
By keeping these key things in mind when using GitHub for version control, you can help ensure that your codebase is well-organized, well-documented, and easy to collaborate on with other team members.
Components of GitHub
Now, let us explore some of the key components of GitHub.
Repository, branch
Repository is a project’s folder and contains all of the project files (including documentation), and stores each file’s revision history.
Branch is a parallel version of a repository. It is contained within the repository, but does not affect the primary or master branch allowing you to work freely without disrupting the “live” version. When you’ve made the changes you want to make, you can merge your branch back into the master branch to publish your changes.
Commit, revert
Commit, or “revision”, is an individual change to a file (or set of files). When you make a commit to save your work, Git creates a unique ID (a.k.a. the “SHA” or “hash”) that allows you to keep record of the specific changes committed along with who made them and when. Commits usually contain a commit message which is a brief description of what changes were made.
Revert - when you revert a pull request on GitHub, a new pull request is automatically opened, which has one commit that reverts the merge commit from the original merged pull request. In Git, you can revert commits with git revert.
Push, pull, fetch, merge
Push means to send your committed changes to a remote repository on GitHub.com. For instance, if you change something locally, you can push those changes so that others may access them.
Pull refers to when you are fetching in changes and merging them. For instance, if someone has edited the remote file you’re both working on, you’ll want to pull in those changes to your local copy so that it’s up to date. See also fetch.
Pull requests are proposed changes to a repository submitted by a user and accepted or rejected by a repository’s collaborators. Like issues, pull requests each have their own discussion forum.
Fetch - when you use git fetch, you’re adding changes from the remote repository to your local working branch without committing them. Unlike git pull, fetching allows you to review changes before committing them to your local branch.
Merge takes the changes from one branch (in the same repository or from a fork), and applies them into another. This often happens as a “pull request” (which can be thought of as a request to merge), or via the command line. A merge can be done through a pull request via the GitHub.com web interface if there are no conflicting changes, or can always be done via the command line.
Fork, clone, download
Fork is a personal copy of another user’s repository that lives on your account. Forks allow you to freely make changes to a project without affecting the original upstream repository. You can also open a pull request in the upstream repository and keep your fork synced with the latest changes since both repositories are still connected
Clone is a copy of a repository that lives on your computer instead of on a website’s server somewhere, or the act of making that copy. When you make a clone, you can edit the files in your preferred editor and use Git to keep track of your changes without having to be online. The repository you cloned is still connected to the remote version so that you can push your local changes to the remote to keep them synced when you’re online.
Download option allows to download project folder as a zip file from GitHub to your local machine. This does not bring the .git folder, so using the http link to download is a better option
Comments welcome!
-
-
An Introduction to Git
A three part article series on version control using Git and GitHub. This is the first article in the series in which I will give a very brief introduction to Git. This will allow most readers to understand enough to utilize it for version control during development.
What is Git?
Git is a popular version control system that allows developers to manage and track changes to their code over time. It’s an essential tool for software development teams, as it helps to ensure that changes to code are properly tracked and documented, and makes it easier for developers to collaborate and work together. Here’s an overview of what Git is and how it works.
Git is a distributed version control system, meaning that every developer working on a project has their own copy of the code repository on their local machine. This allows developers to work on their own changes and then merge them back into the main repository when they are ready. Git is also designed to be very fast and efficient, making it ideal for managing large codebases and complex projects.
How does Git work?
Git works by tracking changes to files and directories in a code repository. When a developer makes changes to the code, they create a new “commit” that documents the changes they made. Git stores these commits in a tree-like structure, with each commit representing a snapshot of the code at a particular point in time. This allows developers to easily view the history of changes to the code over time, and to revert to previous versions if necessary.
Git also allows developers to create branches, which are essentially separate versions of the code repository that can be worked on independently. Branches are useful for trying out new features or making experimental changes without affecting the main codebase. Once changes have been tested and reviewed, they can be merged back into the main branch.
Using Git for version control
To use Git for version control, developers typically create a new repository on a Git hosting service such as GitHub, GitLab, or Bitbucket. They then clone the repository onto their local machine and begin making changes to the code. To commit changes, developers use Git commands such as “git add” to add changed files to the commit, and “git commit” to create a new commit with a commit message that describes the changes.
To collaborate with other developers, developers can push their changes to the remote repository and create “pull requests” that allow other developers to review the changes and provide feedback. Once changes have been reviewed and approved, they can be merged back into the main branch.
Basic terminal commands
Terminal (for Unix or Mac) or Command Prompt for Windows allows us to type Git commands and manage project repositories. In this section we will be focusing on terminal commands.
By default we are in the /home/vivek directory. home and mnt folders are in the same directory (usually they are in the highest level directory signified by just a /)
pwd shows the current directory
clear is used to clear the command line
cd + tab key is used to cycle between sub directories in a directory
cd .. is used to move up a directory
cd mnt/ is used to enter the mnt directory. In this directory we can find the windows c drive (basically it is a directory named c)
~ signifies that you are in your home directory
.. is used to move up one directory
/ signifies the highest level directory, you cant go back from there
mkdir is used to create a new directory
Directory names are case sensitive
Right click is used to paste an absolute path name in the terminal
ls is used to list all directories and files in a directory
rm -rf is used to remove folders. rf tells that we are using the command to remove a directory, as by default rm is used to remove a file
git --version is used to see the version of git
touch file_name.txt is used to create a file
Basic Git commands
Git Repository is used to save project files and the information about the changes in the project. Repository can be created locally, or it can be a clone Git repository which is a copy of a remote Git repo.
git init is used to initialize the directory as a git repository. This will create a .git folder in the directory and we can start using git features
git status shows staging area. You will see some files under “Untracked files:” header
git add file-name is used to add a file to staging area. After this you will see the file under the “Changes to be committed:” header
git add . is used to add all files in directory to staging area (. signifies all)
git rm --cached file.txt is used to unstage a file
git rm -f file.txt is used to force remove a file from staging area and also deletes the file from directory (-f signifies force)
git config --global user.email “abc.xyz@email.com”
git config --global user.name “abc.xyz”
git commit --help
git commit -a -m “Initial commit” (-m to include a message; -a to automatically stage files that have been modified and deleted, but new files you have not told Git about are not affected)
git log (if you want to see a shorter version then use git log --oneline)
Head is usually on master (most recent commit). Head is what the project directory looks like.
git checkout <commit-id> is used to see the contents of the folder as they looked during that particular commit
git checkout master is used to restore the head to the most recent commit, hence the contents of the project directory are also restored to what they were at the time of the most recent commit
git revert <commit-id> is used to revert the contents of the project directory to what they were before that particular commit. This will still appear in the log and we can go back to that commit by using git revert again
git reset - three kinds - soft (only goes back in time in the commit tree, so just moves the head back; this is similar to checkout), mixed (moving back in time in the project directory but still can come back, doesn’t remove files) and hard (moving back in time in the project directory and staying there, removes files)
touch .gitignore, now open the .gitignore file with notepad and add the names of the files you don’t want to track in that. # can be used to comment in this file. Usually you create .gitignore during initializing the project. If you have committed files already before adding them into the .gitignore file, then you need to remove them from cache by using the following series of commands
git rm -r --cached .
git add .
git commit -m “message”
If there is a directory in your project folder and you want to ignore all files in the directory from future commits, you can add “directory-name/*” in the .gitignore file
Git Branches for Error Handling
Lets say there is an error in one of the files in the project folder
We can create a branch to fix the error while the master repository stays intact
git checkout -b err01 (creates a new branch called err01)
<fix the error in one of the files in the project folder>
git add . (add all changes made to the err01 to the staging area, so they can be committed)
git commit -m ‘fixed error’ (commit all changes made to err01 branch)
git checkout master (switch back to master branch)
git merge err01 (merge changes made in err01 to master branch; merging will only take last commit of err01 and weave it into the master branch commit timeline)
git push (this will push master branch of project folder to remote repository)
git push origin err01 (this will push err01 branch of project folder to remote repository)
git push origin --delete err01 (we delete the err01 branch as we don’t need it anymore)
git branch -d bugs (local branches can be deleted using -d)
git branch -a (list all branches)
Remote Repositories for Effective Collaboration
First step is to create a new repository on GitHub (don’t add a read-me, gitignore or license). Copy the url of the repository
Create a project folder in your local machine and browse into that folder using bash
git init (you will see that the repository has not been initialized yet; git init is used to create a new repository)
git remote add origin <paste url here>
git remote -v (you will see that the repository has been initialized)
In GitHub website
“Create new file” > README.md
“Create new file” > LICENSE
“Create new file” > .gitignore > in content of that file type /AutoGen to exclude all files that we keep in that folder
pull - go back to bash
git pull origin master (we don’t need to specify origin master is we set master as the tracked branch)
git branch --set-upstream-to=origin/<branch> master
Sometimes you might be prompted for a login at this stage
<make changes to the local repository>
git push -u origin master (push updates to remote repository on GitHub; will ask for username and password)
You can add other developers as collaborators to this repository.
In summary, Git is a powerful tool for version control that allows developers to manage and track changes to code over time. With its distributed architecture, fast performance, and support for branching and merging, Git is an essential tool for software development teams of all sizes.
Comments welcome!
-
Introduction to Programming in R
Quick Introduction to R
R is a programming language and environment for statistical computing and graphics. It was created in the early 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. R is now widely used in academia, industry, and government for data analysis, statistical modeling, and data visualization.
One of the key features of R is its wide range of statistical and graphical techniques. R provides a vast array of statistical and graphical methods, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and graphical techniques for data visualization. R is also highly extensible and has an active community of users and developers who create and contribute packages that enhance the capabilities of the language.
R is an open-source language, which means that the code is available for free and can be modified and redistributed. This has led to the development of a large and active community of R users and developers. The R community provides a wealth of resources, including documentation, tutorials, and help forums, making it easy for users to get started with the language and to find solutions to their problems.
One of the advantages of R is its integration with other programming languages and data sources. R can read data from a wide range of sources, including text files, spreadsheets, databases, and web services. R can also interact with other programming languages, such as Python, Java, and C++, allowing users to take advantage of the strengths of different languages and libraries.
Another advantage of R is its versatility. R can be used for a wide range of tasks, from data analysis and visualization to machine learning and artificial intelligence. R can also be used in a variety of settings, from research and academia to industry and government.
Most modern programming languages have a set up similar building blocks, for example
Receiving input from the user and Showing output to the user
Ability to store values in variables (usually of different kinds such as integers, floating points or character)
A string of characters where you can store names, addresses, or any other kind of text
Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Put your code in functions
Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes
Read file from a disk and save file to a disk
Ability to comment your code so you can understand it when you revisit it some time later
Lets dive right in and see how we can do these things in R.
Before we can begin to write a program in R, we need to install R and R studio.
myString <- "Hello, World!"
print (myString)
1. Receiving input from the user and Showing output to the user
There are several ways in which we can show output to the user. Let’s look at some ways of showing output:
var.1 = c(0,1,2,3)
'Method 1: values of the variables can be printed using print()
print(var.1)
# Output: 0 1 2 3
'Method 2: cat() function combines multiple items into a continuous print output
cat ("var.1 is ", var.1 ,"\n")
# Output: var.1 is 0 1 2 3
2. Ability to store values in variables (usually of different kinds such as integers, floating points or character)
Basic data types: In R we call variables as objects. There are several types of objects, lets take a look at the important ones:
# Logical
v <- TRUE
print(class(v)) # class funciton can be used to see the data type of the variable
# Numeric
v <- 23.5
print(class(v))
# Integer
v <- 2L
print(class(v))
# Complex
v <- 2+5i
print(class(v))
# Character
v <- "TRUE"
print(class(v))
# Raw
v <- charToRaw("Hello")
print(class(v))
Advanced data types: Much of R’s power comes from the fact that R lets us access some advanced objects other than the basic ones shown earlier. Lets take a look at some of the advanced variables:
# Vectors - When you want to create vector with more than one element, you should use c() function which means to combine the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
# Get the class of the vector.
print(class(apple))
# Lists - A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
# Print the list.
print(list1)
# Matrices - A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
# Arrays - While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. In the below example we create an array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
# Factors - Factors are the r-objects which are created using a vector. It stores the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are useful in statistical modeling. Factors are created using the factor() function. The nlevels functions gives the count of levels.
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
# Create a factor object.
factor_apple <- factor(apple_colors)
# Print the factor.
print(factor_apple)
print(nlevels(factor_apple))
# Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data. The first column can be numeric while the second column can be character and third column can be logical. It is a list of vectors of equal length. Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
3. A string of characters where you can store names, addresses, or any other kind of text
Any value written within a pair of single quote or double quotes in R is treated as a string.
Key idea here is to learn how to manipulate string variables. There are a few common operations that we will focus on:
a. Concatenate strings
# Concatenate strings
paste(str1, str2, str3, ... , sep = " ", collapse = NULL)
b. Counting number of characters in a string
# Counting number of characters in a string - nchar() function
nchar(test_str)
c. Changing the case - toupper() & tolower() functions
str = 'apPlE'
toupper(str) # APPLE
tolower(str) # apple
d. Extracting parts of a string - substring() function
# Syntax
substring(x,first,last)
# Example - Extract characters from 5th to 7th position.
result <- substring("Extract", 5, 7)
print(result)
e. Formatting - Numbers and strings can be formatted to a specific style using format() function.
# Syntax
format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none"))
# Example
# Total number of digits displayed. Last digit rounded off.
result <- format(23.123456789, digits = 9)
print(result)
# Display numbers in scientific notation.
result <- format(c(6, 13.14521), scientific = TRUE)
print(result)
# The minimum number of digits to the right of the decimal point.
result <- format(23.47, nsmall = 5)
print(result)
# Format treats everything as a string.
result <- format(6)
print(result)
# Numbers are padded with blank in the beginning for width.
result <- format(13.7, width = 6)
print(result)
# Left justify strings.
result <- format("Hello", width = 8, justify = "l")
print(result)
# Justfy string with center.
result <- format("Hello", width = 8, justify = "c")
print(result)
4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Arrays are a series of similar type of data stored together in one variable. Arrays can be one-dimentional or multi-dimentional. An array is created using the array() function. It takes vectors as input and uses the values in the dim parameter to create an array.
For example − If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns.
# dim=c(rows, columns, matrices)
array2 = array(1:12, dim=c(2, 3, 2))
# Naming Columns and Rows
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2")
matrix.names <- c("Matrix1","Matrix2")
array2 = array(1:12, dim=c(2, 3, 2), dimnames = list(row.names, column.names, matrix.names))
Lets see how we can access array elements:
# dim=c(rows, columns, matrices)
print(array2[2,,2]) # Print the second row of the second matrix of the array.
print(array2[1,3,1]) # Print the element in the 1st row and 3rd column of the 1st matrix.
print(array2[,,2]) # Print the 2nd Matrix.
Since the returned values here are matrices, we can perform matrix operations on them
Calculations Across Array Elements (we can use user defined functions as well)
apply()
lapply()
sapply()
tapply()
# apply(X, MARGIN, FUN) - apply to r or c or both - input to this funciton is a df - output is a vector, list or array
m1 <- matrix(C<-(1:10),nrow=5, ncol=2)
apply(m1, 2, sum)
# lapply(X, FUN) - apply to all elements - input to this function is list, vector or df - output is a list
# sapply(X, FUN) - apply to all elements - input to this function is list, vector or df - output is a vector or a matrix
movies <- c("BRAVEHEART","BATMAN","VERTIGO","GANDHI")
lapply(movies, tolower)
sapply(movies, tolower)
# tapply(X, INDEX, FUN = NULL) - apply to each factor variable in a vector - input to this function is a vector - output it an array
data(iris)
tapply(iris$Sepal.Width, iris$Species, median)
5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
R has several looping options (repeat, while and for). There are also options of nesting (single, double, triple, ..) loops.
a. The Repeat loop executes the same code again and again until a stop condition is met:
# Syntax
repeat {
commands
if(condition) {
break
}
}
# Example
v <- c("Hello","loop")
cnt <- 2
repeat {
print(v)
cnt <- cnt+1
if(cnt > 5) {
break
}
}
b. The While loop executes the same code again and again until a stop condition is met:
# Syntax
while (test_expression) {
statement
}
# Example
v <- c("Hello","while loop")
cnt <- 2
while (cnt < 7) {
print(v)
cnt = cnt + 1
}
c. The for loop:
# Syntax
for (value in vector) {
statements
}
# Example
v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
R also provides the break and next statements that allow us to alter the loops further. Following is their use:
When the break statement is encountered inside a loop, the loop is immediately terminated and program control resumes at the next statement following the loop.
On encountering next, the R parser skips further evaluation and starts next iteration of the loop.
6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
R provides if.., if..else.., if..else..if.., and switch options to apply conditional logic. Lets take a look at them:
a. The basic syntax for creating an if statement in R is:
# Syntax
if (test_expression) {
statement
}
# Example
x <- 5
if(x > 0){
print("Positive number")
}
b. The basic syntax for creating an if…else statement in R is:
if (test_expression) {
statement1
} else {
statement2
}
# Example
x <- -5
if(x > 0){
print("Non-negative number")
} else {
print("Negative number")
}
c. The basic syntax for creating an if…else if…else statement in R is:
if (test_expression1) {
statement1
} else if (test_expression2) {
statement2
} else if (test_expression3) {
statement3
} else {
statement4
}
# Example
x <- 0
if (x < 0) {
print("Negative number")
} else if (x > 0) {
print("Positive number")
} else
print("Zero")
d. A switch statement allows a variable to be tested for equality against a list of values. Each value is called a case, and the variable being switched on is checked for each case.
x <- switch(
2,
"first",
"second",
"third",
"fourth"
)
print(x)
7. Put your code in functions
In R a user defined function is created by using the keyword function.
# Syntax
function_name <- function(arg_1, arg_2, ...) {
Function body
}
# Example
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
We can call the function new.function supplying 6 as an argument.
new.function(6)
We can also create functions to which we can pass arguments. These functions can also be defined to use default values for those arguments in case user does not provide a value. Lets see how this is done:
new.function <- function(a = 3, b = 6) {
result <- a * b
print(result)
}
Now we can call this with or without passing any values:
# Call the function without giving any argument.
new.function()
# Call the function with giving new values of the argument.
new.function(9,5)
8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes
Class is the blueprint that helps to create an object and contains its member variable along with the attributes. R lets you create two types of classes, S3 and S4.
S3 Classes: These let you overload the functions.
S4 Classes: These let you limit the data as it is quite difficult to debug the program
We will cover s4 classes here. S4 class is defined by the setClass() method.
# Defining a class
setClass("emp_info", slots=list(name="character", age="numeric", contact="character"))
emp1 <- new("emp_info",name="vivek", age=30, contact="somehwere on the internet")
# Access elements of a class
emp1@name
9. Read file from a disk and save file to a disk
Lets see how to read and write csv in an organized way. CSV is the most common file type you will be using for data science, however R can read several other file types as well.
# read a csv file
data <- read.csv('file.csv')
# write a csv file
write.csv(df, 'file.csv', row.names = FALSE)
10. Ability to comment your code so you can understand it when you revisit it some time later
We can tell R that a line of code is a comment by starting it with a #.
# this is a comment
In summary, R is a powerful and versatile programming language that is widely used for statistical computing and graphics. Its extensive range of statistical and graphical techniques, its open-source nature, and its active community of users and developers make it a valuable tool for data analysis and modeling. Whether you are a researcher, data analyst, or developer, R provides a wide range of tools and resources for working with data and creating meaningful insights.
To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent.
Comments welcome!
-
Introduction to Programming in Markdown
Quick Introduction to Markdown
Markdown is a lightweight markup language that is used to format text in a simple and consistent way. It was first created in 2004 by John Gruber and Aaron Swartz as a way to write content for the web that was easy to read and write.
Markdown is designed to be easy to learn and use. It uses simple syntax to format text, making it easy to create headings, lists, links, and other formatting elements. Markdown can be used in a wide variety of contexts, including writing blog posts, creating documentation, and writing code comments.
One of the key features of Markdown is its simplicity. Markdown uses plain text that can be easily read and edited using any text editor. This makes it easy to collaborate on documents and to transfer files between different devices and platforms. Additionally, Markdown is supported by a wide variety of software tools and platforms, including blogging platforms, content management systems, and online forums.
Another important feature of Markdown is its flexibility. Markdown can be customized and extended to support a wide variety of use cases. For example, Markdown supports the creation of tables, code blocks, and mathematical equations. Additionally, there are many third-party tools and libraries that extend the functionality of Markdown, such as Pandoc, which can convert Markdown to other formats like HTML, LaTeX, and PDF.
Markdown is also popular among programmers, as it can be used to create code blocks and inline code snippets. This is particularly useful for writing documentation and sharing code examples. Many code editors also support Markdown, allowing programmers to write and preview Markdown documents without leaving their development environment.
Following table provides a quick overview of frequently used Markdown syntax elements. It does not cover every case, so if you need more information about any of these elements, refer to the reference guides for basic syntax and extended syntax.
Element
Markdown Syntax
Heading
# for H1, ## for H2 and so on
Bold
**bold text**
Italic
*italicized text*
Blockquote
> blockquote
Ordered List
Just add 1., 2. and so on in front of list elements
Unordered List
Just add a - or * in front of list elements
Code
`code`
Horizontal Rule
three or more *, -, or _
Link
[title](https://www.example.com)
Image
![alt text](file-path/image.jpg){:class=”img-responsive”}
Now that we have reviewed some of the basic syntax elements, lets familiarize ourself with some advance syntax elements.
Element
Markdown Syntax
Table
| for vertical lines and - for horizontal lines
Code Block
``` code ```
Footnote
[^1]: This is the footnote.
Heading ID
### Heading {#custom-id}
Strikethrough
~~The world is flat.~~
URL
https://www.markdownguide.org
Email
fake@example.com
Escape character
\
Markdown also offers syntax highlighting for various programming languages when we specify a code block. Most of the time all we need to do is just mention the name of the programming language after the opening ```, like ```python. Following is a curated list of supported programming languages:
Language
Supported file types
bash
’*.sh’, ‘*.ksh’, ‘*.bash’, ‘*.ebuild’, ‘*.eclass’
bat
’*.bat’, ‘*.cmd’
c
’*.c’, ‘*.h’
cpp
’*.cpp’, ‘*.hpp’, ‘*.c++’, ‘*.h++’, ‘*.cc’, ‘*.hh’, ‘*.cxx’, ‘*.hxx’, ‘*.pde’
csharp
’*.cs’
css
’*.css’
fortran
’*.f’, ‘*.f90’
go
’*.go’
html
’*.html’, ‘*.htm’, ‘*.xhtml’, ‘*.xslt’
java
’*.java’
js
’*.js’
markdown
’*.md’
perl
’*.pl’, ‘*.pm’
php
’*.php’, ‘*.php(345)’
postscript
’*.ps’, ‘*.eps’
python
’*.py’, ‘*.pyw’, ‘*.sc’, ‘SConstruct’, ‘SConscript’, ‘*.tac’
rb or ruby
’*.rb’, ‘*.rbw’, ‘Rakefile’, ‘*.rake’, ‘*.gemspec’, ‘*.rbx’, ‘*.duby’
sql
’*.sql’
vbnet
’*.vb’, ‘*.bas’
xml
’*.xml’, ‘*.xsl’, ‘*.rss’, ‘*.xslt’, ‘*.xsd’, ‘*.wsdl’
yaml
’*.yaml’, ‘*.yml’
A great heading (h1)
Another great heading (h2)
Some great subheading (h3)
You might want a sub-subheading (h4)
Could be a smaller sub-heading, pacman (h5)
Small yet significant sub-heading (h6)
Code box
<html>
<head>
</head>
<body>
<p>Hello, World!</p>
</body>
</html>
List
First item, yo
Second item, dawg
Third item, what what?!
Fourth item, fo sheezy my neezy
Numbered list
First item, yo
Second item, dawg
Third item, what what?!
Fourth item, fo sheezy my neezy
Comments
{% comment %}
Might you have an include in your theme? Why not try it here!
{% include my-themes-great-include.html %}
{% endcomment %}
Tables
Title 1
Title 2
Title 3
Title 4
lorem
lorem ipsum
lorem ipsum dolor
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
Title 1
Title 2
Title 3
Title 4
lorem
lorem ipsum
lorem ipsum dolor
lorem ipsum dolor sit
lorem ipsum dolor sit amet
lorem ipsum dolor sit amet consectetur
lorem ipsum dolor sit amet
lorem ipsum dolor sit
lorem ipsum dolor
lorem ipsum
lorem
lorem ipsum
lorem ipsum dolor
lorem ipsum dolor sit
lorem ipsum dolor sit amet
lorem ipsum dolor sit amet consectetur
In summary, Markdown is a simple and flexible markup language that is widely used for formatting text on the web. Its simplicity and ease of use make it an attractive option for writing and sharing documents, and its flexibility allows it to be customized and extended to support a wide variety of use cases. Whether writing blog posts, creating documentation, or sharing code examples, Markdown is a valuable tool for anyone who wants to format text in a consistent and easy-to-read way.
For a more complete list consider visiting Codebase.
By the way this page was written using markdown and rendered to HTML using Jekyll.
Comments welcome!
-
Introduction to Programming in Python
Quick Introduction to Python
Python is a high-level, interpreted programming language that was first released in 1991 by Guido van Rossum. It is a general-purpose language that is designed to be easy to use, with a focus on readability and simplicity. Python is often used for web development, data analysis, artificial intelligence, scientific computing, and other types of software development.
One of the key features of Python is its ease of use. Python’s syntax is designed to be simple and intuitive, making it accessible to both beginner and experienced programmers. Python is also an interpreted language, meaning that it does not require compilation, which makes it easy to write and test code quickly.
Another important feature of Python is its support for object-oriented programming. Python allows users to create classes and objects, and to define methods on those objects. This makes it a powerful tool for building complex software systems.
Python also includes a large and growing library of built-in modules and packages. These modules provide a wide range of functionality, from working with strings, arrays, and dictionaries to working with databases, web frameworks, and machine learning tools. Python’s open-source ecosystem is one of its biggest strengths, as it allows developers to easily access and integrate with a wide range of third-party libraries and tools.
One of the most popular web development frameworks built in Python is Django. Django is a full-stack web framework that provides a set of conventions and tools for building web applications quickly and easily. With its focus on developer productivity, Django has become a popular choice for startups, small businesses, and large enterprises.
Python’s popularity has also been driven by its use in data analysis and scientific computing. With packages like NumPy, Pandas, and Matplotlib, Python has become a leading language for data analysis and visualization. In recent years, Python has also become a popular language for artificial intelligence and machine learning, with packages like TensorFlow, PyTorch, and Scikit-learn providing powerful tools for building machine learning models.
Most modern programming languages have a set up similar building blocks, for example
Receiving input from the user and Showing output to the user
Ability to store values in variables (usually of different kinds such as integers, floating points or character)
A string of characters where you can store names, addresses, or any other kind of text
Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Put your code in functions
Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes
Read file from a disk and save file to a disk
Ability to comment your code so you can understand it when you revisit it some time later
Lets dive right in and see how we can do these things in Python.
0. How to install Ruby on your desktop?
Before we can begin to write a program in Python, we need to install Anaconda. This will install the Anaconda data science environment and Spyder IDE for coding in Python. Once done, go ahead and open Spyder and try out the following code to see if everything is in order.
myString = "Hello, World!"
print (myString)
1. Receiving input from the user and Showing output to the user
There are several ways in which we can show output to the user. Let’s look at some ways of showing output:
name = input('please enter your name') # receiving character input
print("hello ", name, ",how are you?") # showing character output
age = input('please enter your age') # receiving numeric input
print('so you are', age, 'years old.')
2. Ability to store values in variables (usually of different kinds such as integers, floating points or character)
Python is dynamically typed - don’t need to type out the variable’s data type before using it. This can sometimes cause unexpected problems if for example a user enters a character where you expect a number. To avoid this kind of problems type() can be used. Alternatively, you can “define the variable” by assigning it an initial value (like age=20).
Basic data types: In Python we have several types of objects, lets take a look at the important ones:
# Boolean / Logical
v = TRUE
print(type(v)) # type funciton can be used to see the data type of the variable
# Numeric
v = 23.5
print(type(v))
# Integer
v = 2L
print(type(v))
# Complex
v = 2+5i
print(type(v))
# Character
v = "TRUE"
print(type(v))
# Some common number functions:
hex(1) # hexadecimal representation of numbers
bin(1) # binary representation of numbers
2**3 # 2^3, 2 to the power 3
pow(2,3) # 2**3
pow(2,3,4) # 2**3 % 4
abs(-2.33)
round(3.14)
round(3.14159,2) # only till 2 decimal places
import math
sq_rt = math.sqrt(variable) # returns the square root of the variable
Advanced data types: Much of Python’s power comes from the fact that it lets us access some advanced variable trypes other than the basic ones shown earlier. Lets take a look at some of the advanced variable types:
# Lists - A list can contain many different types of elements inside it such as character, numeric, etc. and even another list inside it.
# Create a list through enumeration.
a=[] # with this we initialize a list element
a=range(1,10) # with this we insert a range of values from 1-10 in the list
print(list(a)) # to show the list as a list, we need to tell the print function that we are passing it a list
# Output: [1, 2, 3, 4, 5, 6, 7, 8, 9] # 10 is excluded because upper bound is excluded in python
# we can have mixed data types in a list
b=[1,2,3,'vivek',True,4,5]
print(list(b))
# index of list start with 0, 1, 2 ..
# so vivek is present at index 3
print(b[3])
# slicing - [start:stop:step]
a[1:6:2] # starts from 1 and goes up until 6 and selects every second element
# reversing a list
L[::-1] # this would take a lot more effort to do in C++!
# tuples - immutable list, cant be changed
t = (1,2,3) # use () instead of []
# dict - d = {'key':'value', ..} is an unordered mutable key:value pairs {"name":"frankie","age":33}
# Dictionary is quite useful in matrix indexing
m=np.array([[1,2,3],[4,5,6],[7,8,9]])
col_names={'age':0, 'weight':1, 'height':2}
row_names={'aa':0, 'cc':1, 'bb':2}
# now we can get weight of ale using actual indexes or dict indexes
m[1,1] # 5
m[row_names['cc'],col_names['weight']] # 5
# set - s=set('a','b','c',..) - unordered collection of unique objects
# It looks like a dictionary {"a","b"} when python shows output, but it is not because it doesn’t have key:value pairs
set([1,1,2,3]) # output: {1,2,3} , List can be passed to set()
set("Mississippi") # output: {'M', 'i', 'p', 's'} , Even strings can be passed to set
# Matrices - A matrix is a two-dimensional rectangular data set. It can be created using .array() function.
# Create a matrix
import numpy as np # we need to import the numpy libabry which provides tools for numerical computing.
m=np.array([[1,2,3],[4,5,6],[7,8,9]])
print(type(m))
# Arrays - while matrices are confined to two dimensions, arrays can be of any number of dimensions.
# Create an array.
import numpy as np # we need to import the numpy libabry which provides tools for numerical computing.
a=np.array([1,2,3]) # this is a 1 dimentional array
print(type(a))
# Convert a list to an array
a=[1,2,3,4]
a=np.array(a) # array([1, 2, 3, 4])
# DataFrame - this is an advanced object that can be used by installing the pandas library. If you are familiar with R, this is similar to data.frame. If you are familiar with excel, you can think of a dataframe as a table with rows and column where rows and colums can potentially have names/labels. You can access data within the dataframe using row/column number (indexing starts from 0) or their labels.
import pandas as pd
# From dict
pd.DataFrame({'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']})
# from list
pd.DataFrame(['orange', 'mango', 'grapes', 'apple'], index=['a', 'b', 'c', 'd'], columns =['Fruits'])
# from list of lists
pd.DataFrame([['orange','tomato'],['mango','potato'],['grapes','onion'],['apple','chilly']], index=['a', 'b', 'c', 'd'], columns =['Fruits', 'Vegetables'])
# from multiple lists
pd.DataFrame(
list(zip(['orange', 'mango', 'grapes', 'apple'],
['tomato', 'potato', 'onion', 'chilly']))
, index=['a', 'b', 'c', 'd']
, columns =['Fruits', 'Vegetables'])
3. A string of characters where you can store names, addresses, or any other kind of text
Any value written within a pair of single quote or double quotes in Python is treated as a string.
Key idea here is to learn how to manipulate string variables
There are a few common operations that we will focus on:
a. Concatenate strings
# Concatenate strings
str1 + str2 + " " + str3
b. Counting number of characters in a string
# Counting number of characters in a string
str1 = "vivek"
len(str1)
c. Changing the case - toupper() & tolower() functions
str1.upper() # convert string to upper case (.lower() for lower case)
str1.isupper(), str1.islower() # check if a string or a character is upper or lower
d. Splitting a string
s.split('e') # returns list of strings before and after e. if there are multiple e's, then split happens for all instances of e
e. Palindrome of a string
str1 = "vivek"
str1[::-1]
4. Some advance data types such as lists which can store a series of regular variables (such as a series of integers)
Lists are a series of variables stored together in one variable. Lists can be one-dimentional or multi-dimentional. A list is created using the list() function. It takes variables (even other lists) as input. List is different from string because elements can be mutated/changed.
# Defining
L=[0,0,0] # [0, 0, 0]
L1=[0]*3 #shorthand way of defining a list with repeated elements
# Supports indexing and slicing
L1=['one', 'two', 'three']
L1[0] # 'one'
L1[1:2] # ['two'], upper bound is excluded
L1[1:3] # ['two', 'three']
# Indexing nested lists
L1 = ['one', 'two', ['three', 'four'], 'five']
L1[2][0] # 'three'
# Elements can be added
L1.append('six')
# Elements can be removed
L1.pop() # last element gets popped, we can save it in a variable also
# Sort
L1.sort() # sorts the list in-place, the actual list gets sorted
sorted(L1) #returns the sorted version of L3 list
# Reverse
L1=['c','a','b']
L1.reverse() # reverses the list in-place, the actual list gets reversed
# Multi dimentional list indexing
L1=[[1,2,3],[4,5,6],[7,8,9]]
L1[0][:] # returns first row
5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Python has several looping options such as ‘for’ and ‘while’. There are also options of nesting (single, double, triple, ..) loops.
a. The While loop executes the same code again and again until a stop condition is met:
# Syntax
while test:
code statements
else:
final code statements
# Example
x = 0
while x < 10:
print('x is currently: ',x)
print(' x is still less than 10, adding 1 to x')
x+=1
b. The for loop: acts as an iterator in Python; it goes through items that are in a sequence or any other iterable item. Objects that we’ve learned about that we can iterate over include strings, lists, tuples, and even built-in iterables for dictionaries, such as keys or values.
# Syntax
for item in object:
statements to do stuff
# Example
list1 = [1,2,3,4,5,6,7,8,9,10]
for num in list1:
print(num)
Python also provides the break, continue and pass statements that allow us to alter the loops further. Following is their use:
break: Breaks out of the current closest enclosing loop.
continue: Goes to the top of the closest enclosing loop.
pass: Does nothing at all.
# Thinking about break and continue statements, the general format of the while loop looks like this:
while test:
code statement
if test:
break
if test:
continue
else:
break and continue statements can appear anywhere inside the loop’s body, but we will usually put them further nested in conjunction with an if statement to perform an action based on some condition.
6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Python provides if.., if..else.., and if..else..if.. statements to apply conditional logic. Lets take a look at them:
a. The basic syntax for creating an if statement is:
if False:
print('It was not true!')
b. The basic syntax for creating an if…else statement is:
x = False
if x:
print('x was True!')
else:
print('I will be printed in any case where x is not true')
c. The basic syntax for creating an if…else if…else statement is:
loc = 'Bank'
if loc == 'Auto Shop':
print('Welcome to the Auto Shop!')
elif loc == 'Bank':
print('Welcome to the bank!')
else:
print('Where are you?')
7. Put your code in functions
Functions allows us to create a block of code that can be executed many times without needing to it write it again.
# Syntax
def name_of_function(argument_name='default value'): #snake casing for name, all lower case alphabets with underscores
'''
what funciton does
'''
print ('hello',argument_name)
print (f'hello {argument_name}') #both print do the same thing
# Example
def add_function(a=0,b=0):
return a+b
We can call the function in the following two ways:
# option 1
add_function(2,3)
# option 2
c=add_function(3,4)
*args and **kwargs stand for arguments and keyword arguments and allow us to extend the funcitonality of functions.
*args lets a function take an arbitrary number of arguments. All arguments are received as a tuple, example - (a,b,c,..). args can be renamed to something else, what really matters is *.
def myfunc(*args):
return args
'''
myfunc(1,2,3,4,5,6,7,8,9)
Out[30]: (1, 2, 3, 4, 5, 6, 7, 8, 9)
'''
**kwargs lets the funciton take an arbitrary number of keyword arguments. All arguments are received as a dictionary of key,value pairs. kwargs can be renamed to something else, what really matters is **.
def myfunc(**kwargs):
print(kwargs)
'''
myfunc(name='vivek', age=34, height=186)
{'name': 'vivek', 'age': 34, 'height': 186}
'''
8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes
Python allows user to create classes. These can be a combination of variables and functions that operate on those variables. Lets take a look at how we can define and use them.
# Define a class
class Person:
"This is a person class"
age = 10
def greet(self):
print('Hello')
# Using class
print(Person.age) # Output: 10
print(Person.greet) # Output: <function Person.greet>
print(Person.__doc__) # Output: 'This is my second class'
# Creating an object of the class and using that
vivek = Person() # create a new object of Person class
print(vivek.greet) # Output: <bound method Person.greet of <__main__.Person object>>
vivek.greet() # Calling object's greet() method; Output: Hello
9. Read file from a disk and save file to a disk
Lets see how to read and write a csv file in an organized way. CSV is the most common file type you will be using for data science, however python can read several other file types and data directly from websites as well.
import pandas
# read a csv using the pandas package
df = pandas.read_csv('student_data.csv')
print(df)
# write data to a csv using pandas package
df.to_csv('student_data_copy.csv')
10. Ability to comment your code so you can understand it when you revisit it some time later
We can tell Python that a line of code is a comment by starting it with a #.
# this is a comment
We can tell that a multi-line block of text is a comment by enclosing it in triple inverted single quotes.
'''
this
is
a
comment
block
'''
Overall, Python is a versatile and powerful programming language that is well-suited for a wide range of programming tasks. With its emphasis on simplicity, object-oriented design, and a large and growing ecosystem of third-party libraries and tools, Python is a valuable tool for both beginner and experienced programmers. Whether building web applications, analyzing data, or working on artificial intelligence projects, Python provides a fast, flexible, and enjoyable development experience.
To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent.
Comments welcome!
-
Introduction to Programming in Julia
Quick Introduction to Julia
Julia is a high-level, high-performance programming language that was created in 2012 by a team of computer scientists led by Jeff Bezanson, Stefan Karpinski, and Viral Shah. Julia was designed to address the limitations of traditional scientific computing languages, such as MATLAB, Python, and R, while still retaining their ease of use and flexibility.
One of the key features of Julia is its performance. Julia is designed to be fast, with execution speeds comparable to those of compiled languages such as C and Fortran. This is achieved through a combination of just-in-time (JIT) compilation, which compiles code on the fly as it is executed, and type inference, which allows Julia to determine the data types of variables at runtime.
Another important feature of Julia is its support for multiple dispatch. Multiple dispatch allows Julia to select the appropriate method to use based on the types of the arguments being passed to a function. This makes Julia a flexible and expressive language that can be easily extended and customized to fit a wide range of programming tasks.
Julia also includes a number of built-in data structures and libraries that make it easy to work with arrays, matrices, and other scientific computing tools. These include tools for linear algebra, statistics, optimization, and machine learning, as well as support for distributed computing and parallelism.
In addition to its scientific computing features, Julia also includes support for general-purpose programming tasks, such as web development, database access, and file I/O. Julia’s growing package ecosystem provides a wide range of libraries and tools for these tasks, making it a versatile language that can be used for a variety of programming tasks.
One of the key benefits of Julia is its community. Julia has a rapidly growing community of developers and users who are actively contributing to the language and its ecosystem. This community has created a large number of high-quality packages, as well as a number of online resources and forums for learning and discussing the language.
Most modern programming languages have a set up similar building blocks, for example
Receiving input from the user and Showing output to the user
Ability to store values in variables (usually of different kinds such as integers, floating points or character)
A string of characters where you can store names, addresses, or any other kind of text
Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Put your code in functions
Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes
Read file from a disk and save file to a disk
Ability to comment your code so you can understand it when you revisit it some time later
Lets dive right in and see how we can do these things in Julia.
0. How to install Julia on your desktop?
Before we can begin to write a program in Julia, we need to install Julia. Next you can install VSCode. Now launch VSCode and install the Julia (by julialang) extension. Now you can create a new test.jl file and add the following code and see if runs.
4+2; # If you don't want to see the result of the expression printed, use a semicolon at the end of the expression
ans; # the value of the last expression you typed on the REPL, it's stored within the variable ans
Before we dive in, chaining functions is possible in Julia, like so:
1:10 |> collect
1. Receiving input from the user and Showing output to the user
There are several ways in which we can show output to the user. Let’s look at some ways of showing output:
# receiving input from user
name = readline(stdin)
# showing output to user
println("you name is ", name)
2. Ability to store values in variables (usually of different kinds such as integers, floating points or character)
Names of variables are in lower case. Word separation can be indicated by underscores.
Julia has several types of variables broadly classified into Concrete and abstract types. The types that can have subtypes (e.g. Any, Number) are called abstract types. The types that can have instances are called concrete types. These types cannot have any subtypes.
Concrete types can be further divided into primitive (or basic), and complex (or composite). Let’s take a deeper look:
# Primitive types
## the basic integer and float types (signed and unsigned): Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64, UInt64, Int128, UInt128, Float16, Float32, and Float64
a = 10
## more advanced numeric types: BigFloat, BigInt
a = BigInt(2)^200
## Boolean and character types: Bool and Char
selected = true
## Text string types: String
name = "vivek"
# Composite type
## Rational, used to represent fractions. It is composed of two pieces, a numerator and a denominator, both integers (of type Int)
666//444 # To make rational numbers, use two slashes (//)
Some advanced data types include dictionary and sets. Sets are similar to arrays with the difference that they dont allow element duplication.
dict = Dict("a" => 1, "b" => 2, "c" => 3)
dict = Dict{String,Integer}("a"=>1, "b" => 2) # If you know the types of the keys and values in advance, you can specify them after the Dict keyword, in curly braces
# looking things up
dict["a"]
values(dict) # to retrieve all values
keys(dict) # to retrieve all keys
# these can be useful for iterating
for k in keys(dict)
for (key, value) in dict
merge(d1, d2) # merge() function which can merge two dictionaries
findmin(d1) # find the minimum value in a dictionary, and return the value, and its key
filter((k, v) -> k == 1, d1)
# sort dict - you can use the SortedDict data type from the DataStructures.jl package
Pkg.add("DataStructures")
import DataStructures
dict = DataStructures.SortedDict("b" => 2, "c" => 3, "d" => 4, "e" => 5, "f" => 6)
# Sets - A set is a collection of elements, just like an array or dictionary, with no duplicated elements.
colors = Set{String}(["red","green","blue","yellow"])
push!(colors, "black") # You can use push!() to add elements to a set
union(colors, rainbow) # The union of two sets is the set of everything that is in one or the other sets
intersect(colors, rainbow) # The intersection of two sets is the set that contains every element that belongs to both sets
setdiff(colors, rainbow) # The difference between two sets is the set of elements that are in the first set, but not in the second
We will discuss abstract data types in section 8 below.
3. A string of characters where you can store names, addresses, or any other kind of text
Any value written within a pair of double quotes in Julia is treated as a string.
"this is a string"
# double quotes and dollar signs need to be preceded (escaped) with a backslash
"""this is "a" string with double quotes""" # triple double quotes can be used to store strings with double quotes in them
Julia also allows the user to indicate special strings.
# special strings
r" " indicates a regular expression
v" " indicates a version string
b" " indicates a byte literal
raw" " indicates a raw string that doesn't do interpolation
Key idea here is to learn how to manipulate string variables. There are a few common operations that we will focus on:
a. Concatenate strings
# Concatenate strings
join(split(s, r"a|e|i|o|u", false), "aiou") # You can join the elements of a split string in array form using join()
b. Counting number of characters in a string
# Counting number of characters in a string
length(str) # to find the length of a string
lastindex(str) # to find index of last char of string
c. Changing the case - toupper() & tolower() functions
uppercase(s)
d. Splitting a string
split("You know my methods, Watson.") # by default splits on space
split("You know my methods, Watson.", 'W') # splits on the char W
# If you want to split a string into separate single-character strings, use the empty string ("")
split("You know my methods, Watson.", r"a|e|i|o|u", false) # splits string on the char that matches any of the vowels
# false makes sure that empty strings are not returned
e. String interpolation
# string interpolation - use the results of Julia expressions inside strings.
x = 42
"The value of x is $(x)." # "The value of x is 42."
f. Iterate over a string
for char in s # iterate through a string
print(char, "_")
end
g. Get index of all characters in a string
for i in eachindex(str)
@show su[i]
end
h. Converting between numbers and strings
a = BigInt(2)^200
a=string(a) # convert number to string
parse(BigInt, a) # convert strings to numbers
i. Finding and replacing things inside strings
s = "My dear Frodo";
in('M', s) # true
occursin("Fro", s) # true
findfirst("My", s) # 1:2
replace(s, "Frodo" => "Frodo Baggins")
There are a lot of other functions as well:
length(str) - - length of string
sizeof(str) - length/size
startswith(strA, strB) - does strA start with strB?
endswith(strA, strB) - does strA end with strB?
occursin(strA, strB) - does strA occur in strB?
all(isletter, str) - is str entirely letters?
all(isnumeric, str) - is str entirely number characters?
isascii(str) - is str ASCII?
all(iscntrl, str) - is str entirely control characters?
all(isdigit, str) - is str 0-9?
all(ispunct, str) - does str consist of punctuation?
all(isspace, str) - is str whitespace characters?
all(isuppercase, str) - is str uppercase?
all(islowercase, str) - is str entirely lowercase?
all(isxdigit, str) - is str entirely hexadecimal digits?
uppercase(str) - return a copy of str converted to uppercase
lowercase(str) - return a copy of str converted to lowercase
titlecase(str) - return copy of str with the first character of each word converted to uppercase
uppercasefirst(str) - return copy of str with first character converted to uppercase
lowercasefirst(str) - return copy of str with first character converted to lowercase
chop(str) - return a copy with the last character removed
chomp(str) - return a copy with the last character removed only if it's a newline
4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Arrays can be one-dimentional or multi-dimentional. An array is created using the square brackets, Array constructor or several other methods. Arrays support a lot of functionality within Julia so I have covered it in more detail in this array specific article. For now lets check out the key functionality.
# Defining
# Creating arrays by initializing
arr_Int64 = [1, 2, 3, 4, 5]
# Creating empty arrays
b = Int64[]
# Creating 2-d arrays
arr_2d = [1 2 3 4] # If you leave out the commas when defining an array, you can create 2D arrays quickly. Here's a single row, multi-column array:
arr_2d = [1 2 3 4 ; 5 6 7 8] # you can add another row using ;
# Creating arrays using range objects
a = 1:10 # creates a range variable with 10 elements from 1 to 10
collect(a) # collect displays a range variable
[a...] # instead of collect, you could use the ellipsis (...) operator (three periods) after the last element
range(1, length=12, stop=100) # Julia calculates the missing pieces for you by combining the values for the keywords step(), length(), and stop()
# Using comprehensions and generators to create arrays
[n^2 for n in 1:5] # a 1-d array
[r * c for r in 1:5, c in 1:5] # a 2-d array
# Reshape an array to create a multi-dimentional array
reshape([1, 2, 3, 4, 5, 6, 7, 8], 2, 4) # create a simple array and then change its shape
# Supports indexing and slicing
# 1-d
a[5] # 5th element
a[end] # last element
a[end-1] # second last element
# 2-d
a = [[1, 2] [3,4]]
a[2,2] # element at row-2 x col-2
a[:,2] # all elements of col-2
getindex(a, 2,2) # same as a[2,2]
# Elements can be added
a = Array[[1, 2], [3,4]]
push!(a, [5,6]) # The push!() function pushes another item onto the back of an array
pushfirst!(a, 0) # To add an item at the front
splice() # To insert an element into an array at a given index
splice!(a, 4:5, 4:6) # insert, at position 4:5, the range of numbers 4:6
L = ['a','b','f']; splice!(L, 3:2, ['c','d','e']) # insert c, d, e between b and f
# Elements can be removed
splice!(a,5); # If you don't supply a replacement, you can also use splice!() can remove elements and move the rest of them along
pop!(a) # To remove the last item
popfirst!(a)
# Elementwise and vectorized operations
a / 100 # every element of the new array is the original divided by 100. These operations operate elementwise
n1 = 1:6;
n2 = 2:7;
n1 .* n2; # if two arrays are to be multiplied then we just add a . before the mathematical operator to signify elementwise
# the first element of the result is what you get by multiplying the first elements of the two arrays, and so on
# How function works on individual variables
f(a, b) = a * b
a=10;b=20;print(f(a,b))
# How function can be applied elementwise to arrays
n1 = 1:6;
n2 = 2:7;
print(f.(n1, n2))
5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Julia has several looping options such as ‘for’ and ‘while’. There are also options of nesting (single, double, triple, ..) loops.
a. The While loop executes the same code again and again until a stop condition is met:
# while end - iterative conditional evaluation
x=0
while x < 4
println(x)
global x += 1
end
b. The for loop: acts as an iterator in Julia; it goes through items that are in a sequence or any other iterable item. Objects that we’ve learned about that we can iterate over include strings, lists, tuples, and even built-in iterables for dictionaries, such as keys or values.
# for end - iterative evaluation
# use the global keyword to define a variable that outlasts the loop
for i in 1:10
z = i
println("z is $z")
end
# Some sample for loop statements for different data types
for color in ["red", "green", "blue"] # an array
for letter in "julia" # a string
for element in (1, 2, 4, 8, 16, 32) # a tuple
for i in Dict("A"=>1, "B"=>2) # a dictionary
for i in Set(["a", "e", "a", "e", "i", "o", "i", "o", "u"])
Julia also provides the break and continue statements that allow us to alter the loops further. Following is their use:
break: Breaks out of the current closest enclosing loop.
continue: Goes to the top of the closest enclosing loop.
# Example with break statement
x=0
while true
println(x)
x += 1
x >= 4 && break # breaks out of the loop
end
break and continue statements can appear anywhere inside the loop’s body, but we will usually put them further nested in conjunction with an if statement to perform an action based on some condition.
Following are some other options for looping options:
# list comprehensions
[i^2 for i in 1:10]
[(r,c) for r in 1:5, c in 1:2] # two iterators in a comprehension
# Generator expressions - generator expressions can be used to produce values from iterating a variable
sum(x^2 for x in 1:10)
# Enumerating arrays
m = rand(0:9, 3, 3)
[i for i in enumerate(m)]
# Zipping arrays
for i in zip(0:10, 100:110, 200:210)
println(i)
end
# Iterable objects
ro = 0:2:100
[i for i in ro]
6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Julia provides several options to apply conditional logic. Lets take a look at them:
a. ternary and compound expressions:
x = 1
x > 3 ? "yes" : "no"
b. Boolean switching expressions:
isodd(1000003) && @warn("That's odd!")
isodd(1000004) || @warn("That's odd!")
c. if elseif else end - conditional evaluation:
name = "Julia"
if name == "Julia"
println("I like Julia")
elseif name == "Python"
println("I like Python.")
println("But I prefer Julia.")
else
println("I don't know what I like")
end
c. Error handling using try.. catch. This allows the code to still keep executing even if an error occurs, which would usually halt the program.
# try catch error throw exception handling
try
<statement-that-might-cause-an-error>;
catch e # error gets caught if it happens
println("caught an error: $e") # show the error if you want to
end
println("but we can continue with execution...")
# Example 1 - error doesnt occur
try
a=10 # no error
catch e
print(e)
end
# Example 2 - error occurs
try
la-la-la # undefined variable error
catch e
print(e)
end
7. Put your code in functions
Functions allows us to create a block of code that can be executed many times without needing to it write it again.
Julia has something called a single expression function. These are usually defined in one line like so:
# Single expression functions
f(x) = x * x
g(x, y) = sqrt(x^2 + y^2)
Functions with multiple expressions are also supported and can be defined using the function keyword:
# Syntax
# Functions with multiple expressions
function say_hello(name)
println("hello ", name)
end
say_hello("vivek")
Additionally, functions can be programmed to retun a single or multiple value using the return keyword.
# define function which returns a value
function add_numbers(a,b)
return a+b
end
# call the function
add_numbers(2,3)
# define function which returns multiple values
function add_multiply_numbers(a, b=10) # we can supply default values as well
return(a+b, a*b)
end
# call the function
add_multiply_numbers(2,3)
add_multiply_numbers(2)
args… lets a function take an arbitrary number of arguments. A for loop can be used to iterate over these arguments.
function show_args(args...)
for arg in args
println(arg," ")
end
end
show_args(10,20,25,35,50)
Julia also supports anonymous functions, with no name.
map((x,y,z) -> x + y + z, [1,2,3], [4, 5, 6], [7, 8, 9])
Map and reduce can also be used to apply functions to arrays.
Map - If you already have a function and an array, you can call the function for each element of the array by using map()
a=1:10;
map(sin, a) # map() returns a new array but if you call map!() , you modify the contents of the original array
The map() function collects the results of some function working on each and every element of an iterable object, such as an array of numbers.
map(+, 1:10)
The reduce() function does a similar job, but after every element has been seen and processed by the function, only one is left. The function should take two arguments and return one.
reduce(+, 1:10)
8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes
Julia allows user to create user defined variables using abstract type (which are abstract) or mutable struct (which are concrete). Lets take a look at both.
Abstract type
abstract type MyAbstractType end # By default, the type you create is a direct subtype of Any
abstract type MyAbstractType2 <: Number end # the new abstract type is a subtype of Number
Concrete type using mutable struct
# define the data type
mutable struct student <: Any
name
age::Int
end
# initialize a variable of that data type
x=student("vivek", 30)
# use the variable
x.name
x.age
9. Read file from a disk and save file to a disk
Lets see how to read in an organized way.
f = open("sherlock-holmes.txt") # To read text from a file, first obtain a file handle:
close(f) # When you've finished with the file, you should close the connection
If you use the following technique then you dont need to close. The open file is automatically closed when this block finishes.
open("sherlock-holmes.txt") do file
# do stuff with the open file
end
10. Ability to comment your code so you can understand it when you revisit it some time later
We can tell Julia that a line of code is a comment by starting it with a #.
# this is a comment
Overall, Julia is a powerful and flexible programming language that is well-suited for scientific computing and other high-performance tasks. With its emphasis on performance, multiple dispatch, and a growing ecosystem of packages and tools, Julia is a valuable tool for researchers, data scientists, and other professionals who need a fast, flexible, and expressive language for their work.
To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent.
Comments welcome!
-
Perspective: A Lesson from The Kite Runner
Perspective: A Lesson from The Kite Runner
Have you ever looked back on a moment in your life and realized you saw it completely differently at the time? Our perspective shapes the way we understand events, people, and even ourselves. Khaled Hosseini’s The Kite Runner masterfully explores the power of perspective through its protagonist, Amir, and his journey of redemption. The novel provides several poignant moments where a shift in perspective redefines reality, reminding us of the importance of seeing beyond our own biases and assumptions.
A Child’s Perspective: The Privilege of Innocence
In the beginning, Amir enjoys a privileged life in Kabul, unaware of the deep societal divides that separate him from Hassan, his Hazara servant and best friend. To Amir, their friendship is pure and unaffected by status. However, Hassan, though younger, understands the weight of their differences. One of the most heartbreaking moments occurs when Amir fails to stand up for Hassan in the alley. From Amir’s limited perspective, his silence is self-preservation, but with time, he realizes it was cowardice—a realization that haunts him into adulthood.
“I ran because I was a coward. I was afraid of Assef and what he would do to me.” This self-awareness only develops later, demonstrating how perspective matures with experience.
The Father-Son Lens: Misunderstood Love
Baba, Amir’s father, is another character whose perspective is misunderstood. Amir believes Baba favors strength and physical courage over intellect, leading to deep insecurities. However, as the novel unfolds, Amir learns of Baba’s sacrifices and hidden struggles—his illegitimate son, his moral dilemmas, and the burden of expectations.
A key moment of realization comes when Baba tells Amir, “There is only one sin, only one. And that is theft… When you tell a lie, you steal someone’s right to the truth.” This lesson, initially abstract to Amir, takes on a new meaning as he matures and understands the gravity of deception—not just in others, but within himself.
Redemption and a Shift in Perspective
Perspective is often best understood in hindsight. Amir’s journey to atone for his past mistakes brings him back to Afghanistan, where he sees his homeland through the eyes of suffering. The Taliban’s rule has reshaped the Kabul of his childhood into an unrecognizable and brutal landscape. His perception of Hassan also shifts dramatically when he discovers the truth about their relationship—that they were brothers.
His final act—rescuing Sohrab—is not just a physical redemption but a transformation of his worldview. He finally understands what it means to be truly selfless, to take action rather than remain passive.
Final Thoughts: Expanding Our Own Perspective
Amir’s journey reminds us that perspective is ever-changing, molded by experience, knowledge, and time. Whether in literature or in life, understanding different perspectives fosters empathy and growth. Just like Amir, we must be willing to look beyond our immediate view and challenge our own biases.
After all, true transformation begins when we allow ourselves to see the world through another’s eyes. How has a shift in perspective changed the way you see a person or situation in your own life?
-
Introduction to Programming in Ruby
Quick Introduction to Ruby
Ruby is a high-level, interpreted programming language that was created in the mid-1990s by Yukihiro “Matz” Matsumoto. It is a general-purpose language that is designed to be easy to use and read, with syntax that is similar to natural language. Ruby is often used for web development, as well as for building command-line utilities, desktop applications, and other types of software.
One of the key features of Ruby is its emphasis on programmer productivity and ease of use. Ruby’s syntax is designed to be intuitive and easy to read, making it accessible to both beginner and experienced programmers. Ruby also includes a number of built-in features and libraries that make it easy to accomplish common programming tasks, such as working with strings, arrays, and hashes.
Another important feature of Ruby is its object-oriented programming model. Everything in Ruby is an object, and methods can be defined on objects to add functionality. Ruby also includes support for inheritance, encapsulation, and polymorphism, which makes it a powerful tool for building complex software systems.
Ruby is also known for its extensive library of open-source gems, which are pre-built packages of code that can be easily integrated into Ruby projects. These gems provide a wide range of functionality, from database access to web development frameworks, and can save developers a significant amount of time and effort in building software.
One of the most popular web development frameworks built in Ruby is Ruby on Rails. Rails is a full-stack web framework that provides a set of conventions and tools for building web applications quickly and easily. With its focus on developer productivity, Rails has become a popular choice for startups and small businesses, as well as for larger enterprises.
Most modern programming languages have a set up similar building blocks, for example
Receiving input from the user and Showing output to the user
Ability to store values in variables (usually of different kinds such as integers, floating points or character)
A string of characters where you can store names, addresses, or any other kind of text
Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Put your code in functions
Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes
Read file from a disk and save file to a disk
Ability to comment your code so you can understand it when you revisit it some time later
Lets dive right in and see how we can do these things in Ruby.
0. How to install Ruby on your desktop?
Before we can begin writing programs in Ruby, we need to set up our ruby environment.
You can install Ruby from here ruby-lang.org.
Additionally, you need to install an IDE to write and execute Ruby code. My personal favorite is code.visualstudio.com.
Lastly, you will also need to install the following extensions within VSCode: Ruby (Peng Lv) and Code Runner (Jun Han).
Now, lets write a simple program that print out hello world for the user to see
print 'Hello World !!!'
1. Receiving input from the user and Showing output to the user
There are several ways in which we can show output to the user. Let’s look at some ways of showing output:
#Method 1:
print 'Hello World !!!'
#Method 2:
p 'Hello World !!!'
#Method 3:
puts 'Hello World !!!'
#Method 4: Showing data stored in variables to user
my_name = "Vivek"
puts "Hello #{my_name}"
#Method 5: Showing multiple variables using same puts statement
aString = "I'm a string!"
aBoolean = true
aNumber = 42
puts "string: #{aString} \nboolean: #{aBoolean} \nnumber: #{aNumber}"
2. Ability to store values in variables (usually of different kinds such as integers, floating points or character)
There are three main types of variable:
Strings (a collection of symbols inside speech marks)
Booleans (true or false)
Numbers (numeric values)
Following are some examples:
aString = "I'm a string!"
aBoolean = true
aNumber = 42
puts "string: #{aString} \nboolean: #{aBoolean} \nnumber: #{aNumber}"
Performing basic math on numeric variables. There are 6 types of basic operations: addition, subtraction, multiplication, division, modulo and exponent.
a = 5
b = 2
puts "sum: #{a+b}\
\ndifference: #{a-b}
\nmultiplication: #{a*b}
\ndivision: #{a/b}
\nmodulo: #{a%b}
\nexponent: #{a**b}"
3. A string of characters where you can store names, addresses, or any other kind of text
You can use single quotes or double quotes for strings - either one is acceptable.
myFirstString = 'I am a string!' #single quotes
mySecondString = "Me too!" #double quotes
There are a few common operations that we will focus on:
"Hi!".length #is 3
"Hi!".reverse #is !iH
"Hi!".upcase #is HI!
"Hi!".downcase #is hi!
# You can also use many methods at once. They are solved from left to right.
"Hi!".downcase.reverse #is !ih
# If you want to check if one string contains another string, you can use .include?.
"Happy Birthday!".include?("Happy")
4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Arrays allow you to group multiple values together in a list. Each value in an array is referred to as an “element”.
a. Defining an array:
myArray = [] # an empty array
myOtherArray = [1, 2, 3] # an array with three elements
b. Accessing array elements:
# In order to add to or change elements in an array, you can refer to an element by number.
myOtherArray[3] = 4
Ruby has another advanced data type called Hash, which is similar to a python dictionary. Just like arrays, hashes allow you to store multiple values together. However, while arrays store values with a numerical index, hashes store information using key-value pairs. Each piece of information in the hash has a unique label, and you can use that label to access the value.
a. To create a hash, use Hash.new, or myHash={}. For example:
myHash=Hash.new()
myHash["Key"]="value"
myHash["Key2"]="value2"
# or
myHash={
"Key" => "value",
"Key2" => "value2"
}
b. To access elements of a hash:
puts myHash["Key"] # puts value
Instead of using a string as a key, you can also use a symbol, like this:
a. To create a hash, use Hash.new, or myHash={}. For example:
myHash=Hash.new()
myHash[:Key]="value"
myHash[:Key2]="value2"
# or
myHash={
Key: "value",
Key2: "value2",
}
b. To access elements of a hash:
puts myHash[:Key] # puts "value"
5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ruby has several looping options (For, While, and Until). There are options of nesting (single, double, triple, ..) loops as well.
a. For loop executes code once for each element in expression. Following example shows how a for loop works:
# Syntax
for variable [, variable ...] in expression [do]
code
end
# Example
for i in 0..5
puts "Value of local variable is #{i}"
end
b. While loop executes code while conditional is true. A while loop’s conditional is separated from code by the reserved word do, a newline, backslash \, or a semicolon ;. Following example shows how a for loop works:
# Syntax
while conditional [do]
code
end
# Example
a=1
b=5
while a<=b
puts "run #{a}"
a=a+1
end
# Ruby while modifier - Executes code while conditional is true.
code while condition
# or
begin # If a while modifier follows a begin statement with no rescue or ensure clauses, code is executed once before conditional is evaluated.
code
end while conditional
c. Until loop executes code while conditional is false. An until statement’s conditional is separated from code by the reserved word do, a newline, or a semicolon. Following example shows how a for loop works:
# Syntax
until conditional [do]
code
end
# Example
$i = 0
$num = 5
until $i > $num do
puts("Inside the loop i = #$i" )
$i +=1;
end
# Ruby until modifier - Executes code while conditional is false.
code until conditional
# or
begin # If an until modifier follows a begin statement with no rescue or ensure clauses, code is executed once before conditional is evaluated.
code
end until conditional
d. Ruby also offers following keywords that can modify the behavior of the above loops:
# break - Terminates the most internal loop. Terminates a method with an associated block if called within the block (with the method returning nil).
# next - Jumps to the next iteration of the most internal loop. Terminates execution of a block if called within a block (with yield or call returning nil).
# redo - Restarts this iteration of the most internal loop, without checking loop condition. Restarts yield or call if called within a block.
# retry - If retry appears in rescue clause of begin expression, restart from the beginning of the begin body.
# retry - If retry appears in the iterator, the block, or the body of the for expression, restarts the invocation of the iterator call. Arguments to the iterator is re-evaluated.
6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Conditionals are used to add branching logic to your programs; they allow you to include complex behaviour that only occurs under specific conditions.
a. If - if condition is an expression that can be checked for truth. If the expression evaluates to true, then the code within the block is executed.
if condition
something to be done
end
# Ruby if modifier - executes code if the conditional is true.
code if condition
Following is an actual example of an if statement with both an elsif and an else.
booleanOne = true
randomCode = "Hi!"
if booleanOne
puts "I will be printed!"
elsif randomCode.length>=1
puts "Even though the above code is true, I won't be executed because the earlier if statement was true!"
else
puts "I won't be printed because the if statement was executed!"
end
b. If Else - You can combine if with the keyword else. This lets you execute one block of code if the condition is true, and a different block if it is false. The else block will only be executed if the if block doesn’t run, so they will never both be executed.
if condition
something to be done
else
something to be done if the condition evaluates to false
end
c. Elseif - When you want more than two options, you can use elsif. This allows you to add more conditions to be checked. Still only one of the code blocks will be run, because the statement only executes the code in the first applicable block; Once a condition has been satisfied, the whole statement ends. Here is if/elsif/else statement syntax:
if condition
something to be done
elsif different condition
something else to be done
else
another different thing to be done
end
d. Unless - Executes code if conditional is false. If the conditional is true, code specified in the else clause is executed.
unless condition
# thing to be done if the condition is false
else
# else is optional
# thing to be done if the condition is true
end
# Ruby unless modifier - Executes code if conditional is false.
code unless conditional
e. Case - this is basically same as a if-elseif-else statement, but with more clear syntax.
# case statement syntax
case expr0
when expr1, expr2
stmt1
when expr3, expr4
stmt2
else
stmt3
end
# is basically similar to the following −
if expr1 === expr0 || expr2 === expr0
stmt1
elsif expr3 === expr0 || expr4 === expr0
stmt2
else
stmt3
end
Example of case statement
$age = 5
case $age
when 0 .. 2
puts "i will not be printed"
when 3 .. 6
puts "i will be printed"
when 7 .. 12
puts "i will not be printed"
when 13 .. 18
puts "youth"
else
puts "i will not be printed"
end
7. Put your code in functions
a. In Ruby we call functions methods. Methods are reuseable sections of code that perform specific tasks in our program. Using methods means that we can write simpler, more easily readable code.
# syntax
def methodname
# method code here
end
b. Methods can also be defined to accept and process any parameters that are passed to them:
# Methods With Parameters
def laugh(number)
puts "haha " * number
end
c. We can call methods using the name of the method and specify the parameters within paranthesis or without them:
# Using method - calling method as follows prints "haha" 5 times on the screen
laugh(5)
# You can also call laugh without paranthesis
laugh 5
d. We can set default values for the parameters, which will be used if method is called without passing the required parameters
def method_name (var1 = value1, var2 = value2)
expr..
end
e. We can also return values. return statement in ruby is used to return one or more values from a Ruby Method.
return
# or
return 12
# or
return 1,2,3
f. We can also define methods with variable number of parameters, like so:
Variable Number of Parameters
def sample (*test)
puts "The number of parameters is #{test.length}"
for i in 0...test.length
puts "The parameters are #{test[i]}"
end
end
sample "Zara", "6", "F"
sample "Mac", "36", "M", "MCA"
8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes
Ruby allows user to create classes. These can be a combination of variables and functions that operate on those variables. Lets take a look at how we can define and use them.
# Define a class
class employee
@@no_of_customers = 0
def initialize(id, name, addr)
@cust_id = id
@cust_name = name
@cust_addr = addr
end
end
# Creating an object of the class and using that
cust1 = employee.new("1", "Vivek", "Somewhere on the, Internet")
9. Read file from a disk and save file to a disk
Lets see how to read and parse csv in an organized way. CSV is the most common file type you will be using for data science, however ruby can read several other file types as well.
require 'csv'
# read a csv
CSV.read("file.csv")
# parse a string of text which is in csv format
CSV.parse("1,penny\n2,nickel\n3,dime")
10. Ability to comment your code so you can understand it when you revisit it some time later
a. We can tell ruby that a line of code is a comment by starting it with #.
#this is a comment
b. We can also specify a comment block, like so:
=begin
There are three main types of variable:
1. Strings (a collection of symbols inside speech marks)
2. Booleans (true or false)
3. Numbers (numeric values)
=end
Overall, Ruby is a powerful and flexible programming language that is well-suited for a wide range of programming tasks. With its focus on ease of use, object-oriented design, and extensive library of gems, Ruby is a valuable tool for both beginner and experienced programmers. Whether building web applications, desktop utilities, or other types of software, Ruby provides a fast, flexible, and enjoyable development experience.
To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent.
Comments welcome!
-
Introduction to Programming in C++
Quick Introduction to C++
C++ is a powerful and popular programming language that was developed in the 1980s as an extension of the C programming language. It is a high-level, object-oriented language that is used to develop a wide range of applications, including operating systems, device drivers, game engines, and more. C++ is also widely used in the field of finance and quantitative analysis, due to its speed and efficiency.
One of the key features of C++ is its ability to directly manipulate memory, allowing for low-level control over the hardware. C++ is also known for its efficiency and speed, making it a popular choice for developing applications that require high performance, such as video games and real-time systems.
Another key feature of C++ is its support for object-oriented programming (OOP). This allows programmers to define their own classes and objects, and to encapsulate data and functionality within those objects. OOP allows for code reusability, modularity, and flexibility, making it a popular paradigm in software development.
C++ is also known for its support for templates and generic programming. Templates allow programmers to write generic code that can work with different data types, without having to write separate code for each type. This can greatly simplify code development and maintenance, and can make C++ code more efficient and easier to read.
Most modern programming languages have a set up similar building blocks, for example
Receiving input from the user and Showing output to the user
Ability to store values in variables (usually of different kinds such as integers, floating points or character)
A string of characters where you can store names, addresses, or any other kind of text
Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Put your code in functions
Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes
Read file from a disk and save file to a disk
Ability to comment your code so you can understand it when you revisit it some time later
Lets dive right in and see how we can do these things in C++.
0. How to install C++ on your desktop?
Before we can begin to write a program in C++, we need to install Dev-C++. Once done, go ahead and open the IDE and try out the following code to see if everything is in order.
#include <iostream>
using namespace std;
int main() {
cout << "Hello World!";
return 0;
}
As you noticed, unlike languages such as Python, R or Ruby, it takes more than a few statements just to display basic text to the user in C++. In the next section we will try to dismantle this code and understand the various components. Lets however cover a few important points:
In C++ we need to end each line of code with a semi-colon ;
The scope of statements is defined using curly brackets {}, unlike Python where the scope is defined through indentation
All statements need to be within a function. Here we have included the statements in the main() function which is the first function that is executed during a compiler call. All other functions will be called from within this funciton.
1. Receiving input from the user and Showing output to the user
Following program shows output to the user. The include statement is used to call the iostream header file which is same as a python library. This header file provides information on basic programming routines including input and output constructs. The next is int main() which says that the main function will return an integer after execution. Within the main function we use cout« to show the text to the user. The text is enclosed in double quotes “text”. endl after the text tells the compiler to insert a new line in the output window. Finally we return 0 as the main function is supposed to return an integer. 0 signifies that everything was in order during the execution of the function.
#include <iostream>
using namespace std;
int main() {
cout << "This is some text." << endl;
return 0;
}
We can modify this program to accept input form the user. The cin» statement allows us to receive input. The variable in which we store the received input needs to be defined beforehand.
#include <iostream>
using namespace std;
int main() {
int age_ = 0;
cout << "What is your age?";
cin>>age_;
cout << "So your age is: " << age_;
return 0;
}
2. Ability to store values in variables (usually of different kinds such as integers, floating points or character)
C++ is not dynamically typed - you need to type out the variable’s name and data type before using it.
Basic data types: In C++ we have several types of variables, lets take a look at the important ones:
// Integer
int numberCats=5;
long int numberCats=5; //long int can be used for storing large values
// Floating point numbers. These are numbers with significant digits after the decimal
float pi=3.1415926535; //pi=22/7
// Double
double dValue=3.1415926535; //for more significant digits we need to use other variable type than float
long double ldValue=3.1415926535;
// Boolean
bool bval=true; //boolean type is true or false; c++ uses 1 for true and 0 for false when outputting
// Character
char cval=55, cval2='7'; //takes exactly 1 byte of computer memory, char represents single characters from the ascii character set, 55 is the ascii code for 7, this is not the number 7 but the character 7
// String
string myname;
3. A string of characters where you can store names, addresses, or any other kind of text
A string in C++ can be defined using the string keyword. It can be assigned usign the input from user or it can be assigned by providing text within double quotes “text”.
string yourName;
cout << "\n\nwhat is your name? ";
cin >> yourName;
cout <<"\nnice to meet you "<<yourName<<endl<<endl;
4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Arrays are a series of variables stored together in one variable. Arrays can be one-dimentional or multi-dimentional.
One-dimentional arrays:
// Defining
int ar[3];
// Initializing the array
ar[0]=10;
ar[1]=20;
ar[2]=30;
// Supports indexing
cout<<ar[0]; // this will output the value stored at index 0, which is 10
Multi-dimentional arrays:
// Defining
int mar[3][2] //multi-dim array
// Initializing the array
mar[3][2]={
{34,188},
{29,165},
{29,160}
};
// Supports indexing
cout<<ar[0][0]; // this will output the value stored at row index 0 x column index 0, which is 34
Loops can be used to iterate over one-dimentional or multi-dimentional arrays. We will take a closer look at this in the next section.
5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
C++ has several looping options such as ‘for’, ‘while’ and ‘do while’. There are also options of nesting (single, double, triple, ..) loops.
a. The for loop
// Syntax
for (i=0;i<10;i++){
statements to do stuff
}
// iterate over elements of one-dimentional array
// practice - create an array with a table of 12
int t12[10];
for (int i=0;i<10;i++){
t12[i]=12*(i+1);
}
// iterate over elements of two-dimentional array (concept of nesting - we will enclose a for loop within another for loop)
int mar[3][2]={
{34,188},
{29,165},
{29,160}
}; //multi-dim array
cout<<"\nthis is a multi dimentional array: ";
for (int i=0;i<3;i++){ //3 rows in the array
cout<<"\nrow "<<i+1<<": ";
for (int j=0;j<2;j++){ //2 columns in the array
cout<<"col "<<j+1<<": "<<mar[i][j]<<", ";
}
}
b. The While loop executes the same code again and again until a stop condition is met:
// Syntax
int i=0;
while (i<10){
code statements;
i+=1;
}
// Example
int i=1;
cout<<"\n\nwhile loop - first 10 natural numbers"<<endl;
while (i<=10){
cout<<i<<", ";
i+=1; //same as i=i+1 or i+=1
}
c. The Do-While loop executes the same code again and again until a stop condition is met. The difference from while loop is that in do-while loop atleast the content of the loop is executed once before checking the condition.
// Syntax
int i=0;
do{
code statements;
i+=1;
}while (i<10)
// Example
//for example if you want the user to enter the password again and again until they enter the correct password
cout<<"\n\ndo-while loop\n";
i=1;
string pass="pass", pass2;
do{
if(i!=1){
cout<<"\naccess denied, try again";
}
cout<<"\nenter your password?";
cin>>pass2;
i=0;
}while(pass2 != pass);
cout<<"\npassword accepted\n\n";
C++ also provides the break and continue statements that allow us to alter the loops further. Following is their use:
break jumps immidiately out of the loop. mostly used in while loops but can also be used in for loops
// break statement example
cout<<"\nbreak statement\n";
for(int f=1;f<11;f++){
if(f==5){
break; //we break out of the loop when f==5, and dont execute the loop for f>=5
}
cout<<f<<", ";
}
continue is similar to break, but just breaks out of the current iteration, but still continues running the next iterations
// continue statement example
cout<<"\nbreak statement\n";
for(int f=1;f<11;f++){
if(f==5){
continue;
}
cout<<f<<", "; //this statement not executed for f==5
}
6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
C++ provides if.., if..else.., and switch statements to apply conditional logic. Lets take a look at them:
a. The basic syntax for creating an if statement is:
/////////// IF STATEMENT ////////////
string pass="password",pass2;
cout<<"\n\n--if statement capability--\n";
cout<<"\nenter password:";
cin>>pass2;
if (pass==pass2){
cout<<"\npassword matches! you can enter!!";
} else{
cout<<"\npassword doesnt match! begone!!";
}
b. The basic syntax for creating an if…else statement is:
/////////// IF-ELSE STATEMENT ////////////
int menuChoice=5;
cout<<"\n\n--if-else statement capability--\n";
cout<<"\n1.\tadd record";
cout<<"\n2.\tdelete record";
cout<<"\n3.\texit";
cout<<"\nwhat do you want to do?";
cin>>menuChoice;
if (menuChoice==1){
cout<<"\nlets add some records!!";
} else if (menuChoice==2){
cout<<"\nlets delete some records!!";
} else{
cout<<"\nexiting! good-bye!!";
}
c. The basic syntax for creating a switch statement is:
/////////// SWITCH STATEMENT ////////////
int menuChoice2=5;
cout<<"\n\n--switch statement capability--\n";
cout<<"\n1.\tadd record";
cout<<"\n2.\tdelete record";
cout<<"\n3.\texit";
cout<<"\nwhat do you want to do?";
cin>>menuChoice2;
switch(menuChoice2){
case 1:
cout<<"\nlets add some records!!";
break;
case 2:
cout<<"\nlets delete some records!!";
break;
case 3:
cout<<"\nexiting! good-bye!!";
break;
default:
cout<<"\n!!!!error!!!!";
}
7. Put your code in functions
Functions allows us to create a block of code that can be executed many times without needing to it write it again.
// Following is an example case where we define a function that shows a menu to the user
int sub_menu(int choice) {
switch(choice){
case 1:
cout<<"\nLets add a new record";
break;
case 2:
cout<<"\nLets view an existing record";
break;
case 3:
cout<<"\nLets delete an existing record";
break;
default:
cout<<"\nExiting! Goodbye!!";
}
return 0;
}
We can call the function by its name:
// lets say we are writing the main() and we want to call the funciton
// lines-of-code
sub_menu()
// lines-of-code
8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes
C++ allows user to create classes. These can be a combination of variables and functions that operate on those variables. Lets take a look at how we can define and use them.
// Create a Car class with some attributes
class Car {
public:
string brand;
string model;
int year;
};
// Create an object of Car
Car carObj1;
carObj1.brand = "Mahindra";
carObj1.model = "Scorpio";
carObj1.year = 2020;
// Using the object
cout << carObj1.brand << " " << carObj1.model << " " << carObj1.year << "\n";
9. Read file from a disk and save file to a disk
Lets see how to read and write a text file in an organized way. We use the fstream header file for importing the functions necessary to read/write files.
#include <fstream>
// read a text file
string line;
ifstream myfile ("file.txt");
if (myfile.is_open())
{
while ( getline (myfile,line) )
{
cout << line << '\n';
}
myfile.close();
}
else cout << "Unable to open file";
// write a text file
ofstream myfile ("file.txt");
if (myfile.is_open())
{
myfile << "This is a line.\n";
myfile << "This is another line.\n";
myfile.close();
}
else cout << "Unable to open file";
10. Ability to comment your code so you can understand it when you revisit it some time later
We can tell C++ that a line of code is a comment as follows.
// this is a comment
We can tell that a multi-line block of text as follows.
/*
this
is
a
comment
block
*/
While C++ can be a powerful tool, it can also be complex and difficult to learn, especially for beginners. The language has a steep learning curve, and requires a solid understanding of programming concepts such as pointers, memory management, and OOP. However, with the right resources and dedication, C++ can be a rewarding and powerful tool for software development.
Overall, C++ is a popular and powerful programming language that is used in a wide range of applications, from operating systems to video games. Its efficiency, speed, and support for OOP and generic programming make it a versatile and powerful tool for software developers.
To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent.
Comments welcome!
-
Introduction to Programming in Microsoft Excel VBA
What is MS Excel VBA?
Excel VBA, or Visual Basic for Applications, is a programming language that can be used to automate tasks and enhance functionality in Microsoft Excel. VBA is a powerful tool that allows users to write custom macros and functions to automate repetitive tasks, perform complex calculations, and create custom solutions.
VBA is a type of Visual Basic, which is an object-oriented programming language developed by Microsoft. VBA is integrated directly into Excel, making it easy to access and use. VBA code is stored in modules, which can be accessed through the Visual Basic Editor in Excel. In the Editor, users can write, edit, and run VBA code, as well as debug their code to identify and fix any errors.
One of the key advantages of VBA is that it allows users to automate repetitive tasks that would otherwise be time-consuming to perform manually. For example, users can write a VBA macro to format data, generate reports, or update data in bulk. VBA can also be used to perform complex calculations, create custom user interfaces, and interact with other applications.
To get started with VBA, users should have a basic understanding of programming concepts and syntax. The VBA language is based on Visual Basic, so many programming concepts, such as variables, loops, and conditional statements, are similar to other programming languages. Excel also provides many built-in functions and objects that can be used in VBA code, making it easy to access and manipulate data in a spreadsheet.
Most modern programming languages have a set up similar building blocks, for example
Receiving input from the user and Showing output to the user
Ability to store values in variables (usually of different kinds such as integers, floating points or character)
A string of characters where you can store names, addresses, or any other kind of text
Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Put your code in functions
Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes
Read file from a disk and save file to a disk
Ability to comment your code so you can understand it when you revisit it some time later
Lets dive right in and see how we can do these things in VBA.
0. Enable VBA in your Excel file
Before we can begin to write a program in VBA, also known as a macro, we need to enable the developer tab. You can do this by going to the File > Options > Customise ribbon. Once the developer tab is available, go there and choose the leftmost option which says Visual Basic. Now you will see a panel in the left where you can double clock on the sheet name you are working on. This will open a empty code window. Here write the following code and save the file as a macro enabled workbook (extension will be .xlsb).
Sub simple_hello()
Range("A2").Value = "Hello World!"
End Sub
Close the file, then opn it back again and chose the option (if shown) to enable macros. Now go to the Developer tab again and this time select the second option called Macros. Here you should see the macro that you just created. Select it and hit run!
1. Receiving input from the user and Showing output to the user
There are several ways in which a macro can show output to the user. Let’s look at some ways of showing output:
'Method 1:
Range("A2").Value = "Hello"
'Method 2:
Worksheets("Sheet1").Range("B2").Value = "Hello"
'Method 3:
Worksheets(1).Range("C2").Value = "Hello"
'Method 4:
MsgBox "I added Hello in cell A2, B2 and C2"
'Method 5:
MsgBox "Hello " & Range("C5").Value & vbNewLine & "So you are " & Range("C6") & " years old!"
2. Ability to store values in variables (usually of different kinds such as integers, floating points or character)
VBA allows 4 key types of variables: Integer, String, Double and Boolean
Integer is good for soring most numeric values, String is for character input and Boolean is for a 0/1 or yes/no type of data. Here are some examples:
'Integer:
Dim x As Integer
x = 6
Range("A1").Value = x
'String:
Dim book As String
book = "bible"
Range("A1").Value = book
'Double:
Dim x As Double
x = 5.5
MsgBox "value is " & x
'Boolean:
Dim continue As Boolean
continue = True
If continue = True Then MsgBox "Boolean variables are cool"
3. A string of characters where you can store names, addresses, or any other kind of text
Key idea here is to learn how to manipulate string variables. There are a few common operations that we will focus on:
a. Joining strings
'Join Strings
Dim text1 As String, text2 As String
text1 = "Hi"
text2 = "Tim"
MsgBox text1 & " " & text2
b. Left/right or middle functions - To extract the leftmost/rightmost or middle characters from a string.
Dim text As String
text = "example text"
MsgBox Left(text, 4)
'Just as left, we can also extract a substing from the right or middle
MsgBox Right("example text", 2)
MsgBox Mid("example text", 9, 2)
c. To get the length of a string, use Len.
MsgBox Len("example text")
d. To find the position of a substring in a string, use Instr.
MsgBox InStr("example text", "am")
4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Array’s are a series of similar type of data stored together in one variable. Arrays can be one-dimentional or multi-dimentional.
a. Following example shows how a one dimentional array works:
Dim Films(1 To 5) As String
Films(1) = "Lord of the Rings"
Films(2) = "Speed"
Films(3) = "Star Wars"
Films(4) = "The Godfather"
Films(5) = "Pulp Fiction"
MsgBox Films(4)
b. Following example shows how a two dimentional array works:
Dim Films(1 To 5, 1 To 2) As String
Dim i As Integer, j As Integer
For i = 1 To 5
For j = 1 To 2
Films(i, j) = Cells(i, j).Value
Next j
Next i
MsgBox Films(4, 2)
5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
VBA has several looping options (for, do-while, do-until). There are options of nesting (single, double, triple, ..) loops.
a. Following example shows how a simple/single for loop works:
Dim i As Integer
For i = 1 To 6
Cells(i, 1).Value = 100
Next i
b. Following example shows how a double for loop works:
Dim i As Integer, j As Integer
For i = 1 To 6
For j = 1 To 2
Cells(i, j).Value = 100
Next j
Next i
c. Following example shows how a triple for loop works:
Dim c As Integer, i As Integer, j As Integer
For c = 1 To 3
For i = 1 To 6
For j = 1 To 2
Worksheets(c).Cells(i, j).Value = 100
Next j
Next i
Next c
VBA also has a do-while loop. Following example shows how it works:
Dim i As Integer
i = 1
Do While i < 6
Cells(i, 1).Value = 20
i = i + 1
Loop
VBA also has a do-until loop. Following example shows how it works:
Dim i As Integer
i = 1
Do Until i > 6
Cells(i, 1).Value = 20
i = i + 1
Loop
6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
a. If Then Statement - VBA has the option of an if statement, which executes a piece of code only if a specified condition is met.
Dim score As Integer, result As String
score = Range("A1").Value
If score >= 60 Then result = "pass"
Range("B1").Value = result
Dim score As Integer, result As String
score = Range("A1").Value
b. If Else Statement - VBA has the option of an if-else statement, which executes a piece of code only if a specified condition is met, if not then it executes another piece of code.
If score >= 60 Then
result = "pass"
Else
result = "fail"
End If
Range("B1").Value = result
c. If Else Statement - VBA has the option of an if-else statement, which executes a piece of code only if a specified condition is met, if not then it executes another piece of code.
'Select Case
'First, declare two variables. One variable of type Integer named score and one variable of type String named result
Dim score As Integer, result As String
'We initialize the variable score with the value of cell A1
score = Range("A1").Value
'Add the Select Case structure
Select Case score
Case Is >= 80
result = "very good"
Case Is >= 70
result = "good"
Case Is >= 60
result = "sufficient"
Case Else
result = "insufficient"
End Select
'Write the value of the variable result to cell B1
Range("B1").Value = result
7. Put your code in functions
VBA allows us to specify a function or a sub. The difference between the two is that funciton allows us to return a variable whereas a sub does not.
a. Function - If you want Excel VBA to perform a task that returns a result, you can use a function. Place a function into a module (In the Visual Basic Editor, click Insert, Module). For example, the function with name Area.
'Explanation: This function has two arguments (of type Double) and a return type (the part after As also of type Double). You can use the name of the function (Area) in your code to indicate which result you want to return (here x * y).
Function Area(x As Double, y As Double) As Double
Area = x * y
End Function
'Explanation: The function returns a value so you have to 'catch' this value in your code. You can use another variable (z) for this. Next, you can add another value to this variable (if you want). Finally, display the value using a MsgBox.
Dim z As Double
z = Area(3, 5) + 2
MsgBox z
b. Sub - If you want Excel VBA to perform some actions, you can use a sub.
Place a sub into a module (In the Visual Basic Editor, click Insert, Module). For example, the sub with name Area.
Sub Area(x As Double, y As Double)
MsgBox x * y
End Sub
'Explanation: This sub has two arguments (of type Double). It does not have a return type! You can refer to this sub (call the sub) from somewhere else in your code by simply using the name of the sub and giving a value for each argument.
'Call it using Area 3, 5
8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes
VBA Class allows us to create our own Object function in which we can add any kind of features, details of the command line, type of function. When we create Class in VBA, they act like totally an independent object function but they all are connected together. Detailed example of how to do this is out of the scope of this article.
9. Out of scope of this article.
10. Ability to comment your code so you can understand it when you revisit it some time later
We can tell VBA that a line of code is a comment by starting it with an single inverted comma.
'this is a comment
Overall, Excel VBA is a powerful tool that can help users automate tasks, improve productivity, and enhance the functionality of Microsoft Excel. With its flexibility and ease of use, VBA is a valuable tool for users of all skill levels, from beginners to advanced programmers.
To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent.
Comments welcome!
Touch background to close