Now Loading ...
-
Make Your Own Python Package and Share with Others in AWS SageMaker
Packaging in Python is the process of creating distributable packages of code that can be easily installed and used by other developers. These packages can contain modules, functions, classes, and other components that can be shared and reused across different projects.
Advantages of packaging functions
Modularity: By packaging related functions together, you can organize your code into reusable modules that can be used across multiple projects. This can help to avoid duplicating code and makes it easier to maintain and update your code.
Encapsulation: Packaging functions in a module can help to encapsulate the implementation details of the functions, making it easier to use them without worrying about the underlying implementation.
Code reusability: If you have a set of functions that perform a specific task, packaging them into a module can make it easy to reuse those functions in other projects without having to rewrite them.
Code organization: Packaging functions into a module can help to keep your code organized and easy to navigate. This can be especially helpful for larger projects with many different functions.
Easy installation: By packaging functions into a module, you can make it easy for others to install and use your code by simply installing the module using pip. This can make it easier to share your code with others and collaborate on projects.
Process of creating a package
Here’s an example of a Python package with functions to add and subtract two numbers and how to make it available in Amazon SageMaker
Create package directory
First, create a new directory for your package and navigate to it in your terminal or command prompt.
Then, Create a new file called init.py in your package directory. This file is required to make your directory a Python package.
Then, create a new file called vp_math.py in your package directory. This file will contain the functions to add and subtract two numbers. Your directory structure should look something like this:
vp_package/
__init__.py
vp_math.py
Adding code
Now you can add the following code to your vp_math.py file:
def add(x, y):
return x + y
def subtract(x, y):
return x - y
Create a setup.py file
Now, create a setup.py file in your package directory, with the following code:
from setuptools import setup
setup(
name='vp_package',
version='0.0.1',
description='A basic package with functions to add and subtract two numbers',
packages=['vp_package']
)
Build and package
Now, build and package the distribution file by running the following command in your terminal or command prompt in your package directory:
python setup.py sdist
This will create a .tar.gz file in a newly created dist directory.
Upload the package to a shared place
Upload the package to a S3 bucket that SageMaker can access. For this, you can use the AWS CLI, for example:
aws s3 cp dist/vp_package-0.0.1.tar.gz s3://my-bucket/
Make sure to replace my-bucket with the name of the S3 bucket you want to use.
Download and make package available in SageMaker
You can import a Python package from an S3 bucket by downloading the package to your local machine or server, and then adding the downloaded package to your Python’s sys.path. Here are the steps to import a Python package from S3.
# Download the package from S3 to current directory
aws s3 cp s3://my-bucket/vp_package-0.0.1.tar.gz .
# Extract the package to a directory named "vp_package-0.0.1"
tar -xzf vp_package-0.0.1.tar.gz -C vp_package/
# We are also do the above step in a Jupyter Notebook by using ! operator
!tar -xzf vp_package-0.0.1.tar.gz -C vp_package/
# Add the package to sys.path to make it importable
import sys
sys.path.append("/path/to/vp_package")
# Now, you can import the package and use it as usual
import vp_package
How to use the package
In SageMaker, create a new Jupyter Notebook and include the following code:
!pip install vp_package==0.0.1 --target /home/ec2-user/SageMaker/vp_package
from vp_package.vp_math import add, subtract
print(add(2, 3)) # Output: 5
print(subtract(5, 2)) # Output: 3
The !pip install command installs your package into the SageMaker notebook instance. The –target argument tells pip where to install the package, in this case, in the /home/ec2-user/SageMaker/vp_package directory.
This should import the functions from your package and allow you to use them in SageMaker.
Note that in a production environment, you would want to host your package on a PyPI server or a private package repository instead of manually uploading it to S3.
Comments welcome!
-
A Premier on Chi-squared test
The chi-square test is a statistical hypothesis test that is used to determine whether there is a significant association between two categorical variables. It is widely used in data analysis, particularly in fields such as social sciences, marketing, and biology, to examine relationships between categorical data. In this article, we will discuss the chi-square test, its applications, and how to perform it using Python.
Understanding the Chi-Square Test
The chi-square test is a non-parametric test that compares the observed frequencies of categorical data with the expected frequencies. The test is based on the chi-square statistic, which is calculated by summing the squared difference between the observed and expected frequencies, divided by the expected frequency, for each category.
The chi-square test is used to test the null hypothesis that there is no significant association between the two variables. If the calculated chi-square value is greater than the critical value, we can reject the null hypothesis and conclude that there is a significant association between the variables.
There are two types of chi-square tests: the chi-square goodness of fit test and the chi-square test of independence. The goodness of fit test is used to test whether the observed data follows a particular distribution, while the test of independence is used to test whether there is a significant association between two categorical variables.
Applications of the Chi-Square Test
The chi-square test is widely used in research and data analysis, with a range of applications across various fields. Some common applications include:
Market research: To determine if there is a significant association between demographic factors and consumer behavior, such as age, gender, and income level.
Biology: To test whether different species of plants or animals are distributed randomly or in patterns in their environment.
Social sciences: To test whether there is a significant relationship between socio-economic status and educational attainment.
Quality control: To test whether a sample of products is defective, based on the number of products that pass or fail inspection.
Performing the Chi-Square Test in Python
Python has several libraries that can be used to perform the chi-square test, including SciPy, Pandas, and StatsModels. Here is an example of how to perform the chi-square test of independence using the chi2_contingency function in the SciPy library:
import scipy.stats as stats
import pandas as pd
# Load data into a Pandas DataFrame
data = pd.read_csv('my_data.csv')
# Create a contingency table
contingency_table = pd.crosstab(data['variable_1'], data['variable_2'])
# Perform the chi-square test of independence
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
# Print the results
print('Chi-square statistic:', chi2)
print('P-value:', p)
In this example, we load data from a CSV file into a Pandas DataFrame, create a contingency table using the crosstab function, and then use the chi2_contingency function to perform the chi-square test of independence. The function returns the chi-square statistic, the p-value, the degrees of freedom, and the expected frequencies.
Conclusion
The chi-square test is a valuable statistical tool for examining the relationship between two categorical variables. By performing the test, we can determine whether there is a significant association between the variables and draw conclusions about the data. With the help of Python and its many data analysis libraries, we can easily perform the chi-square test and gain valuable insights from our data.
Comments welcome!
-
A Premier on ANOVA
ANOVA (Analysis of Variance) is a statistical method used to analyze and test the differences between the means of three or more groups. ANOVA compares the variation within groups to the variation between groups to determine whether the differences in means are statistically significant or just due to random chance.
The basic idea behind ANOVA is that if the variation between groups is significantly greater than the variation within groups, then there is evidence to suggest that the means of the groups are different. ANOVA allows us to test the null hypothesis that all of the group means are equal against the alternative hypothesis that at least one group mean is different from the others.
ANOVA is used in a wide range of applications, including biology, social sciences, economics, and engineering. It is often used in experimental research to test the effects of different treatments or interventions on a particular outcome.
There are several types of ANOVA, including one-way ANOVA, which compares the means of three or more groups that are unrelated, and repeated measures ANOVA, which compares the means of three or more groups that are related (i.e., the same group is measured under different conditions). ANOVA can be performed using software such as R, Python, or SPSS. In this article, we will be using Python.
Assumptions of ANOVA
ANOVA (Analysis of Variance) has several assumptions that should be met to ensure the validity and reliability of the test. The main assumptions of ANOVA are:
Normality: The dependent variable should be normally distributed in each group. One way to check this is by examining the distribution of the residuals (the differences between the observed values and the predicted values) for each group.
Homogeneity of variances: The variances of the dependent variable should be equal in each group. This can be checked by examining the variance of the residuals for each group.
Independence: The observations should be independent of each other. This means that there should be no systematic relationship between the observations in one group and the observations in another group.
Random Sampling: The observations should be randomly sampled from each group in the population.
If these assumptions are not met, the results of the ANOVA may not be reliable. In addition, violating these assumptions can lead to a higher probability of type I errors (rejecting the null hypothesis when it is actually true) or type II errors (failing to reject the null hypothesis when it is actually false).
Types of ANOVA tests
One-way ANOVA: This test is used to compare the means of more than two independent groups.
Two-way ANOVA: This test is used to compare the means of two or more independent groups while controlling for one or more other variables.
One-way ANOVA
One-way ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups. It is used to determine whether there are significant differences between the means of the groups based on the variability within each group and the variability between groups. In this article, we will walk through how to perform a one-way ANOVA test using Python.
Performing a one-way ANOVA test in Python:
To perform a one-way ANOVA test in Python, we can use the scipy.stats module. Here’s an example code snippet:
import scipy.stats as stats
import pandas as pd
# Create data
group1 = [1, 2, 3, 4, 5]
group2 = [6, 7, 8, 9, 10]
group3 = [11, 12, 13, 14, 15]
# Combine data into a pandas dataframe
data = pd.DataFrame({'Group1': group1, 'Group2': group2, 'Group3': group3})
# Perform one-way ANOVA test
fvalue, pvalue = stats.f_oneway(data['Group1'], data['Group2'], data['Group3'])
# Print results
print('F-value:', fvalue)
print('P-value:', pvalue)
In this example, we create three groups of data (group1, group2, and group3) and combine them into a pandas dataframe. We then use the f_oneway() function from the scipy.stats module to perform the one-way ANOVA test on the three groups. The output of the test includes the F-value and the p-value.
Interpreting the results:
The F-value is a measure of the variance between the groups compared to the variance within the groups. A higher F-value indicates that there is more variability between the groups and less variability within the groups. The p-value is a measure of the statistical significance of the F-value. A p-value less than 0.05 indicates that there is a statistically significant difference between the means of the groups.
In the example above, the F-value is 75 and the p-value is less than 0.05, which suggests that there is a statistically significant difference between the means of the three groups.
Two-way ANOVA
Two-way ANOVA is a statistical test used to determine the difference in the means of two or more groups. It involves testing the effects of two different factors on a response variable. In this article, we will go over how to perform two-way ANOVA in Python using the statsmodels package.
To illustrate two-way ANOVA in Python, we will use a dataset called ‘PlantGrowth’. It is a dataset of 30 plants, each receiving one of three different treatments (control, trt1, and trt2) and measuring their weight after a set period. We are interested in testing the effects of the treatments and the type of seed on the weight of the plants.
[{'weight': '4.17', 'group': 'ctrl', 'plant': 'plant_1'},
{'weight': '5.58', 'group': 'ctrl', 'plant': 'plant_2'},
{'weight': '5.18', 'group': 'ctrl', 'plant': 'plant_3'},
{'weight': '6.11', 'group': 'ctrl', 'plant': 'plant_4'},
{'weight': '4.50', 'group': 'ctrl', 'plant': 'plant_5'},
{'weight': '4.61', 'group': 'ctrl', 'plant': 'plant_6'},
{'weight': '5.17', 'group': 'ctrl', 'plant': 'plant_7'},
{'weight': '4.53', 'group': 'ctrl', 'plant': 'plant_8'},
{'weight': '5.33', 'group': 'ctrl', 'plant': 'plant_9'},
{'weight': '5.14', 'group': 'trt1', 'plant': 'plant_10'},
{'weight': '4.81', 'group': 'trt1', 'plant': 'plant_11'},
{'weight': '4.17', 'group': 'trt1', 'plant': 'plant_12'},
{'weight': '4.41', 'group': 'trt1', 'plant': 'plant_13'},
{'weight': '3.59', 'group': 'trt1', 'plant': 'plant_14'},
{'weight': '5.87', 'group': 'trt1', 'plant': 'plant_15'},
{'weight': '3.83', 'group': 'trt1', 'plant': 'plant_16'},
{'weight': '6.03', 'group': 'trt1', 'plant': 'plant_17'},
{'weight': '4.89', 'group': 'trt1', 'plant': 'plant_18'},
{'weight': '4.32', 'group': 'trt2', 'plant': 'plant_19'},
{'weight': '4.69', 'group': 'trt2', 'plant': 'plant_20'},
{'weight': '6.31', 'group': 'trt2', 'plant': 'plant_21'},
{'weight': '5.12', 'group': 'trt2', 'plant': 'plant_22'},
{'weight': '5.54', 'group': 'trt2', 'plant': 'plant_23'},
{'weight': '5.50', 'group': 'trt2', 'plant': 'plant_24'},
{'weight': '5.37', 'group': 'trt2', 'plant': 'plant_25'},
{'weight': '5.29', 'group': 'trt2', 'plant': 'plant_26'},
{'weight': '4.92', 'group': 'trt2', 'plant': 'plant_27'}]
Here’s how to perform a two-way ANOVA in Python:
Step 1: Load the required libraries and dataset
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
data = pd.read_csv('PlantGrowth.csv')
Step 2: Create a model formula and fit the model
model = ols('weight ~ C(treatment) + C(seed) + C(treatment):C(seed)', data).fit()
Here, ‘weight’ is the dependent variable, and ‘treatment’ and ‘seed’ are the two independent variables.
Step 3: Perform the two-way ANOVA using anova_lm()
anova_results = anova_lm(model, typ=2)
print(anova_results)
The typ parameter specifies the type of sum of squares to use. Here, we use type 2 sum of squares.
The anova_lm() function returns a table with the results of the ANOVA. The table includes the sum of squares, degrees of freedom, F-value, and p-value for each main effect and interaction effect.
Step 4: Interpret the results
The ANOVA table shows that both the main effects of ‘treatment’ and ‘seed’ are statistically significant, as well as the interaction effect between ‘treatment’ and ‘seed’. This suggests that both the type of treatment and the type of seed have a significant effect on the weight of the plants, and that the effect of the treatment depends on the type of seed.
In conclusion, performing a two-way ANOVA in Python is straightforward using the statsmodels package. It is important to ensure that the assumptions of the ANOVA are met before interpreting the results.
Finally, to close, ANOVA is a powerful statistical technique that can be used to compare the means of two or more groups. Whether you are testing the effectiveness of different treatments, analyzing the impact of a categorical variable, or trying to determine if there are significant differences between groups, ANOVA can help you identify these differences and draw meaningful conclusions. By using Python and its many data analysis libraries, you can easily perform ANOVA and other statistical tests on your data and gain valuable insights that can inform your decisions and actions. With the right approach and tools, ANOVA can be a valuable addition to your statistical toolbox.
Comments welcome!
-
A Premier on T-tests
T-tests are a class of statistical tests used to determine whether there is a significant difference between the means of two groups of data. T-tests are often used to compare the means of a sample to the population mean, or to compare the means of two independent samples or two paired samples.
Following are the most common types of t-tests are the one-sample t-test that we will cover:
One-sample t-test: This test is used to compare the mean of a single sample to a known or hypothesized population mean.
Independent samples t-test: This test is used to compare the means of two independent groups.
Paired samples t-test: This test is used to compare the means of two dependent (paired) groups.
T-tests have several assumptions that need to be met in order for the test to be valid. The most important assumptions are:
Normality: The data should follow a normal distribution. This means that the sample means should be normally distributed.
Independence: The samples should be independent of each other. This means that the observations in one sample should not be related to the observations in the other sample.
Homogeneity of variances: The variances of the two samples should be approximately equal. This means that the spread of the data should be similar in both groups.
If these assumptions are not met, the results of the t-test may be invalid or misleading. There are also different types of t-tests that make different assumptions. For example, the paired samples t-test assumes that the differences between paired observations are normally distributed, while the independent samples t-test assumes that the two samples have equal variances. It’s important to carefully consider the assumptions of the test and to use caution when interpreting the results.
How to perform T-tests in Python
One-sample t-test
A one-sample t-test is used to compare the mean of a single sample to a known or hypothesized population mean. This test is useful for determining whether a sample differs significantly from the population mean.
To perform a one-sample t-test in Python, you can use the scipy.stats.ttest_1samp function. Here’s an example:
import numpy as np
from scipy.stats import ttest_1samp
# Generate a sample of data
data = np.random.normal(loc=10, scale=2, size=100)
# Set the hypothesized population mean
pop_mean = 9
# Perform the one-sample t-test
t_stat, p_val = ttest_1samp(data, pop_mean)
# Print the results
print("t-statistic: {:.3f}".format(t_stat))
print("p-value: {:.3f}".format(p_val))
In this example, we first generate a sample of data using the numpy.random.normal function, which generates a sample of data from a normal distribution with the specified mean (loc) and standard deviation (scale). We then set the hypothesized population mean to 9.
We then perform the one-sample t-test using the ttest_1samp function, which takes two arguments: the sample data and the hypothesized population mean. The function returns two values: the t-statistic and the p-value.
Finally, we print the results using the print function, formatting the t-statistic and p-value to three decimal places.
If the p-value is less than the significance level (usually 0.05), we can reject the null hypothesis and conclude that the sample mean differs significantly from the population mean. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest a significant difference between the sample mean and the population mean.
Independent samples t-test
An independent samples t-test is used to compare the means of two independent groups to determine if they are significantly different. This test is used when the two groups being compared are completely independent of each other.
To perform an independent samples t-test in Python, we can use the scipy.stats.ttest_ind function from the SciPy library. Here’s an example:
import numpy as np
from scipy.stats import ttest_ind
# Generate two independent samples of data
sample1 = np.random.normal(loc=10, scale=2, size=100)
sample2 = np.random.normal(loc=12, scale=2, size=100)
# Perform the independent samples t-test
t_stat, p_val = ttest_ind(sample1, sample2)
# Print the results
print("t-statistic: {:.3f}".format(t_stat))
print("p-value: {:.3f}".format(p_val))
In this example, we first generate two independent samples of data using the numpy.random.normal function. We then perform the independent samples t-test using the ttest_ind function, which takes two arguments: the two samples being compared. The function returns two values: the t-statistic and the p-value.
Finally, we print the results using the print function, formatting the t-statistic and p-value to three decimal places.
If the p-value is less than the significance level (usually 0.05), we can reject the null hypothesis and conclude that the means of the two groups are significantly different. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest a significant difference between the means of the two groups.
Paired samples t-test
A paired samples t-test is a statistical test used to determine whether there is a statistically significant difference between the means of two related groups. In other words, it helps us determine whether the two groups are significantly different from each other or not.
To perform a paired samples t-test in Python, we can use the scipy.stats module, which contains a variety of statistical functions including the ttest_rel() function. This function computes the t-test for two related samples of scores.
Here is an example code snippet for performing a paired samples t-test in Python:
import numpy as np
from scipy.stats import ttest_rel
# Create two related random samples of data
before = np.random.normal(5, 1, 100)
after = before + np.random.normal(1, 0.5, 100)
# Compute the t-test
t_stat, p_val = ttest_rel(before, after)
# Print the results
print("t-statistic: {}".format(t_stat))
print("p-value: {}".format(p_val))
In this example, we first create two related random samples of data using the numpy.random.normal() function. We create the second sample by adding some random noise to the first sample. We then compute the paired samples t-test for these two samples using the ttest_rel() function. The function returns two values: the t-statistic and the p-value.
Finally, we print the results of the test using the print() function. If the p-value is less than the significance level (usually 0.05), we can reject the null hypothesis and conclude that the means of the two related groups are significantly different. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest a significant difference between the means of the two related groups.
It’s important to note that a paired samples t-test assumes that the differences between the pairs of observations are normally distributed. If this assumption is not met, other tests or transformations may be needed. Additionally, like any statistical test, it’s important to carefully consider the context and limitations of the test and to avoid drawing causal conclusions from statistical associations alone.
To close, T-tests are useful because they provide a simple and easy-to-interpret method for comparing two groups of data. They are widely used in a variety of fields including psychology, medicine, education, and more. However, it’s important to note that t-tests have certain assumptions, such as normality of the data and equal variances, which need to be met for the test to be valid. It’s also important to use caution when interpreting t-test results and to consider the context and limitations of the test.
Comments welcome!
-
Statistical Hypothesis Testing
Hypothesis testing is a statistical method used to determine whether a hypothesis about a population parameter is supported by the data. It is a powerful tool for making decisions based on data, and is widely used in many fields including medicine, social sciences, and business.
The basic steps in hypothesis testing are as follows:
Formulate the null and alternative hypotheses: The null hypothesis is the statement that the population parameter is equal to a specified value, while the alternative hypothesis is the statement that the population parameter is not equal to the specified value. For example, if you want to test whether the mean height of a population is 65 inches, the null hypothesis would be “the mean height is equal to 65 inches” and the alternative hypothesis would be “the mean height is not equal to 65 inches.”
Choose a level of significance: The level of significance is the probability of rejecting the null hypothesis when it is actually true. Commonly used levels of significance are 0.05 (5%) and 0.01 (1%).
Collect data and calculate test statistic: Next, you need to collect a sample of data and calculate a test statistic, which is a measure of how far the sample data is from what is expected under the null hypothesis. The test statistic will depend on the type of test being used, such as t-test or chi-squared test.
Determine the p-value: The p-value is the probability of obtaining a test statistic as extreme or more extreme than the observed test statistic, assuming the null hypothesis is true. If the p-value is less than the chosen level of significance, then the null hypothesis is rejected and the alternative hypothesis is supported.
Interpret the results: Finally, the results of the hypothesis test need to be interpreted in the context of the problem being studied. If the null hypothesis is rejected, it may be concluded that there is evidence to support the alternative hypothesis. However, if the null hypothesis is not rejected, it cannot be concluded that the null hypothesis is true, only that there is not enough evidence to reject it.
Hypothesis testing is a powerful tool for making decisions based on data, but it is important to use it correctly and to interpret the results carefully. When conducting a hypothesis test, it is important to ensure that the assumptions of the test are met, and to choose the appropriate test based on the type of data being analyzed. By following the steps outlined above and taking care to interpret the results correctly, hypothesis testing can be a valuable tool for making evidence-based decisions.
There are many different types of hypothesis tests, each suited to different types of data and research questions. Here are a few of the most common types:
One-sample t-test: This test is used to compare the mean of a single sample to a known or hypothesized population mean.
Independent samples t-test: This test is used to compare the means of two independent groups.
Paired samples t-test: This test is used to compare the means of two dependent (paired) groups.
One-way ANOVA: This test is used to compare the means of more than two independent groups.
Two-way ANOVA: This test is used to compare the means of two or more independent groups while controlling for one or more other variables.
Chi-squared test: This test is used to compare the frequencies of categorical data between two or more groups.
Mann-Whitney U test: This non-parametric test is used to compare the medians of two independent groups when the data are not normally distributed.
Kruskal-Wallis test: This non-parametric test is used to compare the medians of more than two independent groups when the data are not normally distributed.
Wilcoxon signed-rank test: This non-parametric test is used to compare the medians of two dependent groups when the data are not normally distributed.
Friedman test: This non-parametric test is used to compare the medians of more than two dependent groups when the data are not normally distributed.
These are just a few examples of the many types of hypothesis tests that are used in statistical analysis. Choosing the right test for a given research question depends on the type of data being analyzed and the specific hypotheses being tested.
Comments welcome!
-
Statistical Distributions
In this article we will cover some distributions that I have found useful while analysing data. I have split them based on whether they are for a continuous or a discrete random variable. First I give a small theoretical introduction about the distribution, its probability density function, and then how to use python to represent it graphically.
Continuous Distributions:
Uniform distribution
Normal Distribution, also known as Gaussian distribution
Standard Normal Distribution - case of normal distribution where loc or mean = 0 and scale or sd = 1
Gamma distribution - exponential, chi-squared, erlang distributions are special cases of the gamma distribution
Erlang distribution - special form of Gamma distribution when a is an integer ?
Exponential distribution - special form of Gamma distribution with a=1
Lognormal - not covered
Chi-Squared - not covered
Weibull - not covered
t Distribution - not covered
F Distribution - not covered
Discrete Distributions:
Poisson distribution is a limiting case of a binomial distribution under the following conditions: n tends to infinity, p tends to zero and np is finite
Binomial Distribution
Negative Binomial - not covered
Bernoulli Distribution is a special case of the binomial distribution where a single trial is conducted n=1
Geometric - not covered
Lets import some basic libraries that we will be using:
import numpy as np
import pandas as pd
import scipy.stats as spss
import plotly.express as px
import seaborn as sns
Continuous Distributions
Uniform distribution
As the name suggests, in uniform distribution the probability of all outcomes is same. The shape of this distribution is a rectange. Now, lets plot this using python. First we will generate an array of random variables using scipy. We will specifically use scipy.stats.uniform.rvs function with following three inputs:
size specifies number of random variates
loc corresponds to mean
scale corresponds to standard deviation
rv_array = spss.uniform.rvs(size=10000, loc = 10, scale=20)
Now we can plot this using the plotly library or the seaborn library. Infact seaborn has a couple of different function, namely the distplot and the histplot, both of which can be used to visually view the unoform data. Lets see the examples one by one:
We can directly plot the data from the array:
px.histogram(rv_array) # plotted using plotly express
sns.histplot(rv_array, kde=True) # plotted using seaborn
Or we can convert array into a dataframe and then plot the data frame:
rv_df = pd.DataFrame(rv_array, columns=['value_of_random_variable'])
px.histogram(rv_df, x='value_of_random_variable', nbins=20) # plotted using plotly express
sns.histplot(data=rv_df, x='value_of_random_variable', kde=True) # plotted using seaborn
Normal Distribution, also known as Gaussian distribution:
The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena.
Normal distribution is a limiting case of Poisson distribution with the parameter lambda tends to infinity. Additionally since poisson distribution is a for of binomial distribution, normal distribution is also a form of binomial distribution.
This distribution has a bell-shaped density curve described by its mean and standard deviation. The mean represents the location and the sd represents the spread of the distribution. The curve represents that the data near the mean occurrs more frequently than the data far from the mean.
Lets plot it using seaborn:
rv_array = spss.norm.rvs(size=10000,loc=10,scale=100) # size specifies number of random variates, loc corresponds to mean, scale corresponds to standard deviation
sns.histplot(rv_array, kde=True)
We can add x and y labels, change the number of bins, color of bars, etc. With distplot we can supply additional arguments for adjusting width of bars, transparency, etc.
ax = sns.distplot(rv_array, bins=100, kde=True, color='cornflowerblue', hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel='Normal Distribution', ylabel='Frequency')
Standard Normal Distribution
Is a special case of the normal distribution where mean = 0 and sd = 1
Lets plot it using seaborn:
rv_array = spss.norm.rvs(size=10000,loc=0,scale=1)
sns.histplot(rv_array, kde=True)
Gamma distribution is a two-parameter family of continuous probability distributions
Exponential, chi-squared, erlang distributions are special cases of the gamma distribution
Lets plot it using seaborn:
rv_array = spss.gamma.rvs(a=5, size=10000) # size specifies number of random variates, a is the shape parameter
sns.distplot(rv_array, kde=True)
Erlang distribution
Special case of Gamma distribution when a is an integer.
Exponential distribution
Special case of Gamma distribution with a=1.
Exponential distribution describes the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate.
Lets plot it using seaborn:
rv_array = spss.expon.rvs(scale=1,loc=0,size=1000) # size specifies number of random variates, loc corresponds to mean, scale corresponds to standard deviation
sns.distplot(rv_array, kde=True)
Discrete Distributions
Binomial Distribution
Distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose. Additionally, the probability of success and failure is same for all the trials. Further, the outcomes need not be equally likely, and each trial is independent of each other.
The probability of observing k events in an interval is given by the equation: f(k;n,p) = nCk * (p^k) * ((1-p)^(n-k))
Where, nCk = (n)! / ((k)! * (n-k)!)
n=total number of trials
p=probability of success in each trial
Lets plot it using seaborn:
rv_array = spss.binom.rvs(n=10,p=0.8,size=10000) # n = number of trials, p = probability of success, size = number of times to repeat the trials
sns.distplot(rv_array, kde=True)
Poisson Distribution
Poisson random variable is typically used to model the number of times an event happened in a time interval. For example, the number of users registered for a web service in an interval can be thought of as a Poisson process. Poisson distribution is described in terms of the rate (μ) at which the events happen. The average number of events in an interval is designated λ (lambda). Lambda is the event rate, also called the rate parameter.
The probability of observing k events in an interval is given by the equation: P(k events in interval) = e^(-lambda) * (lambda^k / k!)
Poisson distribution is a limiting case of a binomial distribution under the following conditions:
The number of trials is indefinitely large or n tends to infinity
The probability of success for each trial is same and indefinitely small or p tends to zero
np = lambda, is finite.
Lets plot it using seaborn:
rv_array = spss.poisson.rvs(mu=3, size=10000) # size specifies number of random variates, loc corresponds to mean, scale corresponds to standard deviation
sns.distplot(rv_array, kde=True)
Bernoulli distribution
This distribution has only two possible outcomes, 1 (success) and 0 (failure), and a single trial, for example, a coin toss. The random variable X which has a Bernoulli distribution can take value 1 with the probability of success, p, and the value 0 with the probability of failure, q or 1-p. The probabilities of success and failure need not be equally likely.
Probability mass function of Bernoulli distribution: f(k;p) = (p^k) * ((1-p)^(1-k))
Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (n=1)
Lets plot it using seaborn:
rv_array = spss.bernoulli.rvs(size=10000,p=0.6) # p = probability of success, size = number of times to repeat the trial
sns.distplot(rv_array, kde=True)
Hope you found this summary of distributions useful. I refer to this from time to time to jog my memory on the various distributions.
Comments welcome!
-
Visualize data using SAS
This is the third of a series of articles that I will write to give a gentle introduction to statistics. In this article we will cover how we can visualize data using various charts and how to read them. I will show how to create these charts using SAS and will include code snippets as well. For a full version of the code visit my GitHub repository.
SAS has an in-built procedure called sgplot that allows you to create several kinds of plots. Also available is proc univariate which allows you to create histograms and normal probability plots, also known as the QQ plots. In this article we will work with the tips dataset that we also used for our Python demonstration.
Before we start plotting, we need to import the dataset. In SAS we do this using the data step.
proc import datafile='/home/u50248307/data/tips.csv'
out=tips
dbms=csv
replace;
getnames=yes;
run;
Once we have imported the dataset, we can view it using the proc print statement.
proc print data=tips;
run;
Lets take a quick look at how the tips dataset is structured:
We can further see some summary information on the dataset using proc contents statement.
proc contents data=tips;
run;
You will notice that I am ending all lines with a semicolon. Unlike Python, SAS does not depend on indentation to show scope of statements. Therefore we use the semicolon, just like we do in C++ to signify end of a statement.
Now lets move to visualizing this data. We will cover the following charts in this article:
Dot plot shows changes between two (or more) points in time or between two (or more) conditions.
proc sgplot data=tips;
title 'Mean of Total bill by Day';
dot day / response=total_bill stat=mean;
xaxis label='Mean of Total Bill';
yaxis label='Day';
run;
proc sgplot data=tips;
title 'Mean of Total bill by Day by Gender';
dot day / response=total_bill group=sex stat=mean;
xaxis label='Mean of Total Bill';
yaxis label='Day';
run;
Bar (horizontal and vertical) chart is used when you want to show a distribution of data points or perform a comparison of metric values across different subgroups of your data.
# horizontal bar chart;
proc sgplot data=tips;
title 'Mean Total bill by Day';
/*hbar day;*/ /*if no other option is specified then it just shows row frequency by cat variable*/
hbar day / response=total_bill stat=mean;
/*hbar day / response=tip stat=mean y2axis;*/
run;
# vertical bar chart;
proc sgplot data=tips;
title 'Mean Total bill by Day';
vbar day / response=total_bill stat=mean;
run;
# XAXISTABLE and YAXISTABLES statements create axis tables which display data values at specific locations along an axis. The only required argument is a list of one or more variables to be displayed.;
proc sgplot data=tips;
title 'Mean Total bill by Day: XAXISTABLE and YAXISTABLES Example';
vbar day / response=total_bill stat=mean;
xaxistable tip size / stat=mean position=top;
xaxistable var1 / stat=freq label="N"; /* the var1 variable was chosen arbitrarily in order to obtain frequency counts of the number of records in each category */
run;
Stacked Bar char is useful when you want to show more than one categorical variable per bar
# Offset Dual Horizontal Bar Plot;
proc sgplot data=tips;
title 'Dual Bar Chart: Mean Total bill and Tip by Day';
hbar day / response=total_bill stat=mean barwidth=0.25 discreteoffset=-0.15;
hbar day / response=tip stat=mean barwidth=0.25 discreteoffset=0.15 y2axis;
run;
# Offset Dual Vertical Bar Plot;
proc sgplot data=tips;
title 'Dual Vertical Bar Chart: Mean Total bill and Tip by Day';
vbar day / response=total_bill stat=mean barwidth=0.25 discreteoffset=-0.15;
vbar day / response=tip stat=mean barwidth=0.25 discreteoffset=0.15 y2axis;
run;
# Stacked Bar Plot;
proc sgplot data=tips;
title 'Stacked Bar Chart with Data Lables';
vbar day / response=total_bill group=sex stat=mean datalabel datalabelattrs=(weight=bold);
xaxis display=(nolabel);
yaxis grid label='total_bill';
run;
Needle plot is similar to barplot and a scatter plot, it can be used to plot datasets that have too many mutations for a barplot to be meaningful.
proc sgplot data=tips;
title 'Needle Chart: Total bill by Meal size';
needle x=size y=total_bill ;
run;
proc sgplot data=tips;
title 'Needle Chart: Total bill by Day';
needle x=day y=tip ;
run;
Boxplot (horizontal and vertical) In a box plot, numerical data is divided into quartiles, and a box is drawn between the first and third quartiles, with an additional line drawn along the second quartile to mark the median. In some box plots, the minimums and maximums outside the first and third quartiles are depicted with lines, which are often called whiskers.
# Vertical Box plot;
proc sgplot data=tips;
title 'Vertical Box plot';
vbox total_bill / category=day boxwidth=0.25 discreteoffset=-0.15;
vbox tip / category=day boxwidth=0.25 discreteoffset=0.15 y2axis;
run;
# Horizontal Box plot;
proc sgplot data=tips;
title 'Horizontal Box plot';
hbox total_bill / category=day boxwidth=0.25 discreteoffset=-0.15;
hbox tip / category=day boxwidth=0.25 discreteoffset=0.15 y2axis;
run;
Histogram is a visual representation of the frequency distribution of your data. The frequencies are represented by bars.
proc sgplot data=tips;
title'Histogram using Proc Sgplot';
histogram total_bill;
run;
proc univariate data=tips;
title'Histogram using Proc Univariate';
histogram total_bill;
run;
Probability Plot is a way of visually comparing the data coming from different distributions. It can be of two types - pp plot or qq plot
pp plot (Probability-to-Probability) is the way to visualize the comparing of cumulative distribution function (CDFs) of the two distributions (empirical and theoretical) against each other.
qq plot (Quantile-to-Quantile) is used to compare the quantiles of two distributions. The quantiles can be defined as continuous intervals with equal probabilities or dividing the samples between a similar way The distributions may be theoretical or sample distributions from a process, etc.
Normal probability plot is a case of the qq plot. It is a way of knowing whether the dataset is normally distributed or not
proc univariate data=tips;
title'Normal probability (QQ) plot using Proc Univariate';
probplot total_bill;
run;
Scatter plot shows the relationship between two numerical variables.
proc sgplot data=tips;
title "total bill vs tip by gender";
scatter x=total_bill y=tip / group=sex markerattrs=(symbol=Square size=10px);
/* SYMBOL: Circle, CircleFilled, Square, Star, Plus, X
SIZE: 0.2in, 3mm, 10pt, 5px, 25pct
COLOR: red, blue, lightscreen, aquamarine, CXFFFFFF */
refline 6 / axis=y lineattrs=(color=green thickness=3px pattern=ShortDashDot); /* REFLINE statement adds horizontal or vertical reference lines to a plot. Its unnamed required argument is a numeric variable, value, or list of values. A reference line will be added for each value listed or for each value of the variable specified. */
run;
# Scatter plot with attribute cycling - when multiple lists of attributes are specified on the STYLEATTRS statement (for example, a list of marker shapes and a list of marker colors);
proc sgplot data=tips;
title "total bill vs tip by gender";
styleattrs datasymbols=(SquareFilled CircleFilled) datacontrastcolors=(purple green);
scatter x=total_bill y=tip / group=sex markerattrs=(size=10px);
/* SYMBOL: Circle, CircleFilled, Square, Star, Plus, X
SIZE: 0.2in, 3mm, 10pt, 5px, 25pct
COLOR: red, blue, lightscreen, aquamarine, CXFFFFFF */
run;
/* SYMBOLCHAR statement is used to define a marker symbol from a Unicode value. */
proc sgplot data=tips;
title "Total bill vs Tip by Gender";
scatter x=total_bill y=tip / group=sex markerattrs=(size=40);
symbolchar name=female_sign char="2640"x; /* identifiers “female_sign” and “male_sign” are arbitrary names */
symbolchar name=male_sign char="2642"x;
styleattrs datasymbols=(female_sign male_sign);
run;
/* Using Data as a Symbol Marker */
proc sgplot data=tips;
title "Total bill vs Tip by Gender";
scatter x=total_bill y=tip / group=sex markerchar=sex markercharattrs=(weight=bold size=10pt);
run;
Line plot is used to visualize the value of something over time. VLINE statement is used to create a vertical line chart (which consists of horizontal lines). The endpoints of the line segments are statistics based on a categorical variable as opposed to raw data values.
LOCATION=Specifies whether legend will appear INSIDE or OUTSIDE (default) the axis area.
POSITION=Specifies the position of the legend: TOP, BOTTOM (default), LEFT, RIGHT, TOPLEFT, TOPRIGHT, BOTTOMLEFT, BOTTOMRIGHT
DOWN=Specifies number of rows in legend
ACROSS=Specifies number of columns in legend
TITLEATTRS=Specifies text attributes of legend title
VALUEATTRS=Specifies text attributes of legend values
# Basic line plot
proc sgplot data=tips;
title 'Line chart showing Average total bill by Day';
vline day / response=total_bill stat=mean markers;
run;
# Line Chart with Dual Axes;
proc sgplot data=tips;
title 'Line Chart with Dual Axes';
vline day / response=total_bill stat=mean markers;
vline day / response=tip stat=mean markers y2axis;
run;
# Line Chart by group with Modifying Line Attributes and Legend;
proc sgplot data=tips;
title 'Line Chart by group with Modifying Line Attributes and Legend';
styleattrs datasymbols=(TriangleFilled CircleFilled) datalinepatterns=(ShortDash LongDash);
vline day / response=total_bill stat=mean markers group=sex lineattrs=(thickness=4px);
keylegend / location=inside position=topleft across=1 titleattrs=(weight=bold size=12pt) valueattrs=(color=green size=12pt);
run;
# XAXIS and YAXIS statements are used to control the features and structure of the X and Y axes, respectively;
proc sgplot data=tips;
title 'Line plot: XAXIS and YAXIS statements';
vline size / response=total_bill stat=mean;
vline size / response=tip stat=mean y2axis;
yaxis min=0 max=40 minor minorcount=9 valueattrs=(style=italic) label='Total Bill ($)';
y2axis offsetmin=0.1 offsetmax=0.1 labelattrs=(color=purple);
/* offsets are proportional to axis length, so between 0 and 1 */
run;
The best way to get better at visualization is through practice. What I have found useful is participating in a weekly visualization challenge called the TidyTuesday!
Comments welcome!
-
Visualize data using Python
This is the second of a series of articles that I will write to give a gentle introduction to statistics. In this article we will cover how we can visualize data using various charts and how to read them. I will show how to create these charts using Python and will include code snippets as well. For a full version of the code visit my GitHub repository.
Python has many libraries that allow creating visually appealing charts. In this article we will work with the in-built tips dataset and then plot using the following libraries:
import seaborn as sns
tips = sns.load_dataset("tips") # tips dataset can be loaded from seaborn
sns.get_dataset_names() # to get a list of other available datasets
import plotly.express as px
tips = px.data.tips() # tips dataset can be loaded from plotly
# data_canada = px.data.gapminder().query("country == 'Canada'")
import pandas as pd
tips.to_csv('/Users/vivekparashar/Downloads/tips.csv') # we can save the dataset into a csv and then load it into SAS or R for plotting
import altair as alt
import statsmodels.api as sm
Lets take a quick look at how the tips dataset is structured:
We will cover the following charts in this article:
Dot plot shows changes between two (or more) points in time or between two (or more) conditions.
# Using plotly library
t = tips.groupby(['day','sex']).mean()[['total_bill']].reset_index()
px.scatter(t, x='day', y='total_bill', color='sex',
title='Average bill by gender by day',
labels={'day':'Day of the week', 'total_bill':'Average Bill in $'})
Bar (horizontal and vertical) chart is used when you want to show a distribution of data points or perform a comparison of metric values across different subgroups of your data.
# Using pandas plot
tips.groupby('sex').mean()['total_bill'].plot(kind='bar')
tips.groupby('sex').mean()['tip'].plot(kind='barh')
# Using plotly
t = tips.groupby(['day','sex']).mean()[['total_bill']].reset_index()
px.bar(t, x='day', y='total_bill') # Using plotly
px.bar(t, x='total_bill', y="day", orientation='h')
Stacked Bar char is useful when you want to show more than one categorical variable per bar
# using pandas plot; kind='barh' for horizontal plot
# need to unstack one of the levels and fill na values
tips.groupby(['day','sex']).mean()[['total_bill']]\
.unstack('sex').fillna(0)\
.plot(kind='bar', stacked=True)
# Using plotly
t = tips.groupby(['day','sex']).mean()[['total_bill']].reset_index()
px.bar(t, x="day", y="total_bill", color="sex", title="Average bill by Gender and Day") # vertical
px.bar(t, x="total_bill", y="day", color="sex", title="Average bill by Gender and Day", orientation='h') # horizontal
Boxplot (horizontal and vertical) In a box plot, numerical data is divided into quartiles, and a box is drawn between the first and third quartiles, with an additional line drawn along the second quartile to mark the median. In some box plots, the minimums and maximums outside the first and third quartiles are depicted with lines, which are often called whiskers.
# using pandas plot
# we specify y=variable for vertical and x=variable for horizontal for horizontal box plot respectively
tips[['total_bill']].plot(kind='box')
# using plotly
px.box(tips, y='total_bill')
# using seaborn
sns.boxplot(y=tips["total_bill"])
Violin plot is a variation of box plot
# Using seaborn
sns.violinplot(y=tips.total_bill)
sns.violinplot(data=tips, x='day', y='total_bill',
hue='smoker',
palette='muted', split=True,
scale='count', inner='quartile',
order=['Thur','Fri','Sat','Sun'])
sns.catplot(x='sex', y='total_bill',
hue='smoker', col='time',
data=tips, kind='violin', split=True,
height=4, aspect=.7)
Histogram is a visual representation of the frequency distribution of your data. The frequencies are represented by bars.
# using pandas plot
tips.total_bill.plot(kind='hist')
# using plotly
px.histogram(tips, x="total_bill")
# using seaborn
sns.histplot(data=tips, x="total_bill")
# using altair
alt.Chart(tips).mark_bar().encode(alt.X('total_bill:Q', bin=True),y='count()')
Probability Plot is a way of visually comparing the data coming from different distributions. It can be of two types - pp plot or qq plot
pp plot (Probability-to-Probability) is the way to visualize the comparing of cumulative distribution function (CDFs) of the two distributions (empirical and theoretical) against each other.
qq plot (Quantile-to-Quantile) is used to compare the quantiles of two distributions. The quantiles can be defined as continuous intervals with equal probabilities or dividing the samples between a similar way The distributions may be theoretical or sample distributions from a process, etc.
Normal probability plot is a case of the qq plot. It is a way of knowing whether the dataset is normally distributed or not
# using statsmodels
import statsmodels.graphics.gofplots as sm
import numpy as np
sm.ProbPlot(np.array(tips.total_bill)).ppplot(line='s')
sm.ProbPlot(np.array(tips.total_bill)).qqplot(line='s')
Scatter plot shows the relationship between two numerical variables.
# using plotly
px.scatter(tips, x='total_bill', y='tip', color='sex', size='size', hover_data=['day'])
# using pandas plot
tips.plot(x='total_bill', y='tip', kind='scatter')
Reg plot creates a regression line between 2 parameters and helps to visualize their linear relationships
# using seaborn
sns.regplot(x="total_bill", y="tip", data=tips, marker='+')
# for categorical variables we can add jitter to see overlapping points
sns.regplot(x="size", y="total_bill", data=tips, x_jitter=.1)
Line plot is used to visualize the value of something over time
# using pandas plot
tips['total_bill'].plot(kind='line')
# using plotly
px.line(tips, y='total_bill', title='Total bill')
t = tips.groupby('day').sum()[['total_bill']].reset_index()
px.line(t, x='day',y='total_bill', title='Total bill by day')
# using altair
alt.Chart(t).mark_line().encode(x='day', y='total_bill')
# using seaborn
sns.lineplot(data=t, x='day', y='total_bill')
Area plot is like a line chart in terms of how data values are plotted on the chart and connected using line segments. In an area plot, however, the area between the line segments and the x-axis is filled with color.
# using pandas plot
tips.groupby('day').sum()[['total_bill']].plot(kind='area')
# stacked area can be done using pandas.plot as well
t = tips.groupby(['day','sex']).count()[['total_bill']].reset_index()
t_pivoted = t.pivot(index='day', columns='sex', values='total_bill')
t_pivoted.plot.area()
# using plotly
px.area(t, x='day', y='total_bill', color='sex',line_group='sex')
# using altair
alt.Chart(t).mark_area().encode(x='day', y='total_bill')
Pie chart is a circular statistical graphic, which is divided into slices to illustrate numerical proportion. In a pie chart, the arc length of each slice, is proportional to the quantity it represents.
# using pandas plot
tips.groupby('sex').count()['tip'].plot(kind='pie')
# using plotly
px.pie(tips, values='tip', names='day')
Sunburst chart is ideal for displaying hierarchical data. Each level of the hierarchy is represented by one ring or circle with the innermost circle as the top of the hierarchy.
px.sunburst(tips, path=['sex', 'day', 'time'], values='total_bill', color='day')
Radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point.
# using plotly
t = tips.groupby('day').mean()[['total_bill']].reset_index()
px.line_polar(t, r='total_bill', theta='day', line_close=True)
The best way to get better at visualization is through practice. What I have found useful is participating in a weekly visualization challenge called the TidyTuesday!
Comments welcome!
-
Describe your data using Python
This is the first of a series of articles that I will write to give a gentle introduction to statistics. In this article we will introduce some basic statistical concepts and learn how to use basic statistics to help you describe your data.
We will cover the following topics in this article:
The difference between a population and a sample
The difference between Descriptive and Inferential statistics
Different types of variables
Types of descriptive statistics
Normal or Gaussian distribution
The difference between a population and a sample:
Population denotes a large group consisting of elements having at least one common feature; it is the complete set of observations
Sample is a finite subset of the population; it is a subset of observations from a population. We get a sample from the population in either of the following ways
Representative sampling - here the sample’s characteristics are similar to the population characteristics
- A simple random sample is the most common approach to obtain a representative sample
- A systematic random sample
- A cluster random sample
- A stratified random sample
Convenience sampling - here we collect sample from section of population that is easily available
The difference between Descriptive and Inferential statistics:
Descriptive statistics - its all about organizing, describing and summarizing data
Exploratory data analysis (EDA)
measures of location - such as Mean, Median, Mode
measures of variability or dispersion - such as Variance, Standard deviation, Range, Inter quartile range (IQR)
Inferential statistics - its all about drawing conclusions about a population from analysis of a random sample drawn from the populaiton
Exploratory modelling - how is x related to y?
Predictive modelling - if you know x, can you predict y?
Different types of variables:
Quantitative
Discrete: a variable whose value is obtained by counting. Example, number of students in a class
Continuous: a variable whose value is obtained by measuring. Example, height of all students in a class
Interval: this is scale of measurement where continuous data is rank ordered
Ratio: this is scale of measurement where continuous data is rank ordered + has meaningful spacing
Qualitative or Categorical
Nominal: example gender - female or male
Ordinal: example size - small, medium, or large
Types of descriptive statistics:
Measures of location: mainly measures of central tendency
Mean: sum of all values divided by the number of values
import seaborn as sns
tips = sns.load_dataset('tips')
tips.mean() # shows mean of all numeric variables
Median: middle value in a given sequence of values ordered by rank
tips.median() # shows median of all numeric variables
Mode: most frequent value in a set of values
tips.mode() # shows mode of all variables
Measures of variability, spread or dispersion
Range: Maximum value - Minimum value
range = tips.total_bill.max() - tips.total_bill.min() # range
IQR (Inter quartile range): 75th percentile - 25th percentile
tips.total_bill.quantile(.75) - tips.total_bill.quantile(.25) # IQR
Variance: Measure of variability of data around the mean
tips.total_bill.var() # variance of total_bill variable
Standard deviation: how spread out the data is, i.e. how much variance there is from the mean
tips.total_bill.std() # standard deviation of total_bill variable
Coefficient of variance (C.V.): measure of standard deviation expressed as a percentage of the mean
cv = lambda x: x.std() / x.mean() * 100
cv(tips.total_bill)
Measures of symmetry and peakedness: Skewness measures symmetry and Kurtosis measures peakedness
Normal or Gaussian distribution
This is one of the most common statistical distribution. The curve of this distribution is shaped like a bell.
The shape of the bell depends on mean and standard deviation of the data
Larger the standard deviation, wider the distribution
A tip to quickly assess normality is to see if mean and median are nearly equal
Skewness and Kurtosis
Skewness measures tendency of data to be spread out on one side of the mean than the other. Skewness value indicates
Negative value indicates the data is left skewed
Positive value indicates the data is right skewed
Closer to zero for the data to be normally distributed
import scipy.stats as s
s.skew(tips.total_bill, bias=False) #calculate sample skewness
Kurtosis measures tendency of data to be concentrated around the center or tails. Kurtosis value indicates
Platykurtic: Negative value indicates lower than normal peakedness
Leptokurtic: Positive value indicates higher than normal peakedness
Mesokurtic: Closer to zero for the data to be normally distributed
import scipy.stats as s
s.kurtosis(tips.total_bill, bias=False) #calculate sample kurtosis
Comments welcome!
-
An Introduction to GitHub
A three part article series on version control using Git and GitHub. This is the third article in the series in which I will give a very brief introduction to GitHub. This will allow most readers to understand enough to utilize it for version control during development.
What is GitHub?
GitHub is a popular platform for hosting and sharing code repositories, and is widely used for version control and collaborative coding projects. If you’re new to using GitHub for version control, here are some key things to keep in mind:
Create a GitHub account: The first step in using GitHub is to create an account. You can sign up for a free account, which gives you access to public repositories, or a paid account, which gives you access to private repositories and additional features.
Create a new repository: Once you have an account, you can create a new repository by clicking the “New repository” button on your GitHub dashboard. You can choose to make the repository public or private, and can add a README file and other files as needed.
Clone the repository to your local machine: Once you have created a repository on GitHub, you can clone it to your local machine using Git. This allows you to make changes to the code locally, and push those changes back to the remote repository on GitHub.
Make changes and commit them: Once you have cloned the repository to your local machine, you can make changes to the code and commit those changes to Git. Be sure to write clear and descriptive commit messages that explain the changes made.
Push changes to the remote repository: After committing changes to Git, you can push those changes back to the remote repository on GitHub. This allows other team members to see the changes and collaborate on the code.
Use pull requests for code reviews: When working on a team, it’s a good practice to use pull requests to review code changes before merging them into the main branch. This allows other team members to review the code and provide feedback before changes are merged.
Use branches for new features or bug fixes: When working on a new feature or bug fix, it’s important to create a new branch in Git rather than making changes directly to the main branch. This keeps the main branch stable and allows for easier collaboration with other team members.
By keeping these key things in mind when using GitHub for version control, you can help ensure that your codebase is well-organized, well-documented, and easy to collaborate on with other team members.
Components of GitHub
Now, let us explore some of the key components of GitHub.
Repository, branch
Repository is a project’s folder and contains all of the project files (including documentation), and stores each file’s revision history.
Branch is a parallel version of a repository. It is contained within the repository, but does not affect the primary or master branch allowing you to work freely without disrupting the “live” version. When you’ve made the changes you want to make, you can merge your branch back into the master branch to publish your changes.
Commit, revert
Commit, or “revision”, is an individual change to a file (or set of files). When you make a commit to save your work, Git creates a unique ID (a.k.a. the “SHA” or “hash”) that allows you to keep record of the specific changes committed along with who made them and when. Commits usually contain a commit message which is a brief description of what changes were made.
Revert - when you revert a pull request on GitHub, a new pull request is automatically opened, which has one commit that reverts the merge commit from the original merged pull request. In Git, you can revert commits with git revert.
Push, pull, fetch, merge
Push means to send your committed changes to a remote repository on GitHub.com. For instance, if you change something locally, you can push those changes so that others may access them.
Pull refers to when you are fetching in changes and merging them. For instance, if someone has edited the remote file you’re both working on, you’ll want to pull in those changes to your local copy so that it’s up to date. See also fetch.
Pull requests are proposed changes to a repository submitted by a user and accepted or rejected by a repository’s collaborators. Like issues, pull requests each have their own discussion forum.
Fetch - when you use git fetch, you’re adding changes from the remote repository to your local working branch without committing them. Unlike git pull, fetching allows you to review changes before committing them to your local branch.
Merge takes the changes from one branch (in the same repository or from a fork), and applies them into another. This often happens as a “pull request” (which can be thought of as a request to merge), or via the command line. A merge can be done through a pull request via the GitHub.com web interface if there are no conflicting changes, or can always be done via the command line.
Fork, clone, download
Fork is a personal copy of another user’s repository that lives on your account. Forks allow you to freely make changes to a project without affecting the original upstream repository. You can also open a pull request in the upstream repository and keep your fork synced with the latest changes since both repositories are still connected
Clone is a copy of a repository that lives on your computer instead of on a website’s server somewhere, or the act of making that copy. When you make a clone, you can edit the files in your preferred editor and use Git to keep track of your changes without having to be online. The repository you cloned is still connected to the remote version so that you can push your local changes to the remote to keep them synced when you’re online.
Download option allows to download project folder as a zip file from GitHub to your local machine. This does not bring the .git folder, so using the http link to download is a better option
Comments welcome!
-
-
An Introduction to Git
A three part article series on version control using Git and GitHub. This is the first article in the series in which I will give a very brief introduction to Git. This will allow most readers to understand enough to utilize it for version control during development.
What is Git?
Git is a popular version control system that allows developers to manage and track changes to their code over time. It’s an essential tool for software development teams, as it helps to ensure that changes to code are properly tracked and documented, and makes it easier for developers to collaborate and work together. Here’s an overview of what Git is and how it works.
Git is a distributed version control system, meaning that every developer working on a project has their own copy of the code repository on their local machine. This allows developers to work on their own changes and then merge them back into the main repository when they are ready. Git is also designed to be very fast and efficient, making it ideal for managing large codebases and complex projects.
How does Git work?
Git works by tracking changes to files and directories in a code repository. When a developer makes changes to the code, they create a new “commit” that documents the changes they made. Git stores these commits in a tree-like structure, with each commit representing a snapshot of the code at a particular point in time. This allows developers to easily view the history of changes to the code over time, and to revert to previous versions if necessary.
Git also allows developers to create branches, which are essentially separate versions of the code repository that can be worked on independently. Branches are useful for trying out new features or making experimental changes without affecting the main codebase. Once changes have been tested and reviewed, they can be merged back into the main branch.
Using Git for version control
To use Git for version control, developers typically create a new repository on a Git hosting service such as GitHub, GitLab, or Bitbucket. They then clone the repository onto their local machine and begin making changes to the code. To commit changes, developers use Git commands such as “git add” to add changed files to the commit, and “git commit” to create a new commit with a commit message that describes the changes.
To collaborate with other developers, developers can push their changes to the remote repository and create “pull requests” that allow other developers to review the changes and provide feedback. Once changes have been reviewed and approved, they can be merged back into the main branch.
Basic terminal commands
Terminal (for Unix or Mac) or Command Prompt for Windows allows us to type Git commands and manage project repositories. In this section we will be focusing on terminal commands.
By default we are in the /home/vivek directory. home and mnt folders are in the same directory (usually they are in the highest level directory signified by just a /)
pwd shows the current directory
clear is used to clear the command line
cd + tab key is used to cycle between sub directories in a directory
cd .. is used to move up a directory
cd mnt/ is used to enter the mnt directory. In this directory we can find the windows c drive (basically it is a directory named c)
~ signifies that you are in your home directory
.. is used to move up one directory
/ signifies the highest level directory, you cant go back from there
mkdir is used to create a new directory
Directory names are case sensitive
Right click is used to paste an absolute path name in the terminal
ls is used to list all directories and files in a directory
rm -rf is used to remove folders. rf tells that we are using the command to remove a directory, as by default rm is used to remove a file
git --version is used to see the version of git
touch file_name.txt is used to create a file
Basic Git commands
Git Repository is used to save project files and the information about the changes in the project. Repository can be created locally, or it can be a clone Git repository which is a copy of a remote Git repo.
git init is used to initialize the directory as a git repository. This will create a .git folder in the directory and we can start using git features
git status shows staging area. You will see some files under “Untracked files:” header
git add file-name is used to add a file to staging area. After this you will see the file under the “Changes to be committed:” header
git add . is used to add all files in directory to staging area (. signifies all)
git rm --cached file.txt is used to unstage a file
git rm -f file.txt is used to force remove a file from staging area and also deletes the file from directory (-f signifies force)
git config --global user.email “abc.xyz@email.com”
git config --global user.name “abc.xyz”
git commit --help
git commit -a -m “Initial commit” (-m to include a message; -a to automatically stage files that have been modified and deleted, but new files you have not told Git about are not affected)
git log (if you want to see a shorter version then use git log --oneline)
Head is usually on master (most recent commit). Head is what the project directory looks like.
git checkout <commit-id> is used to see the contents of the folder as they looked during that particular commit
git checkout master is used to restore the head to the most recent commit, hence the contents of the project directory are also restored to what they were at the time of the most recent commit
git revert <commit-id> is used to revert the contents of the project directory to what they were before that particular commit. This will still appear in the log and we can go back to that commit by using git revert again
git reset - three kinds - soft (only goes back in time in the commit tree, so just moves the head back; this is similar to checkout), mixed (moving back in time in the project directory but still can come back, doesn’t remove files) and hard (moving back in time in the project directory and staying there, removes files)
touch .gitignore, now open the .gitignore file with notepad and add the names of the files you don’t want to track in that. # can be used to comment in this file. Usually you create .gitignore during initializing the project. If you have committed files already before adding them into the .gitignore file, then you need to remove them from cache by using the following series of commands
git rm -r --cached .
git add .
git commit -m “message”
If there is a directory in your project folder and you want to ignore all files in the directory from future commits, you can add “directory-name/*” in the .gitignore file
Git Branches for Error Handling
Lets say there is an error in one of the files in the project folder
We can create a branch to fix the error while the master repository stays intact
git checkout -b err01 (creates a new branch called err01)
<fix the error in one of the files in the project folder>
git add . (add all changes made to the err01 to the staging area, so they can be committed)
git commit -m ‘fixed error’ (commit all changes made to err01 branch)
git checkout master (switch back to master branch)
git merge err01 (merge changes made in err01 to master branch; merging will only take last commit of err01 and weave it into the master branch commit timeline)
git push (this will push master branch of project folder to remote repository)
git push origin err01 (this will push err01 branch of project folder to remote repository)
git push origin --delete err01 (we delete the err01 branch as we don’t need it anymore)
git branch -d bugs (local branches can be deleted using -d)
git branch -a (list all branches)
Remote Repositories for Effective Collaboration
First step is to create a new repository on GitHub (don’t add a read-me, gitignore or license). Copy the url of the repository
Create a project folder in your local machine and browse into that folder using bash
git init (you will see that the repository has not been initialized yet; git init is used to create a new repository)
git remote add origin <paste url here>
git remote -v (you will see that the repository has been initialized)
In GitHub website
“Create new file” > README.md
“Create new file” > LICENSE
“Create new file” > .gitignore > in content of that file type /AutoGen to exclude all files that we keep in that folder
pull - go back to bash
git pull origin master (we don’t need to specify origin master is we set master as the tracked branch)
git branch --set-upstream-to=origin/<branch> master
Sometimes you might be prompted for a login at this stage
<make changes to the local repository>
git push -u origin master (push updates to remote repository on GitHub; will ask for username and password)
You can add other developers as collaborators to this repository.
In summary, Git is a powerful tool for version control that allows developers to manage and track changes to code over time. With its distributed architecture, fast performance, and support for branching and merging, Git is an essential tool for software development teams of all sizes.
Comments welcome!
-
Introduction to Programming in R
Quick Introduction
R is a programming language and environment for statistical computing and graphics. It was created in the early 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. R is now widely used in academia, industry, and government for data analysis, statistical modeling, and data visualization.
One of the key features of R is its wide range of statistical and graphical techniques. R provides a vast array of statistical and graphical methods, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and graphical techniques for data visualization. R is also highly extensible and has an active community of users and developers who create and contribute packages that enhance the capabilities of the language.
R is an open-source language, which means that the code is available for free and can be modified and redistributed. This has led to the development of a large and active community of R users and developers. The R community provides a wealth of resources, including documentation, tutorials, and help forums, making it easy for users to get started with the language and to find solutions to their problems.
One of the advantages of R is its integration with other programming languages and data sources. R can read data from a wide range of sources, including text files, spreadsheets, databases, and web services. R can also interact with other programming languages, such as Python, Java, and C++, allowing users to take advantage of the strengths of different languages and libraries.
Another advantage of R is its versatility. R can be used for a wide range of tasks, from data analysis and visualization to machine learning and artificial intelligence. R can also be used in a variety of settings, from research and academia to industry and government.
Most modern programming languages have a set up similar building blocks, for example
Receiving input from the user and Showing output to the user
Ability to store values in variables (usually of different kinds such as integers, floating points or character)
A string of characters where you can store names, addresses, or any other kind of text
Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Put your code in functions
Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes
Read file from a disk and save file to a disk
Ability to comment your code so you can understand it when you revisit it some time later
Lets dive right in and see how we can do these things in R.
Before we can begin to write a program in R, we need to install R and R studio.
myString <- "Hello, World!"
print (myString)
1. Receiving input from the user and Showing output to the user
There are several ways in which we can show output to the user. Let’s look at some ways of showing output:
var.1 = c(0,1,2,3)
'Method 1: values of the variables can be printed using print()
print(var.1)
# Output: 0 1 2 3
'Method 2: cat() function combines multiple items into a continuous print output
cat ("var.1 is ", var.1 ,"\n")
# Output: var.1 is 0 1 2 3
2. Ability to store values in variables (usually of different kinds such as integers, floating points or character)
Basic data types: In R we call variables as objects. There are several types of objects, lets take a look at the important ones:
# Logical
v <- TRUE
print(class(v)) # class funciton can be used to see the data type of the variable
# Numeric
v <- 23.5
print(class(v))
# Integer
v <- 2L
print(class(v))
# Complex
v <- 2+5i
print(class(v))
# Character
v <- "TRUE"
print(class(v))
# Raw
v <- charToRaw("Hello")
print(class(v))
Advanced data types: Much of R’s power comes from the fact that R lets us access some advanced objects other than the basic ones shown earlier. Lets take a look at some of the advanced variables:
# Vectors - When you want to create vector with more than one element, you should use c() function which means to combine the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
# Get the class of the vector.
print(class(apple))
# Lists - A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
# Print the list.
print(list1)
# Matrices - A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
# Arrays - While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. In the below example we create an array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
# Factors - Factors are the r-objects which are created using a vector. It stores the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are useful in statistical modeling. Factors are created using the factor() function. The nlevels functions gives the count of levels.
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
# Create a factor object.
factor_apple <- factor(apple_colors)
# Print the factor.
print(factor_apple)
print(nlevels(factor_apple))
# Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data. The first column can be numeric while the second column can be character and third column can be logical. It is a list of vectors of equal length. Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
3. A string of characters where you can store names, addresses, or any other kind of text
Any value written within a pair of single quote or double quotes in R is treated as a string.
Key idea here is to learn how to manipulate string variables. There are a few common operations that we will focus on:
a. Concatenate strings
# Concatenate strings
paste(str1, str2, str3, ... , sep = " ", collapse = NULL)
b. Counting number of characters in a string
# Counting number of characters in a string - nchar() function
nchar(test_str)
c. Changing the case - toupper() & tolower() functions
str = 'apPlE'
toupper(str) # APPLE
tolower(str) # apple
d. Extracting parts of a string - substring() function
# Syntax
substring(x,first,last)
# Example - Extract characters from 5th to 7th position.
result <- substring("Extract", 5, 7)
print(result)
e. Formatting - Numbers and strings can be formatted to a specific style using format() function.
# Syntax
format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none"))
# Example
# Total number of digits displayed. Last digit rounded off.
result <- format(23.123456789, digits = 9)
print(result)
# Display numbers in scientific notation.
result <- format(c(6, 13.14521), scientific = TRUE)
print(result)
# The minimum number of digits to the right of the decimal point.
result <- format(23.47, nsmall = 5)
print(result)
# Format treats everything as a string.
result <- format(6)
print(result)
# Numbers are padded with blank in the beginning for width.
result <- format(13.7, width = 6)
print(result)
# Left justify strings.
result <- format("Hello", width = 8, justify = "l")
print(result)
# Justfy string with center.
result <- format("Hello", width = 8, justify = "c")
print(result)
4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Arrays are a series of similar type of data stored together in one variable. Arrays can be one-dimentional or multi-dimentional. An array is created using the array() function. It takes vectors as input and uses the values in the dim parameter to create an array.
For example − If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns.
# dim=c(rows, columns, matrices)
array2 = array(1:12, dim=c(2, 3, 2))
# Naming Columns and Rows
column.names <- c("COL1","COL2","COL3")
row.names <- c("ROW1","ROW2")
matrix.names <- c("Matrix1","Matrix2")
array2 = array(1:12, dim=c(2, 3, 2), dimnames = list(row.names, column.names, matrix.names))
Lets see how we can access array elements:
# dim=c(rows, columns, matrices)
print(array2[2,,2]) # Print the second row of the second matrix of the array.
print(array2[1,3,1]) # Print the element in the 1st row and 3rd column of the 1st matrix.
print(array2[,,2]) # Print the 2nd Matrix.
Since the returned values here are matrices, we can perform matrix operations on them
Calculations Across Array Elements (we can use user defined functions as well)
apply()
lapply()
sapply()
tapply()
# apply(X, MARGIN, FUN) - apply to r or c or both - input to this funciton is a df - output is a vector, list or array
m1 <- matrix(C<-(1:10),nrow=5, ncol=2)
apply(m1, 2, sum)
# lapply(X, FUN) - apply to all elements - input to this function is list, vector or df - output is a list
# sapply(X, FUN) - apply to all elements - input to this function is list, vector or df - output is a vector or a matrix
movies <- c("BRAVEHEART","BATMAN","VERTIGO","GANDHI")
lapply(movies, tolower)
sapply(movies, tolower)
# tapply(X, INDEX, FUN = NULL) - apply to each factor variable in a vector - input to this function is a vector - output it an array
data(iris)
tapply(iris$Sepal.Width, iris$Species, median)
5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
R has several looping options (repeat, while and for). There are also options of nesting (single, double, triple, ..) loops.
a. The Repeat loop executes the same code again and again until a stop condition is met:
# Syntax
repeat {
commands
if(condition) {
break
}
}
# Example
v <- c("Hello","loop")
cnt <- 2
repeat {
print(v)
cnt <- cnt+1
if(cnt > 5) {
break
}
}
b. The While loop executes the same code again and again until a stop condition is met:
# Syntax
while (test_expression) {
statement
}
# Example
v <- c("Hello","while loop")
cnt <- 2
while (cnt < 7) {
print(v)
cnt = cnt + 1
}
c. The for loop:
# Syntax
for (value in vector) {
statements
}
# Example
v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
R also provides the break and next statements that allow us to alter the loops further. Following is their use:
When the break statement is encountered inside a loop, the loop is immediately terminated and program control resumes at the next statement following the loop.
On encountering next, the R parser skips further evaluation and starts next iteration of the loop.
6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
R provides if.., if..else.., if..else..if.., and switch options to apply conditional logic. Lets take a look at them:
a. The basic syntax for creating an if statement in R is:
# Syntax
if (test_expression) {
statement
}
# Example
x <- 5
if(x > 0){
print("Positive number")
}
b. The basic syntax for creating an if…else statement in R is:
if (test_expression) {
statement1
} else {
statement2
}
# Example
x <- -5
if(x > 0){
print("Non-negative number")
} else {
print("Negative number")
}
c. The basic syntax for creating an if…else if…else statement in R is:
if (test_expression1) {
statement1
} else if (test_expression2) {
statement2
} else if (test_expression3) {
statement3
} else {
statement4
}
# Example
x <- 0
if (x < 0) {
print("Negative number")
} else if (x > 0) {
print("Positive number")
} else
print("Zero")
d. A switch statement allows a variable to be tested for equality against a list of values. Each value is called a case, and the variable being switched on is checked for each case.
x <- switch(
2,
"first",
"second",
"third",
"fourth"
)
print(x)
7. Put your code in functions
In R a user defined function is created by using the keyword function.
# Syntax
function_name <- function(arg_1, arg_2, ...) {
Function body
}
# Example
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
We can call the function new.function supplying 6 as an argument.
new.function(6)
We can also create functions to which we can pass arguments. These functions can also be defined to use default values for those arguments in case user does not provide a value. Lets see how this is done:
new.function <- function(a = 3, b = 6) {
result <- a * b
print(result)
}
Now we can call this with or without passing any values:
# Call the function without giving any argument.
new.function()
# Call the function with giving new values of the argument.
new.function(9,5)
8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes
Class is the blueprint that helps to create an object and contains its member variable along with the attributes. R lets you create two types of classes, S3 and S4.
S3 Classes: These let you overload the functions.
S4 Classes: These let you limit the data as it is quite difficult to debug the program
We will cover s4 classes here. S4 class is defined by the setClass() method.
# Defining a class
setClass("emp_info", slots=list(name="character", age="numeric", contact="character"))
emp1 <- new("emp_info",name="vivek", age=30, contact="somehwere on the internet")
# Access elements of a class
emp1@name
9. Read file from a disk and save file to a disk
Lets see how to read and write csv in an organized way. CSV is the most common file type you will be using for data science, however R can read several other file types as well.
# read a csv file
data <- read.csv('file.csv')
# write a csv file
write.csv(df, 'file.csv', row.names = FALSE)
10. Ability to comment your code so you can understand it when you revisit it some time later
We can tell R that a line of code is a comment by starting it with a #.
# this is a comment
In summary, R is a powerful and versatile programming language that is widely used for statistical computing and graphics. Its extensive range of statistical and graphical techniques, its open-source nature, and its active community of users and developers make it a valuable tool for data analysis and modeling. Whether you are a researcher, data analyst, or developer, R provides a wide range of tools and resources for working with data and creating meaningful insights.
To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent.
Comments welcome!
-
Introduction to Programming in Markdown
Quick Introduction
Markdown is a lightweight markup language that is used to format text in a simple and consistent way. It was first created in 2004 by John Gruber and Aaron Swartz as a way to write content for the web that was easy to read and write.
Markdown is designed to be easy to learn and use. It uses simple syntax to format text, making it easy to create headings, lists, links, and other formatting elements. Markdown can be used in a wide variety of contexts, including writing blog posts, creating documentation, and writing code comments.
One of the key features of Markdown is its simplicity. Markdown uses plain text that can be easily read and edited using any text editor. This makes it easy to collaborate on documents and to transfer files between different devices and platforms. Additionally, Markdown is supported by a wide variety of software tools and platforms, including blogging platforms, content management systems, and online forums.
Another important feature of Markdown is its flexibility. Markdown can be customized and extended to support a wide variety of use cases. For example, Markdown supports the creation of tables, code blocks, and mathematical equations. Additionally, there are many third-party tools and libraries that extend the functionality of Markdown, such as Pandoc, which can convert Markdown to other formats like HTML, LaTeX, and PDF.
Markdown is also popular among programmers, as it can be used to create code blocks and inline code snippets. This is particularly useful for writing documentation and sharing code examples. Many code editors also support Markdown, allowing programmers to write and preview Markdown documents without leaving their development environment.
Following table provides a quick overview of frequently used Markdown syntax elements. It does not cover every case, so if you need more information about any of these elements, refer to the reference guides for basic syntax and extended syntax.
Element
Markdown Syntax
Heading
# for H1, ## for H2 and so on
Bold
**bold text**
Italic
*italicized text*
Blockquote
> blockquote
Ordered List
Just add 1., 2. and so on in front of list elements
Unordered List
Just add a - or * in front of list elements
Code
`code`
Horizontal Rule
three or more *, -, or _
Link
[title](https://www.example.com)
Image
{:class=”img-responsive”}
Now that we have reviewed some of the basic syntax elements, lets familiarize ourself with some advance syntax elements.
Element
Markdown Syntax
Table
| for vertical lines and - for horizontal lines
Code Block
``` code ```
Footnote
[^1]: This is the footnote.
Heading ID
### Heading {#custom-id}
Strikethrough
~~The world is flat.~~
URL
https://www.markdownguide.org
Email
fake@example.com
Escape character
\
Markdown also offers syntax highlighting for various programming languages when we specify a code block. Most of the time all we need to do is just mention the name of the programming language after the opening ```, like ```python. Following is a curated list of supported programming languages:
Language
Supported file types
bash
’*.sh’, ‘*.ksh’, ‘*.bash’, ‘*.ebuild’, ‘*.eclass’
bat
’*.bat’, ‘*.cmd’
c
’*.c’, ‘*.h’
cpp
’*.cpp’, ‘*.hpp’, ‘*.c++’, ‘*.h++’, ‘*.cc’, ‘*.hh’, ‘*.cxx’, ‘*.hxx’, ‘*.pde’
csharp
’*.cs’
css
’*.css’
fortran
’*.f’, ‘*.f90’
go
’*.go’
html
’*.html’, ‘*.htm’, ‘*.xhtml’, ‘*.xslt’
java
’*.java’
js
’*.js’
markdown
’*.md’
perl
’*.pl’, ‘*.pm’
php
’*.php’, ‘*.php(345)’
postscript
’*.ps’, ‘*.eps’
python
’*.py’, ‘*.pyw’, ‘*.sc’, ‘SConstruct’, ‘SConscript’, ‘*.tac’
rb or ruby
’*.rb’, ‘*.rbw’, ‘Rakefile’, ‘*.rake’, ‘*.gemspec’, ‘*.rbx’, ‘*.duby’
sql
’*.sql’
vbnet
’*.vb’, ‘*.bas’
xml
’*.xml’, ‘*.xsl’, ‘*.rss’, ‘*.xslt’, ‘*.xsd’, ‘*.wsdl’
yaml
’*.yaml’, ‘*.yml’
A great heading (h1)
Another great heading (h2)
Some great subheading (h3)
You might want a sub-subheading (h4)
Could be a smaller sub-heading, pacman (h5)
Small yet significant sub-heading (h6)
Code box
<html>
<head>
</head>
<body>
<p>Hello, World!</p>
</body>
</html>
List
First item, yo
Second item, dawg
Third item, what what?!
Fourth item, fo sheezy my neezy
Numbered list
First item, yo
Second item, dawg
Third item, what what?!
Fourth item, fo sheezy my neezy
Comments
{% comment %}
Might you have an include in your theme? Why not try it here!
{% include my-themes-great-include.html %}
{% endcomment %}
Tables
Title 1
Title 2
Title 3
Title 4
lorem
lorem ipsum
lorem ipsum dolor
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
lorem ipsum dolor sit
Title 1
Title 2
Title 3
Title 4
lorem
lorem ipsum
lorem ipsum dolor
lorem ipsum dolor sit
lorem ipsum dolor sit amet
lorem ipsum dolor sit amet consectetur
lorem ipsum dolor sit amet
lorem ipsum dolor sit
lorem ipsum dolor
lorem ipsum
lorem
lorem ipsum
lorem ipsum dolor
lorem ipsum dolor sit
lorem ipsum dolor sit amet
lorem ipsum dolor sit amet consectetur
In summary, Markdown is a simple and flexible markup language that is widely used for formatting text on the web. Its simplicity and ease of use make it an attractive option for writing and sharing documents, and its flexibility allows it to be customized and extended to support a wide variety of use cases. Whether writing blog posts, creating documentation, or sharing code examples, Markdown is a valuable tool for anyone who wants to format text in a consistent and easy-to-read way.
For a more complete list consider visiting Codebase.
By the way this page was written using markdown and rendered to HTML using Jekyll.
Comments welcome!
-
Introduction to Programming in Python
Quick Introduction
Python is a high-level, interpreted programming language that was first released in 1991 by Guido van Rossum. It is a general-purpose language that is designed to be easy to use, with a focus on readability and simplicity. Python is often used for web development, data analysis, artificial intelligence, scientific computing, and other types of software development.
One of the key features of Python is its ease of use. Python’s syntax is designed to be simple and intuitive, making it accessible to both beginner and experienced programmers. Python is also an interpreted language, meaning that it does not require compilation, which makes it easy to write and test code quickly.
Another important feature of Python is its support for object-oriented programming. Python allows users to create classes and objects, and to define methods on those objects. This makes it a powerful tool for building complex software systems.
Python also includes a large and growing library of built-in modules and packages. These modules provide a wide range of functionality, from working with strings, arrays, and dictionaries to working with databases, web frameworks, and machine learning tools. Python’s open-source ecosystem is one of its biggest strengths, as it allows developers to easily access and integrate with a wide range of third-party libraries and tools.
One of the most popular web development frameworks built in Python is Django. Django is a full-stack web framework that provides a set of conventions and tools for building web applications quickly and easily. With its focus on developer productivity, Django has become a popular choice for startups, small businesses, and large enterprises.
Python’s popularity has also been driven by its use in data analysis and scientific computing. With packages like NumPy, Pandas, and Matplotlib, Python has become a leading language for data analysis and visualization. In recent years, Python has also become a popular language for artificial intelligence and machine learning, with packages like TensorFlow, PyTorch, and Scikit-learn providing powerful tools for building machine learning models.
Most modern programming languages have a set up similar building blocks, for example
Receiving input from the user and Showing output to the user
Ability to store values in variables (usually of different kinds such as integers, floating points or character)
A string of characters where you can store names, addresses, or any other kind of text
Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Put your code in functions
Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes
Read file from a disk and save file to a disk
Ability to comment your code so you can understand it when you revisit it some time later
Lets dive right in and see how we can do these things in Python.
0. How to install Ruby on your desktop?
Before we can begin to write a program in Python, we need to install Anaconda. This will install the Anaconda data science environment and Spyder IDE for coding in Python. Once done, go ahead and open Spyder and try out the following code to see if everything is in order.
myString = "Hello, World!"
print (myString)
1. Receiving input from the user and Showing output to the user
There are several ways in which we can show output to the user. Let’s look at some ways of showing output:
name = input('please enter your name') # receiving character input
print("hello ", name, ",how are you?") # showing character output
age = input('please enter your age') # receiving numeric input
print('so you are', age, 'years old.')
2. Ability to store values in variables (usually of different kinds such as integers, floating points or character)
Python is dynamically typed - don’t need to type out the variable’s data type before using it. This can sometimes cause unexpected problems if for example a user enters a character where you expect a number. To avoid this kind of problems type() can be used. Alternatively, you can “define the variable” by assigning it an initial value (like age=20).
Basic data types: In Python we have several types of objects, lets take a look at the important ones:
# Boolean / Logical
v = TRUE
print(type(v)) # type funciton can be used to see the data type of the variable
# Numeric
v = 23.5
print(type(v))
# Integer
v = 2L
print(type(v))
# Complex
v = 2+5i
print(type(v))
# Character
v = "TRUE"
print(type(v))
# Some common number functions:
hex(1) # hexadecimal representation of numbers
bin(1) # binary representation of numbers
2**3 # 2^3, 2 to the power 3
pow(2,3) # 2**3
pow(2,3,4) # 2**3 % 4
abs(-2.33)
round(3.14)
round(3.14159,2) # only till 2 decimal places
import math
sq_rt = math.sqrt(variable) # returns the square root of the variable
Advanced data types: Much of Python’s power comes from the fact that it lets us access some advanced variable trypes other than the basic ones shown earlier. Lets take a look at some of the advanced variable types:
# Lists - A list can contain many different types of elements inside it such as character, numeric, etc. and even another list inside it.
# Create a list through enumeration.
a=[] # with this we initialize a list element
a=range(1,10) # with this we insert a range of values from 1-10 in the list
print(list(a)) # to show the list as a list, we need to tell the print function that we are passing it a list
# Output: [1, 2, 3, 4, 5, 6, 7, 8, 9] # 10 is excluded because upper bound is excluded in python
# we can have mixed data types in a list
b=[1,2,3,'vivek',True,4,5]
print(list(b))
# index of list start with 0, 1, 2 ..
# so vivek is present at index 3
print(b[3])
# slicing - [start:stop:step]
a[1:6:2] # starts from 1 and goes up until 6 and selects every second element
# reversing a list
L[::-1] # this would take a lot more effort to do in C++!
# tuples - immutable list, cant be changed
t = (1,2,3) # use () instead of []
# dict - d = {'key':'value', ..} is an unordered mutable key:value pairs {"name":"frankie","age":33}
# Dictionary is quite useful in matrix indexing
m=np.array([[1,2,3],[4,5,6],[7,8,9]])
col_names={'age':0, 'weight':1, 'height':2}
row_names={'aa':0, 'cc':1, 'bb':2}
# now we can get weight of ale using actual indexes or dict indexes
m[1,1] # 5
m[row_names['cc'],col_names['weight']] # 5
# set - s=set('a','b','c',..) - unordered collection of unique objects
# It looks like a dictionary {"a","b"} when python shows output, but it is not because it doesn’t have key:value pairs
set([1,1,2,3]) # output: {1,2,3} , List can be passed to set()
set("Mississippi") # output: {'M', 'i', 'p', 's'} , Even strings can be passed to set
# Matrices - A matrix is a two-dimensional rectangular data set. It can be created using .array() function.
# Create a matrix
import numpy as np # we need to import the numpy libabry which provides tools for numerical computing.
m=np.array([[1,2,3],[4,5,6],[7,8,9]])
print(type(m))
# Arrays - while matrices are confined to two dimensions, arrays can be of any number of dimensions.
# Create an array.
import numpy as np # we need to import the numpy libabry which provides tools for numerical computing.
a=np.array([1,2,3]) # this is a 1 dimentional array
print(type(a))
# Convert a list to an array
a=[1,2,3,4]
a=np.array(a) # array([1, 2, 3, 4])
# DataFrame - this is an advanced object that can be used by installing the pandas library. If you are familiar with R, this is similar to data.frame. If you are familiar with excel, you can think of a dataframe as a table with rows and column where rows and colums can potentially have names/labels. You can access data within the dataframe using row/column number (indexing starts from 0) or their labels.
import pandas as pd
# From dict
pd.DataFrame({'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']})
# from list
pd.DataFrame(['orange', 'mango', 'grapes', 'apple'], index=['a', 'b', 'c', 'd'], columns =['Fruits'])
# from list of lists
pd.DataFrame([['orange','tomato'],['mango','potato'],['grapes','onion'],['apple','chilly']], index=['a', 'b', 'c', 'd'], columns =['Fruits', 'Vegetables'])
# from multiple lists
pd.DataFrame(
list(zip(['orange', 'mango', 'grapes', 'apple'],
['tomato', 'potato', 'onion', 'chilly']))
, index=['a', 'b', 'c', 'd']
, columns =['Fruits', 'Vegetables'])
3. A string of characters where you can store names, addresses, or any other kind of text
Any value written within a pair of single quote or double quotes in Python is treated as a string.
Key idea here is to learn how to manipulate string variables
There are a few common operations that we will focus on:
a. Concatenate strings
# Concatenate strings
str1 + str2 + " " + str3
b. Counting number of characters in a string
# Counting number of characters in a string
str1 = "vivek"
len(str1)
c. Changing the case - toupper() & tolower() functions
str1.upper() # convert string to upper case (.lower() for lower case)
str1.isupper(), str1.islower() # check if a string or a character is upper or lower
d. Splitting a string
s.split('e') # returns list of strings before and after e. if there are multiple e's, then split happens for all instances of e
e. Palindrome of a string
str1 = "vivek"
str1[::-1]
4. Some advance data types such as lists which can store a series of regular variables (such as a series of integers)
Lists are a series of variables stored together in one variable. Lists can be one-dimentional or multi-dimentional. A list is created using the list() function. It takes variables (even other lists) as input. List is different from string because elements can be mutated/changed.
# Defining
L=[0,0,0] # [0, 0, 0]
L1=[0]*3 #shorthand way of defining a list with repeated elements
# Supports indexing and slicing
L1=['one', 'two', 'three']
L1[0] # 'one'
L1[1:2] # ['two'], upper bound is excluded
L1[1:3] # ['two', 'three']
# Indexing nested lists
L1 = ['one', 'two', ['three', 'four'], 'five']
L1[2][0] # 'three'
# Elements can be added
L1.append('six')
# Elements can be removed
L1.pop() # last element gets popped, we can save it in a variable also
# Sort
L1.sort() # sorts the list in-place, the actual list gets sorted
sorted(L1) #returns the sorted version of L3 list
# Reverse
L1=['c','a','b']
L1.reverse() # reverses the list in-place, the actual list gets reversed
# Multi dimentional list indexing
L1=[[1,2,3],[4,5,6],[7,8,9]]
L1[0][:] # returns first row
5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Python has several looping options such as ‘for’ and ‘while’. There are also options of nesting (single, double, triple, ..) loops.
a. The While loop executes the same code again and again until a stop condition is met:
# Syntax
while test:
code statements
else:
final code statements
# Example
x = 0
while x < 10:
print('x is currently: ',x)
print(' x is still less than 10, adding 1 to x')
x+=1
b. The for loop: acts as an iterator in Python; it goes through items that are in a sequence or any other iterable item. Objects that we’ve learned about that we can iterate over include strings, lists, tuples, and even built-in iterables for dictionaries, such as keys or values.
# Syntax
for item in object:
statements to do stuff
# Example
list1 = [1,2,3,4,5,6,7,8,9,10]
for num in list1:
print(num)
Python also provides the break, continue and pass statements that allow us to alter the loops further. Following is their use:
break: Breaks out of the current closest enclosing loop.
continue: Goes to the top of the closest enclosing loop.
pass: Does nothing at all.
# Thinking about break and continue statements, the general format of the while loop looks like this:
while test:
code statement
if test:
break
if test:
continue
else:
break and continue statements can appear anywhere inside the loop’s body, but we will usually put them further nested in conjunction with an if statement to perform an action based on some condition.
6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Python provides if.., if..else.., and if..else..if.. statements to apply conditional logic. Lets take a look at them:
a. The basic syntax for creating an if statement is:
if False:
print('It was not true!')
b. The basic syntax for creating an if…else statement is:
x = False
if x:
print('x was True!')
else:
print('I will be printed in any case where x is not true')
c. The basic syntax for creating an if…else if…else statement is:
loc = 'Bank'
if loc == 'Auto Shop':
print('Welcome to the Auto Shop!')
elif loc == 'Bank':
print('Welcome to the bank!')
else:
print('Where are you?')
7. Put your code in functions
Functions allows us to create a block of code that can be executed many times without needing to it write it again.
# Syntax
def name_of_function(argument_name='default value'): #snake casing for name, all lower case alphabets with underscores
'''
what funciton does
'''
print ('hello',argument_name)
print (f'hello {argument_name}') #both print do the same thing
# Example
def add_function(a=0,b=0):
return a+b
We can call the function in the following two ways:
# option 1
add_function(2,3)
# option 2
c=add_function(3,4)
*args and **kwargs stand for arguments and keyword arguments and allow us to extend the funcitonality of functions.
*args lets a function take an arbitrary number of arguments. All arguments are received as a tuple, example - (a,b,c,..). args can be renamed to something else, what really matters is *.
def myfunc(*args):
return args
'''
myfunc(1,2,3,4,5,6,7,8,9)
Out[30]: (1, 2, 3, 4, 5, 6, 7, 8, 9)
'''
**kwargs lets the funciton take an arbitrary number of keyword arguments. All arguments are received as a dictionary of key,value pairs. kwargs can be renamed to something else, what really matters is **.
def myfunc(**kwargs):
print(kwargs)
'''
myfunc(name='vivek', age=34, height=186)
{'name': 'vivek', 'age': 34, 'height': 186}
'''
8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes
Python allows user to create classes. These can be a combination of variables and functions that operate on those variables. Lets take a look at how we can define and use them.
# Define a class
class Person:
"This is a person class"
age = 10
def greet(self):
print('Hello')
# Using class
print(Person.age) # Output: 10
print(Person.greet) # Output: <function Person.greet>
print(Person.__doc__) # Output: 'This is my second class'
# Creating an object of the class and using that
vivek = Person() # create a new object of Person class
print(vivek.greet) # Output: <bound method Person.greet of <__main__.Person object>>
vivek.greet() # Calling object's greet() method; Output: Hello
9. Read file from a disk and save file to a disk
Lets see how to read and write a csv file in an organized way. CSV is the most common file type you will be using for data science, however python can read several other file types and data directly from websites as well.
import pandas
# read a csv using the pandas package
df = pandas.read_csv('student_data.csv')
print(df)
# write data to a csv using pandas package
df.to_csv('student_data_copy.csv')
10. Ability to comment your code so you can understand it when you revisit it some time later
We can tell Python that a line of code is a comment by starting it with a #.
# this is a comment
We can tell that a multi-line block of text is a comment by enclosing it in triple inverted single quotes.
'''
this
is
a
comment
block
'''
Overall, Python is a versatile and powerful programming language that is well-suited for a wide range of programming tasks. With its emphasis on simplicity, object-oriented design, and a large and growing ecosystem of third-party libraries and tools, Python is a valuable tool for both beginner and experienced programmers. Whether building web applications, analyzing data, or working on artificial intelligence projects, Python provides a fast, flexible, and enjoyable development experience.
To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent.
Comments welcome!
-
Introduction to Programming in Julia
Quick Introduction
Julia is a high-level, high-performance programming language that was created in 2012 by a team of computer scientists led by Jeff Bezanson, Stefan Karpinski, and Viral Shah. Julia was designed to address the limitations of traditional scientific computing languages, such as MATLAB, Python, and R, while still retaining their ease of use and flexibility.
One of the key features of Julia is its performance. Julia is designed to be fast, with execution speeds comparable to those of compiled languages such as C and Fortran. This is achieved through a combination of just-in-time (JIT) compilation, which compiles code on the fly as it is executed, and type inference, which allows Julia to determine the data types of variables at runtime.
Another important feature of Julia is its support for multiple dispatch. Multiple dispatch allows Julia to select the appropriate method to use based on the types of the arguments being passed to a function. This makes Julia a flexible and expressive language that can be easily extended and customized to fit a wide range of programming tasks.
Julia also includes a number of built-in data structures and libraries that make it easy to work with arrays, matrices, and other scientific computing tools. These include tools for linear algebra, statistics, optimization, and machine learning, as well as support for distributed computing and parallelism.
In addition to its scientific computing features, Julia also includes support for general-purpose programming tasks, such as web development, database access, and file I/O. Julia’s growing package ecosystem provides a wide range of libraries and tools for these tasks, making it a versatile language that can be used for a variety of programming tasks.
One of the key benefits of Julia is its community. Julia has a rapidly growing community of developers and users who are actively contributing to the language and its ecosystem. This community has created a large number of high-quality packages, as well as a number of online resources and forums for learning and discussing the language.
Most modern programming languages have a set up similar building blocks, for example
Receiving input from the user and Showing output to the user
Ability to store values in variables (usually of different kinds such as integers, floating points or character)
A string of characters where you can store names, addresses, or any other kind of text
Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Put your code in functions
Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes
Read file from a disk and save file to a disk
Ability to comment your code so you can understand it when you revisit it some time later
Lets dive right in and see how we can do these things in Julia.
0. How to install Julia on your desktop?
Before we can begin to write a program in Julia, we need to install Julia. Next you can install VSCode. Now launch VSCode and install the Julia (by julialang) extension. Now you can create a new test.jl file and add the following code and see if runs.
4+2; # If you don't want to see the result of the expression printed, use a semicolon at the end of the expression
ans; # the value of the last expression you typed on the REPL, it's stored within the variable ans
Before we dive in, chaining functions is possible in Julia, like so:
1:10 |> collect
1. Receiving input from the user and Showing output to the user
There are several ways in which we can show output to the user. Let’s look at some ways of showing output:
# receiving input from user
name = readline(stdin)
# showing output to user
println("you name is ", name)
2. Ability to store values in variables (usually of different kinds such as integers, floating points or character)
Names of variables are in lower case. Word separation can be indicated by underscores.
Julia has several types of variables broadly classified into Concrete and abstract types. The types that can have subtypes (e.g. Any, Number) are called abstract types. The types that can have instances are called concrete types. These types cannot have any subtypes.
Concrete types can be further divided into primitive (or basic), and complex (or composite). Let’s take a deeper look:
# Primitive types
## the basic integer and float types (signed and unsigned): Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64, UInt64, Int128, UInt128, Float16, Float32, and Float64
a = 10
## more advanced numeric types: BigFloat, BigInt
a = BigInt(2)^200
## Boolean and character types: Bool and Char
selected = true
## Text string types: String
name = "vivek"
# Composite type
## Rational, used to represent fractions. It is composed of two pieces, a numerator and a denominator, both integers (of type Int)
666//444 # To make rational numbers, use two slashes (//)
Some advanced data types include dictionary and sets. Sets are similar to arrays with the difference that they dont allow element duplication.
dict = Dict("a" => 1, "b" => 2, "c" => 3)
dict = Dict{String,Integer}("a"=>1, "b" => 2) # If you know the types of the keys and values in advance, you can specify them after the Dict keyword, in curly braces
# looking things up
dict["a"]
values(dict) # to retrieve all values
keys(dict) # to retrieve all keys
# these can be useful for iterating
for k in keys(dict)
for (key, value) in dict
merge(d1, d2) # merge() function which can merge two dictionaries
findmin(d1) # find the minimum value in a dictionary, and return the value, and its key
filter((k, v) -> k == 1, d1)
# sort dict - you can use the SortedDict data type from the DataStructures.jl package
Pkg.add("DataStructures")
import DataStructures
dict = DataStructures.SortedDict("b" => 2, "c" => 3, "d" => 4, "e" => 5, "f" => 6)
# Sets - A set is a collection of elements, just like an array or dictionary, with no duplicated elements.
colors = Set{String}(["red","green","blue","yellow"])
push!(colors, "black") # You can use push!() to add elements to a set
union(colors, rainbow) # The union of two sets is the set of everything that is in one or the other sets
intersect(colors, rainbow) # The intersection of two sets is the set that contains every element that belongs to both sets
setdiff(colors, rainbow) # The difference between two sets is the set of elements that are in the first set, but not in the second
We will discuss abstract data types in section 8 below.
3. A string of characters where you can store names, addresses, or any other kind of text
Any value written within a pair of double quotes in Julia is treated as a string.
"this is a string"
# double quotes and dollar signs need to be preceded (escaped) with a backslash
"""this is "a" string with double quotes""" # triple double quotes can be used to store strings with double quotes in them
Julia also allows the user to indicate special strings.
# special strings
r" " indicates a regular expression
v" " indicates a version string
b" " indicates a byte literal
raw" " indicates a raw string that doesn't do interpolation
Key idea here is to learn how to manipulate string variables. There are a few common operations that we will focus on:
a. Concatenate strings
# Concatenate strings
join(split(s, r"a|e|i|o|u", false), "aiou") # You can join the elements of a split string in array form using join()
b. Counting number of characters in a string
# Counting number of characters in a string
length(str) # to find the length of a string
lastindex(str) # to find index of last char of string
c. Changing the case - toupper() & tolower() functions
uppercase(s)
d. Splitting a string
split("You know my methods, Watson.") # by default splits on space
split("You know my methods, Watson.", 'W') # splits on the char W
# If you want to split a string into separate single-character strings, use the empty string ("")
split("You know my methods, Watson.", r"a|e|i|o|u", false) # splits string on the char that matches any of the vowels
# false makes sure that empty strings are not returned
e. String interpolation
# string interpolation - use the results of Julia expressions inside strings.
x = 42
"The value of x is $(x)." # "The value of x is 42."
f. Iterate over a string
for char in s # iterate through a string
print(char, "_")
end
g. Get index of all characters in a string
for i in eachindex(str)
@show su[i]
end
h. Converting between numbers and strings
a = BigInt(2)^200
a=string(a) # convert number to string
parse(BigInt, a) # convert strings to numbers
i. Finding and replacing things inside strings
s = "My dear Frodo";
in('M', s) # true
occursin("Fro", s) # true
findfirst("My", s) # 1:2
replace(s, "Frodo" => "Frodo Baggins")
There are a lot of other functions as well:
length(str) - - length of string
sizeof(str) - length/size
startswith(strA, strB) - does strA start with strB?
endswith(strA, strB) - does strA end with strB?
occursin(strA, strB) - does strA occur in strB?
all(isletter, str) - is str entirely letters?
all(isnumeric, str) - is str entirely number characters?
isascii(str) - is str ASCII?
all(iscntrl, str) - is str entirely control characters?
all(isdigit, str) - is str 0-9?
all(ispunct, str) - does str consist of punctuation?
all(isspace, str) - is str whitespace characters?
all(isuppercase, str) - is str uppercase?
all(islowercase, str) - is str entirely lowercase?
all(isxdigit, str) - is str entirely hexadecimal digits?
uppercase(str) - return a copy of str converted to uppercase
lowercase(str) - return a copy of str converted to lowercase
titlecase(str) - return copy of str with the first character of each word converted to uppercase
uppercasefirst(str) - return copy of str with first character converted to uppercase
lowercasefirst(str) - return copy of str with first character converted to lowercase
chop(str) - return a copy with the last character removed
chomp(str) - return a copy with the last character removed only if it's a newline
4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Arrays can be one-dimentional or multi-dimentional. An array is created using the square brackets, Array constructor or several other methods. Arrays support a lot of functionality within Julia so I have covered it in more detail in this array specific article. For now lets check out the key functionality.
# Defining
# Creating arrays by initializing
arr_Int64 = [1, 2, 3, 4, 5]
# Creating empty arrays
b = Int64[]
# Creating 2-d arrays
arr_2d = [1 2 3 4] # If you leave out the commas when defining an array, you can create 2D arrays quickly. Here's a single row, multi-column array:
arr_2d = [1 2 3 4 ; 5 6 7 8] # you can add another row using ;
# Creating arrays using range objects
a = 1:10 # creates a range variable with 10 elements from 1 to 10
collect(a) # collect displays a range variable
[a...] # instead of collect, you could use the ellipsis (...) operator (three periods) after the last element
range(1, length=12, stop=100) # Julia calculates the missing pieces for you by combining the values for the keywords step(), length(), and stop()
# Using comprehensions and generators to create arrays
[n^2 for n in 1:5] # a 1-d array
[r * c for r in 1:5, c in 1:5] # a 2-d array
# Reshape an array to create a multi-dimentional array
reshape([1, 2, 3, 4, 5, 6, 7, 8], 2, 4) # create a simple array and then change its shape
# Supports indexing and slicing
# 1-d
a[5] # 5th element
a[end] # last element
a[end-1] # second last element
# 2-d
a = [[1, 2] [3,4]]
a[2,2] # element at row-2 x col-2
a[:,2] # all elements of col-2
getindex(a, 2,2) # same as a[2,2]
# Elements can be added
a = Array[[1, 2], [3,4]]
push!(a, [5,6]) # The push!() function pushes another item onto the back of an array
pushfirst!(a, 0) # To add an item at the front
splice() # To insert an element into an array at a given index
splice!(a, 4:5, 4:6) # insert, at position 4:5, the range of numbers 4:6
L = ['a','b','f']; splice!(L, 3:2, ['c','d','e']) # insert c, d, e between b and f
# Elements can be removed
splice!(a,5); # If you don't supply a replacement, you can also use splice!() can remove elements and move the rest of them along
pop!(a) # To remove the last item
popfirst!(a)
# Elementwise and vectorized operations
a / 100 # every element of the new array is the original divided by 100. These operations operate elementwise
n1 = 1:6;
n2 = 2:7;
n1 .* n2; # if two arrays are to be multiplied then we just add a . before the mathematical operator to signify elementwise
# the first element of the result is what you get by multiplying the first elements of the two arrays, and so on
# How function works on individual variables
f(a, b) = a * b
a=10;b=20;print(f(a,b))
# How function can be applied elementwise to arrays
n1 = 1:6;
n2 = 2:7;
print(f.(n1, n2))
5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Julia has several looping options such as ‘for’ and ‘while’. There are also options of nesting (single, double, triple, ..) loops.
a. The While loop executes the same code again and again until a stop condition is met:
# while end - iterative conditional evaluation
x=0
while x < 4
println(x)
global x += 1
end
b. The for loop: acts as an iterator in Julia; it goes through items that are in a sequence or any other iterable item. Objects that we’ve learned about that we can iterate over include strings, lists, tuples, and even built-in iterables for dictionaries, such as keys or values.
# for end - iterative evaluation
# use the global keyword to define a variable that outlasts the loop
for i in 1:10
z = i
println("z is $z")
end
# Some sample for loop statements for different data types
for color in ["red", "green", "blue"] # an array
for letter in "julia" # a string
for element in (1, 2, 4, 8, 16, 32) # a tuple
for i in Dict("A"=>1, "B"=>2) # a dictionary
for i in Set(["a", "e", "a", "e", "i", "o", "i", "o", "u"])
Julia also provides the break and continue statements that allow us to alter the loops further. Following is their use:
break: Breaks out of the current closest enclosing loop.
continue: Goes to the top of the closest enclosing loop.
# Example with break statement
x=0
while true
println(x)
x += 1
x >= 4 && break # breaks out of the loop
end
break and continue statements can appear anywhere inside the loop’s body, but we will usually put them further nested in conjunction with an if statement to perform an action based on some condition.
Following are some other options for looping options:
# list comprehensions
[i^2 for i in 1:10]
[(r,c) for r in 1:5, c in 1:2] # two iterators in a comprehension
# Generator expressions - generator expressions can be used to produce values from iterating a variable
sum(x^2 for x in 1:10)
# Enumerating arrays
m = rand(0:9, 3, 3)
[i for i in enumerate(m)]
# Zipping arrays
for i in zip(0:10, 100:110, 200:210)
println(i)
end
# Iterable objects
ro = 0:2:100
[i for i in ro]
6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Julia provides several options to apply conditional logic. Lets take a look at them:
a. ternary and compound expressions:
x = 1
x > 3 ? "yes" : "no"
b. Boolean switching expressions:
isodd(1000003) && @warn("That's odd!")
isodd(1000004) || @warn("That's odd!")
c. if elseif else end - conditional evaluation:
name = "Julia"
if name == "Julia"
println("I like Julia")
elseif name == "Python"
println("I like Python.")
println("But I prefer Julia.")
else
println("I don't know what I like")
end
c. Error handling using try.. catch. This allows the code to still keep executing even if an error occurs, which would usually halt the program.
# try catch error throw exception handling
try
<statement-that-might-cause-an-error>;
catch e # error gets caught if it happens
println("caught an error: $e") # show the error if you want to
end
println("but we can continue with execution...")
# Example 1 - error doesnt occur
try
a=10 # no error
catch e
print(e)
end
# Example 2 - error occurs
try
la-la-la # undefined variable error
catch e
print(e)
end
7. Put your code in functions
Functions allows us to create a block of code that can be executed many times without needing to it write it again.
Julia has something called a single expression function. These are usually defined in one line like so:
# Single expression functions
f(x) = x * x
g(x, y) = sqrt(x^2 + y^2)
Functions with multiple expressions are also supported and can be defined using the function keyword:
# Syntax
# Functions with multiple expressions
function say_hello(name)
println("hello ", name)
end
say_hello("vivek")
Additionally, functions can be programmed to retun a single or multiple value using the return keyword.
# define function which returns a value
function add_numbers(a,b)
return a+b
end
# call the function
add_numbers(2,3)
# define function which returns multiple values
function add_multiply_numbers(a, b=10) # we can supply default values as well
return(a+b, a*b)
end
# call the function
add_multiply_numbers(2,3)
add_multiply_numbers(2)
args… lets a function take an arbitrary number of arguments. A for loop can be used to iterate over these arguments.
function show_args(args...)
for arg in args
println(arg," ")
end
end
show_args(10,20,25,35,50)
Julia also supports anonymous functions, with no name.
map((x,y,z) -> x + y + z, [1,2,3], [4, 5, 6], [7, 8, 9])
Map and reduce can also be used to apply functions to arrays.
Map - If you already have a function and an array, you can call the function for each element of the array by using map()
a=1:10;
map(sin, a) # map() returns a new array but if you call map!() , you modify the contents of the original array
The map() function collects the results of some function working on each and every element of an iterable object, such as an array of numbers.
map(+, 1:10)
The reduce() function does a similar job, but after every element has been seen and processed by the function, only one is left. The function should take two arguments and return one.
reduce(+, 1:10)
8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes
Julia allows user to create user defined variables using abstract type (which are abstract) or mutable struct (which are concrete). Lets take a look at both.
Abstract type
abstract type MyAbstractType end # By default, the type you create is a direct subtype of Any
abstract type MyAbstractType2 <: Number end # the new abstract type is a subtype of Number
Concrete type using mutable struct
# define the data type
mutable struct student <: Any
name
age::Int
end
# initialize a variable of that data type
x=student("vivek", 30)
# use the variable
x.name
x.age
9. Read file from a disk and save file to a disk
Lets see how to read in an organized way.
f = open("sherlock-holmes.txt") # To read text from a file, first obtain a file handle:
close(f) # When you've finished with the file, you should close the connection
If you use the following technique then you dont need to close. The open file is automatically closed when this block finishes.
open("sherlock-holmes.txt") do file
# do stuff with the open file
end
10. Ability to comment your code so you can understand it when you revisit it some time later
We can tell Julia that a line of code is a comment by starting it with a #.
# this is a comment
Overall, Julia is a powerful and flexible programming language that is well-suited for scientific computing and other high-performance tasks. With its emphasis on performance, multiple dispatch, and a growing ecosystem of packages and tools, Julia is a valuable tool for researchers, data scientists, and other professionals who need a fast, flexible, and expressive language for their work.
To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent.
Comments welcome!
-
Introduction to Programming in Ruby
Quick Introduction
Ruby is a high-level, interpreted programming language that was created in the mid-1990s by Yukihiro “Matz” Matsumoto. It is a general-purpose language that is designed to be easy to use and read, with syntax that is similar to natural language. Ruby is often used for web development, as well as for building command-line utilities, desktop applications, and other types of software.
One of the key features of Ruby is its emphasis on programmer productivity and ease of use. Ruby’s syntax is designed to be intuitive and easy to read, making it accessible to both beginner and experienced programmers. Ruby also includes a number of built-in features and libraries that make it easy to accomplish common programming tasks, such as working with strings, arrays, and hashes.
Another important feature of Ruby is its object-oriented programming model. Everything in Ruby is an object, and methods can be defined on objects to add functionality. Ruby also includes support for inheritance, encapsulation, and polymorphism, which makes it a powerful tool for building complex software systems.
Ruby is also known for its extensive library of open-source gems, which are pre-built packages of code that can be easily integrated into Ruby projects. These gems provide a wide range of functionality, from database access to web development frameworks, and can save developers a significant amount of time and effort in building software.
One of the most popular web development frameworks built in Ruby is Ruby on Rails. Rails is a full-stack web framework that provides a set of conventions and tools for building web applications quickly and easily. With its focus on developer productivity, Rails has become a popular choice for startups and small businesses, as well as for larger enterprises.
Most modern programming languages have a set up similar building blocks, for example
Receiving input from the user and Showing output to the user
Ability to store values in variables (usually of different kinds such as integers, floating points or character)
A string of characters where you can store names, addresses, or any other kind of text
Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Put your code in functions
Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes
Read file from a disk and save file to a disk
Ability to comment your code so you can understand it when you revisit it some time later
Lets dive right in and see how we can do these things in Ruby.
0. How to install Ruby on your desktop?
Before we can begin writing programs in Ruby, we need to set up our ruby environment.
You can install Ruby from here ruby-lang.org.
Additionally, you need to install an IDE to write and execute Ruby code. My personal favorite is code.visualstudio.com.
Lastly, you will also need to install the following extensions within VSCode: Ruby (Peng Lv) and Code Runner (Jun Han).
Now, lets write a simple program that print out hello world for the user to see
print 'Hello World !!!'
1. Receiving input from the user and Showing output to the user
There are several ways in which we can show output to the user. Let’s look at some ways of showing output:
#Method 1:
print 'Hello World !!!'
#Method 2:
p 'Hello World !!!'
#Method 3:
puts 'Hello World !!!'
#Method 4: Showing data stored in variables to user
my_name = "Vivek"
puts "Hello #{my_name}"
#Method 5: Showing multiple variables using same puts statement
aString = "I'm a string!"
aBoolean = true
aNumber = 42
puts "string: #{aString} \nboolean: #{aBoolean} \nnumber: #{aNumber}"
2. Ability to store values in variables (usually of different kinds such as integers, floating points or character)
There are three main types of variable:
Strings (a collection of symbols inside speech marks)
Booleans (true or false)
Numbers (numeric values)
Following are some examples:
aString = "I'm a string!"
aBoolean = true
aNumber = 42
puts "string: #{aString} \nboolean: #{aBoolean} \nnumber: #{aNumber}"
Performing basic math on numeric variables. There are 6 types of basic operations: addition, subtraction, multiplication, division, modulo and exponent.
a = 5
b = 2
puts "sum: #{a+b}\
\ndifference: #{a-b}
\nmultiplication: #{a*b}
\ndivision: #{a/b}
\nmodulo: #{a%b}
\nexponent: #{a**b}"
3. A string of characters where you can store names, addresses, or any other kind of text
You can use single quotes or double quotes for strings - either one is acceptable.
myFirstString = 'I am a string!' #single quotes
mySecondString = "Me too!" #double quotes
There are a few common operations that we will focus on:
"Hi!".length #is 3
"Hi!".reverse #is !iH
"Hi!".upcase #is HI!
"Hi!".downcase #is hi!
# You can also use many methods at once. They are solved from left to right.
"Hi!".downcase.reverse #is !ih
# If you want to check if one string contains another string, you can use .include?.
"Happy Birthday!".include?("Happy")
4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Arrays allow you to group multiple values together in a list. Each value in an array is referred to as an “element”.
a. Defining an array:
myArray = [] # an empty array
myOtherArray = [1, 2, 3] # an array with three elements
b. Accessing array elements:
# In order to add to or change elements in an array, you can refer to an element by number.
myOtherArray[3] = 4
Ruby has another advanced data type called Hash, which is similar to a python dictionary. Just like arrays, hashes allow you to store multiple values together. However, while arrays store values with a numerical index, hashes store information using key-value pairs. Each piece of information in the hash has a unique label, and you can use that label to access the value.
a. To create a hash, use Hash.new, or myHash={}. For example:
myHash=Hash.new()
myHash["Key"]="value"
myHash["Key2"]="value2"
# or
myHash={
"Key" => "value",
"Key2" => "value2"
}
b. To access elements of a hash:
puts myHash["Key"] # puts value
Instead of using a string as a key, you can also use a symbol, like this:
a. To create a hash, use Hash.new, or myHash={}. For example:
myHash=Hash.new()
myHash[:Key]="value"
myHash[:Key2]="value2"
# or
myHash={
Key: "value",
Key2: "value2",
}
b. To access elements of a hash:
puts myHash[:Key] # puts "value"
5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ruby has several looping options (For, While, and Until). There are options of nesting (single, double, triple, ..) loops as well.
a. For loop executes code once for each element in expression. Following example shows how a for loop works:
# Syntax
for variable [, variable ...] in expression [do]
code
end
# Example
for i in 0..5
puts "Value of local variable is #{i}"
end
b. While loop executes code while conditional is true. A while loop’s conditional is separated from code by the reserved word do, a newline, backslash \, or a semicolon ;. Following example shows how a for loop works:
# Syntax
while conditional [do]
code
end
# Example
a=1
b=5
while a<=b
puts "run #{a}"
a=a+1
end
# Ruby while modifier - Executes code while conditional is true.
code while condition
# or
begin # If a while modifier follows a begin statement with no rescue or ensure clauses, code is executed once before conditional is evaluated.
code
end while conditional
c. Until loop executes code while conditional is false. An until statement’s conditional is separated from code by the reserved word do, a newline, or a semicolon. Following example shows how a for loop works:
# Syntax
until conditional [do]
code
end
# Example
$i = 0
$num = 5
until $i > $num do
puts("Inside the loop i = #$i" )
$i +=1;
end
# Ruby until modifier - Executes code while conditional is false.
code until conditional
# or
begin # If an until modifier follows a begin statement with no rescue or ensure clauses, code is executed once before conditional is evaluated.
code
end until conditional
d. Ruby also offers following keywords that can modify the behavior of the above loops:
# break - Terminates the most internal loop. Terminates a method with an associated block if called within the block (with the method returning nil).
# next - Jumps to the next iteration of the most internal loop. Terminates execution of a block if called within a block (with yield or call returning nil).
# redo - Restarts this iteration of the most internal loop, without checking loop condition. Restarts yield or call if called within a block.
# retry - If retry appears in rescue clause of begin expression, restart from the beginning of the begin body.
# retry - If retry appears in the iterator, the block, or the body of the for expression, restarts the invocation of the iterator call. Arguments to the iterator is re-evaluated.
6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Conditionals are used to add branching logic to your programs; they allow you to include complex behaviour that only occurs under specific conditions.
a. If - if condition is an expression that can be checked for truth. If the expression evaluates to true, then the code within the block is executed.
if condition
something to be done
end
# Ruby if modifier - executes code if the conditional is true.
code if condition
Following is an actual example of an if statement with both an elsif and an else.
booleanOne = true
randomCode = "Hi!"
if booleanOne
puts "I will be printed!"
elsif randomCode.length>=1
puts "Even though the above code is true, I won't be executed because the earlier if statement was true!"
else
puts "I won't be printed because the if statement was executed!"
end
b. If Else - You can combine if with the keyword else. This lets you execute one block of code if the condition is true, and a different block if it is false. The else block will only be executed if the if block doesn’t run, so they will never both be executed.
if condition
something to be done
else
something to be done if the condition evaluates to false
end
c. Elseif - When you want more than two options, you can use elsif. This allows you to add more conditions to be checked. Still only one of the code blocks will be run, because the statement only executes the code in the first applicable block; Once a condition has been satisfied, the whole statement ends. Here is if/elsif/else statement syntax:
if condition
something to be done
elsif different condition
something else to be done
else
another different thing to be done
end
d. Unless - Executes code if conditional is false. If the conditional is true, code specified in the else clause is executed.
unless condition
# thing to be done if the condition is false
else
# else is optional
# thing to be done if the condition is true
end
# Ruby unless modifier - Executes code if conditional is false.
code unless conditional
e. Case - this is basically same as a if-elseif-else statement, but with more clear syntax.
# case statement syntax
case expr0
when expr1, expr2
stmt1
when expr3, expr4
stmt2
else
stmt3
end
# is basically similar to the following −
if expr1 === expr0 || expr2 === expr0
stmt1
elsif expr3 === expr0 || expr4 === expr0
stmt2
else
stmt3
end
Example of case statement
$age = 5
case $age
when 0 .. 2
puts "i will not be printed"
when 3 .. 6
puts "i will be printed"
when 7 .. 12
puts "i will not be printed"
when 13 .. 18
puts "youth"
else
puts "i will not be printed"
end
7. Put your code in functions
a. In Ruby we call functions methods. Methods are reuseable sections of code that perform specific tasks in our program. Using methods means that we can write simpler, more easily readable code.
# syntax
def methodname
# method code here
end
b. Methods can also be defined to accept and process any parameters that are passed to them:
# Methods With Parameters
def laugh(number)
puts "haha " * number
end
c. We can call methods using the name of the method and specify the parameters within paranthesis or without them:
# Using method - calling method as follows prints "haha" 5 times on the screen
laugh(5)
# You can also call laugh without paranthesis
laugh 5
d. We can set default values for the parameters, which will be used if method is called without passing the required parameters
def method_name (var1 = value1, var2 = value2)
expr..
end
e. We can also return values. return statement in ruby is used to return one or more values from a Ruby Method.
return
# or
return 12
# or
return 1,2,3
f. We can also define methods with variable number of parameters, like so:
Variable Number of Parameters
def sample (*test)
puts "The number of parameters is #{test.length}"
for i in 0...test.length
puts "The parameters are #{test[i]}"
end
end
sample "Zara", "6", "F"
sample "Mac", "36", "M", "MCA"
8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes
Ruby allows user to create classes. These can be a combination of variables and functions that operate on those variables. Lets take a look at how we can define and use them.
# Define a class
class employee
@@no_of_customers = 0
def initialize(id, name, addr)
@cust_id = id
@cust_name = name
@cust_addr = addr
end
end
# Creating an object of the class and using that
cust1 = employee.new("1", "Vivek", "Somewhere on the, Internet")
9. Read file from a disk and save file to a disk
Lets see how to read and parse csv in an organized way. CSV is the most common file type you will be using for data science, however ruby can read several other file types as well.
require 'csv'
# read a csv
CSV.read("file.csv")
# parse a string of text which is in csv format
CSV.parse("1,penny\n2,nickel\n3,dime")
10. Ability to comment your code so you can understand it when you revisit it some time later
a. We can tell ruby that a line of code is a comment by starting it with #.
#this is a comment
b. We can also specify a comment block, like so:
=begin
There are three main types of variable:
1. Strings (a collection of symbols inside speech marks)
2. Booleans (true or false)
3. Numbers (numeric values)
=end
Overall, Ruby is a powerful and flexible programming language that is well-suited for a wide range of programming tasks. With its focus on ease of use, object-oriented design, and extensive library of gems, Ruby is a valuable tool for both beginner and experienced programmers. Whether building web applications, desktop utilities, or other types of software, Ruby provides a fast, flexible, and enjoyable development experience.
To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent.
Comments welcome!
-
Introduction to Programming in C++
Quick Introduction
C++ is a powerful and popular programming language that was developed in the 1980s as an extension of the C programming language. It is a high-level, object-oriented language that is used to develop a wide range of applications, including operating systems, device drivers, game engines, and more. C++ is also widely used in the field of finance and quantitative analysis, due to its speed and efficiency.
One of the key features of C++ is its ability to directly manipulate memory, allowing for low-level control over the hardware. C++ is also known for its efficiency and speed, making it a popular choice for developing applications that require high performance, such as video games and real-time systems.
Another key feature of C++ is its support for object-oriented programming (OOP). This allows programmers to define their own classes and objects, and to encapsulate data and functionality within those objects. OOP allows for code reusability, modularity, and flexibility, making it a popular paradigm in software development.
C++ is also known for its support for templates and generic programming. Templates allow programmers to write generic code that can work with different data types, without having to write separate code for each type. This can greatly simplify code development and maintenance, and can make C++ code more efficient and easier to read.
Most modern programming languages have a set up similar building blocks, for example
Receiving input from the user and Showing output to the user
Ability to store values in variables (usually of different kinds such as integers, floating points or character)
A string of characters where you can store names, addresses, or any other kind of text
Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Put your code in functions
Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes
Read file from a disk and save file to a disk
Ability to comment your code so you can understand it when you revisit it some time later
Lets dive right in and see how we can do these things in C++.
0. How to install C++ on your desktop?
Before we can begin to write a program in C++, we need to install Dev-C++. Once done, go ahead and open the IDE and try out the following code to see if everything is in order.
#include <iostream>
using namespace std;
int main() {
cout << "Hello World!";
return 0;
}
As you noticed, unlike languages such as Python, R or Ruby, it takes more than a few statements just to display basic text to the user in C++. In the next section we will try to dismantle this code and understand the various components. Lets however cover a few important points:
In C++ we need to end each line of code with a semi-colon ;
The scope of statements is defined using curly brackets {}, unlike Python where the scope is defined through indentation
All statements need to be within a function. Here we have included the statements in the main() function which is the first function that is executed during a compiler call. All other functions will be called from within this funciton.
1. Receiving input from the user and Showing output to the user
Following program shows output to the user. The include statement is used to call the iostream header file which is same as a python library. This header file provides information on basic programming routines including input and output constructs. The next is int main() which says that the main function will return an integer after execution. Within the main function we use cout« to show the text to the user. The text is enclosed in double quotes “text”. endl after the text tells the compiler to insert a new line in the output window. Finally we return 0 as the main function is supposed to return an integer. 0 signifies that everything was in order during the execution of the function.
#include <iostream>
using namespace std;
int main() {
cout << "This is some text." << endl;
return 0;
}
We can modify this program to accept input form the user. The cin» statement allows us to receive input. The variable in which we store the received input needs to be defined beforehand.
#include <iostream>
using namespace std;
int main() {
int age_ = 0;
cout << "What is your age?";
cin>>age_;
cout << "So your age is: " << age_;
return 0;
}
2. Ability to store values in variables (usually of different kinds such as integers, floating points or character)
C++ is not dynamically typed - you need to type out the variable’s name and data type before using it.
Basic data types: In C++ we have several types of variables, lets take a look at the important ones:
// Integer
int numberCats=5;
long int numberCats=5; //long int can be used for storing large values
// Floating point numbers. These are numbers with significant digits after the decimal
float pi=3.1415926535; //pi=22/7
// Double
double dValue=3.1415926535; //for more significant digits we need to use other variable type than float
long double ldValue=3.1415926535;
// Boolean
bool bval=true; //boolean type is true or false; c++ uses 1 for true and 0 for false when outputting
// Character
char cval=55, cval2='7'; //takes exactly 1 byte of computer memory, char represents single characters from the ascii character set, 55 is the ascii code for 7, this is not the number 7 but the character 7
// String
string myname;
3. A string of characters where you can store names, addresses, or any other kind of text
A string in C++ can be defined using the string keyword. It can be assigned usign the input from user or it can be assigned by providing text within double quotes “text”.
string yourName;
cout << "\n\nwhat is your name? ";
cin >> yourName;
cout <<"\nnice to meet you "<<yourName<<endl<<endl;
4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Arrays are a series of variables stored together in one variable. Arrays can be one-dimentional or multi-dimentional.
One-dimentional arrays:
// Defining
int ar[3];
// Initializing the array
ar[0]=10;
ar[1]=20;
ar[2]=30;
// Supports indexing
cout<<ar[0]; // this will output the value stored at index 0, which is 10
Multi-dimentional arrays:
// Defining
int mar[3][2] //multi-dim array
// Initializing the array
mar[3][2]={
{34,188},
{29,165},
{29,160}
};
// Supports indexing
cout<<ar[0][0]; // this will output the value stored at row index 0 x column index 0, which is 34
Loops can be used to iterate over one-dimentional or multi-dimentional arrays. We will take a closer look at this in the next section.
5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
C++ has several looping options such as ‘for’, ‘while’ and ‘do while’. There are also options of nesting (single, double, triple, ..) loops.
a. The for loop
// Syntax
for (i=0;i<10;i++){
statements to do stuff
}
// iterate over elements of one-dimentional array
// practice - create an array with a table of 12
int t12[10];
for (int i=0;i<10;i++){
t12[i]=12*(i+1);
}
// iterate over elements of two-dimentional array (concept of nesting - we will enclose a for loop within another for loop)
int mar[3][2]={
{34,188},
{29,165},
{29,160}
}; //multi-dim array
cout<<"\nthis is a multi dimentional array: ";
for (int i=0;i<3;i++){ //3 rows in the array
cout<<"\nrow "<<i+1<<": ";
for (int j=0;j<2;j++){ //2 columns in the array
cout<<"col "<<j+1<<": "<<mar[i][j]<<", ";
}
}
b. The While loop executes the same code again and again until a stop condition is met:
// Syntax
int i=0;
while (i<10){
code statements;
i+=1;
}
// Example
int i=1;
cout<<"\n\nwhile loop - first 10 natural numbers"<<endl;
while (i<=10){
cout<<i<<", ";
i+=1; //same as i=i+1 or i+=1
}
c. The Do-While loop executes the same code again and again until a stop condition is met. The difference from while loop is that in do-while loop atleast the content of the loop is executed once before checking the condition.
// Syntax
int i=0;
do{
code statements;
i+=1;
}while (i<10)
// Example
//for example if you want the user to enter the password again and again until they enter the correct password
cout<<"\n\ndo-while loop\n";
i=1;
string pass="pass", pass2;
do{
if(i!=1){
cout<<"\naccess denied, try again";
}
cout<<"\nenter your password?";
cin>>pass2;
i=0;
}while(pass2 != pass);
cout<<"\npassword accepted\n\n";
C++ also provides the break and continue statements that allow us to alter the loops further. Following is their use:
break jumps immidiately out of the loop. mostly used in while loops but can also be used in for loops
// break statement example
cout<<"\nbreak statement\n";
for(int f=1;f<11;f++){
if(f==5){
break; //we break out of the loop when f==5, and dont execute the loop for f>=5
}
cout<<f<<", ";
}
continue is similar to break, but just breaks out of the current iteration, but still continues running the next iterations
// continue statement example
cout<<"\nbreak statement\n";
for(int f=1;f<11;f++){
if(f==5){
continue;
}
cout<<f<<", "; //this statement not executed for f==5
}
6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
C++ provides if.., if..else.., and switch statements to apply conditional logic. Lets take a look at them:
a. The basic syntax for creating an if statement is:
/////////// IF STATEMENT ////////////
string pass="password",pass2;
cout<<"\n\n--if statement capability--\n";
cout<<"\nenter password:";
cin>>pass2;
if (pass==pass2){
cout<<"\npassword matches! you can enter!!";
} else{
cout<<"\npassword doesnt match! begone!!";
}
b. The basic syntax for creating an if…else statement is:
/////////// IF-ELSE STATEMENT ////////////
int menuChoice=5;
cout<<"\n\n--if-else statement capability--\n";
cout<<"\n1.\tadd record";
cout<<"\n2.\tdelete record";
cout<<"\n3.\texit";
cout<<"\nwhat do you want to do?";
cin>>menuChoice;
if (menuChoice==1){
cout<<"\nlets add some records!!";
} else if (menuChoice==2){
cout<<"\nlets delete some records!!";
} else{
cout<<"\nexiting! good-bye!!";
}
c. The basic syntax for creating a switch statement is:
/////////// SWITCH STATEMENT ////////////
int menuChoice2=5;
cout<<"\n\n--switch statement capability--\n";
cout<<"\n1.\tadd record";
cout<<"\n2.\tdelete record";
cout<<"\n3.\texit";
cout<<"\nwhat do you want to do?";
cin>>menuChoice2;
switch(menuChoice2){
case 1:
cout<<"\nlets add some records!!";
break;
case 2:
cout<<"\nlets delete some records!!";
break;
case 3:
cout<<"\nexiting! good-bye!!";
break;
default:
cout<<"\n!!!!error!!!!";
}
7. Put your code in functions
Functions allows us to create a block of code that can be executed many times without needing to it write it again.
// Following is an example case where we define a function that shows a menu to the user
int sub_menu(int choice) {
switch(choice){
case 1:
cout<<"\nLets add a new record";
break;
case 2:
cout<<"\nLets view an existing record";
break;
case 3:
cout<<"\nLets delete an existing record";
break;
default:
cout<<"\nExiting! Goodbye!!";
}
return 0;
}
We can call the function by its name:
// lets say we are writing the main() and we want to call the funciton
// lines-of-code
sub_menu()
// lines-of-code
8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes
C++ allows user to create classes. These can be a combination of variables and functions that operate on those variables. Lets take a look at how we can define and use them.
// Create a Car class with some attributes
class Car {
public:
string brand;
string model;
int year;
};
// Create an object of Car
Car carObj1;
carObj1.brand = "Mahindra";
carObj1.model = "Scorpio";
carObj1.year = 2020;
// Using the object
cout << carObj1.brand << " " << carObj1.model << " " << carObj1.year << "\n";
9. Read file from a disk and save file to a disk
Lets see how to read and write a text file in an organized way. We use the fstream header file for importing the functions necessary to read/write files.
#include <fstream>
// read a text file
string line;
ifstream myfile ("file.txt");
if (myfile.is_open())
{
while ( getline (myfile,line) )
{
cout << line << '\n';
}
myfile.close();
}
else cout << "Unable to open file";
// write a text file
ofstream myfile ("file.txt");
if (myfile.is_open())
{
myfile << "This is a line.\n";
myfile << "This is another line.\n";
myfile.close();
}
else cout << "Unable to open file";
10. Ability to comment your code so you can understand it when you revisit it some time later
We can tell C++ that a line of code is a comment as follows.
// this is a comment
We can tell that a multi-line block of text as follows.
/*
this
is
a
comment
block
*/
While C++ can be a powerful tool, it can also be complex and difficult to learn, especially for beginners. The language has a steep learning curve, and requires a solid understanding of programming concepts such as pointers, memory management, and OOP. However, with the right resources and dedication, C++ can be a rewarding and powerful tool for software development.
Overall, C++ is a popular and powerful programming language that is used in a wide range of applications, from operating systems to video games. Its efficiency, speed, and support for OOP and generic programming make it a versatile and powerful tool for software developers.
To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent.
Comments welcome!
-
Introduction to Programming in Microsoft Excel VBA
Quick Introduction
Excel VBA, or Visual Basic for Applications, is a programming language that can be used to automate tasks and enhance functionality in Microsoft Excel. VBA is a powerful tool that allows users to write custom macros and functions to automate repetitive tasks, perform complex calculations, and create custom solutions.
VBA is a type of Visual Basic, which is an object-oriented programming language developed by Microsoft. VBA is integrated directly into Excel, making it easy to access and use. VBA code is stored in modules, which can be accessed through the Visual Basic Editor in Excel. In the Editor, users can write, edit, and run VBA code, as well as debug their code to identify and fix any errors.
One of the key advantages of VBA is that it allows users to automate repetitive tasks that would otherwise be time-consuming to perform manually. For example, users can write a VBA macro to format data, generate reports, or update data in bulk. VBA can also be used to perform complex calculations, create custom user interfaces, and interact with other applications.
To get started with VBA, users should have a basic understanding of programming concepts and syntax. The VBA language is based on Visual Basic, so many programming concepts, such as variables, loops, and conditional statements, are similar to other programming languages. Excel also provides many built-in functions and objects that can be used in VBA code, making it easy to access and manipulate data in a spreadsheet.
Most modern programming languages have a set up similar building blocks, for example
Receiving input from the user and Showing output to the user
Ability to store values in variables (usually of different kinds such as integers, floating points or character)
A string of characters where you can store names, addresses, or any other kind of text
Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
Put your code in functions
Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes
Read file from a disk and save file to a disk
Ability to comment your code so you can understand it when you revisit it some time later
Lets dive right in and see how we can do these things in VBA.
0. Enable VBA in your Excel file
Before we can begin to write a program in VBA, also known as a macro, we need to enable the developer tab. You can do this by going to the File > Options > Customise ribbon. Once the developer tab is available, go there and choose the leftmost option which says Visual Basic. Now you will see a panel in the left where you can double clock on the sheet name you are working on. This will open a empty code window. Here write the following code and save the file as a macro enabled workbook (extension will be .xlsb).
Sub simple_hello()
Range("A2").Value = "Hello World!"
End Sub
Close the file, then opn it back again and chose the option (if shown) to enable macros. Now go to the Developer tab again and this time select the second option called Macros. Here you should see the macro that you just created. Select it and hit run!
1. Receiving input from the user and Showing output to the user
There are several ways in which a macro can show output to the user. Let’s look at some ways of showing output:
'Method 1:
Range("A2").Value = "Hello"
'Method 2:
Worksheets("Sheet1").Range("B2").Value = "Hello"
'Method 3:
Worksheets(1).Range("C2").Value = "Hello"
'Method 4:
MsgBox "I added Hello in cell A2, B2 and C2"
'Method 5:
MsgBox "Hello " & Range("C5").Value & vbNewLine & "So you are " & Range("C6") & " years old!"
2. Ability to store values in variables (usually of different kinds such as integers, floating points or character)
VBA allows 4 key types of variables: Integer, String, Double and Boolean
Integer is good for soring most numeric values, String is for character input and Boolean is for a 0/1 or yes/no type of data. Here are some examples:
'Integer:
Dim x As Integer
x = 6
Range("A1").Value = x
'String:
Dim book As String
book = "bible"
Range("A1").Value = book
'Double:
Dim x As Double
x = 5.5
MsgBox "value is " & x
'Boolean:
Dim continue As Boolean
continue = True
If continue = True Then MsgBox "Boolean variables are cool"
3. A string of characters where you can store names, addresses, or any other kind of text
Key idea here is to learn how to manipulate string variables. There are a few common operations that we will focus on:
a. Joining strings
'Join Strings
Dim text1 As String, text2 As String
text1 = "Hi"
text2 = "Tim"
MsgBox text1 & " " & text2
b. Left/right or middle functions - To extract the leftmost/rightmost or middle characters from a string.
Dim text As String
text = "example text"
MsgBox Left(text, 4)
'Just as left, we can also extract a substing from the right or middle
MsgBox Right("example text", 2)
MsgBox Mid("example text", 9, 2)
c. To get the length of a string, use Len.
MsgBox Len("example text")
d. To find the position of a substring in a string, use Instr.
MsgBox InStr("example text", "am")
4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers)
Array’s are a series of similar type of data stored together in one variable. Arrays can be one-dimentional or multi-dimentional.
a. Following example shows how a one dimentional array works:
Dim Films(1 To 5) As String
Films(1) = "Lord of the Rings"
Films(2) = "Speed"
Films(3) = "Star Wars"
Films(4) = "The Godfather"
Films(5) = "Pulp Fiction"
MsgBox Films(4)
b. Following example shows how a two dimentional array works:
Dim Films(1 To 5, 1 To 2) As String
Dim i As Integer, j As Integer
For i = 1 To 5
For j = 1 To 2
Films(i, j) = Cells(i, j).Value
Next j
Next i
MsgBox Films(4, 2)
5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times
VBA has several looping options (for, do-while, do-until). There are options of nesting (single, double, triple, ..) loops.
a. Following example shows how a simple/single for loop works:
Dim i As Integer
For i = 1 To 6
Cells(i, 1).Value = 100
Next i
b. Following example shows how a double for loop works:
Dim i As Integer, j As Integer
For i = 1 To 6
For j = 1 To 2
Cells(i, j).Value = 100
Next j
Next i
c. Following example shows how a triple for loop works:
Dim c As Integer, i As Integer, j As Integer
For c = 1 To 3
For i = 1 To 6
For j = 1 To 2
Worksheets(c).Cells(i, j).Value = 100
Next j
Next i
Next c
VBA also has a do-while loop. Following example shows how it works:
Dim i As Integer
i = 1
Do While i < 6
Cells(i, 1).Value = 20
i = i + 1
Loop
VBA also has a do-until loop. Following example shows how it works:
Dim i As Integer
i = 1
Do Until i > 6
Cells(i, 1).Value = 20
i = i + 1
Loop
6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails
a. If Then Statement - VBA has the option of an if statement, which executes a piece of code only if a specified condition is met.
Dim score As Integer, result As String
score = Range("A1").Value
If score >= 60 Then result = "pass"
Range("B1").Value = result
Dim score As Integer, result As String
score = Range("A1").Value
b. If Else Statement - VBA has the option of an if-else statement, which executes a piece of code only if a specified condition is met, if not then it executes another piece of code.
If score >= 60 Then
result = "pass"
Else
result = "fail"
End If
Range("B1").Value = result
c. If Else Statement - VBA has the option of an if-else statement, which executes a piece of code only if a specified condition is met, if not then it executes another piece of code.
'Select Case
'First, declare two variables. One variable of type Integer named score and one variable of type String named result
Dim score As Integer, result As String
'We initialize the variable score with the value of cell A1
score = Range("A1").Value
'Add the Select Case structure
Select Case score
Case Is >= 80
result = "very good"
Case Is >= 70
result = "good"
Case Is >= 60
result = "sufficient"
Case Else
result = "insufficient"
End Select
'Write the value of the variable result to cell B1
Range("B1").Value = result
7. Put your code in functions
VBA allows us to specify a function or a sub. The difference between the two is that funciton allows us to return a variable whereas a sub does not.
a. Function - If you want Excel VBA to perform a task that returns a result, you can use a function. Place a function into a module (In the Visual Basic Editor, click Insert, Module). For example, the function with name Area.
'Explanation: This function has two arguments (of type Double) and a return type (the part after As also of type Double). You can use the name of the function (Area) in your code to indicate which result you want to return (here x * y).
Function Area(x As Double, y As Double) As Double
Area = x * y
End Function
'Explanation: The function returns a value so you have to 'catch' this value in your code. You can use another variable (z) for this. Next, you can add another value to this variable (if you want). Finally, display the value using a MsgBox.
Dim z As Double
z = Area(3, 5) + 2
MsgBox z
b. Sub - If you want Excel VBA to perform some actions, you can use a sub.
Place a sub into a module (In the Visual Basic Editor, click Insert, Module). For example, the sub with name Area.
Sub Area(x As Double, y As Double)
MsgBox x * y
End Sub
'Explanation: This sub has two arguments (of type Double). It does not have a return type! You can refer to this sub (call the sub) from somewhere else in your code by simply using the name of the sub and giving a value for each argument.
'Call it using Area 3, 5
8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes
VBA Class allows us to create our own Object function in which we can add any kind of features, details of the command line, type of function. When we create Class in VBA, they act like totally an independent object function but they all are connected together. Detailed example of how to do this is out of the scope of this article.
9. Out of scope of this article.
10. Ability to comment your code so you can understand it when you revisit it some time later
We can tell VBA that a line of code is a comment by starting it with an single inverted comma.
'this is a comment
Overall, Excel VBA is a powerful tool that can help users automate tasks, improve productivity, and enhance the functionality of Microsoft Excel. With its flexibility and ease of use, VBA is a valuable tool for users of all skill levels, from beginners to advanced programmers.
To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent.
Comments welcome!
Touch background to close