Publishing your blog post

None · 2024-02-04

Manage blog comments with Giscus

Giscus is a free comments system powered without your own database. Giscus uses the Github Discussions to store and load associated comments based on a chosen mapping (URL, pathname, title, etc.). To comment, visitors must authorize the giscus app to post on their behalf using the GitHub OAuth flow. Alternatively, visitors can comment on the GitHub Discussion directly. You can moderate the comments on GitHub. Prerequisites Create a github repo You need a GitHub repository first. If you gonna use GitHub Pages for hosting your website, you can choose the corresponding repository (i.e., [userID].github.io) The repository should be public, otherwise visitors will not be able to view the discussion. Turn on Discussion feature In your GitHub repository Settings, make sure that General > Features > Discussions feature is enabled. Activate Giscus API Follow the steps in Configuration guide. Make sure the verification of your repository is successful. Then, scroll down from the manual page and choose the Discussion Category options. You don’t need to touch other configs. Copy _config.yml Now, you get the giscus script. Copy the four properties marked with a red box as shown below: Paste those values to _config.yml placed in the root directory. # External API giscus_repo: "[ENTER REPO HERE]" giscus_repoId: "[ENTER REPO ID HERE]" giscus_category: "[ENTER CATEGORY NAME HERE]" giscus_categoryId: "[ENTER CATEGORY ID HERE]"

None · 2024-02-03

Activate Goat Counter to know how many users visit your blog or posts

GoatCounter is an open source web analytics platform available as a free donation-supported hosted service or self-hosted app. It aims to offer easy to use and meaningful privacy-friendly web analytics as an alternative to Google Analytics or Matomo (in official guide). Getting started Sign up & Get your access code Create new account in here You will access your blog statistics at https://[my-code].goatcounter.com. Add your code to _config.yml goatcounter_code: [my-code] Check if your adblocker is blocking GoatCounter if you don’t see any pageviews. Allow visitor counter Sign in Goat counter from your browser, and enter the Settings tab. Make sure Allow adding visitor counts on your website is checked:

None · 2024-02-02

Markdown from A to Z

Headings To create a heading, add number signs (#) in front of a word or phrase. The number of number signs you use should correspond to the heading level. For example, to create a heading level three (<h3>), use three number signs (e.g., ### My Header). Markdown HTML Rendered Output # Header 1 <h1>Header 1</h1> Header 1 ## Header 2 <h2>Header 2</h2> Header 2 ### Header 3 <h3>Header 3</h3> Header 3 Emphasis You can add emphasis by making text bold or italic. Bold To bold text, add two asterisks (e.g., **text** = text) or underscores before and after a word or phrase. To bold the middle of a word for emphasis, add two asterisks without spaces around the letters. Italic To italicize text, add one asterisk (e.g., *text* = text) or underscore before and after a word or phrase. To italicize the middle of a word for emphasis, add one asterisk without spaces around the letters. Blockquotes To create a blockquote, add a > in front of a paragraph. > Yongha Kim is the best developer in the world. > > Factos 👍👀 Yongha Kim is the best developer in the world. Factos 👍👀 Lists You can organize items into ordered and unordered lists. Ordered Lists To create an ordered list, add line items with numbers followed by periods. The numbers don’t have to be in numerical order, but the list should start with the number one. 1. First item 2. Second item 3. Third item 4. Fourth item First item Second item Third item Fourth item Unordered Lists To create an unordered list, add dashes (-), asterisks (*), or plus signs (+) in front of line items. Indent one or more items to create a nested list. * First item * Second item * Third item * Fourth item First item Second item Third item Fourth item Code To denote a word or phrase as code, enclose it in backticks (`). Markdown HTML Rendered Output At the command prompt, type `nano`. At the command prompt, type <code>nano</code>. At the command prompt, type nano. Escaping Backticks If the word or phrase you want to denote as code includes one or more backticks, you can escape it by enclosing the word or phrase in double backticks (``). Markdown HTML Rendered Output ``Use `code` in your Markdown file.`` <code>Use `code` in your Markdown file.</code> Use `code` in your Markdown file. Code Blocks To create code blocks that spans multiple lines of code, set the text inside three or more backquotes ( ``` ) or tildes ( ~~~ ). <html> <head> </head> </html> def foo(): a = 1 for i in [1,2,3]: a += i Horizontal Rules To create a horizontal rule, use three or more asterisks (***), dashes (---), or underscores (___) on a line by themselves. *** --- _________________ Links To create a link, enclose the link text in brackets (e.g., [Blue Archive]) and then follow it immediately with the URL in parentheses (e.g., (https://bluearchive.nexon.com)). My favorite mobile game is [Blue Archive](https://bluearchive.nexon.com). The rendered output looks like this: My favorite mobile game is Blue Archive. Adding Titles You can optionally add a title for a link. This will appear as a tooltip when the user hovers over the link. To add a title, enclose it in quotation marks after the URL. My favorite mobile game is [Blue Archive](https://bluearchive.nexon.com "All senseis are welcome!"). The rendered output looks like this: My favorite mobile game is Blue Archive. URLs and Email Addresses To quickly turn a URL or email address into a link, enclose it in angle brackets. <https://www.youtube.com/> <fake@example.com> The rendered output looks like this: https://www.youtube.com/ fake@example.com Images To add an image, add an exclamation mark (!), followed by alt text in brackets, and the path or URL to the image asset in parentheses. You can optionally add a title in quotation marks after the path or URL. ![Tropical Paradise](/assets/img/example.jpg "Maldives, in October") The rendered output looks like this: Linking Images To add a link to an image, enclose the Markdown for the image in brackets, and then add the link in parentheses. [![La Mancha](/assets/img/La-Mancha.jpg "La Mancha: Spain, Don Quixote")](https://www.britannica.com/place/La-Mancha) The rendered output looks like this: Escaping Characters To display a literal character that would otherwise be used to format text in a Markdown document, add a backslash (\) in front of the character. \* Without the backslash, this would be a bullet in an unordered list. The rendered output looks like this: * Without the backslash, this would be a bullet in an unordered list. Characters You Can Escape You can use a backslash to escape the following characters. Character Name ` backtick * asterisk _ underscore {} curly braces [] brackets <> angle brackets () parentheses # pound sign + plus sign - minus sign (hyphen) . dot ! exclamation mark | pipe HTML Many Markdown applications allow you to use HTML tags in Markdown-formatted text. This is helpful if you prefer certain HTML tags to Markdown syntax. For example, some people find it easier to use HTML tags for images. Using HTML is also helpful when you need to change the attributes of an element, like specifying the color of text or changing the width of an image. To use HTML, place the tags in the text of your Markdown-formatted file. This **word** is bold. This <span style="font-style: italic;">word</span> is italic. The rendered output looks like this: This word is bold. This word is italic.

None · 2023-09-05

Important AWS Services that you need to Know Now

Data Science · 2022-07-02

Implementing Self Organizing Maps using Python

What are Self Organizing Maps (SOMs)? SOM stands for Self-Organizing Map, which is a type of artificial neural network that is used for unsupervised learning and dimensionality reduction. SOMs are inspired by the structure and function of the human brain, and they can be used to visualize and explore complex, high-dimensional data in a two-dimensional map or grid. SOMs consist of an input layer, a layer of computational nodes, and an output layer. The input layer receives the data, and the computational nodes perform computations on the data. The output layer is the two-dimensional grid of nodes that represents the input data. During training, the nodes in the output layer are adjusted to represent the input data in a way that preserves the topological relationships between the input data points. SOMs have a wide range of applications, including image processing, data visualization, data clustering, feature extraction, and anomaly detection. They are particularly useful for visualizing and exploring large, complex datasets, as they can reveal patterns and relationships that might not be apparent from the raw data. Implementation To implement Self-Organizing Maps (SOM) in Python, you can use the SOMPY library. SOMPY is a Python library for Self Organizing Map (SOM), and it provides an easy-to-use interface to implement SOM in Python. Here are the steps to implement SOM using SOMPY library in Python: Install SOMPY library: You can install SOMPY library using pip by running the following command in the terminal: pip install sompy Import the SOMPY library: To use the SOMPY library, you need to import it first. You can do this using the following code: from sompy.sompy import SOMFactory Load data: You need to load the data you want to cluster using SOM. You can load data from a file or create a numpy array. Create a SOM object: You need to create a SOM object using SOMFactory class. You can set the parameters of SOM object such as the number of nodes, learning rate, and neighborhood function. som = SOMFactory.build(data, mapsize=[20,20], normalization='var', initialization='pca', component_names=features) Here, data is the input data you loaded in the previous step, mapsize is the number of nodes in the SOM, normalization is the normalization method, initialization is the initialization method, and component_names is the feature names of the input data. Train the SOM: You can train the SOM object using the following code: som.train(n_job=1, verbose=False) Here, n_job is the number of processors to use, and verbose is the flag to print the training progress. Plot the SOM: You can visualize the SOM using the following code: from sompy.visualization.mapview import View2D from sompy.visualization.bmuhits import BmuHitsView # View the map view2D = View2D(10,10,"rand data",text_size=10) view2D.show(som, col_sz=4, which_dim="all", denormalize=True) # View the hit map hits = BmuHitsView(4,4,"Hits Map",text_size=12) hits.show(som, anotate=True, onlyzeros=False, labelsize=12, cmap="Greys", logaritmic=False) Here, View2D is used to view the map, and BmuHitsView is used to view the hit map. You can set the number of columns in the map and other parameters to adjust the size and style of the map. That’s it! These are the basic steps to implement SOM using the SOMPY library in Python. You can customize the SOM object and visualization methods to fit your requirements. Comments welcome!

Data Science · 2022-06-04

Implementing Convolutional Neural Networks using Python

What are Convolutional Neural Networks (CNNs)? Convolutional Neural Networks (CNNs) are a type of deep neural network that are commonly used in computer vision tasks such as image classification, object detection, and segmentation. They are able to automatically learn and extract features from images, allowing them to identify patterns and structures in complex visual data. The key component of a CNN is the convolutional layer, which performs a series of convolutions between the input image and a set of learnable filters. Each filter is designed to detect a specific pattern or feature in the image, such as edges, corners, or textures. The result of the convolution is a feature map that captures the presence and location of the detected feature. In addition to the convolutional layer, a typical CNN architecture also includes pooling layers, which reduce the spatial resolution of the feature maps while retaining their most important information, and fully connected layers, which combine the extracted features into a final output. One of the major advantages of CNNs is their ability to learn hierarchical representations of images, where lower-level features such as edges and corners are combined to form higher-level features such as shapes and objects. This makes them highly effective for image classification and object detection tasks, where they can achieve state-of-the-art performance on benchmark datasets. Implementation CNNs can be implemented in various deep learning frameworks such as TensorFlow, PyTorch, and Keras. These frameworks provide pre-built layers and functions for building and training CNN models, making it relatively easy to implement even for those with limited programming experience. Using Tensorflow library Here’s an example of how to implement a basic convolutional neural network (CNN) using TensorFlow in Python: import tensorflow as tf # Define the model architecture model = tf.keras.models.Sequential([ tf.keras.layers.Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1)), tf.keras.layers.MaxPooling2D(pool_size=(2, 2)), tf.keras.layers.Flatten(), tf.keras.layers.Dense(10, activation='softmax') ]) # Compile the model with an optimizer, loss function, and metrics model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) # Load the training and test data (train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data() # Preprocess the data train_images = train_images.reshape(train_images.shape[0], 28, 28, 1) train_images = train_images.astype('float32') / 255 train_labels = tf.keras.utils.to_categorical(train_labels, num_classes=10) test_images = test_images.reshape(test_images.shape[0], 28, 28, 1) test_images = test_images.astype('float32') / 255 test_labels = tf.keras.utils.to_categorical(test_labels, num_classes=10) # Train the model model.fit(train_images, train_labels, batch_size=128, epochs=10, validation_data=(test_images, test_labels)) In this example, we define a simple CNN architecture with one convolutional layer, one max pooling layer, one flattening layer, and one fully connected (dense) layer. We use the MNIST dataset for training and testing the model. We compile the model with the Adam optimizer, categorical cross-entropy loss function, and accuracy metric. Finally, we train the model for 10 epochs and evaluate its performance on the test data. Using keras library Here is an example of how to implement a convolutional neural network (CNN) in Keras: # First, you need to import the required libraries: from keras.models import Sequential from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense Next, you can define your CNN model using the Sequential API. Here’s an example model: model = Sequential() # Add a convolutional layer with 32 filters, a 3x3 kernel size, and ReLU activation model.add(Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1))) # Add a max pooling layer with a 2x2 pool size model.add(MaxPooling2D(pool_size=(2, 2))) # Add another convolutional layer with 64 filters and a 3x3 kernel size model.add(Conv2D(64, (3, 3), activation='relu')) # Add another max pooling layer model.add(MaxPooling2D(pool_size=(2, 2))) # Flatten the output from the previous layer model.add(Flatten()) # Add a fully connected layer with 128 neurons and ReLU activation model.add(Dense(128, activation='relu')) # Add an output layer with 10 neurons (for a 10-class classification problem) and softmax activation model.add(Dense(10, activation='softmax')) This CNN model has two convolutional layers with 32 and 64 filters, respectively, each followed by a max pooling layer with a 2x2 pool size. The output from the last max pooling layer is flattened and fed into a fully connected layer with 128 neurons, which is then connected to an output layer with 10 neurons and softmax activation for a 10-class classification problem. Finally, you can compile and train the model using the compile() and fit() methods, respectively. Here’s an example of compiling and training the model on the MNIST dataset: # Compile the model with categorical crossentropy loss and Adam optimizer model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # Train the model on the MNIST dataset model.fit(X_train, y_train, batch_size=128, epochs=10, validation_data=(X_test, y_test)) In this example, X_train and y_train are the training data and labels, respectively, and X_test and y_test are the validation data and labels, respectively. The model is compiled with categorical crossentropy loss and Adam optimizer, and trained for 10 epochs with a batch size of 128. The model’s training and validation accuracy are recorded and printed after each epoch. Using PyTorch library To implement a Convolutional Neural Network (CNN) in PyTorch, you can follow these steps: # Import the necessary PyTorch libraries: import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F Define the CNN architecture by creating a class that inherits from the nn.Module class: class CNN(nn.Module): def __init__(self): super(CNN, self).__init__() self.conv1 = nn.Conv2d(3, 32, kernel_size=3) self.conv2 = nn.Conv2d(32, 64, kernel_size=3) self.pool = nn.MaxPool2d(2, 2) self.dropout = nn.Dropout(p=0.5) self.fc1 = nn.Linear(64 * 6 * 6, 128) self.fc2 = nn.Linear(128, 10) def forward(self, x): x = self.pool(F.relu(self.conv1(x))) x = self.dropout(x) x = self.pool(F.relu(self.conv2(x))) x = self.dropout(x) x = x.view(-1, 64 * 6 * 6) x = F.relu(self.fc1(x)) x = self.dropout(x) x = self.fc2(x) return x Here, we have defined a CNN architecture with two convolutional layers, two max pooling layers, two dropout layers, and two fully connected layers. # Define the loss function and the optimizer: criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(model.parameters(), lr=0.01) # Train the model: for epoch in range(num_epochs): for i, data in enumerate(train_loader, 0): inputs, labels = data optimizer.zero_grad() outputs = model(inputs) loss = criterion(outputs, labels) loss.backward() optimizer.step() # Evaluate the model: correct = 0 total = 0 with torch.no_grad(): for data in test_loader: images, labels = data outputs = model(images) _, predicted = torch.max(outputs.data, 1) total += labels.size(0) correct += (predicted == labels).sum().item() print('Accuracy of the network on the 10000 test images: %d %%' % (100 * correct / total)) This is a basic example of how to implement a CNN using PyTorch. Of course, there are many ways to customize the architecture, loss function, optimizer, and training procedure based on your specific needs. In summary, CNNs are a powerful and widely used tool in computer vision and have led to significant advancements in areas such as image recognition, object detection, and segmentation. With the availability of deep learning frameworks, it has become easier than ever to implement and experiment with CNN models for a wide range of applications. Comments welcome!

Data Science · 2022-05-07

Implementing Recurrent Neural Networks using Python

What are Recurrent Neural Networks (RNNs)? Recurrent Neural Networks, or RNNs, are a type of artificial neural network designed to process sequential data, such as time-series or natural language. While traditional neural networks process input data independently of one another, RNNs allow for the input of past data to influence current output. This is done by introducing a loop within the neural network, allowing previous output to be fed back into the input layer. The ability to process sequential data makes RNNs particularly useful for a variety of tasks. For example, in natural language processing, RNNs can be used to generate text or to predict the next word in a sentence. In speech recognition, RNNs can be used to transcribe audio to text. In financial modeling, RNNs can be used to predict stock prices based on historical data. The core of an RNN is its hidden state, which is a vector that is updated at each time step. The state vector summarizes information from previous inputs, and is used to predict the output at the current time step. The state vector is updated using a set of weights that are learned during training. One common issue with RNNs is that the hidden state can become “saturated” and lose information from previous time steps. To address this, several variations of RNNs have been developed, including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs), which can better maintain the memory of the network over longer periods of time. Implementation Implementing an RNN in Python can be done using several popular deep learning frameworks, such as TensorFlow, Keras, and PyTorch. These frameworks provide high-level APIs that make it easier to build and train complex neural networks. With the popularity of RNNs increasing, they have become a powerful tool for a variety of applications across many different fields. Using TensorFlow library Here is an example of how to implement a simple RNN using TensorFlow: import tensorflow as tf import numpy as np # Define the RNN model num_inputs = 1 num_neurons = 100 num_outputs = 1 learning_rate = 0.001 X = tf.placeholder(tf.float32, [None, None, num_inputs]) y = tf.placeholder(tf.float32, [None, num_outputs]) cell = tf.contrib.rnn.OutputProjectionWrapper( tf.contrib.rnn.BasicRNNCell(num_units=num_neurons, activation=tf.nn.relu), output_size=num_outputs) outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32) # Define the loss function and optimizer loss = tf.reduce_mean(tf.square(outputs - y)) optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) train = optimizer.minimize(loss) # Generate some sample data t_min, t_max = 0, 30 resolution = 0.1 t = np.linspace(t_min, t_max, int((t_max - t_min) / resolution)) x = np.sin(t) # Train the model n_iterations = 500 batch_size = 50 init = tf.global_variables_initializer() with tf.Session() as sess: init.run() for iteration in range(n_iterations): X_batch = x.reshape(-1, batch_size, num_inputs) y_batch = x.reshape(-1, batch_size, num_outputs) sess.run(train, feed_dict={X: X_batch, y: y_batch}) # Make some predictions X_new = x.reshape(-1, 1, num_inputs) y_pred = sess.run(outputs, feed_dict={X: X_new}) This is a simple RNN that is trained on a sin wave and is able to predict the next value in the sequence. You can modify the code to work with your own data and adjust the parameters to improve the accuracy of the model. Using keras library Here’s an example code for implementing RNN using Keras in Python: import numpy as np from keras.models import Sequential from keras.layers import Dense, SimpleRNN # define the data X = np.array([[[1], [2], [3], [4], [5]], [[6], [7], [8], [9], [10]]]) y = np.array([[[6], [7], [8], [9], [10]], [[11], [12], [13], [14], [15]]]) # define the model model = Sequential() model.add(SimpleRNN(1, input_shape=(5, 1), return_sequences=True)) # compile the model model.compile(optimizer='adam', loss='mse') # fit the model model.fit(X, y, epochs=1000, verbose=0) # make predictions predictions = model.predict(X) print(predictions) In this example, we define a simple RNN model using Keras to predict the next value in a sequence. We input two sequences, each of length 5, and output two sequences, each of length 5. We define the model using the Sequential class and add a single SimpleRNN layer with a single neuron. We compile the model using the adam optimizer and mean squared error loss function. We then fit the model on the input and output sequences, running for 1000 epochs. Finally, we use the model to make predictions on the input sequences, printing the predictions. Using PyTorch library Here is an example of implementing a Recurrent Neural Network (RNN) in Python using PyTorch: import torch import torch.nn as nn # Define the RNN model class RNN(nn.Module): def __init__(self, input_size, hidden_size, output_size): super(RNN, self).__init__() self.hidden_size = hidden_size self.i2h = nn.Linear(input_size + hidden_size, hidden_size) self.i2o = nn.Linear(input_size + hidden_size, output_size) self.softmax = nn.LogSoftmax(dim=1) def forward(self, input, hidden): combined = torch.cat((input, hidden), 1) hidden = self.i2h(combined) output = self.i2o(combined) output = self.softmax(output) return output, hidden def initHidden(self): return torch.zeros(1, self.hidden_size) # Set the hyperparameters input_size = 5 hidden_size = 10 output_size = 2 # Create the RNN model rnn = RNN(input_size, hidden_size, output_size) # Define the input and the initial hidden state input = torch.randn(1, input_size) hidden = torch.zeros(1, hidden_size) # Run the RNN model output, next_hidden = rnn(input, hidden) This code defines an RNN model using PyTorch’s nn.Module class, which includes an input layer, a hidden layer, and an output layer. The forward method defines how the input is processed through the network, and the initHidden method initializes the hidden state. To run the RNN model, we first set the hyperparameters such as input_size, hidden_size, and output_size. Then we create an instance of the RNN model and pass in an input tensor and an initial hidden state to the forward method. The output of the RNN model is the output tensor and the next hidden state. Note that this is just a simple example, and there are many variations of RNNs that can be implemented in PyTorch depending on the specific use case. Comments welcome!

Data Science · 2022-04-02

Implementing Artificial Neural Networks using Python

What are Artificial Neural Networks (ANNs)? Artificial Neural Networks (ANNs) are a type of machine learning model that are designed to simulate the function of a biological neural network. ANNs are composed of interconnected nodes or artificial neurons that process and transmit information to one another. The structure of an ANN consists of an input layer, one or more hidden layers, and an output layer. The input layer is where data is introduced to the network, while the output layer produces the network’s prediction or classification. Hidden layers contain a variable number of artificial neurons, which allow the network to model non-linear relationships in the data. The connections between the neurons in the hidden layers have weights that can be adjusted through training to optimize the performance of the network. ANNs can be used for a variety of machine learning tasks, including regression, classification, and clustering. For regression, ANNs can be trained to model the relationship between input variables and output variables. In classification, ANNs can be trained to classify input data into different categories. In clustering, ANNs can be used to group similar data points together. The training process of an ANN involves adjusting the weights of the connections between the neurons to minimize the difference between the predicted output and the actual output. This process involves passing data through the network multiple times and updating the weights based on the difference between the predicted output and the actual output. The goal is to find a set of weights that minimize the error and optimize the performance of the network. Implementation Python has several libraries that can be used to implement ANNs, including scikit-learn, TensorFlow, Keras, and PyTorch. These libraries provide high-level abstractions that make it easier to build and train ANNs. In addition, they provide a wide range of pre-built layers and functions that can be used to customize the architecture of the network. Using scikit-learn library Here’s an example of how to create a simple ANN using the scikit-learn library: # Import the necessary libraries from sklearn.neural_network import MLPClassifier from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # Generate a random dataset for classification X, y = make_classification(n_features=4, random_state=0) # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0) # Create an ANN classifier with one hidden layer clf = MLPClassifier(hidden_layer_sizes=(5,), max_iter=1000, random_state=0) # Train the classifier on the training set clf.fit(X_train, y_train) # Evaluate the classifier on the testing set score = clf.score(X_test, y_test) print("Accuracy: {:.2f}%".format(score*100)) In this example, we first import the necessary libraries, generate a random dataset for classification, and split the data into training and testing sets. We then create an ANN classifier with one hidden layer and train it on the training set. Finally, we evaluate the classifier on the testing set and print the accuracy. This is just a basic example, and there are many ways to customize and optimize your ANN, depending on your specific use case. Using Tensorflow library Here’s an example of how to implement an artificial neural network using TensorFlow without Keras: import tensorflow as tf import numpy as np # Define the input data and expected outputs input_data = np.array([[0,0], [0,1], [1,0], [1,1]], dtype=np.float32) expected_output = np.array([[0], [1], [1], [0]], dtype=np.float32) # Define the network architecture num_input = 2 num_hidden = 2 num_output = 1 learning_rate = 0.1 # Define the weights and biases for the network weights = { 'hidden': tf.Variable(tf.random.normal([num_input, num_hidden])), 'output': tf.Variable(tf.random.normal([num_hidden, num_output])) } biases = { 'hidden': tf.Variable(tf.random.normal([num_hidden])), 'output': tf.Variable(tf.random.normal([num_output])) } # Define the forward propagation step def neural_network(input_data): hidden_layer = tf.add(tf.matmul(input_data, weights['hidden']), biases['hidden']) hidden_layer = tf.nn.sigmoid(hidden_layer) output_layer = tf.add(tf.matmul(hidden_layer, weights['output']), biases['output']) output_layer = tf.nn.sigmoid(output_layer) return output_layer # Define the loss function and optimizer loss_func = tf.keras.losses.MeanSquaredError() optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate) # Define the training loop num_epochs = 10000 for epoch in range(num_epochs): with tf.GradientTape() as tape: # Forward propagation output = neural_network(input_data) loss = loss_func(expected_output, output) # Backward propagation and update the weights and biases gradients = tape.gradient(loss, [weights['hidden'], weights['output'], biases['hidden'], biases['output']]) optimizer.apply_gradients(zip(gradients, [weights['hidden'], weights['output'], biases['hidden'], biases['output']])) if epoch % 1000 == 0: print(f"Epoch {epoch} Loss: {loss:.4f}") # Test the network test_data = np.array([[0,0], [0,1], [1,0], [1,1]], dtype=np.float32) predictions = neural_network(test_data) print(predictions) In this example, we define the architecture of the neural network by specifying the number of input, hidden, and output nodes. We also define the learning rate and the weight and bias variables. The forward propagation step is defined by using the tf.add() and tf.matmul() functions to compute the weighted sum and then applying the sigmoid activation function. The loss function and optimizer are defined using the tf.keras.losses and tf.keras.optimizers modules, respectively. Finally, we train the network by performing forward and backward propagation steps in a loop, and then we test the network using test data. Using keras library Keras is a high-level neural network API that can run on top of TensorFlow. It provides a simplified interface for building and training deep learning models. Here is an example of how to implement an Artificial Neural Network (ANN) in Python using Keras: # Import the necessary libraries from tensorflow import keras from tensorflow.keras import layers # Define the model architecture model = keras.Sequential([ layers.Dense(64, activation='relu', input_shape=[X_train.shape[1]]), layers.Dense(64, activation='relu'), layers.Dense(1, activation='sigmoid') ]) # Compile the model model.compile( optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'] ) # Train the model history = model.fit( X_train, y_train, validation_data=(X_val, y_val), epochs=100, batch_size=32 ) # Evaluate the model test_scores = model.evaluate(X_test, y_test, verbose=2) print(f'Test loss: {test_scores[0]}') print(f'Test accuracy: {test_scores[1]}') This example creates a model with 2 hidden layers and 1 output layer. The first 2 hidden layers have 64 nodes each and use the ReLU activation function. The output layer has a single node and uses the sigmoid activation function. The model is trained using the Adam optimizer and binary cross-entropy loss. The accuracy metric is used to evaluate the model. To use this code, you will need to replace X_train, y_train, X_val, y_val, X_test, and y_test with your own training, validation, and test data. Using PyTorch library To implement Artificial Neural Networks (ANN) using PyTorch, you can follow these general steps: # Import the necessary libraries: PyTorch, NumPy, and Pandas. import torch import numpy as np import pandas as pd # Load the dataset: You can use Pandas to load the dataset. data = pd.read_csv('dataset.csv') # Split the dataset into training and testing sets: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(data.iloc[:, :-1], data.iloc[:, -1], test_size=0.2, random_state=0) # Convert the data into PyTorch tensors: X_train = torch.from_numpy(np.array(X_train)).float() y_train = torch.from_numpy(np.array(y_train)).float() X_test = torch.from_numpy(np.array(X_test)).float() y_test = torch.from_numpy(np.array(y_test)).float() # Define the neural network architecture: You can define the neural network using the torch.nn module. class ANN(torch.nn.Module): def __init__(self): super(ANN, self).__init__() self.fc1 = torch.nn.Linear(8, 16) self.fc2 = torch.nn.Linear(16, 8) self.fc3 = torch.nn.Linear(8, 1) def forward(self, x): x = torch.relu(self.fc1(x)) x = torch.relu(self.fc2(x)) x = torch.sigmoid(self.fc3(x)) return x model = ANN() # In this example, we define an ANN with 3 fully connected layers, where the first two layers have a ReLU activation function and the last layer has a sigmoid activation function. # Define the loss function and optimizer: loss_fn = torch.nn.BCELoss() optimizer = torch.optim.Adam(model.parameters(), lr=0.001) # Train the model: for epoch in range(100): y_pred = model(X_train) loss = loss_fn(y_pred, y_train.unsqueeze(1)) optimizer.zero_grad() loss.backward() optimizer.step() # Test the model: y_pred_test = model(X_test) y_pred_test = (y_pred_test > 0.5).float() from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_test, y_pred_test) # Save the model: torch.save(model.state_dict(), 'model.pth') This is a general template for implementing an ANN using PyTorch. You can customize it based on your specific requirements. In conclusion, ANNs are a powerful machine learning model that can be used to model non-linear relationships in data. The structure of an ANN consists of an input layer, one or more hidden layers, and an output layer. Python has several libraries that can be used to implement ANNs, including TensorFlow, Keras, and PyTorch. Comments welcome!

Data Science · 2022-03-05

Overview of Deep Learning Activation Functions

Data Science · 2022-02-05

Overview of Deep Learning Techniques

Data Science · 2022-01-01

Boosting vs Bagging Model Improvement Techniques

In machine learning, there are two popular techniques for improving the accuracy of models: boosting and bagging. Both techniques are used to reduce the variance of a model, which is the tendency to overfit to the training data. While they have similar goals, they differ in their approach and functionality. In this article, we’ll explore the differences between boosting and bagging to help you decide which technique is right for your machine learning project. Bagging Bagging, short for bootstrap aggregating, is a technique that involves training multiple models on different random subsets of the training data. The goal of bagging is to reduce the variance of a model by averaging the predictions of multiple models. Each model in the ensemble is trained independently and the final prediction is the average of all models. Bagging can be used with any algorithm, but it is most commonly used with decision trees. The most popular implementation of bagging is the random forest algorithm, which uses an ensemble of decision trees to make predictions. Boosting Boosting is a technique that involves training multiple weak models on the same training data sequentially. The goal of boosting is to improve the accuracy of a model by adding new models that focus on the misclassified samples of the previous model. Each model in the ensemble is trained on the same dataset, but with different weights assigned to each sample. The weights are adjusted based on the misclassified samples of the previous model. The final prediction is a weighted average of all models in the ensemble. Boosting is commonly used with decision trees, but it can be used with any algorithm. Differences between Boosting and Bagging While boosting and bagging have similar goals, they differ in their approach and functionality. The main differences between these two techniques are: Approach: Bagging involves training multiple models independently on different random subsets of the training data, while boosting trains multiple models sequentially on the same dataset with different weights assigned to each sample. Sample Weighting: Bagging assigns equal weight to each sample in the training data, while boosting assigns higher weight to misclassified samples. Model Selection: In bagging, the final prediction is the average of all models in the ensemble, while in boosting, the final prediction is a weighted average of all models in the ensemble. Performance: Bagging can reduce the variance of a model and improve its stability, but it may not improve its accuracy. Boosting can improve the accuracy of a model, but it may increase its variance and overfitting. Conclusion In conclusion, boosting and bagging are two popular techniques for improving the accuracy of machine learning models. While they have similar goals, they differ in their approach and functionality. Bagging involves training multiple models independently on different subsets of the training data, while boosting trains multiple models sequentially on the same dataset with different weights assigned to each sample. Which technique is right for your machine learning project depends on your specific needs and goals. Bagging can improve model stability, while boosting can improve model accuracy. Comments welcome!

Data Science · 2021-12-04

Implementing XGBoost in Python

XGBoost (Extreme Gradient Boosting) is a popular algorithm for supervised learning problems, including regression, classification, and ranking tasks. In the financial services industry, XGBoost can be used for a variety of regression problems, such as predicting stock prices, credit risk scoring, and forecasting financial time series. One advantage of XGBoost is that it can handle missing values and outliers in the data. It can also automatically handle feature selection and feature engineering, which are important steps in preparing data for regression analysis. XGBoost is also highly optimized for performance and can handle large datasets with millions of rows and thousands of features. Use-case of xgboost for regression For example, in the stock market, XGBoost can be used to predict the future price of a stock based on historical data. XGBoost can also be used for credit scoring to assess the creditworthiness of borrowers by analyzing various features such as credit history, income, and debt-to-income ratio. In addition, XGBoost can be used for forecasting financial time series, such as predicting the future values of stock market indices or exchange rates. Use-case of xgboost for classification One such application is in the classification of credit risk. Credit risk classification is a fundamental task in the financial industry. The goal is to predict the probability of a borrower defaulting on a loan, based on a variety of factors such as credit score, income, employment status, and loan amount. This information can help banks and financial institutions make informed decisions about lending and managing risk. XGBoost has been shown to be effective in credit risk classification tasks, achieving high accuracy and predictive power. In a typical use case, the algorithm is trained on historical data, which includes information about borrowers and their credit outcomes. The model is then used to predict the probability of default for new loan applications. Implementation of XGBoost for regression using Python: First, we’ll need to install the XGBoost library: !pip install xgboost Then, we can import the necessary libraries and load our dataset. In this example, we’ll use the Boston Housing dataset, which is built into scikit-learn: import xgboost as xgb from sklearn.datasets import load_boston from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split # Load data boston = load_boston() X, y = boston.data, boston.target # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) Next, we’ll define our XGBoost model and fit it to the training data: # Define model xg_reg = xgb.XGBRegressor(objective='reg:squarederror', colsample_bytree=0.3, learning_rate=0.1, max_depth=5, alpha=10, n_estimators=10) # Fit model xg_reg.fit(X_train, y_train) We can then use the trained model to make predictions on the test set and evaluate its performance using mean squared error: # Make predictions on test set y_pred = xg_reg.predict(X_test) # Evaluate model rmse = np.sqrt(mean_squared_error(y_test, y_pred)) print("RMSE: %f" % (rmse)) That’s it! We’ve trained an XGBoost model for regression and evaluated its performance on a test set. Note that in practice, you would likely want to tune the hyperparameters of the model using a validation set or cross-validation. Implementing XGBoost for binary classification in Python: In this example, we load the dataset into a Pandas dataframe and split it into training and testing sets using train_test_split from scikit-learn. We then define the XGBoost classifier with hyperparameters such as the number of trees, maximum depth of each tree, learning rate, and fraction of samples and features used in each tree. We train the model on the training data using fit and make predictions on the test data using predict. Finally, we evaluate the performance of the model using accuracy score. import pandas as pd import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load the dataset into a Pandas dataframe data = pd.read_csv('path/to/dataset.csv') # Split the data into input features (X) and target variable (y) X = data.drop('target_variable', axis=1) y = data['target_variable'] # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # Define the XGBoost classifier with hyperparameters xgb_model = xgb.XGBClassifier( n_estimators=100, # number of trees max_depth=5, # maximum depth of each tree learning_rate=0.1, # learning rate subsample=0.8, # fraction of samples used in each tree colsample_bytree=0.8, # fraction of features used in each tree objective='binary:logistic', # objective function seed=42 # random seed for reproducibility ) # Train the XGBoost classifier on the training data xgb_model.fit(X_train, y_train) # Make predictions on the test data y_pred = xgb_model.predict(X_test) # Evaluate the performance of the model accuracy = accuracy_score(y_test, y_pred) print("Accuracy: %.2f%%" % (accuracy * 100.0)) Implement XGBoost for multi-class classification using Python In this example, we first load a multi-class classification dataset and split it into training and testing sets. We then initialize an XGBoost classifier and fit it on the training data. Finally, we make predictions on the test data and calculate the accuracy of the model. Note that the XGBClassifier class automatically handles multi-class classification problems, so we don’t need to do any additional preprocessing. import pandas as pd import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Load dataset data = pd.read_csv('dataset.csv') # Separate target variable from features X = data.iloc[:, :-1] y = data.iloc[:, -1] # Split data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123) # Initialize the XGBoost classifier with default hyperparameters model = xgb.XGBClassifier() # Fit the model on the training data model.fit(X_train, y_train) # Make predictions on the test data y_pred = model.predict(X_test) # Calculate the accuracy of the model accuracy = accuracy_score(y_test, y_pred) print('Accuracy: {:.2f}%'.format(accuracy * 100)) Overall, XGBoost is a powerful tool for regression in the financial services industry and is widely used by financial institutions and investment firms to make data-driven decisions. Comments welcome!

Data Science · 2021-11-06

Implementing Reinforcement Learning in Python and R

Data Science · 2021-10-02

Implementing Association Rule Learning using APRIORI in Python and R

Data Science · 2021-09-04

Implementing K-Means Clustering in Python and R

K-means clustering is a popular unsupervised learning technique used to cluster data points based on their similarity. In this article, we will explore what k-means clustering is, how it works, and how to implement it in Python and R. What is K-means Clustering? K-means clustering is a clustering algorithm that partitions n data points into k clusters based on their similarity. It aims to find the optimal center point for each cluster that minimizes the sum of squared distances between each data point and its respective cluster center. The algorithm iteratively assigns each data point to its nearest cluster center and re-computes the center point of each cluster. How K-means Clustering Works? K-means clustering follows a simple procedure to partition the data into k clusters. Here are the main steps involved in the k-means clustering algorithm: Initialization: Choose k random points from the data as the initial cluster centroids. Assignment: Assign each data point to the nearest cluster centroid based on the Euclidean distance. Update: Calculate the new cluster centroid for each cluster based on the mean of all data points assigned to it. Repeat: Repeat steps 2 and 3 until the cluster assignments no longer change or a maximum number of iterations is reached. Elbow method to choose the optimal number of clusters The elbow method is a popular technique for choosing the optimal number of clusters in k-means clustering. It involves plotting the values of the within-cluster sum of squares (WSS) against the number of clusters, and identifying the “elbow” in the curve as the point at which additional clusters no longer provide a significant reduction in WSS. Here’s how to implement the elbow method for choosing the optimal number of clusters in Python: import matplotlib.pyplot as plt from sklearn.cluster import KMeans # Create an array of the WSS values for a range of k values (number of clusters): wss_values = [] for i in range(1, 11): kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0) kmeans.fit(X) wss_values.append(kmeans.inertia_) # Plot the WSS values against the number of clusters: plt.plot(range(1, 11), wss_values) plt.title('The Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WSS') plt.show() # Identify the "elbow" in the curve and select the optimal number of clusters How to Implement K-means Clustering in Python? Python has many machine learning libraries that provide built-in functions for implementing k-means clustering. Here is a simple example using the scikit-learn library: from sklearn.cluster import KMeans import numpy as np # Generate some random data data = np.random.rand(100, 2) # Initialize KMeans object kmeans = KMeans(n_clusters=2, random_state=0) # Fit the data to the KMeans object kmeans.fit(data) # Print the cluster centers print(kmeans.cluster_centers_) In the above code, we first import the KMeans class from the scikit-learn library and generate some random data. We then initialize the KMeans object with the number of clusters and a random state for reproducibility. Finally, we fit the data to the KMeans object and print the resulting cluster centers. Implementing K-means Clustering in R To implement k-means clustering in R, we first need to load a dataset. For this example, we will use the iris dataset that comes with R. The iris dataset contains measurements of various attributes of iris flowers, such as sepal length, sepal width, petal length, and petal width. The dataset also includes the species of the flower. # Load the iris dataset data(iris) # Select the columns that we want to cluster data <- iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")] # Scale the data scaled_data <- scale(data) Next, we will use the kmeans function to perform the clustering. We will set the number of clusters to 3 since there are 3 species of iris flowers in the dataset. # Perform k-means clustering kmeans_result <- kmeans(scaled_data, centers = 3) Finally, we can plot the results to visualize the clusters. # Plot the results library(ggplot2) df <- data.frame(scaled_data, cluster = as.factor(kmeans_result$cluster)) ggplot(df, aes(x = Sepal.Length, y = Sepal.Width, color = cluster)) + geom_point() The resulting plot shows the three clusters that were formed by the algorithm. Conclusion K-means clustering is a popular unsupervised learning technique used for clustering data points based on their similarity. In this article, we explored what k-means clustering is, how it works, and how to implement it in Python (using the scikit-learn library) and R. K-means clustering is a powerful tool that has many applications in fields such as data mining, image processing, and natural language processing. Comments welcome!

Data Science · 2021-08-07

Implementing Random Forest Classification in Python and R

Data Science · 2021-07-03

Implementing Decision Tree Classification in Python and R

Data Science · 2021-06-05

Implementing Logistic Regression in Python and R

Logistic regression is a type of statistical analysis (also known as logit model). It is often used for predictive analytics and modeling, and extends to applications in machine learning. In this analytics approach, the dependent variable is finite or categorical: either A or B (binary regression) or a range of finite options A, B, C or D (multinomial regression). It is used to understand the relationship between the dependent variable and one or more independent variables by estimating probabilities using a logistic regression equation. This type of analysis can help you predict the likelihood of an event happening or a choice being made. For example, you may want to know the likelihood of a visitor choosing an offer made on your website — or not (dependent variable). Your analysis can look at known characteristics of visitors, such as sites they came from, repeat visits to your site, behavior on your site (independent variables). Logistic regression models help you determine a probability of what type of visitors are likely to accept the offer — or not. As a result, you can make better decisions about promoting your offer or make decisions about the offer itself. Logistic regression formula Here p is the probability of a positive outcome. Logit(p) = log(p / (1-p)) Types of logistic models Following are some types of predictive models that use logistic analysis. Generalized linear model Discrete choice Multinomial logit Mixed logit Probit Multinomial probit Ordered logit Assumptions of logistic regression Before we apply the logistic regression model, we also need to check if the following assumptions hold true. The Response Variable is Binary The Observations are Independent - The easiest way to check this assumption is to create a plot of residuals against time (i.e. the order of the observations) and observe whether or not there is a random pattern. If there is not a random pattern, then this assumption may be violated. There is No Multicollinearity Among Explanatory Variables - The most common way to detect multicollinearity is by using the variance inflation factor (VIF), which measures the correlation and strength of correlation between the predictor variables in a regression model. There are No Extreme Outliers - The most common way to test for extreme outliers and influential observations in a dataset is to calculate Cook’s distance for each observation. If there are indeed outliers, you can choose to (1) remove them, (2) replace them with a value like the mean or median, or (3) simply keep them in the model but make a note about this when reporting the regression results. There is a Linear Relationship Between Explanatory Variables and the Logit of the Response Variable. The easiest way to see if this assumption is met is to use a Box-Tidwell test. Implementing the model in python and R Implementing the model consists of the following key steps. Data pre-processing: This is similar for most ML models, so we tackle this in a separate article and not here Training the model Using the model for prediction Data pre-processing At this stage we do several pre-processing activities including splitting the data into training set and test set. We usually can follow the 80:20 principle, meaning that we use 80% of our data to train the model and remaining 20% of the data to test the model, and catch under or overfitting. Training the model We use the generalized linear model to obtain an equation that predicts the dependent variable using independent variables from the training set. Using python from sklearn.linear_model import LogisticRegression classifier = LogisticRegression(random_state = 0) classifier.fit(X_train, y_train) Using R classifier = glm(formula = Purchased ~ ., family = binomial, data = training_set) Using the model Now, we use the obtained equation to predict the dependent variable using the test set independent variables. Using python y_pred = classifier.predict(X_test) Using R prob_pred = predict(classifier, type = 'response', newdata = test_set[-3]) y_pred = ifelse(prob_pred > 0.5, 1, 0) Visualizing results Visualising the outcome of the model through a confusion matrix. Using python from sklearn.metrics import confusion_matrix, accuracy_score cm = confusion_matrix(y_test, y_pred) accuracy_score(y_test, y_pred) Using R cm = table(test_set[, 3], y_pred > 0.5) For full implementation, check out my github repository - python and github repository - R. Comments welcome!

Data Science · 2021-05-01

Implementing Random Forest Regression in Python and R

Data Science · 2021-04-03

Support Vector Regression

Support Vector Regression (SVR) is a type of regression algorithm that uses Support Vector Machines (SVM) to perform regression analysis. In contrast to traditional regression algorithms, which aim to minimize the error between the predicted and actual values, SVR aims to fit a “tube” around the data such that the majority of the data points fall within the tube. The goal of SVR is to find a function that has a maximum margin from the tube. In SVR, the input data is transformed into a higher-dimensional space, where a linear regression model is applied. The SVM then finds the best fit line for the transformed data, which corresponds to a non-linear fit in the original data space. Implementing SVR in Python To implement SVR in Python, we can use the SVR class from the sklearn.svm module in scikit-learn, which is a popular Python machine learning library. Here’s an example code to implement SVR in Python: from sklearn.svm import SVR import numpy as np # Generate some sample data X = np.sort(5 * np.random.rand(100, 1), axis=0) y = np.sin(X).ravel() # Create an SVR object and fit the model to the data clf = SVR(kernel='rbf', C=1e3, gamma=0.1) clf.fit(X, y) # Make some predictions with the trained model y_pred = clf.predict(X) # Print the mean squared error of the predictions mse = np.mean((y_pred - y) ** 2) print(f"Mean squared error: {mse:.2f}") In this example, we generate some sample data by randomly selecting 100 points along the sine curve. We then create an SVR object with an RBF kernel and some hyperparameters C and gamma. We fit the model to the sample data and make some predictions with the trained model. Finally, we calculate the mean squared error between the predicted values and the true values. Note that the hyperparameters C and gamma control the regularization and non-linearity of the SVR model, respectively. These values can be tuned to optimize the performance of the model on a particular dataset. Additionally, provides many other options for configuring and fine-tuning the SVR model. Implementing SVR in R In R, we can implement SVR using the e1071 package, which provides the svm function for fitting support vector machines. Here’s an example code to implement SVR in R: library(e1071) # Generate some sample data set.seed(1) x <- sort(5 * runif(100)) y <- sin(x) # Fit an SVR model to the data model <- svm(x, y, kernel = "radial", gamma = 0.1, cost = 1000) # Make some predictions with the trained model y_pred <- predict(model, x) # Print the mean squared error of the predictions mse <- mean((y_pred - y) ^ 2) cat(sprintf("Mean squared error: %.2f\n", mse)) In this example, we generate some sample data by randomly selecting 100 points along the sine curve. We then fit an SVR model to the data using the svm function from the e1071 package. We use a radial basis function (RBF) kernel and specify some hyperparameters gamma and cost. We make some predictions with the trained model and calculate the mean squared error between the predicted values and the true values. Note that the hyperparameters gamma and cost control the non-linearity and regularization of the SVR model, respectively. These values can be tuned to optimize the performance of the model on a particular dataset. Additionally, the scikit-learn (Python) and e1071 (R) package provides many other options for configuring and fine-tuning the SVM model. Math behind SVR The math behind Support Vector Regression (SVR) is based on the same principles as Support Vector Machines (SVM), with some modifications to handle regression tasks. Here is a brief overview of the math behind SVR: Given a set of training data, SVR first transforms the input data to a high-dimensional feature space using a kernel function. The kernel function computes the similarity between two data points in the original space and maps them to a higher-dimensional space where they can be more easily separated by a linear hyperplane. The goal of SVR is to find a hyperplane in the feature space that maximally separates the training data while maintaining a margin around it. This is done by solving an optimization problem that involves minimizing the distance between the hyperplane and the training data while maximizing the margin. In SVR, the margin is defined as a tube around the hyperplane, rather than a margin between two parallel hyperplanes as in SVM. The width of the tube is controlled by two parameters, ε (epsilon) and C. ε defines the width of the tube and C controls the trade-off between the size of the margin and the amount of training data that is allowed to violate it. The optimization problem in SVR is typically formulated as a quadratic programming problem, which can be solved using numerical optimization techniques. Once the hyperplane is found, SVR uses it to make predictions for new data points by computing their distance to the hyperplane in the feature space. The distance is transformed back to the original space using the kernel function to obtain the predicted output. Overall, the math behind SVR involves finding a hyperplane that maximizes the margin around the training data while maintaining a tube around the hyperplane. This is done by transforming the data to a high-dimensional feature space, solving an optimization problem to find the hyperplane, and using the hyperplane to make predictions for new data points. Advantages of SVR Support Vector Regression (SVR) has several advantages over other regression models: Non-linearity: SVR can model non-linear relationships between the input and output variables, while linear regression models can only model linear relationships. Robustness to outliers: SVR is less sensitive to outliers in the input data compared to other regression models. This is because the optimization process in SVR only considers data points near the decision boundary, rather than all data points. Flexibility: SVR allows for the use of different kernel functions, which can be used to model different types of non-linear relationships between the input and output variables. Regularization: SVR incorporates a regularization term in the objective function, which helps to prevent overfitting and improve the generalization performance of the model. Efficient memory usage: SVR uses only a subset of the training data (support vectors) to build the decision boundary. This results in a more efficient memory usage, which is particularly useful when dealing with large datasets. Overall, SVR is a powerful and flexible regression model that can handle a wide range of regression tasks. Its ability to model non-linear relationships, its robustness to outliers, and its efficient memory usage make it a popular choice for many machine learning applications. Comments welcome!

Data Science · 2021-03-06

Implementing Linear Regression in Python and R

Regression is a supervised learning technique to predict the value of a continuous target or dependent variable using a combination of predictor or independent variables. Linear regression is a type of regression where the primary consideration is that the independent and dependent variables have a linear relationship. Linear regression is of two broad types - simple linear regression and multiple linear regression. In simple linear regression there is only one independent variable. Whereas, multiple linear regression refers to a statistical technique that uses two or more independent variables to predict the outcome of a dependent variable. Linear regression also has some modifications such as lasso, ridge or elastic-net regression. However, in this article we will cover multiple linear regression. Intuition behind linear regression Before we begin, let us take a look at the equation of multiple linear regression. Y is the target variable that we are trying to predict. x1, x2, .. , xn are the n predictor variables. b0, b1, .. , bn are the n constants that the linear regression (OLS - ordinary least square) model will help us figure out. Example, we can use linear regression to predict a real value, like profit. Y = b0 + b1*x1 + b2*x2 + .. + bn*xn profit = b0 + b1*r_n_d_spend + b2*administration + b3*marketing_spend + b4*state The ordinary least squares method gets the best fitting line by identifying the line that minimizes square of distance between actual and predicted values. sum ( y_actual - y_hat ) ^ 2 -> minimize Assumptions of linear regression Before we apply the linear regression model, we also need to check if the following assumptions hold true. Linearity: The relationship between X and the mean of Y is linear Homoscedasticity: The variance of residual is the same for any value of X Independence: Observations are independent of each other Normality: For any fixed value of X, Y is normally distributed Implementing the model in python and R Implementing the model consists of the following key steps. Data pre-processing: This is similar for most ML models, so we tackle this in a separate article and not here Training the model Using the model for prediction Data pre-processing At this stage we do several pre-processing activities including splitting the data into training set and test set. We usually can follow the 80:20 principle, meaning that we use 80% of our data to train the model and remaining 20% of the data to test the model, and catch under or overfitting. Training the model We use the ordinary least squares method to obtain an equation that predicts the dependent variable using independent variables from the training set. Using python from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) Using R regressor = lm(formula = Profit ~ ., data = training_set) Using the model Now, we use the obtained equation to predict the dependent variable using the test set independent variables. Using python y_pred = regressor.predict(X_test) Using R y_pred = predict(regressor, newdata = test_set) Visualizing results Visualising actual (x-axis) vs predicted (y-axis) test set values Using python plt.scatter(y_test, y_pred) Using R ggplot() + geom_point(aes(x = test_set$Profit, y = y_pred)) For full implementation, check out my github repository - python and github repository - R. Comments welcome!

Data Science · 2021-02-06

An Overview of Machine Learning Techniques

Machine learning is a subfield of artificial intelligence (AI) that allows systems to learn and improve from experience without being explicitly programmed. Essentially, machine learning involves the use of algorithms that can learn from data and improve performance over time. This means that machine learning can be used to identify patterns and make predictions, and can be used in a wide variety of applications, such as image and speech recognition, fraud detection, recommender systems, and many more. The process of building a machine learning model typically involves several steps, including data cleaning and preprocessing, selecting appropriate features, selecting an appropriate model or algorithm, training the model on a labeled dataset, and then evaluating its performance on a separate test dataset. This process is often iterative, with adjustments made to the model and its parameters until the desired level of performance is achieved. There are several types of machine learning, including supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves training a model on labeled data, meaning that the desired output is already known. Regression Regression is used to predict a continuous value, such as a number or a quantity. It is used to model the relationship between a dependent variable (the output) and one or more independent variables (the inputs). Regression is commonly used for tasks such as predicting stock prices, weather forecasting, or predicting sales figures. Following are some common regression algorithms: Linear Regression: This is a simple algorithm that models the relationship between a dependent variable and one or more independent variables. Ridge Regression: This is a type of linear regression that includes a penalty term to prevent overfitting. Lasso Regression: This is another type of linear regression that includes a penalty term, but it has the added benefit of performing feature selection. Elastic Net Regression: This algorithm is a combination of Ridge and Lasso regression, allowing for both feature selection and regularization. Polynomial Regression: This algorithm fits a polynomial equation to the data, allowing for more complex relationships between the dependent and independent variables. Support Vector Regression: This algorithm models the data by finding a hyperplane that maximizes the margin between the data points. Decision Tree Regression: This algorithm builds a decision tree based on the data, allowing for nonlinear relationships between the dependent and independent variables. Random Forest Regression: This is an extension of decision tree regression that builds multiple trees and averages their predictions to improve accuracy. Gradient Boosting Regression: This is an ensemble method that combines multiple weak regression models to create a strong model. Classification Classification, on the other hand, is used to predict a categorical value, such as a label or a class. It is used to identify the class or category to which a given data point belongs based on the features or attributes of that data point. Classification is commonly used for tasks such as image recognition, spam filtering, or predicting whether a customer will churn or not. Following are some common classification algorithms: Logistic Regression: Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). Support Vector Machines: Support Vector Machines (SVM) are a set of related supervised learning methods that analyze data and recognize patterns, used for classification and regression analysis. SVM works by finding the hyperplane that maximizes the margin between the two classes, and then classifying new data points based on which side of the hyperplane they fall on. K-Nearest Neighbors: K-Nearest Neighbors (KNN) is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN is a type of instance-based learning or lazy learning where the function is only approximated locally and all computation is deferred until classification. Naive Bayes: Naive Bayes is a probabilistic algorithm that makes predictions based on the probability of a certain outcome. It works by calculating the probability of each class given a set of input features, and then choosing the class with the highest probability. Decision Trees: A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. Decision trees are popular because they are easy to understand and interpret. Random Forest works by creating multiple decision trees, each based on a different random subset of the original data. The trees are then combined to make predictions on new data by taking a majority vote. The main advantage of Random Forest is that it can handle both categorical and numerical data, and can also handle missing values. It is known for its high accuracy and is often used in real-world applications such as image classification, fraud detection, and recommendation systems. However, it can be computationally expensive and may overfit if the number of trees is too large. Unsupervised learning involves training a model on unlabeled data, meaning that the model must identify patterns and relationships on its own. Clustering Clustering is a technique used in unsupervised machine learning to group similar data points together based on their attributes or features. Following are some common clustering algorithms: K-Means Clustering: This algorithm groups data points into k clusters based on their distance from k centroids. The algorithm iteratively adjusts the centroids to minimize the sum of squared distances between data points and their respective centroids. Hierarchical Clustering: This algorithm creates a hierarchy of clusters by either starting with individual data points as clusters and combining them iteratively or starting with all data points as a single cluster and splitting them iteratively. DBSCAN: This algorithm groups data points together that are closely packed together in high-density regions and separates out data points that are in low-density regions. Gaussian Mixture Models: This algorithm models data as a combination of multiple Gaussian distributions and groups data points together based on the probabilities of belonging to different distributions. Spectral Clustering: This algorithm uses graph theory to group data points together based on the similarity of their eigenvectors. Association rule-based learning Association rule-based learning algorithms are a type of unsupervised machine learning algorithm that identify interesting relationships, associations, or correlations among different variables in a dataset. These algorithms are commonly used in market basket analysis, where the goal is to identify relationships between items that are frequently purchased together. Following are some common association rule learning algorithms: Apriori algorithm: A classic algorithm that discovers frequent itemsets in a dataset and generates association rules based on these itemsets. FP-Growth algorithm: A faster algorithm than Apriori that builds a compact representation of the dataset, known as a frequent pattern (FP) tree, to efficiently mine frequent itemsets and generate association rules. Eclat algorithm: Another algorithm that mines frequent itemsets in a dataset, but instead of generating association rules, it focuses on finding frequent itemsets that share a common prefix. Reinforcement learning involves training a model to make decisions based on trial-and-error feedback. Reinforcement learning, is a broader class of problems in which an agent interacts with an environment over a period of time, and the agent’s goal is to learn a policy that maximizes its total reward over the long run. On the other hand, the multi-armed bandit problem is often considered as a simpler version of reinforcement learning. In multi-armed bandit problem, an agent repeatedly selects an action (often referred to as a “bandit arm”) and receives a reward associated with that action. The agent’s goal is to maximize its total reward over a fixed period of time. For example, there are a number of slot machines (or “one-armed bandits”) that a player can choose to play. Each slot machine has a different probability of paying out, and the player’s goal is to figure out which slot machine has the highest payout probability in the shortest amount of time. Following are some common algorithms to solve the multi-armed bandit problem: Upper Confidence Bound (UCB) algorithm approaches this problem by keeping track of the average payout for each slot machine, as well as the number of times each machine has been played. It then calculates an upper confidence bound for each machine based on these values, which represents the upper limit of what the true payout probability could be for that machine. The player then chooses the slot machine with the highest upper confidence bound, which balances the desire to play machines that have paid out well in the past with the desire to explore other machines that may have a higher payout probability. Over time, as more data is collected on each machine’s payout probability, the upper confidence bound for each machine will become narrower and more accurate, leading to better decisions and higher payouts for the player. Thompson sampling is a Bayesian algorithm for decision making under uncertainty. It is a probabilistic algorithm that can be used to solve multi-armed bandit problems. The algorithm works by updating a prior distribution on the unknown parameters of the problem based on the observed data. At each step, the algorithm chooses the action with the highest expected reward, where the expected reward is calculated by averaging over the posterior distribution of the unknown parameters. The algorithm is often used in online advertising, where it can be used to choose the best ad to display to a user based on their past behavior. Overall, machine learning is a powerful tool that has the potential to revolutionize many industries and improve our lives in countless ways. As more data becomes available and computing power continues to increase, we can expect to see even more impressive applications of machine learning in the years to come. Comments welcome!

Data Science · 2021-01-02

A Premier on Chi-squared test

The chi-square test is a statistical hypothesis test that is used to determine whether there is a significant association between two categorical variables. It is widely used in data analysis, particularly in fields such as social sciences, marketing, and biology, to examine relationships between categorical data. In this article, we will discuss the chi-square test, its applications, and how to perform it using Python. Understanding the Chi-Square Test The chi-square test is a non-parametric test that compares the observed frequencies of categorical data with the expected frequencies. The test is based on the chi-square statistic, which is calculated by summing the squared difference between the observed and expected frequencies, divided by the expected frequency, for each category. The chi-square test is used to test the null hypothesis that there is no significant association between the two variables. If the calculated chi-square value is greater than the critical value, we can reject the null hypothesis and conclude that there is a significant association between the variables. There are two types of chi-square tests: the chi-square goodness of fit test and the chi-square test of independence. The goodness of fit test is used to test whether the observed data follows a particular distribution, while the test of independence is used to test whether there is a significant association between two categorical variables. Applications of the Chi-Square Test The chi-square test is widely used in research and data analysis, with a range of applications across various fields. Some common applications include: Market research: To determine if there is a significant association between demographic factors and consumer behavior, such as age, gender, and income level. Biology: To test whether different species of plants or animals are distributed randomly or in patterns in their environment. Social sciences: To test whether there is a significant relationship between socio-economic status and educational attainment. Quality control: To test whether a sample of products is defective, based on the number of products that pass or fail inspection. Performing the Chi-Square Test in Python Python has several libraries that can be used to perform the chi-square test, including SciPy, Pandas, and StatsModels. Here is an example of how to perform the chi-square test of independence using the chi2_contingency function in the SciPy library: import scipy.stats as stats import pandas as pd # Load data into a Pandas DataFrame data = pd.read_csv('my_data.csv') # Create a contingency table contingency_table = pd.crosstab(data['variable_1'], data['variable_2']) # Perform the chi-square test of independence chi2, p, dof, expected = stats.chi2_contingency(contingency_table) # Print the results print('Chi-square statistic:', chi2) print('P-value:', p) In this example, we load data from a CSV file into a Pandas DataFrame, create a contingency table using the crosstab function, and then use the chi2_contingency function to perform the chi-square test of independence. The function returns the chi-square statistic, the p-value, the degrees of freedom, and the expected frequencies. Conclusion The chi-square test is a valuable statistical tool for examining the relationship between two categorical variables. By performing the test, we can determine whether there is a significant association between the variables and draw conclusions about the data. With the help of Python and its many data analysis libraries, we can easily perform the chi-square test and gain valuable insights from our data. Comments welcome!

Data Science · 2020-12-05

A Premier on ANOVA

ANOVA (Analysis of Variance) is a statistical method used to analyze and test the differences between the means of three or more groups. ANOVA compares the variation within groups to the variation between groups to determine whether the differences in means are statistically significant or just due to random chance. The basic idea behind ANOVA is that if the variation between groups is significantly greater than the variation within groups, then there is evidence to suggest that the means of the groups are different. ANOVA allows us to test the null hypothesis that all of the group means are equal against the alternative hypothesis that at least one group mean is different from the others. ANOVA is used in a wide range of applications, including biology, social sciences, economics, and engineering. It is often used in experimental research to test the effects of different treatments or interventions on a particular outcome. There are several types of ANOVA, including one-way ANOVA, which compares the means of three or more groups that are unrelated, and repeated measures ANOVA, which compares the means of three or more groups that are related (i.e., the same group is measured under different conditions). ANOVA can be performed using software such as R, Python, or SPSS. In this article, we will be using Python. Assumptions of ANOVA ANOVA (Analysis of Variance) has several assumptions that should be met to ensure the validity and reliability of the test. The main assumptions of ANOVA are: Normality: The dependent variable should be normally distributed in each group. One way to check this is by examining the distribution of the residuals (the differences between the observed values and the predicted values) for each group. Homogeneity of variances: The variances of the dependent variable should be equal in each group. This can be checked by examining the variance of the residuals for each group. Independence: The observations should be independent of each other. This means that there should be no systematic relationship between the observations in one group and the observations in another group. Random Sampling: The observations should be randomly sampled from each group in the population. If these assumptions are not met, the results of the ANOVA may not be reliable. In addition, violating these assumptions can lead to a higher probability of type I errors (rejecting the null hypothesis when it is actually true) or type II errors (failing to reject the null hypothesis when it is actually false). Types of ANOVA tests One-way ANOVA: This test is used to compare the means of more than two independent groups. Two-way ANOVA: This test is used to compare the means of two or more independent groups while controlling for one or more other variables. One-way ANOVA One-way ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups. It is used to determine whether there are significant differences between the means of the groups based on the variability within each group and the variability between groups. In this article, we will walk through how to perform a one-way ANOVA test using Python. Performing a one-way ANOVA test in Python: To perform a one-way ANOVA test in Python, we can use the scipy.stats module. Here’s an example code snippet: import scipy.stats as stats import pandas as pd # Create data group1 = [1, 2, 3, 4, 5] group2 = [6, 7, 8, 9, 10] group3 = [11, 12, 13, 14, 15] # Combine data into a pandas dataframe data = pd.DataFrame({'Group1': group1, 'Group2': group2, 'Group3': group3}) # Perform one-way ANOVA test fvalue, pvalue = stats.f_oneway(data['Group1'], data['Group2'], data['Group3']) # Print results print('F-value:', fvalue) print('P-value:', pvalue) In this example, we create three groups of data (group1, group2, and group3) and combine them into a pandas dataframe. We then use the f_oneway() function from the scipy.stats module to perform the one-way ANOVA test on the three groups. The output of the test includes the F-value and the p-value. Interpreting the results: The F-value is a measure of the variance between the groups compared to the variance within the groups. A higher F-value indicates that there is more variability between the groups and less variability within the groups. The p-value is a measure of the statistical significance of the F-value. A p-value less than 0.05 indicates that there is a statistically significant difference between the means of the groups. In the example above, the F-value is 75 and the p-value is less than 0.05, which suggests that there is a statistically significant difference between the means of the three groups. Two-way ANOVA Two-way ANOVA is a statistical test used to determine the difference in the means of two or more groups. It involves testing the effects of two different factors on a response variable. In this article, we will go over how to perform two-way ANOVA in Python using the statsmodels package. To illustrate two-way ANOVA in Python, we will use a dataset called ‘PlantGrowth’. It is a dataset of 30 plants, each receiving one of three different treatments (control, trt1, and trt2) and measuring their weight after a set period. We are interested in testing the effects of the treatments and the type of seed on the weight of the plants. [{'weight': '4.17', 'group': 'ctrl', 'plant': 'plant_1'}, {'weight': '5.58', 'group': 'ctrl', 'plant': 'plant_2'}, {'weight': '5.18', 'group': 'ctrl', 'plant': 'plant_3'}, {'weight': '6.11', 'group': 'ctrl', 'plant': 'plant_4'}, {'weight': '4.50', 'group': 'ctrl', 'plant': 'plant_5'}, {'weight': '4.61', 'group': 'ctrl', 'plant': 'plant_6'}, {'weight': '5.17', 'group': 'ctrl', 'plant': 'plant_7'}, {'weight': '4.53', 'group': 'ctrl', 'plant': 'plant_8'}, {'weight': '5.33', 'group': 'ctrl', 'plant': 'plant_9'}, {'weight': '5.14', 'group': 'trt1', 'plant': 'plant_10'}, {'weight': '4.81', 'group': 'trt1', 'plant': 'plant_11'}, {'weight': '4.17', 'group': 'trt1', 'plant': 'plant_12'}, {'weight': '4.41', 'group': 'trt1', 'plant': 'plant_13'}, {'weight': '3.59', 'group': 'trt1', 'plant': 'plant_14'}, {'weight': '5.87', 'group': 'trt1', 'plant': 'plant_15'}, {'weight': '3.83', 'group': 'trt1', 'plant': 'plant_16'}, {'weight': '6.03', 'group': 'trt1', 'plant': 'plant_17'}, {'weight': '4.89', 'group': 'trt1', 'plant': 'plant_18'}, {'weight': '4.32', 'group': 'trt2', 'plant': 'plant_19'}, {'weight': '4.69', 'group': 'trt2', 'plant': 'plant_20'}, {'weight': '6.31', 'group': 'trt2', 'plant': 'plant_21'}, {'weight': '5.12', 'group': 'trt2', 'plant': 'plant_22'}, {'weight': '5.54', 'group': 'trt2', 'plant': 'plant_23'}, {'weight': '5.50', 'group': 'trt2', 'plant': 'plant_24'}, {'weight': '5.37', 'group': 'trt2', 'plant': 'plant_25'}, {'weight': '5.29', 'group': 'trt2', 'plant': 'plant_26'}, {'weight': '4.92', 'group': 'trt2', 'plant': 'plant_27'}] Here’s how to perform a two-way ANOVA in Python: Step 1: Load the required libraries and dataset import pandas as pd from statsmodels.formula.api import ols from statsmodels.stats.anova import anova_lm data = pd.read_csv('PlantGrowth.csv') Step 2: Create a model formula and fit the model model = ols('weight ~ C(treatment) + C(seed) + C(treatment):C(seed)', data).fit() Here, ‘weight’ is the dependent variable, and ‘treatment’ and ‘seed’ are the two independent variables. Step 3: Perform the two-way ANOVA using anova_lm() anova_results = anova_lm(model, typ=2) print(anova_results) The typ parameter specifies the type of sum of squares to use. Here, we use type 2 sum of squares. The anova_lm() function returns a table with the results of the ANOVA. The table includes the sum of squares, degrees of freedom, F-value, and p-value for each main effect and interaction effect. Step 4: Interpret the results The ANOVA table shows that both the main effects of ‘treatment’ and ‘seed’ are statistically significant, as well as the interaction effect between ‘treatment’ and ‘seed’. This suggests that both the type of treatment and the type of seed have a significant effect on the weight of the plants, and that the effect of the treatment depends on the type of seed. In conclusion, performing a two-way ANOVA in Python is straightforward using the statsmodels package. It is important to ensure that the assumptions of the ANOVA are met before interpreting the results. Finally, to close, ANOVA is a powerful statistical technique that can be used to compare the means of two or more groups. Whether you are testing the effectiveness of different treatments, analyzing the impact of a categorical variable, or trying to determine if there are significant differences between groups, ANOVA can help you identify these differences and draw meaningful conclusions. By using Python and its many data analysis libraries, you can easily perform ANOVA and other statistical tests on your data and gain valuable insights that can inform your decisions and actions. With the right approach and tools, ANOVA can be a valuable addition to your statistical toolbox. Comments welcome!

Data Science · 2020-11-07

A Premier on T-tests

T-tests are a class of statistical tests used to determine whether there is a significant difference between the means of two groups of data. T-tests are often used to compare the means of a sample to the population mean, or to compare the means of two independent samples or two paired samples. Following are the most common types of t-tests are the one-sample t-test that we will cover: One-sample t-test: This test is used to compare the mean of a single sample to a known or hypothesized population mean. Independent samples t-test: This test is used to compare the means of two independent groups. Paired samples t-test: This test is used to compare the means of two dependent (paired) groups. T-tests have several assumptions that need to be met in order for the test to be valid. The most important assumptions are: Normality: The data should follow a normal distribution. This means that the sample means should be normally distributed. Independence: The samples should be independent of each other. This means that the observations in one sample should not be related to the observations in the other sample. Homogeneity of variances: The variances of the two samples should be approximately equal. This means that the spread of the data should be similar in both groups. If these assumptions are not met, the results of the t-test may be invalid or misleading. There are also different types of t-tests that make different assumptions. For example, the paired samples t-test assumes that the differences between paired observations are normally distributed, while the independent samples t-test assumes that the two samples have equal variances. It’s important to carefully consider the assumptions of the test and to use caution when interpreting the results. How to perform T-tests in Python One-sample t-test A one-sample t-test is used to compare the mean of a single sample to a known or hypothesized population mean. This test is useful for determining whether a sample differs significantly from the population mean. To perform a one-sample t-test in Python, you can use the scipy.stats.ttest_1samp function. Here’s an example: import numpy as np from scipy.stats import ttest_1samp # Generate a sample of data data = np.random.normal(loc=10, scale=2, size=100) # Set the hypothesized population mean pop_mean = 9 # Perform the one-sample t-test t_stat, p_val = ttest_1samp(data, pop_mean) # Print the results print("t-statistic: {:.3f}".format(t_stat)) print("p-value: {:.3f}".format(p_val)) In this example, we first generate a sample of data using the numpy.random.normal function, which generates a sample of data from a normal distribution with the specified mean (loc) and standard deviation (scale). We then set the hypothesized population mean to 9. We then perform the one-sample t-test using the ttest_1samp function, which takes two arguments: the sample data and the hypothesized population mean. The function returns two values: the t-statistic and the p-value. Finally, we print the results using the print function, formatting the t-statistic and p-value to three decimal places. If the p-value is less than the significance level (usually 0.05), we can reject the null hypothesis and conclude that the sample mean differs significantly from the population mean. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest a significant difference between the sample mean and the population mean. Independent samples t-test An independent samples t-test is used to compare the means of two independent groups to determine if they are significantly different. This test is used when the two groups being compared are completely independent of each other. To perform an independent samples t-test in Python, we can use the scipy.stats.ttest_ind function from the SciPy library. Here’s an example: import numpy as np from scipy.stats import ttest_ind # Generate two independent samples of data sample1 = np.random.normal(loc=10, scale=2, size=100) sample2 = np.random.normal(loc=12, scale=2, size=100) # Perform the independent samples t-test t_stat, p_val = ttest_ind(sample1, sample2) # Print the results print("t-statistic: {:.3f}".format(t_stat)) print("p-value: {:.3f}".format(p_val)) In this example, we first generate two independent samples of data using the numpy.random.normal function. We then perform the independent samples t-test using the ttest_ind function, which takes two arguments: the two samples being compared. The function returns two values: the t-statistic and the p-value. Finally, we print the results using the print function, formatting the t-statistic and p-value to three decimal places. If the p-value is less than the significance level (usually 0.05), we can reject the null hypothesis and conclude that the means of the two groups are significantly different. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest a significant difference between the means of the two groups. Paired samples t-test A paired samples t-test is a statistical test used to determine whether there is a statistically significant difference between the means of two related groups. In other words, it helps us determine whether the two groups are significantly different from each other or not. To perform a paired samples t-test in Python, we can use the scipy.stats module, which contains a variety of statistical functions including the ttest_rel() function. This function computes the t-test for two related samples of scores. Here is an example code snippet for performing a paired samples t-test in Python: import numpy as np from scipy.stats import ttest_rel # Create two related random samples of data before = np.random.normal(5, 1, 100) after = before + np.random.normal(1, 0.5, 100) # Compute the t-test t_stat, p_val = ttest_rel(before, after) # Print the results print("t-statistic: {}".format(t_stat)) print("p-value: {}".format(p_val)) In this example, we first create two related random samples of data using the numpy.random.normal() function. We create the second sample by adding some random noise to the first sample. We then compute the paired samples t-test for these two samples using the ttest_rel() function. The function returns two values: the t-statistic and the p-value. Finally, we print the results of the test using the print() function. If the p-value is less than the significance level (usually 0.05), we can reject the null hypothesis and conclude that the means of the two related groups are significantly different. Otherwise, we fail to reject the null hypothesis and conclude that there is not enough evidence to suggest a significant difference between the means of the two related groups. It’s important to note that a paired samples t-test assumes that the differences between the pairs of observations are normally distributed. If this assumption is not met, other tests or transformations may be needed. Additionally, like any statistical test, it’s important to carefully consider the context and limitations of the test and to avoid drawing causal conclusions from statistical associations alone. To close, T-tests are useful because they provide a simple and easy-to-interpret method for comparing two groups of data. They are widely used in a variety of fields including psychology, medicine, education, and more. However, it’s important to note that t-tests have certain assumptions, such as normality of the data and equal variances, which need to be met for the test to be valid. It’s also important to use caution when interpreting t-test results and to consider the context and limitations of the test. Comments welcome!

Data Science · 2020-10-03

Statistical Hypothesis Testing

Data Science · 2020-09-05

Important GCP Services that you need to Know Now

Introduction Google Cloud Platform (GCP) is a cloud computing platform offered by Google. GCP provides a comprehensive set of tools and services for building, deploying, and managing cloud applications. It includes services for compute, storage, networking, machine learning, analytics, and more. Some of the most commonly used GCP services include Compute Engine, Cloud Storage, BigQuery, and Kubernetes Engine. GCP is known for its powerful data analytics and machine learning capabilities. It offers a range of machine learning services that allow users to build, train, and deploy machine learning models at scale. GCP also provides powerful data analytics tools, including BigQuery, which allows users to analyze massive datasets quickly and easily. GCP is a popular choice for businesses of all sizes, from small startups to large enterprises. It offers flexible pricing options, with pay-as-you-go and monthly subscription plans available. Additionally, GCP offers a range of tools and services to help businesses optimize their cloud costs, including cost management tools and usage analytics. Some of the most commonly used GCP services are: Google Compute Engine (GCE) - a virtual machine service for running applications on the cloud. Google Kubernetes Engine (GKE) - a managed Kubernetes service for container orchestration. Google Cloud Storage (GCS) - a scalable object storage service for unstructured data. Google Cloud Bigtable - a NoSQL database service for large, mission-critical applications. Google Cloud SQL - a fully managed relational database service. Google Cloud Datastore - a NoSQL document database service for web and mobile applications. Google Cloud Pub/Sub - a messaging service for real-time data delivery and streaming. Google Cloud Dataproc - a fully managed cloud service for running Apache Hadoop and Apache Spark workloads. Google Cloud ML Engine - a managed service for training and deploying machine learning models. Google Cloud Vision API - an image analysis API that can identify objects, faces, and other visual content. Google Cloud Speech-to-Text - a speech recognition service that transcribes audio files to text. Google Cloud Text-to-Speech - a text-to-speech conversion service that creates natural-sounding speech from text input. How to access GCP services use the Cloud Client Libraries or the Cloud APIs directly. To use the Cloud Client Libraries, you’ll need to first authenticate your application. You can do this by creating a service account, downloading a JSON file containing your credentials, and setting the GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of the file. Once you’ve authenticated, you can import the relevant client library and start using GCP services. use the Cloud APIs directly by making REST requests. To make requests to the Cloud APIs, you’ll need to authenticate and authorize your application by creating a service account and generating a private key. You can then use this key to sign your requests using OAuth 2.0. Once you’ve authenticated, you can make requests to the relevant API endpoints using HTTP requests. Comments welcome!

Data Science · 2020-09-03

Important Azure Services that you need to Know Now

Introduction Azure is a cloud computing platform and set of services offered by Microsoft. It provides a wide range of services such as virtual machines, databases, storage, and networking, among others, that users can access and use to build, deploy, and manage their applications and services. Azure also offers a variety of tools and services to help users with tasks such as data analytics, artificial intelligence, and machine learning. Azure provides a pay-as-you-go pricing model, allowing users to only pay for the services they use. Key Services Azure Virtual Machines: a cloud computing service that allows users to create and manage virtual machines in the cloud. Azure App Service: a platform as a service (PaaS) offering that allows developers to build, deploy, and scale web and mobile apps. Azure Functions: a serverless computing service that allows developers to run small pieces of code (functions) in the cloud. Azure Blob Storage: a cloud storage service that allows users to store and access large amounts of unstructured data. Azure SQL Database: a fully managed relational database service that allows users to build, deploy, and manage applications with a variety of languages and frameworks. Azure Active Directory: a cloud-based identity and access management service that provides secure access and single sign-on to various cloud applications. Azure Cosmos DB: a globally distributed, multi-model database service that allows users to manage and store large volumes of data with low latency and high availability. Azure Machine Learning: a cloud-based machine learning service that allows users to build, train, and deploy machine learning models at scale. Azure DevOps: a set of services that provides development teams with everything they need to plan, build, test, and deploy applications. Azure Kubernetes Service: a fully managed Kubernetes container orchestration service that allows users to deploy and manage containerized applications at scale. How to access the services Azure Portal: The Azure Portal is a web-based user interface that provides access to Azure services. Users can log in and manage their resources in the Azure Portal. Azure CLI: The Azure Command-Line Interface (CLI) is a cross-platform command-line tool that allows you to manage Azure resources. Azure PowerShell: Azure PowerShell is a command-line tool that allows users to manage Azure resources using Windows PowerShell. Azure SDKs: Azure provides Software Development Kits (SDKs) for various programming languages, such as .NET, Java, Python, Ruby, and Node.js. These SDKs provide libraries and tools for interacting with Azure services. REST APIs: Azure services can be accessed using REST APIs. Developers can use any programming language that supports HTTP/HTTPS to interact with Azure services. Azure Functions: Azure Functions is a serverless compute service that allows you to run code on demand. You can use Azure Functions to access Azure services. Azure Logic Apps: Azure Logic Apps is a cloud-based service that allows you to create workflows that integrate with various Azure services. Azure DevOps: Azure DevOps is a set of development tools that includes features such as source control, continuous integration, and continuous delivery. Developers can use Azure DevOps to manage and deploy their applications to Azure services. Comments welcome!

Data Science · 2020-08-06

Statistical Distributions

In this article we will cover some distributions that I have found useful while analysing data. I have split them based on whether they are for a continuous or a discrete random variable. First I give a small theoretical introduction about the distribution, its probability density function, and then how to use python to represent it graphically. Continuous Distributions: Uniform distribution Normal Distribution, also known as Gaussian distribution Standard Normal Distribution - case of normal distribution where loc or mean = 0 and scale or sd = 1 Gamma distribution - exponential, chi-squared, erlang distributions are special cases of the gamma distribution Erlang distribution - special form of Gamma distribution when a is an integer ? Exponential distribution - special form of Gamma distribution with a=1 Lognormal - not covered Chi-Squared - not covered Weibull - not covered t Distribution - not covered F Distribution - not covered Discrete Distributions: Poisson distribution is a limiting case of a binomial distribution under the following conditions: n tends to infinity, p tends to zero and np is finite Binomial Distribution Negative Binomial - not covered Bernoulli Distribution is a special case of the binomial distribution where a single trial is conducted n=1 Geometric - not covered Lets import some basic libraries that we will be using: import numpy as np import pandas as pd import scipy.stats as spss import plotly.express as px import seaborn as sns Continuous Distributions Uniform distribution As the name suggests, in uniform distribution the probability of all outcomes is same. The shape of this distribution is a rectange. Now, lets plot this using python. First we will generate an array of random variables using scipy. We will specifically use scipy.stats.uniform.rvs function with following three inputs: size specifies number of random variates loc corresponds to mean scale corresponds to standard deviation rv_array = spss.uniform.rvs(size=10000, loc = 10, scale=20) Now we can plot this using the plotly library or the seaborn library. Infact seaborn has a couple of different function, namely the distplot and the histplot, both of which can be used to visually view the unoform data. Lets see the examples one by one: We can directly plot the data from the array: px.histogram(rv_array) # plotted using plotly express sns.histplot(rv_array, kde=True) # plotted using seaborn Or we can convert array into a dataframe and then plot the data frame: rv_df = pd.DataFrame(rv_array, columns=['value_of_random_variable']) px.histogram(rv_df, x='value_of_random_variable', nbins=20) # plotted using plotly express sns.histplot(data=rv_df, x='value_of_random_variable', kde=True) # plotted using seaborn Normal Distribution, also known as Gaussian distribution: The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. Normal distribution is a limiting case of Poisson distribution with the parameter lambda tends to infinity. Additionally since poisson distribution is a for of binomial distribution, normal distribution is also a form of binomial distribution. This distribution has a bell-shaped density curve described by its mean and standard deviation. The mean represents the location and the sd represents the spread of the distribution. The curve represents that the data near the mean occurrs more frequently than the data far from the mean. Lets plot it using seaborn: rv_array = spss.norm.rvs(size=10000,loc=10,scale=100) # size specifies number of random variates, loc corresponds to mean, scale corresponds to standard deviation sns.histplot(rv_array, kde=True) We can add x and y labels, change the number of bins, color of bars, etc. With distplot we can supply additional arguments for adjusting width of bars, transparency, etc. ax = sns.distplot(rv_array, bins=100, kde=True, color='cornflowerblue', hist_kws={"linewidth": 15,'alpha':1}) ax.set(xlabel='Normal Distribution', ylabel='Frequency') Standard Normal Distribution Is a special case of the normal distribution where mean = 0 and sd = 1 Lets plot it using seaborn: rv_array = spss.norm.rvs(size=10000,loc=0,scale=1) sns.histplot(rv_array, kde=True) Gamma distribution is a two-parameter family of continuous probability distributions Exponential, chi-squared, erlang distributions are special cases of the gamma distribution Lets plot it using seaborn: rv_array = spss.gamma.rvs(a=5, size=10000) # size specifies number of random variates, a is the shape parameter sns.distplot(rv_array, kde=True) Erlang distribution Special case of Gamma distribution when a is an integer. Exponential distribution Special case of Gamma distribution with a=1. Exponential distribution describes the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. Lets plot it using seaborn: rv_array = spss.expon.rvs(scale=1,loc=0,size=1000) # size specifies number of random variates, loc corresponds to mean, scale corresponds to standard deviation sns.distplot(rv_array, kde=True) Discrete Distributions Binomial Distribution Distribution where only two outcomes are possible, such as success or failure, gain or loss, win or lose. Additionally, the probability of success and failure is same for all the trials. Further, the outcomes need not be equally likely, and each trial is independent of each other. The probability of observing k events in an interval is given by the equation: f(k;n,p) = nCk * (p^k) * ((1-p)^(n-k)) Where, nCk = (n)! / ((k)! * (n-k)!) n=total number of trials p=probability of success in each trial Lets plot it using seaborn: rv_array = spss.binom.rvs(n=10,p=0.8,size=10000) # n = number of trials, p = probability of success, size = number of times to repeat the trials sns.distplot(rv_array, kde=True) Poisson Distribution Poisson random variable is typically used to model the number of times an event happened in a time interval. For example, the number of users registered for a web service in an interval can be thought of as a Poisson process. Poisson distribution is described in terms of the rate (μ) at which the events happen. The average number of events in an interval is designated λ (lambda). Lambda is the event rate, also called the rate parameter. The probability of observing k events in an interval is given by the equation: P(k events in interval) = e^(-lambda) * (lambda^k / k!) Poisson distribution is a limiting case of a binomial distribution under the following conditions: The number of trials is indefinitely large or n tends to infinity The probability of success for each trial is same and indefinitely small or p tends to zero np = lambda, is finite. Lets plot it using seaborn: rv_array = spss.poisson.rvs(mu=3, size=10000) # size specifies number of random variates, loc corresponds to mean, scale corresponds to standard deviation sns.distplot(rv_array, kde=True) Bernoulli distribution This distribution has only two possible outcomes, 1 (success) and 0 (failure), and a single trial, for example, a coin toss. The random variable X which has a Bernoulli distribution can take value 1 with the probability of success, p, and the value 0 with the probability of failure, q or 1-p. The probabilities of success and failure need not be equally likely. Probability mass function of Bernoulli distribution: f(k;p) = (p^k) * ((1-p)^(1-k)) Bernoulli distribution is a special case of the binomial distribution where a single trial is conducted (n=1) Lets plot it using seaborn: rv_array = spss.bernoulli.rvs(size=10000,p=0.6) # p = probability of success, size = number of times to repeat the trial sns.distplot(rv_array, kde=True) Hope you found this summary of distributions useful. I refer to this from time to time to jog my memory on the various distributions. Comments welcome!

Data Science · 2020-08-01

Visualize data using SAS

Data Science · 2020-07-04

Visualize data using Python

This is the second of a series of articles that I will write to give a gentle introduction to statistics. In this article we will cover how we can visualize data using various charts and how to read them. I will show how to create these charts using Python and will include code snippets as well. For a full version of the code visit my GitHub repository. Python has many libraries that allow creating visually appealing charts. In this article we will work with the in-built tips dataset and then plot using the following libraries: import seaborn as sns tips = sns.load_dataset("tips") # tips dataset can be loaded from seaborn sns.get_dataset_names() # to get a list of other available datasets import plotly.express as px tips = px.data.tips() # tips dataset can be loaded from plotly # data_canada = px.data.gapminder().query("country == 'Canada'") import pandas as pd tips.to_csv('/Users/vivekparashar/Downloads/tips.csv') # we can save the dataset into a csv and then load it into SAS or R for plotting import altair as alt import statsmodels.api as sm Lets take a quick look at how the tips dataset is structured: We will cover the following charts in this article: Dot plot shows changes between two (or more) points in time or between two (or more) conditions. # Using plotly library t = tips.groupby(['day','sex']).mean()[['total_bill']].reset_index() px.scatter(t, x='day', y='total_bill', color='sex', title='Average bill by gender by day', labels={'day':'Day of the week', 'total_bill':'Average Bill in $'}) Bar (horizontal and vertical) chart is used when you want to show a distribution of data points or perform a comparison of metric values across different subgroups of your data. # Using pandas plot tips.groupby('sex').mean()['total_bill'].plot(kind='bar') tips.groupby('sex').mean()['tip'].plot(kind='barh') # Using plotly t = tips.groupby(['day','sex']).mean()[['total_bill']].reset_index() px.bar(t, x='day', y='total_bill') # Using plotly px.bar(t, x='total_bill', y="day", orientation='h') Stacked Bar char is useful when you want to show more than one categorical variable per bar # using pandas plot; kind='barh' for horizontal plot # need to unstack one of the levels and fill na values tips.groupby(['day','sex']).mean()[['total_bill']]\ .unstack('sex').fillna(0)\ .plot(kind='bar', stacked=True) # Using plotly t = tips.groupby(['day','sex']).mean()[['total_bill']].reset_index() px.bar(t, x="day", y="total_bill", color="sex", title="Average bill by Gender and Day") # vertical px.bar(t, x="total_bill", y="day", color="sex", title="Average bill by Gender and Day", orientation='h') # horizontal Boxplot (horizontal and vertical) In a box plot, numerical data is divided into quartiles, and a box is drawn between the first and third quartiles, with an additional line drawn along the second quartile to mark the median. In some box plots, the minimums and maximums outside the first and third quartiles are depicted with lines, which are often called whiskers. # using pandas plot # we specify y=variable for vertical and x=variable for horizontal for horizontal box plot respectively tips[['total_bill']].plot(kind='box') # using plotly px.box(tips, y='total_bill') # using seaborn sns.boxplot(y=tips["total_bill"]) Violin plot is a variation of box plot # Using seaborn sns.violinplot(y=tips.total_bill) sns.violinplot(data=tips, x='day', y='total_bill', hue='smoker', palette='muted', split=True, scale='count', inner='quartile', order=['Thur','Fri','Sat','Sun']) sns.catplot(x='sex', y='total_bill', hue='smoker', col='time', data=tips, kind='violin', split=True, height=4, aspect=.7) Histogram is a visual representation of the frequency distribution of your data. The frequencies are represented by bars. # using pandas plot tips.total_bill.plot(kind='hist') # using plotly px.histogram(tips, x="total_bill") # using seaborn sns.histplot(data=tips, x="total_bill") # using altair alt.Chart(tips).mark_bar().encode(alt.X('total_bill:Q', bin=True),y='count()') Probability Plot is a way of visually comparing the data coming from different distributions. It can be of two types - pp plot or qq plot pp plot (Probability-to-Probability) is the way to visualize the comparing of cumulative distribution function (CDFs) of the two distributions (empirical and theoretical) against each other. qq plot (Quantile-to-Quantile) is used to compare the quantiles of two distributions. The quantiles can be defined as continuous intervals with equal probabilities or dividing the samples between a similar way The distributions may be theoretical or sample distributions from a process, etc. Normal probability plot is a case of the qq plot. It is a way of knowing whether the dataset is normally distributed or not # using statsmodels import statsmodels.graphics.gofplots as sm import numpy as np sm.ProbPlot(np.array(tips.total_bill)).ppplot(line='s') sm.ProbPlot(np.array(tips.total_bill)).qqplot(line='s') Scatter plot shows the relationship between two numerical variables. # using plotly px.scatter(tips, x='total_bill', y='tip', color='sex', size='size', hover_data=['day']) # using pandas plot tips.plot(x='total_bill', y='tip', kind='scatter') Reg plot creates a regression line between 2 parameters and helps to visualize their linear relationships # using seaborn sns.regplot(x="total_bill", y="tip", data=tips, marker='+') # for categorical variables we can add jitter to see overlapping points sns.regplot(x="size", y="total_bill", data=tips, x_jitter=.1) Line plot is used to visualize the value of something over time # using pandas plot tips['total_bill'].plot(kind='line') # using plotly px.line(tips, y='total_bill', title='Total bill') t = tips.groupby('day').sum()[['total_bill']].reset_index() px.line(t, x='day',y='total_bill', title='Total bill by day') # using altair alt.Chart(t).mark_line().encode(x='day', y='total_bill') # using seaborn sns.lineplot(data=t, x='day', y='total_bill') Area plot is like a line chart in terms of how data values are plotted on the chart and connected using line segments. In an area plot, however, the area between the line segments and the x-axis is filled with color. # using pandas plot tips.groupby('day').sum()[['total_bill']].plot(kind='area') # stacked area can be done using pandas.plot as well t = tips.groupby(['day','sex']).count()[['total_bill']].reset_index() t_pivoted = t.pivot(index='day', columns='sex', values='total_bill') t_pivoted.plot.area() # using plotly px.area(t, x='day', y='total_bill', color='sex',line_group='sex') # using altair alt.Chart(t).mark_area().encode(x='day', y='total_bill') Pie chart is a circular statistical graphic, which is divided into slices to illustrate numerical proportion. In a pie chart, the arc length of each slice, is proportional to the quantity it represents. # using pandas plot tips.groupby('sex').count()['tip'].plot(kind='pie') # using plotly px.pie(tips, values='tip', names='day') Sunburst chart is ideal for displaying hierarchical data. Each level of the hierarchy is represented by one ring or circle with the innermost circle as the top of the hierarchy. px.sunburst(tips, path=['sex', 'day', 'time'], values='total_bill', color='day') Radar chart is a graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point. # using plotly t = tips.groupby('day').mean()[['total_bill']].reset_index() px.line_polar(t, r='total_bill', theta='day', line_close=True) The best way to get better at visualization is through practice. What I have found useful is participating in a weekly visualization challenge called the TidyTuesday! Comments welcome!

Data Science · 2020-06-06

Describe your data using Python

This is the first of a series of articles that I will write to give a gentle introduction to statistics. In this article we will introduce some basic statistical concepts and learn how to use basic statistics to help you describe your data. We will cover the following topics in this article: The difference between a population and a sample The difference between Descriptive and Inferential statistics Different types of variables Types of descriptive statistics Normal or Gaussian distribution The difference between a population and a sample: Population denotes a large group consisting of elements having at least one common feature; it is the complete set of observations Sample is a finite subset of the population; it is a subset of observations from a population. We get a sample from the population in either of the following ways Representative sampling - here the sample’s characteristics are similar to the population characteristics - A simple random sample is the most common approach to obtain a representative sample - A systematic random sample - A cluster random sample - A stratified random sample Convenience sampling - here we collect sample from section of population that is easily available The difference between Descriptive and Inferential statistics: Descriptive statistics - its all about organizing, describing and summarizing data Exploratory data analysis (EDA) measures of location - such as Mean, Median, Mode measures of variability or dispersion - such as Variance, Standard deviation, Range, Inter quartile range (IQR) Inferential statistics - its all about drawing conclusions about a population from analysis of a random sample drawn from the populaiton Exploratory modelling - how is x related to y? Predictive modelling - if you know x, can you predict y? Different types of variables: Quantitative Discrete: a variable whose value is obtained by counting. Example, number of students in a class Continuous: a variable whose value is obtained by measuring. Example, height of all students in a class Interval: this is scale of measurement where continuous data is rank ordered Ratio: this is scale of measurement where continuous data is rank ordered + has meaningful spacing Qualitative or Categorical Nominal: example gender - female or male Ordinal: example size - small, medium, or large Types of descriptive statistics: Measures of location: mainly measures of central tendency Mean: sum of all values divided by the number of values import seaborn as sns tips = sns.load_dataset('tips') tips.mean() # shows mean of all numeric variables Median: middle value in a given sequence of values ordered by rank tips.median() # shows median of all numeric variables Mode: most frequent value in a set of values tips.mode() # shows mode of all variables Measures of variability, spread or dispersion Range: Maximum value - Minimum value range = tips.total_bill.max() - tips.total_bill.min() # range IQR (Inter quartile range): 75th percentile - 25th percentile tips.total_bill.quantile(.75) - tips.total_bill.quantile(.25) # IQR Variance: Measure of variability of data around the mean tips.total_bill.var() # variance of total_bill variable Standard deviation: how spread out the data is, i.e. how much variance there is from the mean tips.total_bill.std() # standard deviation of total_bill variable Coefficient of variance (C.V.): measure of standard deviation expressed as a percentage of the mean cv = lambda x: x.std() / x.mean() * 100 cv(tips.total_bill) Measures of symmetry and peakedness: Skewness measures symmetry and Kurtosis measures peakedness Normal or Gaussian distribution This is one of the most common statistical distribution. The curve of this distribution is shaped like a bell. The shape of the bell depends on mean and standard deviation of the data Larger the standard deviation, wider the distribution A tip to quickly assess normality is to see if mean and median are nearly equal Skewness and Kurtosis Skewness measures tendency of data to be spread out on one side of the mean than the other. Skewness value indicates Negative value indicates the data is left skewed Positive value indicates the data is right skewed Closer to zero for the data to be normally distributed import scipy.stats as s s.skew(tips.total_bill, bias=False) #calculate sample skewness Kurtosis measures tendency of data to be concentrated around the center or tails. Kurtosis value indicates Platykurtic: Negative value indicates lower than normal peakedness Leptokurtic: Positive value indicates higher than normal peakedness Mesokurtic: Closer to zero for the data to be normally distributed import scipy.stats as s s.kurtosis(tips.total_bill, bias=False) #calculate sample kurtosis Comments welcome!

Data Science · 2020-05-02

Optimizing Retention through Machine Learning

Acquiring a new customer in the financial services sector can be as much as five to 25 times more expensive than retaining an existing one. Therefore, prevention of costumer churn is of paramount importance for the business. Advances in the area of Machine Learning, availability of large amount of customer data, and more sophisticated methods for predicting churn can help devise data backed strategy to prevent customers from churning. Imagine that you are a large bank facing a challenge in this area. You are witnessing an increasing amount of customers churn, which has starting hitting your profit margin. You establish a team of analysts to review your current customer development and retention program. The analysts quickly uncover that the current program is a patchwork of mostly reactive strategies applied in various silos within the bank. However, the upside is that the bank has already collected rich data on customer interactions that could possibly help get a deeper understanding of reasons for churn. Based on this initial assessment, the team recommends a data driven retention solution which uses machine learning to identify the reasons for churn and possible measures to prevent it. The solution consists of an array of sub-solutions focused towards specific areas of retention. The first level of sub-solutions consists of insights that can be directly derived from the existing customer data, answering for example the following business questions: Churn History Analysis: What are characteristics of churning customers? Are there any events that indicate an increased probability for churn, like long periods without contact to the customer, several months of default on a credit product etc.? Customer Segmentation: Are there groups of customers that have similar behavior and characteristics? Do any of these groups show higher churn rates? Customer Profitability: How much profit is the business generating by different customers? What are characteristics of profitable customers? First results can be drawn by these analyses. Additional insights are generated by combining them with data points such as the historical monthly profit that a business loses due to churn. Further, the data can be used for training supervised machine learning models which allow predicting future months or help classifying customers for which rich data is not available yet. This is the idea behind the second level of sub-solutions. Customer Life Time Value: What is the expected profitability for a given customer in the future? Churn Prediction: Which customers are in risk of churn? For which customers a quick intervention can improve retention? The early detection of customers at risk of churn is crucial for improving retention. However, not only is it beneficial to know the churn likelihood but also the expected profit loss that is connected with each customer in case of churn. Constant and fast advances in the area of Machine Learning help to improve these results. Being able to process large amounts of data allows for more customized results that are focused on the individuality of each customer. This is an important point as every customer has different preferences when it comes to contact with the bank, different reactions when it comes to offers and different needs and goals. Combining previously mentioned analyses and a large amount of customer data provides the third level of sub-solutions which allow individualized prescriptive solutions for at-risk customers. The idea behind this prescriptive retention solution is the simulation of alternative paths combined with optimization techniques along different parameters like how many days passed since the last contact of the client with the bank. The first set of descriptive or diagnostic solutions can be implemented relatively quickly as siloed analytics teams within the bank are already exploring them on their own. The second set of solutions which is more predictive in nature could take upto an year to implement. Built atop these, the prescriptive solution utilizes the outcome of previous analyses to suggest improved and individualized retention strategies. As a result the bank can now take different preventive retention measures for each customer. Comments welcome!

Finance and Marketing · 2020-03-07

Customer Journey Analytics

How important is it to align your analytics efforts with the customer lifecycle? Imagine you are a credit card department within the consumer banking branch of large bank. You are sending periodic mailers offering credit cards to your customers. Before sending these mail offers you do a minimum screening in a way that you only offer these to customers that have been with the bank for at-least 2 years and have maintained a balance above a certain threshold. However, you notice that the acceptance of your mail offers remains low even after a few campaigns. Why do you think is that? The answer lies in a simple concept, but one that is often overlook by analytics teams. Are you trying to identify which life stage the customer is in? Are you trying to synchronize your sales effort with the customer lifecycle? What is customer lifecycle you ask? Customer lifecycle can be understood as a framework to track the relationship between a customer and a bank. It starts off with the Acquisition stage where your primary focus is to figure out ways to identify and bring on-board customers with which a mutually beneficial relationship can be created. After this comes the Development stage, where the customer is encouraged to expand his portfolio with your products through cross-sell efforts, etc. Finally, comes the Retention stage where the customer has been with you for more than a decade, so you try to enhance the relationship and monitor customer satisfaction so that the customer can act as a good ambassador for you. These are the three basic stages, Acquire > Develop > Retain. You could break-down these stages further to target any pain-points you might be facing in a particular stage. For example, your acquisition through campaigns this year has not been as fruitful as previous years. So you break down Acquisition into Awareness > Consideration > Purchase to pin-point the root cause. Data suggests that the advertising budget is same as previous years. Marketing campaigns to tip consumers in the consideration stage into the purchase stage are also being sent in a timely manner. However, you are still loosing prospective customers in the purchase stage. You sanction a study to identify any changes that might have happened in the way you on-board a customer. Voilà! You identify that the on-boarding form has been appended with two new sections seeking a little more information about the customer before on-boarding. You weigh the necessity of collecting the information which on-boarding and decide to drop these additional sections. Few months later, Acquisition metrics start to return to previous years ballpark. Perhaps the most important aspect in the world of data driven decision making is to align the reporting and analytical efforts with the customer lifecycle. For example, during the acquisition phase your primary aim is to provide the right product just when the prospect customer needs it. This could be achieved though an analysis such as the Best Next Offer, where you use Machine Learning techniques to match your products with profile of prospects created using demographic, psychographic, etc. factors. Similarly, during the Development stage you focus on meticulously reporting and driving cross-sell efforts to increase your product presence in the customer portfolio. Lastly, during the Retention stage your focus should be on minimizing churn through customer satisfaction and this can be achieved through churn analysis on the quality data you collected in this aspect. To close I will reemphasize the importance of collecting good data, analytics and aligning it closely with customer lifecycle for optimal data driven decision making. Comments welcome!

Finance and Marketing · 2020-02-01

An Introduction to GitHub

A three part article series on version control using Git and GitHub. This is the third article in the series in which I will give a very brief introduction to GitHub. This will allow most readers to understand enough to utilize it for version control during development. What is GitHub? GitHub is a popular platform for hosting and sharing code repositories, and is widely used for version control and collaborative coding projects. If you’re new to using GitHub for version control, here are some key things to keep in mind: Create a GitHub account: The first step in using GitHub is to create an account. You can sign up for a free account, which gives you access to public repositories, or a paid account, which gives you access to private repositories and additional features. Create a new repository: Once you have an account, you can create a new repository by clicking the “New repository” button on your GitHub dashboard. You can choose to make the repository public or private, and can add a README file and other files as needed. Clone the repository to your local machine: Once you have created a repository on GitHub, you can clone it to your local machine using Git. This allows you to make changes to the code locally, and push those changes back to the remote repository on GitHub. Make changes and commit them: Once you have cloned the repository to your local machine, you can make changes to the code and commit those changes to Git. Be sure to write clear and descriptive commit messages that explain the changes made. Push changes to the remote repository: After committing changes to Git, you can push those changes back to the remote repository on GitHub. This allows other team members to see the changes and collaborate on the code. Use pull requests for code reviews: When working on a team, it’s a good practice to use pull requests to review code changes before merging them into the main branch. This allows other team members to review the code and provide feedback before changes are merged. Use branches for new features or bug fixes: When working on a new feature or bug fix, it’s important to create a new branch in Git rather than making changes directly to the main branch. This keeps the main branch stable and allows for easier collaboration with other team members. By keeping these key things in mind when using GitHub for version control, you can help ensure that your codebase is well-organized, well-documented, and easy to collaborate on with other team members. Components of GitHub Now, let us explore some of the key components of GitHub. Repository, branch Repository is a project’s folder and contains all of the project files (including documentation), and stores each file’s revision history. Branch is a parallel version of a repository. It is contained within the repository, but does not affect the primary or master branch allowing you to work freely without disrupting the “live” version. When you’ve made the changes you want to make, you can merge your branch back into the master branch to publish your changes. Commit, revert Commit, or “revision”, is an individual change to a file (or set of files). When you make a commit to save your work, Git creates a unique ID (a.k.a. the “SHA” or “hash”) that allows you to keep record of the specific changes committed along with who made them and when. Commits usually contain a commit message which is a brief description of what changes were made. Revert - when you revert a pull request on GitHub, a new pull request is automatically opened, which has one commit that reverts the merge commit from the original merged pull request. In Git, you can revert commits with git revert. Push, pull, fetch, merge Push means to send your committed changes to a remote repository on GitHub.com. For instance, if you change something locally, you can push those changes so that others may access them. Pull refers to when you are fetching in changes and merging them. For instance, if someone has edited the remote file you’re both working on, you’ll want to pull in those changes to your local copy so that it’s up to date. See also fetch. Pull requests are proposed changes to a repository submitted by a user and accepted or rejected by a repository’s collaborators. Like issues, pull requests each have their own discussion forum. Fetch - when you use git fetch, you’re adding changes from the remote repository to your local working branch without committing them. Unlike git pull, fetching allows you to review changes before committing them to your local branch. Merge takes the changes from one branch (in the same repository or from a fork), and applies them into another. This often happens as a “pull request” (which can be thought of as a request to merge), or via the command line. A merge can be done through a pull request via the GitHub.com web interface if there are no conflicting changes, or can always be done via the command line. Fork, clone, download Fork is a personal copy of another user’s repository that lives on your account. Forks allow you to freely make changes to a project without affecting the original upstream repository. You can also open a pull request in the upstream repository and keep your fork synced with the latest changes since both repositories are still connected Clone is a copy of a repository that lives on your computer instead of on a website’s server somewhere, or the act of making that copy. When you make a clone, you can edit the files in your preferred editor and use Git to keep track of your changes without having to be online. The repository you cloned is still connected to the remote version so that you can push your local changes to the remote to keep them synced when you’re online. Download option allows to download project folder as a zip file from GitHub to your local machine. This does not bring the .git folder, so using the http link to download is a better option Comments welcome!

Coding · 2019-10-05

Git Cheatsheet

A three part article series on version control using Git and GitHub. This is the second article in the series in which I will share my Git cheatsheet. This will enable the reader to quickly recall important commands to aid development. Key things to remember when using Git Always create a new branch for new features or bug fixes: When working on a new feature or bug fix, it’s important to create a new branch in Git rather than making changes directly to the main branch. This keeps the main branch stable and allows for easier collaboration with other team members. Commit early and often: It’s a good practice to commit changes to Git as often as possible, rather than waiting until the end of the day or the end of a coding session. This makes it easier to track changes and roll back to previous versions if needed. Write clear commit messages: When committing changes to Git, be sure to write clear and descriptive commit messages that explain the changes made. This makes it easier for other team members to understand the changes and can save time during code reviews. Use Git pull requests for code reviews: When working on a team, it’s a good practice to use Git pull requests to review code changes. This allows other team members to review the code and provide feedback before changes are merged into the main branch. Keep your Git repository organized: Make sure that your Git repository is organized and easy to navigate, with clear file and folder structures. This makes it easier to find specific files and makes the code repository more manageable over time. By keeping these five key things in mind when using Git for version control, you can help ensure that your codebase is well-organized, well-documented, and easy to collaborate on with other team members. Basic commands Getting & Creating Projects Command Description git init Initialize a local Git repository git clone ssh://git@github.com/[username]/[repository-name].git Create a local copy of a remote repository Basic Snapshotting Command Description git status Check status git add [file-name.txt] Add a file to the staging area git add -A Add all new and changed files to the staging area git commit -m “[commit message]” Commit changes git rm -r [file-name.txt] Remove a file (or folder) Branching & Merging Command Description git branch List branches (the asterisk denotes the current branch) git branch -a List all branches (local and remote) git branch [branch name] Create a new branch git branch -d [branch name] Delete a branch git push origin --delete [branch name] Delete a remote branch git checkout -b [branch name] Create a new branch and switch to it git checkout -b [branch name] origin/[branch name] Clone a remote branch and switch to it git branch -m [old branch name] [new branch name] Rename a local branch git checkout [branch name] Switch to a branch git checkout - Switch to the branch last checked out git checkout -- [file-name.txt] Discard changes to a file git merge [branch name] Merge a branch into the active branch git merge [source branch] [target branch] Merge a branch into a target branch git stash Stash changes in a dirty working directory git stash clear Remove all stashed entries Sharing & Updating Projects Command Description git push origin [branch name] Push a branch to your remote repository git push -u origin [branch name] Push changes to remote repository (and remember the branch) git push Push changes to remote repository (remembered branch) git push origin –delete [branch name] Delete a remote branch git pull Update local repository to the newest commit git pull origin [branch name] Pull changes from remote repository git remote add origin ssh://git@github.com/[username]/[repository-name].git Add a remote repository git remote set-url origin ssh://git@github.com/[username]/[repository-name].git Set a repository’s origin branch to SSH Inspection & Comparison Command Description git log View changes git log --summary View changes (detailed) git log --oneline View changes (briefly) git diff [source branch] [target branch] Preview changes before merging Comments welcome!

Coding · 2019-09-07

An Introduction to Git

A three part article series on version control using Git and GitHub. This is the first article in the series in which I will give a very brief introduction to Git. This will allow most readers to understand enough to utilize it for version control during development. What is Git? Git is a popular version control system that allows developers to manage and track changes to their code over time. It’s an essential tool for software development teams, as it helps to ensure that changes to code are properly tracked and documented, and makes it easier for developers to collaborate and work together. Here’s an overview of what Git is and how it works. Git is a distributed version control system, meaning that every developer working on a project has their own copy of the code repository on their local machine. This allows developers to work on their own changes and then merge them back into the main repository when they are ready. Git is also designed to be very fast and efficient, making it ideal for managing large codebases and complex projects. How does Git work? Git works by tracking changes to files and directories in a code repository. When a developer makes changes to the code, they create a new “commit” that documents the changes they made. Git stores these commits in a tree-like structure, with each commit representing a snapshot of the code at a particular point in time. This allows developers to easily view the history of changes to the code over time, and to revert to previous versions if necessary. Git also allows developers to create branches, which are essentially separate versions of the code repository that can be worked on independently. Branches are useful for trying out new features or making experimental changes without affecting the main codebase. Once changes have been tested and reviewed, they can be merged back into the main branch. Using Git for version control To use Git for version control, developers typically create a new repository on a Git hosting service such as GitHub, GitLab, or Bitbucket. They then clone the repository onto their local machine and begin making changes to the code. To commit changes, developers use Git commands such as “git add” to add changed files to the commit, and “git commit” to create a new commit with a commit message that describes the changes. To collaborate with other developers, developers can push their changes to the remote repository and create “pull requests” that allow other developers to review the changes and provide feedback. Once changes have been reviewed and approved, they can be merged back into the main branch. Basic terminal commands Terminal (for Unix or Mac) or Command Prompt for Windows allows us to type Git commands and manage project repositories. In this section we will be focusing on terminal commands. By default we are in the /home/vivek directory. home and mnt folders are in the same directory (usually they are in the highest level directory signified by just a /) pwd shows the current directory clear is used to clear the command line cd + tab key is used to cycle between sub directories in a directory cd .. is used to move up a directory cd mnt/ is used to enter the mnt directory. In this directory we can find the windows c drive (basically it is a directory named c) ~ signifies that you are in your home directory .. is used to move up one directory / signifies the highest level directory, you cant go back from there mkdir is used to create a new directory Directory names are case sensitive Right click is used to paste an absolute path name in the terminal ls is used to list all directories and files in a directory rm -rf is used to remove folders. rf tells that we are using the command to remove a directory, as by default rm is used to remove a file git --version is used to see the version of git touch file_name.txt is used to create a file Basic Git commands Git Repository is used to save project files and the information about the changes in the project. Repository can be created locally, or it can be a clone Git repository which is a copy of a remote Git repo. git init is used to initialize the directory as a git repository. This will create a .git folder in the directory and we can start using git features git status shows staging area. You will see some files under “Untracked files:” header git add file-name is used to add a file to staging area. After this you will see the file under the “Changes to be committed:” header git add . is used to add all files in directory to staging area (. signifies all) git rm --cached file.txt is used to unstage a file git rm -f file.txt is used to force remove a file from staging area and also deletes the file from directory (-f signifies force) git config --global user.email “abc.xyz@email.com” git config --global user.name “abc.xyz” git commit --help git commit -a -m “Initial commit” (-m to include a message; -a to automatically stage files that have been modified and deleted, but new files you have not told Git about are not affected) git log (if you want to see a shorter version then use git log --oneline) Head is usually on master (most recent commit). Head is what the project directory looks like. git checkout <commit-id> is used to see the contents of the folder as they looked during that particular commit git checkout master is used to restore the head to the most recent commit, hence the contents of the project directory are also restored to what they were at the time of the most recent commit git revert <commit-id> is used to revert the contents of the project directory to what they were before that particular commit. This will still appear in the log and we can go back to that commit by using git revert again git reset - three kinds - soft (only goes back in time in the commit tree, so just moves the head back; this is similar to checkout), mixed (moving back in time in the project directory but still can come back, doesn’t remove files) and hard (moving back in time in the project directory and staying there, removes files) touch .gitignore, now open the .gitignore file with notepad and add the names of the files you don’t want to track in that. # can be used to comment in this file. Usually you create .gitignore during initializing the project. If you have committed files already before adding them into the .gitignore file, then you need to remove them from cache by using the following series of commands git rm -r --cached . git add . git commit -m “message” If there is a directory in your project folder and you want to ignore all files in the directory from future commits, you can add “directory-name/*” in the .gitignore file Git Branches for Error Handling Lets say there is an error in one of the files in the project folder We can create a branch to fix the error while the master repository stays intact git checkout -b err01 (creates a new branch called err01) <fix the error in one of the files in the project folder> git add . (add all changes made to the err01 to the staging area, so they can be committed) git commit -m ‘fixed error’ (commit all changes made to err01 branch) git checkout master (switch back to master branch) git merge err01 (merge changes made in err01 to master branch; merging will only take last commit of err01 and weave it into the master branch commit timeline) git push (this will push master branch of project folder to remote repository) git push origin err01 (this will push err01 branch of project folder to remote repository) git push origin --delete err01 (we delete the err01 branch as we don’t need it anymore) git branch -d bugs (local branches can be deleted using -d) git branch -a (list all branches) Remote Repositories for Effective Collaboration First step is to create a new repository on GitHub (don’t add a read-me, gitignore or license). Copy the url of the repository Create a project folder in your local machine and browse into that folder using bash git init (you will see that the repository has not been initialized yet; git init is used to create a new repository) git remote add origin <paste url here> git remote -v (you will see that the repository has been initialized) In GitHub website “Create new file” > README.md “Create new file” > LICENSE “Create new file” > .gitignore > in content of that file type /AutoGen to exclude all files that we keep in that folder pull - go back to bash git pull origin master (we don’t need to specify origin master is we set master as the tracked branch) git branch --set-upstream-to=origin/<branch> master Sometimes you might be prompted for a login at this stage <make changes to the local repository> git push -u origin master (push updates to remote repository on GitHub; will ask for username and password) You can add other developers as collaborators to this repository. In summary, Git is a powerful tool for version control that allows developers to manage and track changes to code over time. With its distributed architecture, fast performance, and support for branching and merging, Git is an essential tool for software development teams of all sizes. Comments welcome!

Coding · 2019-08-03

Introduction to Programming in R

Quick Introduction to R R is a programming language and environment for statistical computing and graphics. It was created in the early 1990s by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand. R is now widely used in academia, industry, and government for data analysis, statistical modeling, and data visualization. One of the key features of R is its wide range of statistical and graphical techniques. R provides a vast array of statistical and graphical methods, including linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, and graphical techniques for data visualization. R is also highly extensible and has an active community of users and developers who create and contribute packages that enhance the capabilities of the language. R is an open-source language, which means that the code is available for free and can be modified and redistributed. This has led to the development of a large and active community of R users and developers. The R community provides a wealth of resources, including documentation, tutorials, and help forums, making it easy for users to get started with the language and to find solutions to their problems. One of the advantages of R is its integration with other programming languages and data sources. R can read data from a wide range of sources, including text files, spreadsheets, databases, and web services. R can also interact with other programming languages, such as Python, Java, and C++, allowing users to take advantage of the strengths of different languages and libraries. Another advantage of R is its versatility. R can be used for a wide range of tasks, from data analysis and visualization to machine learning and artificial intelligence. R can also be used in a variety of settings, from research and academia to industry and government. Most modern programming languages have a set up similar building blocks, for example Receiving input from the user and Showing output to the user Ability to store values in variables (usually of different kinds such as integers, floating points or character) A string of characters where you can store names, addresses, or any other kind of text Some advance data types such as arrays which can store a series of regular variables (such as a series of integers) Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails Put your code in functions Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes Read file from a disk and save file to a disk Ability to comment your code so you can understand it when you revisit it some time later Lets dive right in and see how we can do these things in R. Before we can begin to write a program in R, we need to install R and R studio. myString <- "Hello, World!" print (myString) 1. Receiving input from the user and Showing output to the user There are several ways in which we can show output to the user. Let’s look at some ways of showing output: var.1 = c(0,1,2,3) 'Method 1: values of the variables can be printed using print() print(var.1) # Output: 0 1 2 3 'Method 2: cat() function combines multiple items into a continuous print output cat ("var.1 is ", var.1 ,"\n") # Output: var.1 is 0 1 2 3 2. Ability to store values in variables (usually of different kinds such as integers, floating points or character) Basic data types: In R we call variables as objects. There are several types of objects, lets take a look at the important ones: # Logical v <- TRUE print(class(v)) # class funciton can be used to see the data type of the variable # Numeric v <- 23.5 print(class(v)) # Integer v <- 2L print(class(v)) # Complex v <- 2+5i print(class(v)) # Character v <- "TRUE" print(class(v)) # Raw v <- charToRaw("Hello") print(class(v)) Advanced data types: Much of R’s power comes from the fact that R lets us access some advanced objects other than the basic ones shown earlier. Lets take a look at some of the advanced variables: # Vectors - When you want to create vector with more than one element, you should use c() function which means to combine the elements into a vector. # Create a vector. apple <- c('red','green',"yellow") print(apple) # Get the class of the vector. print(class(apple)) # Lists - A list is an R-object which can contain many different types of elements inside it like vectors, functions and even another list inside it. # Create a list. list1 <- list(c(2,5,3),21.3,sin) # Print the list. print(list1) # Matrices - A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix function. # Create a matrix. M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE) print(M) # Arrays - While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function takes a dim attribute which creates the required number of dimension. In the below example we create an array with two elements which are 3x3 matrices each. # Create an array. a <- array(c('green','yellow'),dim = c(3,3,2)) print(a) # Factors - Factors are the r-objects which are created using a vector. It stores the vector along with the distinct values of the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or character or Boolean etc. in the input vector. They are useful in statistical modeling. Factors are created using the factor() function. The nlevels functions gives the count of levels. # Create a vector. apple_colors <- c('green','green','yellow','red','red','red','green') # Create a factor object. factor_apple <- factor(apple_colors) # Print the factor. print(factor_apple) print(nlevels(factor_apple)) # Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of data. The first column can be numeric while the second column can be character and third column can be logical. It is a list of vectors of equal length. Data Frames are created using the data.frame() function. # Create the data frame. BMI <- data.frame( gender = c("Male", "Male","Female"), height = c(152, 171.5, 165), weight = c(81,93, 78), Age = c(42,38,26) ) print(BMI) 3. A string of characters where you can store names, addresses, or any other kind of text Any value written within a pair of single quote or double quotes in R is treated as a string. Key idea here is to learn how to manipulate string variables. There are a few common operations that we will focus on: a. Concatenate strings # Concatenate strings paste(str1, str2, str3, ... , sep = " ", collapse = NULL) b. Counting number of characters in a string # Counting number of characters in a string - nchar() function nchar(test_str) c. Changing the case - toupper() & tolower() functions str = 'apPlE' toupper(str) # APPLE tolower(str) # apple d. Extracting parts of a string - substring() function # Syntax substring(x,first,last) # Example - Extract characters from 5th to 7th position. result <- substring("Extract", 5, 7) print(result) e. Formatting - Numbers and strings can be formatted to a specific style using format() function. # Syntax format(x, digits, nsmall, scientific, width, justify = c("left", "right", "centre", "none")) # Example # Total number of digits displayed. Last digit rounded off. result <- format(23.123456789, digits = 9) print(result) # Display numbers in scientific notation. result <- format(c(6, 13.14521), scientific = TRUE) print(result) # The minimum number of digits to the right of the decimal point. result <- format(23.47, nsmall = 5) print(result) # Format treats everything as a string. result <- format(6) print(result) # Numbers are padded with blank in the beginning for width. result <- format(13.7, width = 6) print(result) # Left justify strings. result <- format("Hello", width = 8, justify = "l") print(result) # Justfy string with center. result <- format("Hello", width = 8, justify = "c") print(result) 4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers) Arrays are a series of similar type of data stored together in one variable. Arrays can be one-dimentional or multi-dimentional. An array is created using the array() function. It takes vectors as input and uses the values in the dim parameter to create an array. For example − If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. # dim=c(rows, columns, matrices) array2 = array(1:12, dim=c(2, 3, 2)) # Naming Columns and Rows column.names <- c("COL1","COL2","COL3") row.names <- c("ROW1","ROW2") matrix.names <- c("Matrix1","Matrix2") array2 = array(1:12, dim=c(2, 3, 2), dimnames = list(row.names, column.names, matrix.names)) Lets see how we can access array elements: # dim=c(rows, columns, matrices) print(array2[2,,2]) # Print the second row of the second matrix of the array. print(array2[1,3,1]) # Print the element in the 1st row and 3rd column of the 1st matrix. print(array2[,,2]) # Print the 2nd Matrix. Since the returned values here are matrices, we can perform matrix operations on them Calculations Across Array Elements (we can use user defined functions as well) apply() lapply() sapply() tapply() # apply(X, MARGIN, FUN) - apply to r or c or both - input to this funciton is a df - output is a vector, list or array m1 <- matrix(C<-(1:10),nrow=5, ncol=2) apply(m1, 2, sum) # lapply(X, FUN) - apply to all elements - input to this function is list, vector or df - output is a list # sapply(X, FUN) - apply to all elements - input to this function is list, vector or df - output is a vector or a matrix movies <- c("BRAVEHEART","BATMAN","VERTIGO","GANDHI") lapply(movies, tolower) sapply(movies, tolower) # tapply(X, INDEX, FUN = NULL) - apply to each factor variable in a vector - input to this function is a vector - output it an array data(iris) tapply(iris$Sepal.Width, iris$Species, median) 5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times R has several looping options (repeat, while and for). There are also options of nesting (single, double, triple, ..) loops. a. The Repeat loop executes the same code again and again until a stop condition is met: # Syntax repeat { commands if(condition) { break } } # Example v <- c("Hello","loop") cnt <- 2 repeat { print(v) cnt <- cnt+1 if(cnt > 5) { break } } b. The While loop executes the same code again and again until a stop condition is met: # Syntax while (test_expression) { statement } # Example v <- c("Hello","while loop") cnt <- 2 while (cnt < 7) { print(v) cnt = cnt + 1 } c. The for loop: # Syntax for (value in vector) { statements } # Example v <- LETTERS[1:4] for ( i in v) { print(i) } R also provides the break and next statements that allow us to alter the loops further. Following is their use: When the break statement is encountered inside a loop, the loop is immediately terminated and program control resumes at the next statement following the loop. On encountering next, the R parser skips further evaluation and starts next iteration of the loop. 6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails R provides if.., if..else.., if..else..if.., and switch options to apply conditional logic. Lets take a look at them: a. The basic syntax for creating an if statement in R is: # Syntax if (test_expression) { statement } # Example x <- 5 if(x > 0){ print("Positive number") } b. The basic syntax for creating an if…else statement in R is: if (test_expression) { statement1 } else { statement2 } # Example x <- -5 if(x > 0){ print("Non-negative number") } else { print("Negative number") } c. The basic syntax for creating an if…else if…else statement in R is: if (test_expression1) { statement1 } else if (test_expression2) { statement2 } else if (test_expression3) { statement3 } else { statement4 } # Example x <- 0 if (x < 0) { print("Negative number") } else if (x > 0) { print("Positive number") } else print("Zero") d. A switch statement allows a variable to be tested for equality against a list of values. Each value is called a case, and the variable being switched on is checked for each case. x <- switch( 2, "first", "second", "third", "fourth" ) print(x) 7. Put your code in functions In R a user defined function is created by using the keyword function. # Syntax function_name <- function(arg_1, arg_2, ...) { Function body } # Example # Create a function to print squares of numbers in sequence. new.function <- function(a) { for(i in 1:a) { b <- i^2 print(b) } } We can call the function new.function supplying 6 as an argument. new.function(6) We can also create functions to which we can pass arguments. These functions can also be defined to use default values for those arguments in case user does not provide a value. Lets see how this is done: new.function <- function(a = 3, b = 6) { result <- a * b print(result) } Now we can call this with or without passing any values: # Call the function without giving any argument. new.function() # Call the function with giving new values of the argument. new.function(9,5) 8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes Class is the blueprint that helps to create an object and contains its member variable along with the attributes. R lets you create two types of classes, S3 and S4. S3 Classes: These let you overload the functions. S4 Classes: These let you limit the data as it is quite difficult to debug the program We will cover s4 classes here. S4 class is defined by the setClass() method. # Defining a class setClass("emp_info", slots=list(name="character", age="numeric", contact="character")) emp1 <- new("emp_info",name="vivek", age=30, contact="somehwere on the internet") # Access elements of a class emp1@name 9. Read file from a disk and save file to a disk Lets see how to read and write csv in an organized way. CSV is the most common file type you will be using for data science, however R can read several other file types as well. # read a csv file data <- read.csv('file.csv') # write a csv file write.csv(df, 'file.csv', row.names = FALSE) 10. Ability to comment your code so you can understand it when you revisit it some time later We can tell R that a line of code is a comment by starting it with a #. # this is a comment In summary, R is a powerful and versatile programming language that is widely used for statistical computing and graphics. Its extensive range of statistical and graphical techniques, its open-source nature, and its active community of users and developers make it a valuable tool for data analysis and modeling. Whether you are a researcher, data analyst, or developer, R provides a wide range of tools and resources for working with data and creating meaningful insights. To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent. Comments welcome!

Coding · 2019-07-06

Introduction to Programming in Markdown

Coding · 2019-06-01

Introduction to Programming in Python

Quick Introduction to Python Python is a high-level, interpreted programming language that was first released in 1991 by Guido van Rossum. It is a general-purpose language that is designed to be easy to use, with a focus on readability and simplicity. Python is often used for web development, data analysis, artificial intelligence, scientific computing, and other types of software development. One of the key features of Python is its ease of use. Python’s syntax is designed to be simple and intuitive, making it accessible to both beginner and experienced programmers. Python is also an interpreted language, meaning that it does not require compilation, which makes it easy to write and test code quickly. Another important feature of Python is its support for object-oriented programming. Python allows users to create classes and objects, and to define methods on those objects. This makes it a powerful tool for building complex software systems. Python also includes a large and growing library of built-in modules and packages. These modules provide a wide range of functionality, from working with strings, arrays, and dictionaries to working with databases, web frameworks, and machine learning tools. Python’s open-source ecosystem is one of its biggest strengths, as it allows developers to easily access and integrate with a wide range of third-party libraries and tools. One of the most popular web development frameworks built in Python is Django. Django is a full-stack web framework that provides a set of conventions and tools for building web applications quickly and easily. With its focus on developer productivity, Django has become a popular choice for startups, small businesses, and large enterprises. Python’s popularity has also been driven by its use in data analysis and scientific computing. With packages like NumPy, Pandas, and Matplotlib, Python has become a leading language for data analysis and visualization. In recent years, Python has also become a popular language for artificial intelligence and machine learning, with packages like TensorFlow, PyTorch, and Scikit-learn providing powerful tools for building machine learning models. Most modern programming languages have a set up similar building blocks, for example Receiving input from the user and Showing output to the user Ability to store values in variables (usually of different kinds such as integers, floating points or character) A string of characters where you can store names, addresses, or any other kind of text Some advance data types such as arrays which can store a series of regular variables (such as a series of integers) Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails Put your code in functions Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes Read file from a disk and save file to a disk Ability to comment your code so you can understand it when you revisit it some time later Lets dive right in and see how we can do these things in Python. 0. How to install Ruby on your desktop? Before we can begin to write a program in Python, we need to install Anaconda. This will install the Anaconda data science environment and Spyder IDE for coding in Python. Once done, go ahead and open Spyder and try out the following code to see if everything is in order. myString = "Hello, World!" print (myString) 1. Receiving input from the user and Showing output to the user There are several ways in which we can show output to the user. Let’s look at some ways of showing output: name = input('please enter your name') # receiving character input print("hello ", name, ",how are you?") # showing character output age = input('please enter your age') # receiving numeric input print('so you are', age, 'years old.') 2. Ability to store values in variables (usually of different kinds such as integers, floating points or character) Python is dynamically typed - don’t need to type out the variable’s data type before using it. This can sometimes cause unexpected problems if for example a user enters a character where you expect a number. To avoid this kind of problems type() can be used. Alternatively, you can “define the variable” by assigning it an initial value (like age=20). Basic data types: In Python we have several types of objects, lets take a look at the important ones: # Boolean / Logical v = TRUE print(type(v)) # type funciton can be used to see the data type of the variable # Numeric v = 23.5 print(type(v)) # Integer v = 2L print(type(v)) # Complex v = 2+5i print(type(v)) # Character v = "TRUE" print(type(v)) # Some common number functions: hex(1) # hexadecimal representation of numbers bin(1) # binary representation of numbers 2**3 # 2^3, 2 to the power 3 pow(2,3) # 2**3 pow(2,3,4) # 2**3 % 4 abs(-2.33) round(3.14) round(3.14159,2) # only till 2 decimal places import math sq_rt = math.sqrt(variable) # returns the square root of the variable Advanced data types: Much of Python’s power comes from the fact that it lets us access some advanced variable trypes other than the basic ones shown earlier. Lets take a look at some of the advanced variable types: # Lists - A list can contain many different types of elements inside it such as character, numeric, etc. and even another list inside it. # Create a list through enumeration. a=[] # with this we initialize a list element a=range(1,10) # with this we insert a range of values from 1-10 in the list print(list(a)) # to show the list as a list, we need to tell the print function that we are passing it a list # Output: [1, 2, 3, 4, 5, 6, 7, 8, 9] # 10 is excluded because upper bound is excluded in python # we can have mixed data types in a list b=[1,2,3,'vivek',True,4,5] print(list(b)) # index of list start with 0, 1, 2 .. # so vivek is present at index 3 print(b[3]) # slicing - [start:stop:step] a[1:6:2] # starts from 1 and goes up until 6 and selects every second element # reversing a list L[::-1] # this would take a lot more effort to do in C++! # tuples - immutable list, cant be changed t = (1,2,3) # use () instead of [] # dict - d = {'key':'value', ..} is an unordered mutable key:value pairs {"name":"frankie","age":33} # Dictionary is quite useful in matrix indexing m=np.array([[1,2,3],[4,5,6],[7,8,9]]) col_names={'age':0, 'weight':1, 'height':2} row_names={'aa':0, 'cc':1, 'bb':2} # now we can get weight of ale using actual indexes or dict indexes m[1,1] # 5 m[row_names['cc'],col_names['weight']] # 5 # set - s=set('a','b','c',..) - unordered collection of unique objects # It looks like a dictionary {"a","b"} when python shows output, but it is not because it doesn’t have key:value pairs set([1,1,2,3]) # output: {1,2,3} , List can be passed to set() set("Mississippi") # output: {'M', 'i', 'p', 's'} , Even strings can be passed to set # Matrices - A matrix is a two-dimensional rectangular data set. It can be created using .array() function. # Create a matrix import numpy as np # we need to import the numpy libabry which provides tools for numerical computing. m=np.array([[1,2,3],[4,5,6],[7,8,9]]) print(type(m)) # Arrays - while matrices are confined to two dimensions, arrays can be of any number of dimensions. # Create an array. import numpy as np # we need to import the numpy libabry which provides tools for numerical computing. a=np.array([1,2,3]) # this is a 1 dimentional array print(type(a)) # Convert a list to an array a=[1,2,3,4] a=np.array(a) # array([1, 2, 3, 4]) # DataFrame - this is an advanced object that can be used by installing the pandas library. If you are familiar with R, this is similar to data.frame. If you are familiar with excel, you can think of a dataframe as a table with rows and column where rows and colums can potentially have names/labels. You can access data within the dataframe using row/column number (indexing starts from 0) or their labels. import pandas as pd # From dict pd.DataFrame({'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}) # from list pd.DataFrame(['orange', 'mango', 'grapes', 'apple'], index=['a', 'b', 'c', 'd'], columns =['Fruits']) # from list of lists pd.DataFrame([['orange','tomato'],['mango','potato'],['grapes','onion'],['apple','chilly']], index=['a', 'b', 'c', 'd'], columns =['Fruits', 'Vegetables']) # from multiple lists pd.DataFrame( list(zip(['orange', 'mango', 'grapes', 'apple'], ['tomato', 'potato', 'onion', 'chilly'])) , index=['a', 'b', 'c', 'd'] , columns =['Fruits', 'Vegetables']) 3. A string of characters where you can store names, addresses, or any other kind of text Any value written within a pair of single quote or double quotes in Python is treated as a string. Key idea here is to learn how to manipulate string variables There are a few common operations that we will focus on: a. Concatenate strings # Concatenate strings str1 + str2 + " " + str3 b. Counting number of characters in a string # Counting number of characters in a string str1 = "vivek" len(str1) c. Changing the case - toupper() & tolower() functions str1.upper() # convert string to upper case (.lower() for lower case) str1.isupper(), str1.islower() # check if a string or a character is upper or lower d. Splitting a string s.split('e') # returns list of strings before and after e. if there are multiple e's, then split happens for all instances of e e. Palindrome of a string str1 = "vivek" str1[::-1] 4. Some advance data types such as lists which can store a series of regular variables (such as a series of integers) Lists are a series of variables stored together in one variable. Lists can be one-dimentional or multi-dimentional. A list is created using the list() function. It takes variables (even other lists) as input. List is different from string because elements can be mutated/changed. # Defining L=[0,0,0] # [0, 0, 0] L1=[0]*3 #shorthand way of defining a list with repeated elements # Supports indexing and slicing L1=['one', 'two', 'three'] L1[0] # 'one' L1[1:2] # ['two'], upper bound is excluded L1[1:3] # ['two', 'three'] # Indexing nested lists L1 = ['one', 'two', ['three', 'four'], 'five'] L1[2][0] # 'three' # Elements can be added L1.append('six') # Elements can be removed L1.pop() # last element gets popped, we can save it in a variable also # Sort L1.sort() # sorts the list in-place, the actual list gets sorted sorted(L1) #returns the sorted version of L3 list # Reverse L1=['c','a','b'] L1.reverse() # reverses the list in-place, the actual list gets reversed # Multi dimentional list indexing L1=[[1,2,3],[4,5,6],[7,8,9]] L1[0][:] # returns first row 5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times Python has several looping options such as ‘for’ and ‘while’. There are also options of nesting (single, double, triple, ..) loops. a. The While loop executes the same code again and again until a stop condition is met: # Syntax while test: code statements else: final code statements # Example x = 0 while x < 10: print('x is currently: ',x) print(' x is still less than 10, adding 1 to x') x+=1 b. The for loop: acts as an iterator in Python; it goes through items that are in a sequence or any other iterable item. Objects that we’ve learned about that we can iterate over include strings, lists, tuples, and even built-in iterables for dictionaries, such as keys or values. # Syntax for item in object: statements to do stuff # Example list1 = [1,2,3,4,5,6,7,8,9,10] for num in list1: print(num) Python also provides the break, continue and pass statements that allow us to alter the loops further. Following is their use: break: Breaks out of the current closest enclosing loop. continue: Goes to the top of the closest enclosing loop. pass: Does nothing at all. # Thinking about break and continue statements, the general format of the while loop looks like this: while test: code statement if test: break if test: continue else: break and continue statements can appear anywhere inside the loop’s body, but we will usually put them further nested in conjunction with an if statement to perform an action based on some condition. 6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails Python provides if.., if..else.., and if..else..if.. statements to apply conditional logic. Lets take a look at them: a. The basic syntax for creating an if statement is: if False: print('It was not true!') b. The basic syntax for creating an if…else statement is: x = False if x: print('x was True!') else: print('I will be printed in any case where x is not true') c. The basic syntax for creating an if…else if…else statement is: loc = 'Bank' if loc == 'Auto Shop': print('Welcome to the Auto Shop!') elif loc == 'Bank': print('Welcome to the bank!') else: print('Where are you?') 7. Put your code in functions Functions allows us to create a block of code that can be executed many times without needing to it write it again. # Syntax def name_of_function(argument_name='default value'): #snake casing for name, all lower case alphabets with underscores ''' what funciton does ''' print ('hello',argument_name) print (f'hello {argument_name}') #both print do the same thing # Example def add_function(a=0,b=0): return a+b We can call the function in the following two ways: # option 1 add_function(2,3) # option 2 c=add_function(3,4) *args and **kwargs stand for arguments and keyword arguments and allow us to extend the funcitonality of functions. *args lets a function take an arbitrary number of arguments. All arguments are received as a tuple, example - (a,b,c,..). args can be renamed to something else, what really matters is *. def myfunc(*args): return args ''' myfunc(1,2,3,4,5,6,7,8,9) Out[30]: (1, 2, 3, 4, 5, 6, 7, 8, 9) ''' **kwargs lets the funciton take an arbitrary number of keyword arguments. All arguments are received as a dictionary of key,value pairs. kwargs can be renamed to something else, what really matters is **. def myfunc(**kwargs): print(kwargs) ''' myfunc(name='vivek', age=34, height=186) {'name': 'vivek', 'age': 34, 'height': 186} ''' 8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes Python allows user to create classes. These can be a combination of variables and functions that operate on those variables. Lets take a look at how we can define and use them. # Define a class class Person: "This is a person class" age = 10 def greet(self): print('Hello') # Using class print(Person.age) # Output: 10 print(Person.greet) # Output: <function Person.greet> print(Person.__doc__) # Output: 'This is my second class' # Creating an object of the class and using that vivek = Person() # create a new object of Person class print(vivek.greet) # Output: <bound method Person.greet of <__main__.Person object>> vivek.greet() # Calling object's greet() method; Output: Hello 9. Read file from a disk and save file to a disk Lets see how to read and write a csv file in an organized way. CSV is the most common file type you will be using for data science, however python can read several other file types and data directly from websites as well. import pandas # read a csv using the pandas package df = pandas.read_csv('student_data.csv') print(df) # write data to a csv using pandas package df.to_csv('student_data_copy.csv') 10. Ability to comment your code so you can understand it when you revisit it some time later We can tell Python that a line of code is a comment by starting it with a #. # this is a comment We can tell that a multi-line block of text is a comment by enclosing it in triple inverted single quotes. ''' this is a comment block ''' Overall, Python is a versatile and powerful programming language that is well-suited for a wide range of programming tasks. With its emphasis on simplicity, object-oriented design, and a large and growing ecosystem of third-party libraries and tools, Python is a valuable tool for both beginner and experienced programmers. Whether building web applications, analyzing data, or working on artificial intelligence projects, Python provides a fast, flexible, and enjoyable development experience. To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent. Comments welcome!

Coding · 2019-05-04

Introduction to Programming in Julia

Quick Introduction to Julia Julia is a high-level, high-performance programming language that was created in 2012 by a team of computer scientists led by Jeff Bezanson, Stefan Karpinski, and Viral Shah. Julia was designed to address the limitations of traditional scientific computing languages, such as MATLAB, Python, and R, while still retaining their ease of use and flexibility. One of the key features of Julia is its performance. Julia is designed to be fast, with execution speeds comparable to those of compiled languages such as C and Fortran. This is achieved through a combination of just-in-time (JIT) compilation, which compiles code on the fly as it is executed, and type inference, which allows Julia to determine the data types of variables at runtime. Another important feature of Julia is its support for multiple dispatch. Multiple dispatch allows Julia to select the appropriate method to use based on the types of the arguments being passed to a function. This makes Julia a flexible and expressive language that can be easily extended and customized to fit a wide range of programming tasks. Julia also includes a number of built-in data structures and libraries that make it easy to work with arrays, matrices, and other scientific computing tools. These include tools for linear algebra, statistics, optimization, and machine learning, as well as support for distributed computing and parallelism. In addition to its scientific computing features, Julia also includes support for general-purpose programming tasks, such as web development, database access, and file I/O. Julia’s growing package ecosystem provides a wide range of libraries and tools for these tasks, making it a versatile language that can be used for a variety of programming tasks. One of the key benefits of Julia is its community. Julia has a rapidly growing community of developers and users who are actively contributing to the language and its ecosystem. This community has created a large number of high-quality packages, as well as a number of online resources and forums for learning and discussing the language. Most modern programming languages have a set up similar building blocks, for example Receiving input from the user and Showing output to the user Ability to store values in variables (usually of different kinds such as integers, floating points or character) A string of characters where you can store names, addresses, or any other kind of text Some advance data types such as arrays which can store a series of regular variables (such as a series of integers) Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails Put your code in functions Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes Read file from a disk and save file to a disk Ability to comment your code so you can understand it when you revisit it some time later Lets dive right in and see how we can do these things in Julia. 0. How to install Julia on your desktop? Before we can begin to write a program in Julia, we need to install Julia. Next you can install VSCode. Now launch VSCode and install the Julia (by julialang) extension. Now you can create a new test.jl file and add the following code and see if runs. 4+2; # If you don't want to see the result of the expression printed, use a semicolon at the end of the expression ans; # the value of the last expression you typed on the REPL, it's stored within the variable ans Before we dive in, chaining functions is possible in Julia, like so: 1:10 |> collect 1. Receiving input from the user and Showing output to the user There are several ways in which we can show output to the user. Let’s look at some ways of showing output: # receiving input from user name = readline(stdin) # showing output to user println("you name is ", name) 2. Ability to store values in variables (usually of different kinds such as integers, floating points or character) Names of variables are in lower case. Word separation can be indicated by underscores. Julia has several types of variables broadly classified into Concrete and abstract types. The types that can have subtypes (e.g. Any, Number) are called abstract types. The types that can have instances are called concrete types. These types cannot have any subtypes. Concrete types can be further divided into primitive (or basic), and complex (or composite). Let’s take a deeper look: # Primitive types ## the basic integer and float types (signed and unsigned): Int8, UInt8, Int16, UInt16, Int32, UInt32, Int64, UInt64, Int128, UInt128, Float16, Float32, and Float64 a = 10 ## more advanced numeric types: BigFloat, BigInt a = BigInt(2)^200 ## Boolean and character types: Bool and Char selected = true ## Text string types: String name = "vivek" # Composite type ## Rational, used to represent fractions. It is composed of two pieces, a numerator and a denominator, both integers (of type Int) 666//444 # To make rational numbers, use two slashes (//) Some advanced data types include dictionary and sets. Sets are similar to arrays with the difference that they dont allow element duplication. dict = Dict("a" => 1, "b" => 2, "c" => 3) dict = Dict{String,Integer}("a"=>1, "b" => 2) # If you know the types of the keys and values in advance, you can specify them after the Dict keyword, in curly braces # looking things up dict["a"] values(dict) # to retrieve all values keys(dict) # to retrieve all keys # these can be useful for iterating for k in keys(dict) for (key, value) in dict merge(d1, d2) # merge() function which can merge two dictionaries findmin(d1) # find the minimum value in a dictionary, and return the value, and its key filter((k, v) -> k == 1, d1) # sort dict - you can use the SortedDict data type from the DataStructures.jl package Pkg.add("DataStructures") import DataStructures dict = DataStructures.SortedDict("b" => 2, "c" => 3, "d" => 4, "e" => 5, "f" => 6) # Sets - A set is a collection of elements, just like an array or dictionary, with no duplicated elements. colors = Set{String}(["red","green","blue","yellow"]) push!(colors, "black") # You can use push!() to add elements to a set union(colors, rainbow) # The union of two sets is the set of everything that is in one or the other sets intersect(colors, rainbow) # The intersection of two sets is the set that contains every element that belongs to both sets setdiff(colors, rainbow) # The difference between two sets is the set of elements that are in the first set, but not in the second We will discuss abstract data types in section 8 below. 3. A string of characters where you can store names, addresses, or any other kind of text Any value written within a pair of double quotes in Julia is treated as a string. "this is a string" # double quotes and dollar signs need to be preceded (escaped) with a backslash """this is "a" string with double quotes""" # triple double quotes can be used to store strings with double quotes in them Julia also allows the user to indicate special strings. # special strings r" " indicates a regular expression v" " indicates a version string b" " indicates a byte literal raw" " indicates a raw string that doesn't do interpolation Key idea here is to learn how to manipulate string variables. There are a few common operations that we will focus on: a. Concatenate strings # Concatenate strings join(split(s, r"a|e|i|o|u", false), "aiou") # You can join the elements of a split string in array form using join() b. Counting number of characters in a string # Counting number of characters in a string length(str) # to find the length of a string lastindex(str) # to find index of last char of string c. Changing the case - toupper() & tolower() functions uppercase(s) d. Splitting a string split("You know my methods, Watson.") # by default splits on space split("You know my methods, Watson.", 'W') # splits on the char W # If you want to split a string into separate single-character strings, use the empty string ("") split("You know my methods, Watson.", r"a|e|i|o|u", false) # splits string on the char that matches any of the vowels # false makes sure that empty strings are not returned e. String interpolation # string interpolation - use the results of Julia expressions inside strings. x = 42 "The value of x is $(x)." # "The value of x is 42." f. Iterate over a string for char in s # iterate through a string print(char, "_") end g. Get index of all characters in a string for i in eachindex(str) @show su[i] end h. Converting between numbers and strings a = BigInt(2)^200 a=string(a) # convert number to string parse(BigInt, a) # convert strings to numbers i. Finding and replacing things inside strings s = "My dear Frodo"; in('M', s) # true occursin("Fro", s) # true findfirst("My", s) # 1:2 replace(s, "Frodo" => "Frodo Baggins") There are a lot of other functions as well: length(str) - - length of string sizeof(str) - length/size startswith(strA, strB) - does strA start with strB? endswith(strA, strB) - does strA end with strB? occursin(strA, strB) - does strA occur in strB? all(isletter, str) - is str entirely letters? all(isnumeric, str) - is str entirely number characters? isascii(str) - is str ASCII? all(iscntrl, str) - is str entirely control characters? all(isdigit, str) - is str 0-9? all(ispunct, str) - does str consist of punctuation? all(isspace, str) - is str whitespace characters? all(isuppercase, str) - is str uppercase? all(islowercase, str) - is str entirely lowercase? all(isxdigit, str) - is str entirely hexadecimal digits? uppercase(str) - return a copy of str converted to uppercase lowercase(str) - return a copy of str converted to lowercase titlecase(str) - return copy of str with the first character of each word converted to uppercase uppercasefirst(str) - return copy of str with first character converted to uppercase lowercasefirst(str) - return copy of str with first character converted to lowercase chop(str) - return a copy with the last character removed chomp(str) - return a copy with the last character removed only if it's a newline 4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers) Arrays can be one-dimentional or multi-dimentional. An array is created using the square brackets, Array constructor or several other methods. Arrays support a lot of functionality within Julia so I have covered it in more detail in this array specific article. For now lets check out the key functionality. # Defining # Creating arrays by initializing arr_Int64 = [1, 2, 3, 4, 5] # Creating empty arrays b = Int64[] # Creating 2-d arrays arr_2d = [1 2 3 4] # If you leave out the commas when defining an array, you can create 2D arrays quickly. Here's a single row, multi-column array: arr_2d = [1 2 3 4 ; 5 6 7 8] # you can add another row using ; # Creating arrays using range objects a = 1:10 # creates a range variable with 10 elements from 1 to 10 collect(a) # collect displays a range variable [a...] # instead of collect, you could use the ellipsis (...) operator (three periods) after the last element range(1, length=12, stop=100) # Julia calculates the missing pieces for you by combining the values for the keywords step(), length(), and stop() # Using comprehensions and generators to create arrays [n^2 for n in 1:5] # a 1-d array [r * c for r in 1:5, c in 1:5] # a 2-d array # Reshape an array to create a multi-dimentional array reshape([1, 2, 3, 4, 5, 6, 7, 8], 2, 4) # create a simple array and then change its shape # Supports indexing and slicing # 1-d a[5] # 5th element a[end] # last element a[end-1] # second last element # 2-d a = [[1, 2] [3,4]] a[2,2] # element at row-2 x col-2 a[:,2] # all elements of col-2 getindex(a, 2,2) # same as a[2,2] # Elements can be added a = Array[[1, 2], [3,4]] push!(a, [5,6]) # The push!() function pushes another item onto the back of an array pushfirst!(a, 0) # To add an item at the front splice() # To insert an element into an array at a given index splice!(a, 4:5, 4:6) # insert, at position 4:5, the range of numbers 4:6 L = ['a','b','f']; splice!(L, 3:2, ['c','d','e']) # insert c, d, e between b and f # Elements can be removed splice!(a,5); # If you don't supply a replacement, you can also use splice!() can remove elements and move the rest of them along pop!(a) # To remove the last item popfirst!(a) # Elementwise and vectorized operations a / 100 # every element of the new array is the original divided by 100. These operations operate elementwise n1 = 1:6; n2 = 2:7; n1 .* n2; # if two arrays are to be multiplied then we just add a . before the mathematical operator to signify elementwise # the first element of the result is what you get by multiplying the first elements of the two arrays, and so on # How function works on individual variables f(a, b) = a * b a=10;b=20;print(f(a,b)) # How function can be applied elementwise to arrays n1 = 1:6; n2 = 2:7; print(f.(n1, n2)) 5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times Julia has several looping options such as ‘for’ and ‘while’. There are also options of nesting (single, double, triple, ..) loops. a. The While loop executes the same code again and again until a stop condition is met: # while end - iterative conditional evaluation x=0 while x < 4 println(x) global x += 1 end b. The for loop: acts as an iterator in Julia; it goes through items that are in a sequence or any other iterable item. Objects that we’ve learned about that we can iterate over include strings, lists, tuples, and even built-in iterables for dictionaries, such as keys or values. # for end - iterative evaluation # use the global keyword to define a variable that outlasts the loop for i in 1:10 z = i println("z is $z") end # Some sample for loop statements for different data types for color in ["red", "green", "blue"] # an array for letter in "julia" # a string for element in (1, 2, 4, 8, 16, 32) # a tuple for i in Dict("A"=>1, "B"=>2) # a dictionary for i in Set(["a", "e", "a", "e", "i", "o", "i", "o", "u"]) Julia also provides the break and continue statements that allow us to alter the loops further. Following is their use: break: Breaks out of the current closest enclosing loop. continue: Goes to the top of the closest enclosing loop. # Example with break statement x=0 while true println(x) x += 1 x >= 4 && break # breaks out of the loop end break and continue statements can appear anywhere inside the loop’s body, but we will usually put them further nested in conjunction with an if statement to perform an action based on some condition. Following are some other options for looping options: # list comprehensions [i^2 for i in 1:10] [(r,c) for r in 1:5, c in 1:2] # two iterators in a comprehension # Generator expressions - generator expressions can be used to produce values from iterating a variable sum(x^2 for x in 1:10) # Enumerating arrays m = rand(0:9, 3, 3) [i for i in enumerate(m)] # Zipping arrays for i in zip(0:10, 100:110, 200:210) println(i) end # Iterable objects ro = 0:2:100 [i for i in ro] 6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails Julia provides several options to apply conditional logic. Lets take a look at them: a. ternary and compound expressions: x = 1 x > 3 ? "yes" : "no" b. Boolean switching expressions: isodd(1000003) && @warn("That's odd!") isodd(1000004) || @warn("That's odd!") c. if elseif else end - conditional evaluation: name = "Julia" if name == "Julia" println("I like Julia") elseif name == "Python" println("I like Python.") println("But I prefer Julia.") else println("I don't know what I like") end c. Error handling using try.. catch. This allows the code to still keep executing even if an error occurs, which would usually halt the program. # try catch error throw exception handling try <statement-that-might-cause-an-error>; catch e # error gets caught if it happens println("caught an error: $e") # show the error if you want to end println("but we can continue with execution...") # Example 1 - error doesnt occur try a=10 # no error catch e print(e) end # Example 2 - error occurs try la-la-la # undefined variable error catch e print(e) end 7. Put your code in functions Functions allows us to create a block of code that can be executed many times without needing to it write it again. Julia has something called a single expression function. These are usually defined in one line like so: # Single expression functions f(x) = x * x g(x, y) = sqrt(x^2 + y^2) Functions with multiple expressions are also supported and can be defined using the function keyword: # Syntax # Functions with multiple expressions function say_hello(name) println("hello ", name) end say_hello("vivek") Additionally, functions can be programmed to retun a single or multiple value using the return keyword. # define function which returns a value function add_numbers(a,b) return a+b end # call the function add_numbers(2,3) # define function which returns multiple values function add_multiply_numbers(a, b=10) # we can supply default values as well return(a+b, a*b) end # call the function add_multiply_numbers(2,3) add_multiply_numbers(2) args… lets a function take an arbitrary number of arguments. A for loop can be used to iterate over these arguments. function show_args(args...) for arg in args println(arg," ") end end show_args(10,20,25,35,50) Julia also supports anonymous functions, with no name. map((x,y,z) -> x + y + z, [1,2,3], [4, 5, 6], [7, 8, 9]) Map and reduce can also be used to apply functions to arrays. Map - If you already have a function and an array, you can call the function for each element of the array by using map() a=1:10; map(sin, a) # map() returns a new array but if you call map!() , you modify the contents of the original array The map() function collects the results of some function working on each and every element of an iterable object, such as an array of numbers. map(+, 1:10) The reduce() function does a similar job, but after every element has been seen and processed by the function, only one is left. The function should take two arguments and return one. reduce(+, 1:10) 8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes Julia allows user to create user defined variables using abstract type (which are abstract) or mutable struct (which are concrete). Lets take a look at both. Abstract type abstract type MyAbstractType end # By default, the type you create is a direct subtype of Any abstract type MyAbstractType2 <: Number end # the new abstract type is a subtype of Number Concrete type using mutable struct # define the data type mutable struct student <: Any name age::Int end # initialize a variable of that data type x=student("vivek", 30) # use the variable x.name x.age 9. Read file from a disk and save file to a disk Lets see how to read in an organized way. f = open("sherlock-holmes.txt") # To read text from a file, first obtain a file handle: close(f) # When you've finished with the file, you should close the connection If you use the following technique then you dont need to close. The open file is automatically closed when this block finishes. open("sherlock-holmes.txt") do file # do stuff with the open file end 10. Ability to comment your code so you can understand it when you revisit it some time later We can tell Julia that a line of code is a comment by starting it with a #. # this is a comment Overall, Julia is a powerful and flexible programming language that is well-suited for scientific computing and other high-performance tasks. With its emphasis on performance, multiple dispatch, and a growing ecosystem of packages and tools, Julia is a valuable tool for researchers, data scientists, and other professionals who need a fast, flexible, and expressive language for their work. To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent. Comments welcome!

Coding · 2019-04-06

Perspective: A Lesson from The Kite Runner

Perspective: A Lesson from The Kite Runner Have you ever looked back on a moment in your life and realized you saw it completely differently at the time? Our perspective shapes the way we understand events, people, and even ourselves. Khaled Hosseini’s The Kite Runner masterfully explores the power of perspective through its protagonist, Amir, and his journey of redemption. The novel provides several poignant moments where a shift in perspective redefines reality, reminding us of the importance of seeing beyond our own biases and assumptions. A Child’s Perspective: The Privilege of Innocence In the beginning, Amir enjoys a privileged life in Kabul, unaware of the deep societal divides that separate him from Hassan, his Hazara servant and best friend. To Amir, their friendship is pure and unaffected by status. However, Hassan, though younger, understands the weight of their differences. One of the most heartbreaking moments occurs when Amir fails to stand up for Hassan in the alley. From Amir’s limited perspective, his silence is self-preservation, but with time, he realizes it was cowardice—a realization that haunts him into adulthood. “I ran because I was a coward. I was afraid of Assef and what he would do to me.” This self-awareness only develops later, demonstrating how perspective matures with experience. The Father-Son Lens: Misunderstood Love Baba, Amir’s father, is another character whose perspective is misunderstood. Amir believes Baba favors strength and physical courage over intellect, leading to deep insecurities. However, as the novel unfolds, Amir learns of Baba’s sacrifices and hidden struggles—his illegitimate son, his moral dilemmas, and the burden of expectations. A key moment of realization comes when Baba tells Amir, “There is only one sin, only one. And that is theft… When you tell a lie, you steal someone’s right to the truth.” This lesson, initially abstract to Amir, takes on a new meaning as he matures and understands the gravity of deception—not just in others, but within himself. Redemption and a Shift in Perspective Perspective is often best understood in hindsight. Amir’s journey to atone for his past mistakes brings him back to Afghanistan, where he sees his homeland through the eyes of suffering. The Taliban’s rule has reshaped the Kabul of his childhood into an unrecognizable and brutal landscape. His perception of Hassan also shifts dramatically when he discovers the truth about their relationship—that they were brothers. His final act—rescuing Sohrab—is not just a physical redemption but a transformation of his worldview. He finally understands what it means to be truly selfless, to take action rather than remain passive. Final Thoughts: Expanding Our Own Perspective Amir’s journey reminds us that perspective is ever-changing, molded by experience, knowledge, and time. Whether in literature or in life, understanding different perspectives fosters empathy and growth. Just like Amir, we must be willing to look beyond our immediate view and challenge our own biases. After all, true transformation begins when we allow ourselves to see the world through another’s eyes. How has a shift in perspective changed the way you see a person or situation in your own life?

Tales and Tunes · 2019-03-05

Introduction to Programming in Ruby

Quick Introduction to Ruby Ruby is a high-level, interpreted programming language that was created in the mid-1990s by Yukihiro “Matz” Matsumoto. It is a general-purpose language that is designed to be easy to use and read, with syntax that is similar to natural language. Ruby is often used for web development, as well as for building command-line utilities, desktop applications, and other types of software. One of the key features of Ruby is its emphasis on programmer productivity and ease of use. Ruby’s syntax is designed to be intuitive and easy to read, making it accessible to both beginner and experienced programmers. Ruby also includes a number of built-in features and libraries that make it easy to accomplish common programming tasks, such as working with strings, arrays, and hashes. Another important feature of Ruby is its object-oriented programming model. Everything in Ruby is an object, and methods can be defined on objects to add functionality. Ruby also includes support for inheritance, encapsulation, and polymorphism, which makes it a powerful tool for building complex software systems. Ruby is also known for its extensive library of open-source gems, which are pre-built packages of code that can be easily integrated into Ruby projects. These gems provide a wide range of functionality, from database access to web development frameworks, and can save developers a significant amount of time and effort in building software. One of the most popular web development frameworks built in Ruby is Ruby on Rails. Rails is a full-stack web framework that provides a set of conventions and tools for building web applications quickly and easily. With its focus on developer productivity, Rails has become a popular choice for startups and small businesses, as well as for larger enterprises. Most modern programming languages have a set up similar building blocks, for example Receiving input from the user and Showing output to the user Ability to store values in variables (usually of different kinds such as integers, floating points or character) A string of characters where you can store names, addresses, or any other kind of text Some advance data types such as arrays which can store a series of regular variables (such as a series of integers) Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails Put your code in functions Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes Read file from a disk and save file to a disk Ability to comment your code so you can understand it when you revisit it some time later Lets dive right in and see how we can do these things in Ruby. 0. How to install Ruby on your desktop? Before we can begin writing programs in Ruby, we need to set up our ruby environment. You can install Ruby from here ruby-lang.org. Additionally, you need to install an IDE to write and execute Ruby code. My personal favorite is code.visualstudio.com. Lastly, you will also need to install the following extensions within VSCode: Ruby (Peng Lv) and Code Runner (Jun Han). Now, lets write a simple program that print out hello world for the user to see print 'Hello World !!!' 1. Receiving input from the user and Showing output to the user There are several ways in which we can show output to the user. Let’s look at some ways of showing output: #Method 1: print 'Hello World !!!' #Method 2: p 'Hello World !!!' #Method 3: puts 'Hello World !!!' #Method 4: Showing data stored in variables to user my_name = "Vivek" puts "Hello #{my_name}" #Method 5: Showing multiple variables using same puts statement aString = "I'm a string!" aBoolean = true aNumber = 42 puts "string: #{aString} \nboolean: #{aBoolean} \nnumber: #{aNumber}" 2. Ability to store values in variables (usually of different kinds such as integers, floating points or character) There are three main types of variable: Strings (a collection of symbols inside speech marks) Booleans (true or false) Numbers (numeric values) Following are some examples: aString = "I'm a string!" aBoolean = true aNumber = 42 puts "string: #{aString} \nboolean: #{aBoolean} \nnumber: #{aNumber}" Performing basic math on numeric variables. There are 6 types of basic operations: addition, subtraction, multiplication, division, modulo and exponent. a = 5 b = 2 puts "sum: #{a+b}\ \ndifference: #{a-b} \nmultiplication: #{a*b} \ndivision: #{a/b} \nmodulo: #{a%b} \nexponent: #{a**b}" 3. A string of characters where you can store names, addresses, or any other kind of text You can use single quotes or double quotes for strings - either one is acceptable. myFirstString = 'I am a string!' #single quotes mySecondString = "Me too!" #double quotes There are a few common operations that we will focus on: "Hi!".length #is 3 "Hi!".reverse #is !iH "Hi!".upcase #is HI! "Hi!".downcase #is hi! # You can also use many methods at once. They are solved from left to right. "Hi!".downcase.reverse #is !ih # If you want to check if one string contains another string, you can use .include?. "Happy Birthday!".include?("Happy") 4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers) Arrays allow you to group multiple values together in a list. Each value in an array is referred to as an “element”. a. Defining an array: myArray = [] # an empty array myOtherArray = [1, 2, 3] # an array with three elements b. Accessing array elements: # In order to add to or change elements in an array, you can refer to an element by number. myOtherArray[3] = 4 Ruby has another advanced data type called Hash, which is similar to a python dictionary. Just like arrays, hashes allow you to store multiple values together. However, while arrays store values with a numerical index, hashes store information using key-value pairs. Each piece of information in the hash has a unique label, and you can use that label to access the value. a. To create a hash, use Hash.new, or myHash={}. For example: myHash=Hash.new() myHash["Key"]="value" myHash["Key2"]="value2" # or myHash={ "Key" => "value", "Key2" => "value2" } b. To access elements of a hash: puts myHash["Key"] # puts value Instead of using a string as a key, you can also use a symbol, like this: a. To create a hash, use Hash.new, or myHash={}. For example: myHash=Hash.new() myHash[:Key]="value" myHash[:Key2]="value2" # or myHash={ Key: "value", Key2: "value2", } b. To access elements of a hash: puts myHash[:Key] # puts "value" 5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times Ruby has several looping options (For, While, and Until). There are options of nesting (single, double, triple, ..) loops as well. a. For loop executes code once for each element in expression. Following example shows how a for loop works: # Syntax for variable [, variable ...] in expression [do] code end # Example for i in 0..5 puts "Value of local variable is #{i}" end b. While loop executes code while conditional is true. A while loop’s conditional is separated from code by the reserved word do, a newline, backslash \, or a semicolon ;. Following example shows how a for loop works: # Syntax while conditional [do] code end # Example a=1 b=5 while a<=b puts "run #{a}" a=a+1 end # Ruby while modifier - Executes code while conditional is true. code while condition # or begin # If a while modifier follows a begin statement with no rescue or ensure clauses, code is executed once before conditional is evaluated. code end while conditional c. Until loop executes code while conditional is false. An until statement’s conditional is separated from code by the reserved word do, a newline, or a semicolon. Following example shows how a for loop works: # Syntax until conditional [do] code end # Example $i = 0 $num = 5 until $i > $num do puts("Inside the loop i = #$i" ) $i +=1; end # Ruby until modifier - Executes code while conditional is false. code until conditional # or begin # If an until modifier follows a begin statement with no rescue or ensure clauses, code is executed once before conditional is evaluated. code end until conditional d. Ruby also offers following keywords that can modify the behavior of the above loops: # break - Terminates the most internal loop. Terminates a method with an associated block if called within the block (with the method returning nil). # next - Jumps to the next iteration of the most internal loop. Terminates execution of a block if called within a block (with yield or call returning nil). # redo - Restarts this iteration of the most internal loop, without checking loop condition. Restarts yield or call if called within a block. # retry - If retry appears in rescue clause of begin expression, restart from the beginning of the begin body. # retry - If retry appears in the iterator, the block, or the body of the for expression, restarts the invocation of the iterator call. Arguments to the iterator is re-evaluated. 6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails Conditionals are used to add branching logic to your programs; they allow you to include complex behaviour that only occurs under specific conditions. a. If - if condition is an expression that can be checked for truth. If the expression evaluates to true, then the code within the block is executed. if condition something to be done end # Ruby if modifier - executes code if the conditional is true. code if condition Following is an actual example of an if statement with both an elsif and an else. booleanOne = true randomCode = "Hi!" if booleanOne puts "I will be printed!" elsif randomCode.length>=1 puts "Even though the above code is true, I won't be executed because the earlier if statement was true!" else puts "I won't be printed because the if statement was executed!" end b. If Else - You can combine if with the keyword else. This lets you execute one block of code if the condition is true, and a different block if it is false. The else block will only be executed if the if block doesn’t run, so they will never both be executed. if condition something to be done else something to be done if the condition evaluates to false end c. Elseif - When you want more than two options, you can use elsif. This allows you to add more conditions to be checked. Still only one of the code blocks will be run, because the statement only executes the code in the first applicable block; Once a condition has been satisfied, the whole statement ends. Here is if/elsif/else statement syntax: if condition something to be done elsif different condition something else to be done else another different thing to be done end d. Unless - Executes code if conditional is false. If the conditional is true, code specified in the else clause is executed. unless condition # thing to be done if the condition is false else # else is optional # thing to be done if the condition is true end # Ruby unless modifier - Executes code if conditional is false. code unless conditional e. Case - this is basically same as a if-elseif-else statement, but with more clear syntax. # case statement syntax case expr0 when expr1, expr2 stmt1 when expr3, expr4 stmt2 else stmt3 end # is basically similar to the following − if expr1 === expr0 || expr2 === expr0 stmt1 elsif expr3 === expr0 || expr4 === expr0 stmt2 else stmt3 end Example of case statement $age = 5 case $age when 0 .. 2 puts "i will not be printed" when 3 .. 6 puts "i will be printed" when 7 .. 12 puts "i will not be printed" when 13 .. 18 puts "youth" else puts "i will not be printed" end 7. Put your code in functions a. In Ruby we call functions methods. Methods are reuseable sections of code that perform specific tasks in our program. Using methods means that we can write simpler, more easily readable code. # syntax def methodname # method code here end b. Methods can also be defined to accept and process any parameters that are passed to them: # Methods With Parameters def laugh(number) puts "haha " * number end c. We can call methods using the name of the method and specify the parameters within paranthesis or without them: # Using method - calling method as follows prints "haha" 5 times on the screen laugh(5) # You can also call laugh without paranthesis laugh 5 d. We can set default values for the parameters, which will be used if method is called without passing the required parameters def method_name (var1 = value1, var2 = value2) expr.. end e. We can also return values. return statement in ruby is used to return one or more values from a Ruby Method. return # or return 12 # or return 1,2,3 f. We can also define methods with variable number of parameters, like so: Variable Number of Parameters def sample (*test) puts "The number of parameters is #{test.length}" for i in 0...test.length puts "The parameters are #{test[i]}" end end sample "Zara", "6", "F" sample "Mac", "36", "M", "MCA" 8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes Ruby allows user to create classes. These can be a combination of variables and functions that operate on those variables. Lets take a look at how we can define and use them. # Define a class class employee @@no_of_customers = 0 def initialize(id, name, addr) @cust_id = id @cust_name = name @cust_addr = addr end end # Creating an object of the class and using that cust1 = employee.new("1", "Vivek", "Somewhere on the, Internet") 9. Read file from a disk and save file to a disk Lets see how to read and parse csv in an organized way. CSV is the most common file type you will be using for data science, however ruby can read several other file types as well. require 'csv' # read a csv CSV.read("file.csv") # parse a string of text which is in csv format CSV.parse("1,penny\n2,nickel\n3,dime") 10. Ability to comment your code so you can understand it when you revisit it some time later a. We can tell ruby that a line of code is a comment by starting it with #. #this is a comment b. We can also specify a comment block, like so: =begin There are three main types of variable: 1. Strings (a collection of symbols inside speech marks) 2. Booleans (true or false) 3. Numbers (numeric values) =end Overall, Ruby is a powerful and flexible programming language that is well-suited for a wide range of programming tasks. With its focus on ease of use, object-oriented design, and extensive library of gems, Ruby is a valuable tool for both beginner and experienced programmers. Whether building web applications, desktop utilities, or other types of software, Ruby provides a fast, flexible, and enjoyable development experience. To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent. Comments welcome!

Coding · 2019-03-02

Introduction to Programming in C++

Quick Introduction to C++ C++ is a powerful and popular programming language that was developed in the 1980s as an extension of the C programming language. It is a high-level, object-oriented language that is used to develop a wide range of applications, including operating systems, device drivers, game engines, and more. C++ is also widely used in the field of finance and quantitative analysis, due to its speed and efficiency. One of the key features of C++ is its ability to directly manipulate memory, allowing for low-level control over the hardware. C++ is also known for its efficiency and speed, making it a popular choice for developing applications that require high performance, such as video games and real-time systems. Another key feature of C++ is its support for object-oriented programming (OOP). This allows programmers to define their own classes and objects, and to encapsulate data and functionality within those objects. OOP allows for code reusability, modularity, and flexibility, making it a popular paradigm in software development. C++ is also known for its support for templates and generic programming. Templates allow programmers to write generic code that can work with different data types, without having to write separate code for each type. This can greatly simplify code development and maintenance, and can make C++ code more efficient and easier to read. Most modern programming languages have a set up similar building blocks, for example Receiving input from the user and Showing output to the user Ability to store values in variables (usually of different kinds such as integers, floating points or character) A string of characters where you can store names, addresses, or any other kind of text Some advance data types such as arrays which can store a series of regular variables (such as a series of integers) Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails Put your code in functions Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes Read file from a disk and save file to a disk Ability to comment your code so you can understand it when you revisit it some time later Lets dive right in and see how we can do these things in C++. 0. How to install C++ on your desktop? Before we can begin to write a program in C++, we need to install Dev-C++. Once done, go ahead and open the IDE and try out the following code to see if everything is in order. #include <iostream> using namespace std; int main() { cout << "Hello World!"; return 0; } As you noticed, unlike languages such as Python, R or Ruby, it takes more than a few statements just to display basic text to the user in C++. In the next section we will try to dismantle this code and understand the various components. Lets however cover a few important points: In C++ we need to end each line of code with a semi-colon ; The scope of statements is defined using curly brackets {}, unlike Python where the scope is defined through indentation All statements need to be within a function. Here we have included the statements in the main() function which is the first function that is executed during a compiler call. All other functions will be called from within this funciton. 1. Receiving input from the user and Showing output to the user Following program shows output to the user. The include statement is used to call the iostream header file which is same as a python library. This header file provides information on basic programming routines including input and output constructs. The next is int main() which says that the main function will return an integer after execution. Within the main function we use cout« to show the text to the user. The text is enclosed in double quotes “text”. endl after the text tells the compiler to insert a new line in the output window. Finally we return 0 as the main function is supposed to return an integer. 0 signifies that everything was in order during the execution of the function. #include <iostream> using namespace std; int main() { cout << "This is some text." << endl; return 0; } We can modify this program to accept input form the user. The cin» statement allows us to receive input. The variable in which we store the received input needs to be defined beforehand. #include <iostream> using namespace std; int main() { int age_ = 0; cout << "What is your age?"; cin>>age_; cout << "So your age is: " << age_; return 0; } 2. Ability to store values in variables (usually of different kinds such as integers, floating points or character) C++ is not dynamically typed - you need to type out the variable’s name and data type before using it. Basic data types: In C++ we have several types of variables, lets take a look at the important ones: // Integer int numberCats=5; long int numberCats=5; //long int can be used for storing large values // Floating point numbers. These are numbers with significant digits after the decimal float pi=3.1415926535; //pi=22/7 // Double double dValue=3.1415926535; //for more significant digits we need to use other variable type than float long double ldValue=3.1415926535; // Boolean bool bval=true; //boolean type is true or false; c++ uses 1 for true and 0 for false when outputting // Character char cval=55, cval2='7'; //takes exactly 1 byte of computer memory, char represents single characters from the ascii character set, 55 is the ascii code for 7, this is not the number 7 but the character 7 // String string myname; 3. A string of characters where you can store names, addresses, or any other kind of text A string in C++ can be defined using the string keyword. It can be assigned usign the input from user or it can be assigned by providing text within double quotes “text”. string yourName; cout << "\n\nwhat is your name? "; cin >> yourName; cout <<"\nnice to meet you "<<yourName<<endl<<endl; 4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers) Arrays are a series of variables stored together in one variable. Arrays can be one-dimentional or multi-dimentional. One-dimentional arrays: // Defining int ar[3]; // Initializing the array ar[0]=10; ar[1]=20; ar[2]=30; // Supports indexing cout<<ar[0]; // this will output the value stored at index 0, which is 10 Multi-dimentional arrays: // Defining int mar[3][2] //multi-dim array // Initializing the array mar[3][2]={ {34,188}, {29,165}, {29,160} }; // Supports indexing cout<<ar[0][0]; // this will output the value stored at row index 0 x column index 0, which is 34 Loops can be used to iterate over one-dimentional or multi-dimentional arrays. We will take a closer look at this in the next section. 5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times C++ has several looping options such as ‘for’, ‘while’ and ‘do while’. There are also options of nesting (single, double, triple, ..) loops. a. The for loop // Syntax for (i=0;i<10;i++){ statements to do stuff } // iterate over elements of one-dimentional array // practice - create an array with a table of 12 int t12[10]; for (int i=0;i<10;i++){ t12[i]=12*(i+1); } // iterate over elements of two-dimentional array (concept of nesting - we will enclose a for loop within another for loop) int mar[3][2]={ {34,188}, {29,165}, {29,160} }; //multi-dim array cout<<"\nthis is a multi dimentional array: "; for (int i=0;i<3;i++){ //3 rows in the array cout<<"\nrow "<<i+1<<": "; for (int j=0;j<2;j++){ //2 columns in the array cout<<"col "<<j+1<<": "<<mar[i][j]<<", "; } } b. The While loop executes the same code again and again until a stop condition is met: // Syntax int i=0; while (i<10){ code statements; i+=1; } // Example int i=1; cout<<"\n\nwhile loop - first 10 natural numbers"<<endl; while (i<=10){ cout<<i<<", "; i+=1; //same as i=i+1 or i+=1 } c. The Do-While loop executes the same code again and again until a stop condition is met. The difference from while loop is that in do-while loop atleast the content of the loop is executed once before checking the condition. // Syntax int i=0; do{ code statements; i+=1; }while (i<10) // Example //for example if you want the user to enter the password again and again until they enter the correct password cout<<"\n\ndo-while loop\n"; i=1; string pass="pass", pass2; do{ if(i!=1){ cout<<"\naccess denied, try again"; } cout<<"\nenter your password?"; cin>>pass2; i=0; }while(pass2 != pass); cout<<"\npassword accepted\n\n"; C++ also provides the break and continue statements that allow us to alter the loops further. Following is their use: break jumps immidiately out of the loop. mostly used in while loops but can also be used in for loops // break statement example cout<<"\nbreak statement\n"; for(int f=1;f<11;f++){ if(f==5){ break; //we break out of the loop when f==5, and dont execute the loop for f>=5 } cout<<f<<", "; } continue is similar to break, but just breaks out of the current iteration, but still continues running the next iterations // continue statement example cout<<"\nbreak statement\n"; for(int f=1;f<11;f++){ if(f==5){ continue; } cout<<f<<", "; //this statement not executed for f==5 } 6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails C++ provides if.., if..else.., and switch statements to apply conditional logic. Lets take a look at them: a. The basic syntax for creating an if statement is: /////////// IF STATEMENT //////////// string pass="password",pass2; cout<<"\n\n--if statement capability--\n"; cout<<"\nenter password:"; cin>>pass2; if (pass==pass2){ cout<<"\npassword matches! you can enter!!"; } else{ cout<<"\npassword doesnt match! begone!!"; } b. The basic syntax for creating an if…else statement is: /////////// IF-ELSE STATEMENT //////////// int menuChoice=5; cout<<"\n\n--if-else statement capability--\n"; cout<<"\n1.\tadd record"; cout<<"\n2.\tdelete record"; cout<<"\n3.\texit"; cout<<"\nwhat do you want to do?"; cin>>menuChoice; if (menuChoice==1){ cout<<"\nlets add some records!!"; } else if (menuChoice==2){ cout<<"\nlets delete some records!!"; } else{ cout<<"\nexiting! good-bye!!"; } c. The basic syntax for creating a switch statement is: /////////// SWITCH STATEMENT //////////// int menuChoice2=5; cout<<"\n\n--switch statement capability--\n"; cout<<"\n1.\tadd record"; cout<<"\n2.\tdelete record"; cout<<"\n3.\texit"; cout<<"\nwhat do you want to do?"; cin>>menuChoice2; switch(menuChoice2){ case 1: cout<<"\nlets add some records!!"; break; case 2: cout<<"\nlets delete some records!!"; break; case 3: cout<<"\nexiting! good-bye!!"; break; default: cout<<"\n!!!!error!!!!"; } 7. Put your code in functions Functions allows us to create a block of code that can be executed many times without needing to it write it again. // Following is an example case where we define a function that shows a menu to the user int sub_menu(int choice) { switch(choice){ case 1: cout<<"\nLets add a new record"; break; case 2: cout<<"\nLets view an existing record"; break; case 3: cout<<"\nLets delete an existing record"; break; default: cout<<"\nExiting! Goodbye!!"; } return 0; } We can call the function by its name: // lets say we are writing the main() and we want to call the funciton // lines-of-code sub_menu() // lines-of-code 8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes C++ allows user to create classes. These can be a combination of variables and functions that operate on those variables. Lets take a look at how we can define and use them. // Create a Car class with some attributes class Car { public: string brand; string model; int year; }; // Create an object of Car Car carObj1; carObj1.brand = "Mahindra"; carObj1.model = "Scorpio"; carObj1.year = 2020; // Using the object cout << carObj1.brand << " " << carObj1.model << " " << carObj1.year << "\n"; 9. Read file from a disk and save file to a disk Lets see how to read and write a text file in an organized way. We use the fstream header file for importing the functions necessary to read/write files. #include <fstream> // read a text file string line; ifstream myfile ("file.txt"); if (myfile.is_open()) { while ( getline (myfile,line) ) { cout << line << '\n'; } myfile.close(); } else cout << "Unable to open file"; // write a text file ofstream myfile ("file.txt"); if (myfile.is_open()) { myfile << "This is a line.\n"; myfile << "This is another line.\n"; myfile.close(); } else cout << "Unable to open file"; 10. Ability to comment your code so you can understand it when you revisit it some time later We can tell C++ that a line of code is a comment as follows. // this is a comment We can tell that a multi-line block of text as follows. /* this is a comment block */ While C++ can be a powerful tool, it can also be complex and difficult to learn, especially for beginners. The language has a steep learning curve, and requires a solid understanding of programming concepts such as pointers, memory management, and OOP. However, with the right resources and dedication, C++ can be a rewarding and powerful tool for software development. Overall, C++ is a popular and powerful programming language that is used in a wide range of applications, from operating systems to video games. Its efficiency, speed, and support for OOP and generic programming make it a versatile and powerful tool for software developers. To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent. Comments welcome!

Coding · 2019-02-02

Introduction to Programming in Microsoft Excel VBA

What is MS Excel VBA? Excel VBA, or Visual Basic for Applications, is a programming language that can be used to automate tasks and enhance functionality in Microsoft Excel. VBA is a powerful tool that allows users to write custom macros and functions to automate repetitive tasks, perform complex calculations, and create custom solutions. VBA is a type of Visual Basic, which is an object-oriented programming language developed by Microsoft. VBA is integrated directly into Excel, making it easy to access and use. VBA code is stored in modules, which can be accessed through the Visual Basic Editor in Excel. In the Editor, users can write, edit, and run VBA code, as well as debug their code to identify and fix any errors. One of the key advantages of VBA is that it allows users to automate repetitive tasks that would otherwise be time-consuming to perform manually. For example, users can write a VBA macro to format data, generate reports, or update data in bulk. VBA can also be used to perform complex calculations, create custom user interfaces, and interact with other applications. To get started with VBA, users should have a basic understanding of programming concepts and syntax. The VBA language is based on Visual Basic, so many programming concepts, such as variables, loops, and conditional statements, are similar to other programming languages. Excel also provides many built-in functions and objects that can be used in VBA code, making it easy to access and manipulate data in a spreadsheet. Most modern programming languages have a set up similar building blocks, for example Receiving input from the user and Showing output to the user Ability to store values in variables (usually of different kinds such as integers, floating points or character) A string of characters where you can store names, addresses, or any other kind of text Some advance data types such as arrays which can store a series of regular variables (such as a series of integers) Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails Put your code in functions Advanced data types that are formed through a combination of one or more types of basic data types such as structures or classes Read file from a disk and save file to a disk Ability to comment your code so you can understand it when you revisit it some time later Lets dive right in and see how we can do these things in VBA. 0. Enable VBA in your Excel file Before we can begin to write a program in VBA, also known as a macro, we need to enable the developer tab. You can do this by going to the File > Options > Customise ribbon. Once the developer tab is available, go there and choose the leftmost option which says Visual Basic. Now you will see a panel in the left where you can double clock on the sheet name you are working on. This will open a empty code window. Here write the following code and save the file as a macro enabled workbook (extension will be .xlsb). Sub simple_hello() Range("A2").Value = "Hello World!" End Sub Close the file, then opn it back again and chose the option (if shown) to enable macros. Now go to the Developer tab again and this time select the second option called Macros. Here you should see the macro that you just created. Select it and hit run! 1. Receiving input from the user and Showing output to the user There are several ways in which a macro can show output to the user. Let’s look at some ways of showing output: 'Method 1: Range("A2").Value = "Hello" 'Method 2: Worksheets("Sheet1").Range("B2").Value = "Hello" 'Method 3: Worksheets(1).Range("C2").Value = "Hello" 'Method 4: MsgBox "I added Hello in cell A2, B2 and C2" 'Method 5: MsgBox "Hello " & Range("C5").Value & vbNewLine & "So you are " & Range("C6") & " years old!" 2. Ability to store values in variables (usually of different kinds such as integers, floating points or character) VBA allows 4 key types of variables: Integer, String, Double and Boolean Integer is good for soring most numeric values, String is for character input and Boolean is for a 0/1 or yes/no type of data. Here are some examples: 'Integer: Dim x As Integer x = 6 Range("A1").Value = x 'String: Dim book As String book = "bible" Range("A1").Value = book 'Double: Dim x As Double x = 5.5 MsgBox "value is " & x 'Boolean: Dim continue As Boolean continue = True If continue = True Then MsgBox "Boolean variables are cool" 3. A string of characters where you can store names, addresses, or any other kind of text Key idea here is to learn how to manipulate string variables. There are a few common operations that we will focus on: a. Joining strings 'Join Strings Dim text1 As String, text2 As String text1 = "Hi" text2 = "Tim" MsgBox text1 & " " & text2 b. Left/right or middle functions - To extract the leftmost/rightmost or middle characters from a string. Dim text As String text = "example text" MsgBox Left(text, 4) 'Just as left, we can also extract a substing from the right or middle MsgBox Right("example text", 2) MsgBox Mid("example text", 9, 2) c. To get the length of a string, use Len. MsgBox Len("example text") d. To find the position of a substring in a string, use Instr. MsgBox InStr("example text", "am") 4. Some advance data types such as arrays which can store a series of regular variables (such as a series of integers) Array’s are a series of similar type of data stored together in one variable. Arrays can be one-dimentional or multi-dimentional. a. Following example shows how a one dimentional array works: Dim Films(1 To 5) As String Films(1) = "Lord of the Rings" Films(2) = "Speed" Films(3) = "Star Wars" Films(4) = "The Godfather" Films(5) = "Pulp Fiction" MsgBox Films(4) b. Following example shows how a two dimentional array works: Dim Films(1 To 5, 1 To 2) As String Dim i As Integer, j As Integer For i = 1 To 5 For j = 1 To 2 Films(i, j) = Cells(i, j).Value Next j Next i MsgBox Films(4, 2) 5. Ability to loop your code in the sense that you want to receive 10 names from a user, you will write the code for that 10 times, but just once and tell the computer to loop through it 10 times VBA has several looping options (for, do-while, do-until). There are options of nesting (single, double, triple, ..) loops. a. Following example shows how a simple/single for loop works: Dim i As Integer For i = 1 To 6 Cells(i, 1).Value = 100 Next i b. Following example shows how a double for loop works: Dim i As Integer, j As Integer For i = 1 To 6 For j = 1 To 2 Cells(i, j).Value = 100 Next j Next i c. Following example shows how a triple for loop works: Dim c As Integer, i As Integer, j As Integer For c = 1 To 3 For i = 1 To 6 For j = 1 To 2 Worksheets(c).Cells(i, j).Value = 100 Next j Next i Next c VBA also has a do-while loop. Following example shows how it works: Dim i As Integer i = 1 Do While i < 6 Cells(i, 1).Value = 20 i = i + 1 Loop VBA also has a do-until loop. Following example shows how it works: Dim i As Integer i = 1 Do Until i > 6 Cells(i, 1).Value = 20 i = i + 1 Loop 6. Ability to execute statements of code conditionally, for example if marks are more than 40 then the student passes else fails a. If Then Statement - VBA has the option of an if statement, which executes a piece of code only if a specified condition is met. Dim score As Integer, result As String score = Range("A1").Value If score >= 60 Then result = "pass" Range("B1").Value = result Dim score As Integer, result As String score = Range("A1").Value b. If Else Statement - VBA has the option of an if-else statement, which executes a piece of code only if a specified condition is met, if not then it executes another piece of code. If score >= 60 Then result = "pass" Else result = "fail" End If Range("B1").Value = result c. If Else Statement - VBA has the option of an if-else statement, which executes a piece of code only if a specified condition is met, if not then it executes another piece of code. 'Select Case 'First, declare two variables. One variable of type Integer named score and one variable of type String named result Dim score As Integer, result As String 'We initialize the variable score with the value of cell A1 score = Range("A1").Value 'Add the Select Case structure Select Case score Case Is >= 80 result = "very good" Case Is >= 70 result = "good" Case Is >= 60 result = "sufficient" Case Else result = "insufficient" End Select 'Write the value of the variable result to cell B1 Range("B1").Value = result 7. Put your code in functions VBA allows us to specify a function or a sub. The difference between the two is that funciton allows us to return a variable whereas a sub does not. a. Function - If you want Excel VBA to perform a task that returns a result, you can use a function. Place a function into a module (In the Visual Basic Editor, click Insert, Module). For example, the function with name Area. 'Explanation: This function has two arguments (of type Double) and a return type (the part after As also of type Double). You can use the name of the function (Area) in your code to indicate which result you want to return (here x * y). Function Area(x As Double, y As Double) As Double Area = x * y End Function 'Explanation: The function returns a value so you have to 'catch' this value in your code. You can use another variable (z) for this. Next, you can add another value to this variable (if you want). Finally, display the value using a MsgBox. Dim z As Double z = Area(3, 5) + 2 MsgBox z b. Sub - If you want Excel VBA to perform some actions, you can use a sub. Place a sub into a module (In the Visual Basic Editor, click Insert, Module). For example, the sub with name Area. Sub Area(x As Double, y As Double) MsgBox x * y End Sub 'Explanation: This sub has two arguments (of type Double). It does not have a return type! You can refer to this sub (call the sub) from somewhere else in your code by simply using the name of the sub and giving a value for each argument. 'Call it using Area 3, 5 8. Advanced data types that are formed through a combinaiton of one or more types of basic data types such as structures or classes VBA Class allows us to create our own Object function in which we can add any kind of features, details of the command line, type of function. When we create Class in VBA, they act like totally an independent object function but they all are connected together. Detailed example of how to do this is out of the scope of this article. 9. Out of scope of this article. 10. Ability to comment your code so you can understand it when you revisit it some time later We can tell VBA that a line of code is a comment by starting it with an single inverted comma. 'this is a comment Overall, Excel VBA is a powerful tool that can help users automate tasks, improve productivity, and enhance the functionality of Microsoft Excel. With its flexibility and ease of use, VBA is a valuable tool for users of all skill levels, from beginners to advanced programmers. To close I will emphasize the importance of practicing in learning anything new. Persistence and trying out different combinations of these building blocks for solving easier problems first and more complex ones later on is the only way to become fluent. Comments welcome!

Coding · 2019-01-05

parashar.ca

Contact

All Posts