Now Loading ...
-
30 Day Daily Map Challenge
The #30DayMapChallenge was a daily social mapping project held in November 2020 that I participated in. I had a good understanding of Python before participating but very less knowledge of working with GIS data. So I had plotted some data on maps using Microsoft Excel, the data for which was usually in tabular format in an Excel or CSV. However, I had never heard of Shape files, DEM – Digital Elevation Models, DSM – Digital Surface Models, or DTM – Digital Terrain Models, etc. Understanding these formats, what they contain and how they can be processed was the biggest challenge I encountered.
GeoPandas is a Python library for working with geospatial data, built on top of the Pandas library. It extends the Pandas data analysis library to enable spatial operations on geometric types. GeoPandas provides data structures to represent geometric objects (points, lines, polygons) and allows for spatial operations on them, such as intersection and distance. GeoPandas also provides tools for reading and writing geospatial data in various formats and for visualizing geospatial data.
All Submissions - HD
How to install GeoPandas?
GeoPandas makes it a breeze to deal with these file formats, however the challenge is to install it. GeoPandas usually has conflicts while installing and the best way that I found to install it was to first install Anaconda Python. Post that we create a new environment and install GeoPandas in that.
conda create –n geopandas_env #create a new environment
conda activate geopandas_env #activate this environment
conda config — env — add channels conda-forge #add conda-forge channel to your environment
conda config — env — set channel_priority strict #make conda-forge your first priority to install any package so that dependent package won’t conflict
conda install python=3 geopandas #install GeoPandas using conda
Basic steps to use Geopandas
Now we have GeoPandas installed, lets move on to the interesting part. Here are some basic steps to use Geopandas:
# importing the Geopandas library
import geopandas as gpd
# Read data: Load the data from the shapefile or other geospatial data sources. You can use the following command to read shapefiles:
data = gpd.read_file('/path/to/shapefile.shp')
# Explore the data: Geopandas data is stored in a pandas dataframe. You can explore the data using the same commands you would use for any pandas dataframe, for example:
data.head()
data.info()
data.describe()
# Plotting: Geopandas provides some basic plotting functionality. You can plot the data using the following command:
data.plot()
# Geospatial operations: You can perform a wide variety of geospatial operations on the data using Geopandas. Here are some examples:
# Subset data based on spatial location:
bbox = (xmin, ymin, xmax, ymax)
subset_data = data.cx[bbox[0]:bbox[2], bbox[1]:bbox[3]]
# Buffering points:
data['buffered'] = data.buffer(distance)
# Overlaying data:
overlay_data = gpd.overlay(data1, data2, how='intersection')
# Spatial joins:
join_data = gpd.sjoin(data1, data2, how='inner', op='intersects')
How to create a map in GeoPandas
Here’s an example code to create a map of India and plot its states on it using GeoPandas. In this example, we first load the shapefiles for India and the state boundaries using GeoPandas’ read_file function. We then create a figure and axis object using Matplotlib’s subplots function and plot the India map on the axis object. Finally, we plot the state boundaries on top of the map and add a title to the plot using Matplotlib’s set_title function. The resulting plot will show the map of India with the state boundaries plotted on top of it.
import geopandas as gpd
import matplotlib.pyplot as plt
# Load the shapefile for India and the state boundaries
india = gpd.read_file('path/to/India.shp')
states = gpd.read_file('path/to/India_States.shp')
# Create a figure and axis object
fig, ax = plt.subplots(figsize=(10, 10))
# Plot the India map
india.plot(ax=ax, color='lightgrey', edgecolor='white')
# Plot the state boundaries on top of the map
states.plot(ax=ax, color='none', edgecolor='black', linewidth=0.5)
# Add a title to the plot
ax.set_title('Map of Indian States')
# Show the plot
plt.show()
Where can you download geospatial data from?
There are many places where you can download geospatial data from, depending on your needs. Some common sources of geospatial data include:
USGS EarthExplorer: The USGS EarthExplorer is a great source for downloading aerial and satellite imagery, as well as elevation data.
Natural Earth: Natural Earth is a public domain map dataset available at a variety of scales, from 1:10m to 1:250m, which can be used for web mapping and GIS projects.
OpenStreetMap: OpenStreetMap is a community-driven map dataset that can be used to create custom maps and for geocoding applications.
NASA Earthdata: NASA Earthdata provides access to a wide range of earth science data, including satellite imagery and climate data.
Global Administrative Areas (GADM): GADM provides administrative boundaries for countries, including states and counties.
WorldClim: WorldClim provides global climate data at a range of spatial resolutions.
National Oceanic and Atmospheric Administration (NOAA): NOAA provides a range of geospatial data, including marine and coastal data, as well as weather and climate data.
Best practices for working with geospatial data
Working with geospatial data can be challenging and requires some best practices to ensure the accuracy and reliability of your results. Here are some best practices for working with geospatial data:
Use projections: A projection is a mathematical method for representing the curved surface of the Earth on a flat map. Different projections are suitable for different regions and uses. Be sure to use an appropriate projection for your data.
Check data quality: Before using any geospatial data, it’s important to check for completeness, consistency, and accuracy. You can do this by examining the metadata, visualizing the data, and performing data validation.
Manage data volumes: Geospatial data can be extremely large, especially if you are working with high-resolution imagery or vector data. Be sure to manage your data volumes carefully, using techniques such as data compression, subsampling, and tiling.
Choose the right software: There are many software options available for working with geospatial data, including open source tools like QGIS and proprietary tools like ArcGIS. Choose a tool that is suitable for your data and your workflow.
Document your workflow: Document your workflow and the decisions you make as you work with your geospatial data. This can help you reproduce your results and share your findings with others.
Use cloud-based services: Cloud-based services can provide easy access to geospatial data and processing power, especially for large data volumes or complex analysis workflows.
Be mindful of licensing: Many geospatial datasets have specific licensing requirements, and it’s important to be aware of these before using the data. Be sure to read and follow the license terms carefully.
Seek out community resources: The geospatial community is large and active, with many resources available online. These can include tutorials, forums, and code repositories. Seek out these resources to learn from others and to share your own experiences.
On a closing note, Geospatial data is an exciting and rapidly evolving field, and working with it can be a lot of fun. Participating in challenges like the 30-day map making challenge can be a great way to improve your skills and learn new things.
Comments welcome!
-
30 Day Daily Chart Challenge
During the month of April 2021, I participated in the #30DayMapChallenge, a daily social data project hosted by Cédric Scherer, Maya Gans and Dominic Royé. In this blog post I wanted to briefly talk about the project and share my experience, learnings in terms of data and tools used, and challenges faced. However, before I delve into that, here is a collage of my submissions.
HD Version:
The ask
The project was about plotting a chart a day using any tool and any relevant dataset. Each day had a different theme and there were five main themes and then sub-themes within them as shown below. For example for day 1 part-to-whole comparison, one could plot the number of Indian Netflix viewers out of the total number of Netflix viewers across the world. For day 15 multivariate relationships, one could plot the relationship between technology indicators and stock market performance.
Tools used
I mainly used Python for plotting. There were several visualization libraries that I used such as Plotly, Altair, Seaborn, Matplotlib, and pandas.plot. I wanted to use Bokeh and plotnine for later some submissions but did not have time alongside my job so ended up reusing some of the code from initial submissions. Overall, I found that Plotly was best for customization while Seaborn and Altair let you create some neat charts very quickly.
Additionally, one of the most useful library that I worked with was dataprep. This is a fast and easy exploratory data analysis tool that allows one to understand a Pandas/Dask DataFrame with a few lines of code in seconds. I was able to quickly slice and dice data in a visual manner using the create_report function, thus enabling me to quickly decide if a particular dataset was relevant for the topic of the day. Following is a code snippet of how one can do that.
from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
create_report(df).show_browser()
Few of my submissions
For day 3, I found an interesting dataset on OWID about the number of working hours. In general the working hours in most countries have gone down over time with a few exceptions. Used Plotly to create a line chart that shows this. The y-axis shows Average working hours per worker over a full year. Used a for loop to add each line to the plot with a separate color.
for i in range(0, y_axis_levels):
fig.add_trace(go.Scatter(x=x_data[i], y=y_data[i], mode='lines',
name=y_axis_labels[i],
line=dict(color=colors[i], width=line_size[i]),
connectgaps=True,
))
For day 5, I found an interesting dataset on OWID about the energy use per person by country. In general the energy use has been increasing dramatically with some of the most developed countries with the highest energy used per capita. Tried to capture this across time using a slope chart plotted using Plotly. Used for loop to plot separate lines for each country and added separate lines to reflect time.
# lines by country
for x_val, y_val, cat_val in zip(year1val, year2val, cat1val):
fig.add_trace(go.Scatter(x=[year1, year2], y=[x_val, y_val], mode='lines+markers+text', text=[cat_val, cat_val], textposition=['middle left', 'middle right'] ))
# vertical lines representing time
fig.add_shape(type="line", x0=year1, y0=vp_min, x1=year1, y1=vp_max,line=dict(color="Grey",width=2))
fig.add_shape(type="line", x0=year2, y0=vp_min, x1=year2, y1=vp_max,line=dict(color="Grey",width=2))
For day 10, I focused on showing the meat consumption trend using Altair. The library makes it really simple to plot shapes on a chart. Following is partial code and the output.
chart = alt.Chart(source).mark_point(filled=True, opacity=1, size=100).encode(
alt.X('x:O', axis=None),
alt.Y('animal:O', axis=None),
alt.Row('year:N', header=alt.Header(title='')),
alt.Shape('animal:N', legend=None, scale=shape_scale),
alt.Color('animal:N', legend=None, scale=color_scale),
)
For day 11, I decided to just plot the number of cases of COVID-19 that happened over a period of 1 year in Chile. Used Matplotlib to plot this. Following is partial code and the output.
angles0 = np.random.normal(loc=0, scale=1, size=10000)
angles1 = np.random.uniform(0, 2*np.pi, size=1000)
# Construct figure and axis to plot on
fig, ax = plt.subplots(1, 2, subplot_kw=dict(projection='polar'))
# Visualise by area of bins
circular_hist(ax[0], angles0)
# Visualise by radius of bins
circular_hist(ax[1], angles1, offset=np.pi/2, density=False)
Day 16 was about trees, so decided to use the Kaggle dataset on mushrooms and ran a decision tree model with default values for the parameters to classify a mushroom as poisonous or not, and built a tree map.
# Fit the classifier with default hyper-parameters
clf = DecisionTreeClassifier(random_state=1234)
model = clf.fit(X, y)
# Print Text Representation
text_representation = tree.export_text(clf)
print(text_representation)
# Plot Tree with plot_tree
fig = plt.figure(figsize=(25,20), facecolor='white')
_ = tree.plot_tree(clf,
feature_names=feature_names1,
class_names=df['class'].unique(),
filled=True)
fig.savefig("decistion_tree.png")
Day 24 was limiting in the way that you could only plot a monochrome chart. Found the interesting dataset on US post offices which showed how many post offices were established and disconnected by year. Plotted the established count on +y and discontinued on the -y axis.
from plotly.subplots import make_subplots
import plotly.graph_objects as go
fig = make_subplots(rows=2, cols=1,
shared_xaxes=True, vertical_spacing=0.02,)
#subplot_titles=("Plot 1", "Plot 2"))
fig.add_trace(
go.Bar(x=list(e.established), y=list(e.state), name='Established', marker_color=chosen_color),
#text=list(m['hex']),textposition='auto'),
row=1, col=1
)
fig.add_trace(
go.Bar(x=list(d.discontinued), y=list(d.state2), name='Discontinued', marker_color=chosen_color),
#text=list(m['hex']),textposition='auto'),
row=2, col=1
)
The challenge
The biggest challenge was certainly in finding the right data for the kind of chart to be plotted on a particular day. Following are some of the websites that I used for getting the data data.world, kaggle, TidyTuesday, MakeoverMonday,
Eurostats, UN Stats, WHO, and OECD Stats.
For complete version of the code snapshots shown in this blog visit my GitHub.
Comments welcome!
-
TidyTuesday Weekly Visualization Challenge
Since October 2020, I have been participating in TidyTuesday, a weekly social data project hosted by R for Data Science Online Learning Community. This enabled to me to quickly up-skill the visualization aspect of my data science journey. I also had the pleasure to engage with several likeminded people and curate and learn from their submissions and thought process. In the following post, I strive to share my experience and learnings.
Packages available for plotting
There are many Python packages available for plotting, each with its own strengths and weaknesses. Here are some of the most popular ones:
Matplotlib: This is the most widely used library for creating static, interactive, and animated visualizations in Python. It is highly customizable and can create a wide variety of charts and graphs.
Seaborn: Seaborn is built on top of Matplotlib and provides a high-level interface for creating statistical visualizations. It is particularly good for creating heatmaps, violin plots, and other complex visualizations.
Plotly: This library provides interactive, web-based visualizations that can be embedded in websites or Jupyter notebooks. It supports a wide variety of chart types, including scatter plots, line charts, and 3D surface plots.
Bokeh: Similar to Plotly, Bokeh provides interactive visualizations that can be embedded in web applications. It is particularly good for creating interactive scatter plots and geographical maps.
plotnine: This library is based on the popular R package ggplot2 and provides a similar grammar for creating plots in Python. It is particularly good for creating aesthetically pleasing plots with minimal code.
Altair: Altair is a declarative visualization library that allows you to create complex visualizations with minimal code. It uses a concise syntax that is based on the Vega-Lite visualization grammar.
Here is a snapshot of the visualizations I created on a dataset from Water Point Data Exchange.
Considerations for plotting
While Python offers a wide variety of plotting libraries, there are still some challenges and considerations to keep in mind when plotting data. Here are a few:
Data formatting: Before you can plot your data, you may need to format it properly. This can involve converting data types, cleaning up missing values, and transforming data into the right shape for a particular plot. This can be time-consuming and may require some knowledge of data manipulation in Python.
Choosing the right plot: There are many types of plots to choose from, each with their own strengths and weaknesses. Choosing the right plot for your data can be challenging, especially if you are not familiar with the different types of plots available or if you are trying to convey complex relationships.
Customizing the plot: Once you have chosen a plot, you may need to customize it to suit your needs. This can involve changing the colors, labels, and other visual properties of the plot. Customizing the plot can be time-consuming and may require some knowledge of the plotting library you are using.
Performance: Depending on the size of your dataset and the complexity of your plot, generating a plot can be computationally intensive. This can lead to slow plot generation times, especially if you are working with large datasets or are trying to create interactive plots.
Accessibility: When creating plots, it is important to consider accessibility for people with visual impairments. This can involve using high-contrast colors, providing alternative text descriptions, and ensuring that the plot is readable when printed in black and white.
Reproducibility: If you are creating plots for research purposes, it is important to ensure that your plots are reproducible. This involves documenting the code and data used to create the plot, so that others can recreate the same plot in the future.
Here is a snapshot of the visualizations I created on a dataset that consists of tv shows and movies available on Netflix as of 2019.
Choosing the right plot to represent your data
Choosing the right kind of plot depends on several factors, including the type of data you are working with, the questions you want to answer, and the audience you are presenting to. The key is to choose a plot that effectively communicates your data and insights to your audience. It may be helpful to experiment with different types of plots and to seek feedback from others to ensure that your plot is clear and understandable. Here are some guidelines for choosing the right kind of plot for your data:
Line charts: Line charts are useful for showing trends over time or for comparing trends between different groups. They work well for data that is continuous and evenly spaced, such as stock prices or weather data.
Bar charts: Bar charts are useful for comparing values between different categories. They work well for data that is categorical or discrete, such as survey responses or sales data.
Scatter plots: Scatter plots are useful for showing the relationship between two variables. They work well for data that is continuous and can show patterns or correlations in the data.
Histograms: Histograms are useful for showing the distribution of a single variable. They work well for data that is continuous and can show the range, frequency, and shape of the data.
Box plots: Box plots are useful for showing the distribution of a single variable across different groups. They work well for data that is continuous and can show the median, quartiles, and outliers in the data.
Heatmaps: Heatmaps are useful for showing the relationship between two variables using color. They work well for data that is categorical or continuous and can show patterns or clusters in the data.
Choropleth maps: Choropleth maps are useful for showing regional differences in data. They work well for data that is geographic in nature, such as population density or election results.
Here is a snapshot of the visualizations I created on a dataset detailing events across all 40 seasons of the US Survivor, including castaway information, vote history, immunity and reward challenge winners and jury votes.
Choosing the right package for building a particular plot
Here are some examples of plots that are specific to a particular Python package or easily built using a particular Python package:
Violin plots: Violin plots are a type of plot that combine a box plot with a kernel density plot. They are commonly used for visualizing the distribution of a dataset. Violin plots can be easily built using the seaborn package, which provides tools for statistical visualization in Python.
Sankey diagrams: Sankey diagrams are a type of flow diagram that show the flow of values between different categories. They can be easily built using the plotly package, which provides tools for creating interactive, web-based visualizations in Python.
3D surface plots: 3D surface plots are a type of plot that show the relationship between three variables. They can be easily built using the plotly or matplotlib package, which both provide tools for creating 3D visualizations in Python.
Heatmaps: Heatmaps are a type of plot that show the relationship between two variables using color. They can be easily built using the seaborn package, which provides tools for statistical visualization in Python.
Treemaps: Treemaps are a type of plot that show hierarchical data using nested rectangles. They can be easily built using the squarify package, which provides tools for creating treemaps in Python.
Here is a snapshot of the visualizations I created on a dataset detailing salaries and other information of 24,000+ survey participants.
Benefits of interacting with other data visualization folks
Interacting with other data visualization folks can offer several benefits, including:
Learning new techniques: By interacting with other data visualization professionals, you can learn new techniques for creating effective visualizations. This can help you to improve your skills and create more compelling visualizations.
Getting feedback: When you share your visualizations with other professionals, they can offer feedback on how to improve your work. This can help you to identify areas for improvement and create more effective visualizations.
Discovering new tools: There are many different tools and libraries available for creating data visualizations in Python and other programming languages. By interacting with other professionals, you can discover new tools and libraries that can help you to create better visualizations.
Finding inspiration: Sometimes, seeing how other professionals have approached data visualization can provide inspiration for your own work. By interacting with other professionals, you can see a variety of different approaches and ideas, which can help you to create more interesting and innovative visualizations.
Building your network: Interacting with other data visualization professionals can help you to build your network and connect with others in your field. This can lead to new opportunities for collaboration and career advancement.
Here is a snapshot of the visualizations I created on a dataset about homeless shelters across Toronto, Canada.
Overall, interacting with other data visualization professionals can help you to improve your skills, create more effective visualizations, and build your network. Whether through online communities, meetups, or conferences, there are many opportunities to connect with other professionals and benefit from their insights and expertise.
I would highly encourage others to participate in the #TidyTuesday weekly visualization challenge.
Comments welcome!
-
Perspective: A Lesson from The Kite Runner
Perspective: A Lesson from The Kite Runner
Have you ever looked back on a moment in your life and realized you saw it completely differently at the time? Our perspective shapes the way we understand events, people, and even ourselves. Khaled Hosseini’s The Kite Runner masterfully explores the power of perspective through its protagonist, Amir, and his journey of redemption. The novel provides several poignant moments where a shift in perspective redefines reality, reminding us of the importance of seeing beyond our own biases and assumptions.
A Child’s Perspective: The Privilege of Innocence
In the beginning, Amir enjoys a privileged life in Kabul, unaware of the deep societal divides that separate him from Hassan, his Hazara servant and best friend. To Amir, their friendship is pure and unaffected by status. However, Hassan, though younger, understands the weight of their differences. One of the most heartbreaking moments occurs when Amir fails to stand up for Hassan in the alley. From Amir’s limited perspective, his silence is self-preservation, but with time, he realizes it was cowardice—a realization that haunts him into adulthood.
“I ran because I was a coward. I was afraid of Assef and what he would do to me.” This self-awareness only develops later, demonstrating how perspective matures with experience.
The Father-Son Lens: Misunderstood Love
Baba, Amir’s father, is another character whose perspective is misunderstood. Amir believes Baba favors strength and physical courage over intellect, leading to deep insecurities. However, as the novel unfolds, Amir learns of Baba’s sacrifices and hidden struggles—his illegitimate son, his moral dilemmas, and the burden of expectations.
A key moment of realization comes when Baba tells Amir, “There is only one sin, only one. And that is theft… When you tell a lie, you steal someone’s right to the truth.” This lesson, initially abstract to Amir, takes on a new meaning as he matures and understands the gravity of deception—not just in others, but within himself.
Redemption and a Shift in Perspective
Perspective is often best understood in hindsight. Amir’s journey to atone for his past mistakes brings him back to Afghanistan, where he sees his homeland through the eyes of suffering. The Taliban’s rule has reshaped the Kabul of his childhood into an unrecognizable and brutal landscape. His perception of Hassan also shifts dramatically when he discovers the truth about their relationship—that they were brothers.
His final act—rescuing Sohrab—is not just a physical redemption but a transformation of his worldview. He finally understands what it means to be truly selfless, to take action rather than remain passive.
Final Thoughts: Expanding Our Own Perspective
Amir’s journey reminds us that perspective is ever-changing, molded by experience, knowledge, and time. Whether in literature or in life, understanding different perspectives fosters empathy and growth. Just like Amir, we must be willing to look beyond our immediate view and challenge our own biases.
After all, true transformation begins when we allow ourselves to see the world through another’s eyes. How has a shift in perspective changed the way you see a person or situation in your own life?
Touch background to close