You already know that Pandas is a power tool for data munging. In this tutorial, I will show you how to explore a data set using Pandas, Numpy and Matplotlib.
My goal for this project is to determine if the gap between Africa/Latin America/Asia and Europe/North America has increased, decreased or stayed the same during the last two decades.
So let’s get started.
Loading Files Into IPython Notebook
Using the list of countries by continent from World Atlas data, I am loading countries.csv file into a Pandas DataFrame using pd.read_csv, and I name this data frame as count_df.
I am loading gapminder.xlsx file as a pandas Data Frame.
Transforming the data
In this section, I am going to transform complete_excel data frame to have years as the rows and countries as the columns.
I will explain what is happening in the code line by line:
complete_excel[complete_excel.columns] will return the first column of complete_excel data frame, and then I am setting the column gdpc2011 as the index of my data frame. But I dont want my index and the first column to be the same, so I am going to delete this column. I am deleting this column using drop command.
transfrom = complete_excel.drop(complete_excel.columns, axis = 1)
After deleting gdp pc column, I am converting year values from float to integers. If you want to know how map statement applies to a data frame, you can read my detailed explanation here.
Now I transpose this data frame:
transfrom.columns = map(lambda x: int(x), transfrom.columns)
Plotting a Histogram
I am plotting a histogram for the year 2000. Here I am using dropna to exclude missing values for the year 2000. Also, .ix enables me to select a subset of the rows and columns from a DataFrame.
I am using log scale to plot the values.
Merging data frames
I am using merge function to merge two data frames(data1 and count_df).
Using Box plot for further exploration
I am generating box plots to explore the trends for the years 1900, 1990 and 2003. I encourage you to explore the trends for the years 1950, 1960, 1970, 1980, 1990, 2000 and 2010; you can use years = np.arange(1950, 2010, 10) statement to do that .
If you explore the changes from 1950 to 2010, you can see that in most continents (especially Africa and Asia) the distribution of incomes is much skewed: most countries are in a group of low-income states with a fat tail of high-income countries that remains approximately constant throughout the 20th century.
Now that you know how to explore data using Python, you are ready to start. You know everything from how to load data into python to how to clean and visualize, and draw insights from data.
Here is a simple exercise for you to improve your data exploration skills.
Consider the distribution of income per person from two regions: Asia and South America. Estimate the average income per person across the countries in those two regions. Which region has the larger average of income per person across the countries in that region? (Use the year 2012). Also create boxplots to see the income distribution of the two continents on the dollar scale and log10(dollar) scale.
If you have any additional questions please let me know.