Much of the work for this project started with simply thinking about the issue of food security and performing the necessary research to understand the issue better and how to best go about an exploratory analysis of related data on a global scale. The United Nations and other organizations are well versed in identifying the leading causes behind food insecurity. A few of these factors are explored in this report. There are also regional factors outside of the datasets used that are discussed later on. Besides confirming existing theories about food insecurity, the visualizations presented here help to show the magnitude of the problem for select areas, especially in the global south.
As for the process of building this report, research was first conducted followed by outlining what types of visualizations would suit each EDA component based on the selected datasets and the topic of global food insecurity. This particular report was generated through Jupyter Notebooks with Python code.
This task involved some data cleaning to get started, but the primary goal was to take a preliminary look at the data. This would aid in asking questions that could be answered by other visualizations.
There are a lot of countries in the dataset, but one way of trying to group them together that could be related to food insecurity would be geography since it is related to shared environmental and trade issues. The original dataset did not include shared regions as a column, so I needed a way to identify shared geographic regions in a way that was not tedious. First, I organized each country into shared geographic groups as python lists. Then, I wrote a program to add the new "Region" column using a lookup dictionary.
Once the dataset was adjusted, box-and-whisker plots were created using seaborn to compare the distribution of food insecurity across each region. The unit by which food insecurity is measured is the three year average (2020-2022) percentage of a given nation's population identified as food insecure by the Food and Agriculture Organization of the United Nations. According to the charts, Africa has the largest distribution of food insecure populations with 50% of African nations being far more food insecure than nations from any other region with some exceptions from Asia and Central America and the Caribbean which has the second largest distribution. The outliers indicate special cases that deserve their own attention, but the regions besides Afria and Central America / the Caribbean show relatively tight ranges.
Given most data wrangling for these datasets was performed using pandas, it seemed appropriate to make a visualization using the pandas library as well. This process began by joining five datasets to the existing one based on the country name and the three year range values. Each of these datasets were chosen after performing preliminary research into data values of interest related to food insecurity. Because the datasets came from the same source, there was not an issue of naming conventions for the country names between the datasets. Prior to merging, the datasets were reviewed for major anomalies and had each of their respective "Value" columns renamed to match the dataset. For example, the food imports datasets had the "Value" column renamed "Food Imports." If there were null values present in the dataset, those would be handled later depending on the visualization.
The GDP dataset required extra handling since it did not have three year rolling averages like the others. Once a column was created to address this, the datasets were merged in a "left join" to the original food insecurity dataset. This means that the countries included in the food insecurity dataset are the only ones that will be considered for analysis moving forward. That being said, they might not have any data to present in the other datasets. These will remain as null values as imputation does not seem necessary at this time.
Note that the units in each of the datasets is in percentages with the exception of the GDP. This particular GDP dataset was chosen because it shows the GDP with purchasing power parity (PPP) per capita. In other words, it provides some insight to the economic output of the average person while accounting for currency differences.
Having a high percentage of food imports (in relation to total exports) is not by itself an indicator of food insecurity. On the other hand, a high percentage of cereals, tubers, and roots in a diet can lead to food insecurity. To see just how often a nation with massive amounts of food imports could be relying on cereals, the visualizations below were created. In general, they seem to indicate a lack of a strong relationship between food importation and cereal imports. It also confirms that nations with high levels of food insecurity are not necessarily tied to high food importation. The visualization also suggests that nations with high food importation values and low cereal importation could potentially have a more nutritionally diverse diet if they are not reliant on importing low-cost, calorie-dense foods. If they are able to import that much food, they must have significant purchasing power. For nations that have low cereal importation values could still have a high percentage of cereals and tubers making up their diet. This just means that they must have greater reliance on local agricultural production. By itself, this is not a problem. It only becomes an issue if environmental factors constanly disrupt their agricultural production.
The first visualization was created strictly using pandas. The second was created using matplotlib in order to allow the layering of data points to highlight the most food insecure nations. The visualizations only cover 2018-2020 since that was the only range with data available for both food and cereal imports.
When deciding what type of visualization to present for this section of the report, a box-and-whisher plot could have worked here instead of using it at the beginning of the report. On the otherhand, comparing the previous box-and-whisher plot with this histogram does help to demonstrate how looking at the data without taking in extra context such as regionality coud lead to missing key insights.
The histogram with the KDE line below shows what percentage ranges for food insecurity are most frequent in the dataset on average from 2020-2022. Thankfully, most countries do not have high food insecurity percentages. However, the length of the tail is still concerning. To think that any country could be facing such drastic levels of food insecurity compared to the rest of the world seems unfathomable until you see the numbers.
Below is a Pearson Correlation heatmap created using seaborn to show the relationship between different features in the dataset. The data used covers every year range available in the dataset (2014-2022). Based on the linked datasets related to food insecurity by the UN Food and Agriculture Organization, each of the features below may play a role with a given nation's food insecurity problem with "Value" representing the food insecure population percentage. When looking at all countries together, we want to understand if some factors play a larger role than others in causing or perhaps predicting food insecurity. Some features like "Cereals in Diet" are evidence of food insecurity versus features like "Irrigated Arable Land" are values of interest because it could indicate a nation at risk of major food insecurity.
Most of the relationships in the results heatmap are not particularly strong with a few exceptions. Unsurprisingly, food insecurity seems to have a somewhat strong and positive relationship with the percentage of cereals, tubers, and roots in the average diet. Food insecurity also has a somewhat strong and negative relationship with GDP. There is also a relationship with the region, but that is harder to distinguish since it is categorical. At the very least, it could indicate some regions being more prone to food insecurity than others. The strongest relationship present is between GDP and the presence of cereals in the diet. This is not entirely surprising either since cereals, tubers, and roots can be produced and purchased at a lower cost, thus becoming "staple foods." Even if a country with a high dependence on cereals in their diet are not at present food insecure, they could be at risk of becoming so, depending on if there are major disruptions to their source of food.
Note that some of these correlations needs to be taken with a grain salt like those based on regions since some regions have more countries to account for than others. However, these values are all in percentages to make this sort of comparison easier. It is also important to note that these values for each feature are three year rolling averages.
The report has shown so far that GDP appears to play an important role when it comes to food insecurity, in particular when it comes to diversifying a nation's diet. Specific regions also seem to be more prone to food insecurity. Some of this may have to do with geography and the environment, but that is outside of the scope of this dataset. If regionality and GDP both play a role in food insecurity, it seems reasonablt to theorize that there must be a similar relationship worth observing when it comes to these two features.
Below is a matlab violin plot to show the distribution of GDP PPP per capita across each major geographic region. The regions with the largest and most concerning food insecurity distributions (Africa and Cental America and the Caribbean) also have some of the smallest GDP distribution ranges with the lowest numbers across the dataset for 2019-2021. Frankly, none of this is too surprising, and there are definitely other factors to account for in each region. For example, the Caribbean has a bad history when it comes to natural disasters in the region that will only get worst as climate change progresses. Much of Europe and later the United States built its wealth and standing on the backs of the colonies it established around the world and in turn destabilizing these nations into the present day. These last few facts are not represented in the data, but they are well known in various studies and literature with one of my favorites being Walter Rodney's How Europe Undedeveloped Africa.
Since there are a total of six different datasets from the UN being used in this report, having certain data imbalances such as missing values is expected. Not all of the datasets cover more recent years nor is every country represented in all of the datasets if their governments did not report information for that timeframe. In this case, I wanted to visualize how often countries were not present in the data and for what timeframes since these missing values were more likely to skew any conclusions about the data.
The plot shows that more recent years have the most complete data reporting fortunately, even if the datasets presented here have not updated themselves for the most recent timeframes.
As the final portion of the analysis, I wanted to look at one feature that was not explored as much as the others: the percentage of arable land equipped for irrigation. Taking the time to train a regression model would be ideal, but it would be difficult to build anything with decent accuracy without more data. The data is from every year available in the dataset for this feature (2014-2020).
The percentage of arable land equipped with irrigation is included as part of this dataset because land with a man-made watering system in place is less vulnerable to environmental issues and thus the destruction of crops. This becomes especially important if a country is not heavily importing food and is dependent on the production of calorie-dense staple foods like cereals, tubers, and roots. Should disaster strike, the country can quickly become food insecure.
Based on the plot, irrigated arable land has a negative relationship with the percentage of food insecure people, which confirms what the heatmap earlier in the report indicated. The addition of color for the different geographic regions adds more insight into the relationship between food insecurity and agriculture production. Unsurprisingly, most nations are clustered to the lower end of percentage of irrigated arable land. This could be because even if land is arable (fertile ground capable of agricultural growth), that does not mean it is actually used for growing crops. Even so, African nations are notably close to zero with a few exceptions. A lot of European nations are also close to zero, but most of them have the purchasing power to import just about anything they want. The two major outliers on the higher end of arable land percentages are Pakistan and Bangledesh. Not every nation has data to report related to arable land, but South Sudan which has 0% irrigated arable land and food insecurity levels above 60% appears to be cut off in the visualization.
There are two additional ways to observe trends in global food security that are not included directly in this report. These include multivariate analysis and time series. Take a look at this additional report built with R for more information.
Using the various visualizations in the report, there are clear trends when it comes to food insecurity. With more data and additional features, more observations could be made. However, one major takeaway that is visually obvious would be the major challenges specific regions face when it comes to food insecurity versus others. Certain plots made comparing the different regions the primary goal. Using color also proved to be helpful when trying to compare multiple features including the region. There were no new relationships between features discovered in this report, and existing theories are confirmed visually. Hunger in the Global South is not unheard of, but visualizing the primary causes and evidence of the issue help highlight the severity.