The goal of this assignment is to explore the given dataset and combine it with public dataset of our choice to hypothesize a claim and test it.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
Note: One problem encountered was that Korean characters come broken when pandas.read_csv() is used. This is due to the fact that read_csv() does not use encoding('UTF-8') as default, which is needed for properly displaying Korean characters After displaying them using the following code, 'data wrangling' (or a.k.a 'data munging') was required to properly use district names in the later part of this assignment.
# reference
# http://freeprog.tistory.com/10
import codecs
import csv
# unicode_csv_reader(), utf_8_encoder()
# ---> https://docs.python.org/2/library/csv.html 에서 제공함.
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
data = []
with codecs.open('1_total.csv','r', encoding = 'euc-kr') as csvf:
rd = unicode_csv_reader(csvf)
t = next(rd)
for fieldname in t:
print fieldname,
print
for idx, data in enumerate(rd):
print idx, data[0], data[1], data[2]
# Read data
df_total = pd.read_csv('1_total.csv')
df_top_business = pd.read_csv('1_top_business.csv')
df_medical = pd.read_csv('1_medical.csv')
df_education = pd.read_csv('1_education.csv')
df_food = pd.read_csv('1_food.csv')
df_wholesale = pd.read_csv('1_wholesale.csv')
df_service = pd.read_csv('1_service.csv')
df_total.head()
total_lst = df_total['total'].tolist()
top_business_lst = df_top_business['total'].tolist()
# Individual sectors are as follows
medical_lst = df_medical['total'].tolist()
education_lst = df_education['total'].tolist()
food_lst = df_food['total'].tolist()
wholesale_lst = df_wholesale['total'].tolist()
service_lst = df_service['total'].tolist()
# Brief statistics
print 'Overview\n'
print 'Max: ' + str(df_total['total'].max())
print 'Min: ' + str(df_total['total'].min())
print 'Max - Min: ' + str( df_total['total'].max() - df_total['total'].min() )
print 'Average: ' + str(int(df_total['total'].mean()))
print 'Standard deviation: ' + str( int(df_total['total'].std()) )
sixty_five_percent = [int(df_total['total'].mean() + df_total['total'].std()), int(df_total['total'].mean() - df_total['total'].std())]
ninety_five_percent = [int(df_total['total'].mean() + 2*df_total['total'].std()), int(df_total['total'].mean() - 2*df_total['total'].std())]
print '65% range: ' + str(sixty_five_percent) + ' (one standard deviation)'
print '95% range: ' + str(ninety_five_percent) + ' (two standard deviation)'
plt.hist(total_lst,bins=100)
plt.xlabel('Revenue (Won)')
plt.ylabel('Frequency')
plt.title(r'$\mathrm{Total\ Revenue\ Distribution:}\ \mu=14787,\ \sigma=5857$')
plt.axis([5000, 40000, 0, 4])
plt.show()
print 'Total number of districts: ' + str(len(total_lst))
top_6_dists = 0
bottom_19_dists = 0
for i in range(0,6):
top_6_dists = top_6_dists + top_business_lst[i]
for j in range(7,25):
bottom_19_dists = bottom_19_dists + top_business_lst[j]
print 'Revenue by top 6 districtss: ' + str(top_6_dists)
print 'Revenue by bottom 19 districts: ' + str(bottom_19_dists)
print
print
print 'Get data before visualizing to identify which business contributes the most\n'
print 'Education Max: ' + str(df_education['total'].max())
print 'Education Min: ' + str(df_education['total'].min())
print
print 'Medical Max: ' + str(df_medical['total'].max())
print 'Medical Min: ' + str(df_medical['total'].min())
print
print 'Food Max: ' + str(df_food['total'].max())
print 'Food Min: ' + str(df_food['total'].min())
print
print 'Wholesale Max: ' + str(df_wholesale['total'].max())
print 'Wholesale Min: ' + str(df_wholesale['total'].min())
from numpy import array
# create multi-dimensional matrix which is the format we need for visualizing in histogram
multi_lst = [education_lst,medical_lst,food_lst,wholesale_lst,service_lst]
arr = array(multi_lst)
arr = np.transpose(arr)
n_bins = 10
fig, axes = plt.subplots(nrows=1, ncols=2,figsize=(10,5))
ax0, ax1 = axes.flat
colors = ['red', 'grey', 'lime','blue','yellow']
labels = ['education','medical','food','wholesale','service']
ax0.hist(arr, n_bins, normed=0, histtype='bar', color=colors, label=labels)
ax0.set_xlabel('Revenue (won)')
ax0.set_ylabel('Frequency')
ax0.legend(prop={'size': 10})
ax0.set_title('Sector distribution')
ax1.hist(arr, n_bins, normed=0, histtype='bar', stacked=True)
ax1.set_xlabel('Revenue (won)')
ax1.set_ylabel('Frequency')
ax1.set_title('Stacked bar (extension from \'Sector distribution\')')
plt.tight_layout(pad=0.8, w_pad=0.8, h_pad=1.0)
plt.show()
Note that we have not analyzed individual districts yet. Let's see what kind of insights we can get from them.
# since the ordering of districts are different for each business sector, I hard-coded the list of districts
dist_total_section = ['kangnam','seocho','songpa','yeongdeungpo','jung','mapo','jongro','kangdong','kangseo','yangcheon','nowon','kwangjin','guro','kwanak','eunpyeong','yongsan','seongdong','dongdaemun','dongjak','seongbuk','joongrang','seodaemun','geumcheon','dobong','kangbuk']
dist_service_section = ['kangnam','seocho','songpa','mapo','kangseo','yeongdeungpo','kwanak','kwangjin','kangdong','jung','nowon','guro','yangcheon','seodaemun','eunpyeong','dongjak','seongbuk','joongrang','jongro','dongdaemun','kangbuk','seongdong','yongsan','geumcheon','dobong']
dist_food_section = ['kangnam','seocho','jung','jongro','mapo','yeongdeungpo','songpa','yongsan','kangseo','kangdong','kwanak','kwangjin','yangcheon','nowon','guro','dongjak','dongdaemun','seodaemun','eunpyeong','seongbuk','seongdong','geumcheon','joongrang','kangbuk','dobong']
dist_wholesale_section = ['kangnam','seocho','yeongdeungpo','jung','jongro','seongdong','mapo','yangcheon','kangseo','kangdong','guro','nowon','kwangjin','eunpeyong','dongdaemun','yongsan','dongjak','joongrang','geumcheon','seongbuk','kwanak','dobong','seodaemun','kangbuk']
dist_medical_section = ['kangnam','seocho','songpa','yeongdeungpo','kangdong','mapo','nowon','kwanak','kangseo','yangcheon','eunpyeong','jung','dongjak','guro','dongdaemun','jongro','seongbuk','kwangjin','seodaemun','seongdong','joongrang','yongsan','kangbuk','geumcheon','dobong']
dist_education_section = ['kangnam','seocho','yangcheon','songpa','nowon','kangseo','kangdong','mapo','dongjak','eunpyeong','yeongdeungpo','seodaemun','dongdaemun','kwangjin','kwanak','seongbuk','guro','jongro','dobong','yongsan','joongrang','seongdong','kangbuk','geumcheon','jung']
Let's transform this into matrix format which makes it easier for us to manipulate
# put lists into list
districts = [dist_service_section, dist_food_section, dist_wholesale_section, dist_medical_section, dist_education_section]
# transform list into numpy.array
np_dists = np.array(districts)
Let's define a function that allows us to see top n districts in terms of total revenue and each sectors' revenue
def print_top_districts(num):
print 'Total Revenue \n' + str(dist_total_section[0:num])
print
print 'Service \n' + str(np_dists[0][0:num])
print 'Food \n' + str(np_dists[1][0:num])
print 'Wholesale \n' + str(np_dists[2][0:num])
print 'Medical \n' + str(np_dists[3][0:num])
print 'Education \n' + str(np_dists[4][0:num])
# check top n districts in terms of total revenue and each sectors' revenue
print_top_districts(7)
One information we have not looked at yet is top business (and its associated area). Let's take a look at it in order to finally formulate hypothesis from the insights gathered so far.
# reference
# http://freeprog.tistory.com/10
import codecs
import csv
# unicode_csv_reader(), utf_8_encoder()
# ---> https://docs.python.org/2/library/csv.html 에서 제공함.
def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
# csv.py doesn't do Unicode; encode temporarily as UTF-8:
csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
dialect=dialect, **kwargs)
for row in csv_reader:
# decode UTF-8 back to Unicode, cell by cell:
yield [unicode(cell, 'utf-8') for cell in row]
def utf_8_encoder(unicode_csv_data):
for line in unicode_csv_data:
yield line.encode('utf-8')
data = []
with codecs.open('1_top_business.csv','r', encoding = 'euc-kr') as csvf:
rd = unicode_csv_reader(csvf)
t = next(rd)
for fieldname in t:
print fieldname,
print
for idx, data in enumerate(rd):
print idx, data[0], data[1], data[2], data[3], data[4], data[5]
top_business = pd.read_csv('1_top_business.csv')
top_business.head()
top_business_revenue = top_business['total']
print 'Top business statistics'
print
print 'Mean: ' + str(int(top_business_revenue.mean()))
print 'Max: ' + str(top_business_revenue.max())
print 'Min: ' + str(top_business_revenue.min())
print 'Standard deviation: ' + str(int(top_business_revenue.std()))
print '65%: ' + str([0, 1530+1369])
print '95%: ' + str([0,1530+1369*2])
print '99%: ' + str([0,1530+1369*3])
print 'Total businesses: ' + str(len(top_business_revenue))
print
print 'Top 10 business revenue \n' + str(top_business_revenue[0:10])
Let's see whether these business revenues are polarized
top_1_percent_revenue = 0
bottom_99_percent_revenue = 0
for i in range(0,5):
top_1_percent_revenue = top_1_percent_revenue + top_business_revenue[i]
for j in range(6,250):
bottom_99_percent_revenue = bottom_99_percent_revenue + top_business_revenue[j]
print
print
print 'Revenue by top 1% (6 businesses): ' + str(top_1_percent_revenue)
print 'Revenue by bottom 99% (244 businesses): ' + str(bottom_99_percent_revenue)
print
print
top_10_percent_revenue = 0
bottom_90_percent_revenue = 0
for i in range(0,25):
top_10_percent_revenue = top_10_percent_revenue + top_business_revenue[i]
for j in range(26,250):
bottom_90_percent_revenue = bottom_90_percent_revenue + top_business_revenue[j]
print 'Revenue by top 10%(25 businesses): ' + str(top_10_percent_revenue)
print 'Revenue by bottom 90% (225 businesses): ' + str(bottom_90_percent_revenue)
print
print
top_28_percent_revenue = 0
bottom_72_percent_revenue = 0
for i in range(0,70):
top_28_percent_revenue = top_28_percent_revenue + top_business_revenue[i]
for j in range(71,250):
bottom_72_percent_revenue = bottom_72_percent_revenue + top_business_revenue[j]
print 'Revenue by top 28%(70 businesses): ' + str(top_28_percent_revenue)
print 'Revenue by bottom 72% (180 businesses): ' + str(bottom_72_percent_revenue)
Note: Can this data be used to argue that businesses are polarized?
plt.hist(top_business_revenue, bins=50)
plt.xlabel('Revenue (won)')
plt.ylabel('Frequency')
# plt.title(r'$\mathrm{Total\ Revenue\ Distribution:}\ \mu=14787,\ \sigma=1369$')
plt.title(r'$\mathrm{Top\ Business:}\ \mu=1530,\ \sigma=1369$')
plt.show()
Now that we have analyzed quite a bit about our dataset. Let's move on to formulating a hypothesis.
We approach this process by first enumerating some of the insights or information we have gathered so far, because these information will work collectively to provide us a topic. They are as follows.
Since 'medical' and 'wholesale' are the top two sectors that contribute the most to the overall revenue, if they are not doing well, the overall revenue will be heavily influenced. And vice versa.
Even in times of economic hardship, top sectors will generate revenue. On the other hand, bottom sectors, such as 'education' and 'service', will be heavily influenced. (Not tested)
In this section, we set out to explore and test whether the formulated hypothesis is true or not. For this, we will refer to data visualization system provided by geovision. (http://datablog.geovision.co.kr/index.php/analysis02/)
To check whether these two districts do okay even in times of economic trouble, I chose the month, 'October', where decrease in overall revenue (from previous month) was the highest(-12.42%), as shown in below image. However, it must be noted that the decrease could have been higher than other months, because overall revenue in the previous month was high. Indeed, this was the case for 'October.' As a result, I check the month, 'February'(-4.59%) as well. (Note that I skip 'January,' because the information about 'December' of previous year is not available and thus, the same problem can apply)
First, let's look at 'October' (-12.42%)
Well, let's check 'September'! (+6.0%)
Now, let's take a look at 'February' (-4.59%)
In conclusion, although more convincing cases will definitely help to make the hypothesis valid, it can be said for now that the analysis seems to demonstrate that our hypothesis may not be too far-fetched.
To check this hypothesis, I first looked at the histogram of 'medical' sector. It presents a collection of percent increase in each month, as shown below. 'October' has -2.25%, and 'February' has -0.60%. However, in the month prior to 'October', there was 7.56% increase in revenue (September), while 'January' had 2.91% increase. Like we have done when we were testing the first hypothesis, this needs to be taken into account for normalization. For now, I will take a simpler approach and just test both 'February' and 'October.'
In the midst of an attempt to test this hypothesis, I discovered that simply looking at 'medical' sector alone would result in misleading analysis. This is because in the month of 'October,' all the other sectors saw decrease in revenue. This is illustrated in the below image.
For proper testing of this hypothesis, only 'medical' and 'wholesale' sectors have to see decrease in revenue, while all the other sectors have increase in revenue, not to mention additional comlexities involved. Since it does not have the combination needed, it was clear that examination of this hypothesis was no longer feasible.
This trial allowed me to experience the importance of the art of narrowing down: the reason for my failure with this experiment is a result of not carefully dissecting hypothesis formulation and in the process making it as detailed as possible.
From this assignment, I realized how much effort and thoughts need to be exerted for proper analysis. Also, the importance of dataset was confirmed again. Overall, the assignment was a challenging yet rewarding experience.