# 玩，科比投篮数据

[TOC]

## 问题

Kobe Bryant marked his retirement from the NBA by scoring 60 points in his final game as a Los Angeles Laker on Wednesday, April 12, 2016. Drafted into the NBA at the age of 17, Kobe earned the sport’s highest accolades throughout his long career.

Using 20 years of data on Kobe’s swishes and misses, can you predict which shots will find the bottom of the net? This competition is well suited for practicing classification basics, feature engineering, and time series analysis. Practice got Kobe an eight-figure contract and 5 championship rings. What will it get you?

| 字段名称 | 我的注释| |————– | —— ——- | |action type| 投球的类型，如Jump Shot(跳投) | |combined shot type| 是比action type 更粗的分类| |game event id| 一个id,具体还以不详| |game id| 比赛的id| |lat|投球的维度（投篮位置在地球的维度）| |loc x|投篮位置相对篮筐的横向坐标| |loc y|投篮位置相对篮筐的纵向坐标| |lon|投篮位置的经度（这个确实是地球的经度）| |minutes remaining| 比赛剩余分钟| |period| 比赛第几节| |playoffs|是否是季后赛| |season |赛季| |seconds remaining|比赛剩余秒| |shot distance|投篮位置到篮筐的距离| |shot made flag (this is what you are predicting)|是否投中| |shot type|3分还是2分| |shot zone area|投篮区域后面会看到具体的| |shot zone basic|投篮区域，和上面是不同的划分方法，见下面的图| |shot zone range|| |team id|这个是一个值，Kobe 所在球队| |team name|球队名称，这个也是一个常量：Los Angeles Lakers| |game date|比赛时间| |matchup|这个能看出对手，是否主场，客场| |opponent|对手| |shot_id|投球的id|

## 数据初探

### 投篮位置的命中率分布

def plot shot basic range(): """ plot 不同区域的 命中率 """ df = load data() train = df.loc[df[‘shot made flag’].isin([0,1])] basic zone list = list(set(train.shot zone basic)) rate data = {} plot data = {} for zone in basic zone list: zone data = train.loc[train.shot zone basic==zone] rate data[zone] = {} rate data[zone][‘size’] = zone data.count()[‘shot zone basic’] rate data[zone][‘rate’] = 100.0*zone data.loc[zone data[‘shot made flag’]==1].count()[‘shot made flag’]/rate data[zone][‘size’] plot data[zone] = {} plot data[zone][‘x’] = train.loc[train.shot zone basic==zone].loc x plot data[zone][‘y’] = train.loc[train.shot zone basic==zone].loc_y

`plt.figure() color_list = ['r','g', 'b', 'c','m','k','y'] for index in range(len(basic_zone_list)):     zone = basic_zone_list[index]     plt.plot(plot_data[zone]['x'], plot_data[zone]['y'],c = color_list[index] ,linestyle='', marker='o', label='%.2f, %s'%(rate_data[zone]['rate'],zone))  plt.legend() ax = plt.gca() ax.set_ylim([-50,900]) plt.show()`

“` 基于 shot zone basic 基于shot zone range 看到上面两个字段的区别了吧。其实这个图给我们分析带来的价值并不大，或许你想说科比喜欢在什么位置投球，什么位置命中率高，这个对这个题没什么价值，因为基本所有人都是越近越准，在没有和其他人进行对比的情况下我们并看不出什么，位置肯定会最为影响命中率的维度加入建模的，毫无疑问。

## 命中率预测

`` values_col = [ 'action_type', 'loc_x', 'loc_y', 'minutes_remaining', 'period', 'playoffs', 'season', 'seconds_remaining', 'opponent', 'shot_id', 'shot_made_flag', ]`
`建模的代码，省去了数据处理部分 ```

def train model svm(): df = load data() data array = hand data(df) pre data = data array[np.isnan(data

array[:,-1].tolist())]

`reverse = np.isnan(data_array[:,-1].tolist()) reverse = map(lambda x: False if x else True, reverse) train_data = data_array[np.array(reverse)] np.random.shuffle(train_data) test_data = train_data[:5000] train_data = train_data[5000:]  X = train_data[:,:-2] Y = train_data[:,-1] Y=Y.astype(int) clf = svm.SVC() clf.fit(X, Y)  P = sum(test_data[:,-1]) predict_lable = clf.predict(test_data[:,:-2]) min_label = predict_lable-test_data[:,-1].astype(int) fn=np.sum(min_label[min_label==1]) tp = P-fn print 'fn:',fn fp = np.sum(min_label[min_label==-1])*-1 tn = 5000-P-fp print 'tp:',tp print 'fn:',fn print 'fp:',fp print 'tn:',tn print 'P:',P  result_label = clf.predict(pre_data[:,:-2]) result_label = result_label.reshape(5000,1) index_ = pre_data[:,-2].reshape(5000,1) result_data = np.hstack((index_,result_label))  result_df = pd.DataFrame(result_data, columns=['id','cuisine']) result_df.to_csv('result.csv')`

“` 上面是比较核心的代码， 可以看到SVM 的参数也没做任何调整，为了看效果看下面的混淆矩阵好了，其实不算好，但是能说明还是科比投篮是有规律可循的，下一步就是要做各种尝试，比如参数，换其他方法什么的，这个题很多方法都行。

|实际| 中|不中| |——|——|——-| |中| 2065|157| |不中|1919| 859| 总的准确率大概在58%