python 数据预处理和pandas sklearn

十月 18, 2016

import pandas as pd
import numpy as np
from sklearn import preprocessing,linear_model,metrics

data = pd.DataFrame({'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
                    'children': [4., 6, 3, 3, 2, 3, 5, 4],
                    'salary': [90, 24, 44, 27, 32, 59, 36, 27]})

lb = preprocessing.LabelBinarizer()
#标签二值化（Label binarization）
#LabelBinarizer通常用于通过一个多类标签（label）列表，创建一个label指示器矩阵
pdlb=pd.DataFrame(lb.fit_transform(data['pet']),columns=['cat','dog','fish'])
#横向合并
data=pd.concat([data,pdlb],axis=1)
#数据标准化(Standardization or Mean Removal and Variance Scaling)
#scale进行标准化缩放的数据均值为0，具有单位方差。
data['children']=preprocessing.scale(data['children'])
#数据规范化（Normalization）
#把数据集中的每个样本所有数值缩放到(-1,1)之间。
#X_normalized = preprocessing.normalize(X, norm='l2')
#二进制化（Binarization）
#将数值型数据转化为布尔型的二值数据，可以设置一个阈值（threshold）
#binarizer = preprocessing.Binarizer(threshold=1.1) # 设定阈值为1.1

#scikit-learn要求X是一个特征矩阵，y是一个NumPy向量
#X可以是pandas的DataFrame，y可以是pandas的Series，scikit-learn可以理解这种结构
X=data[['children','salary','cat','dog','fish']]
y=data['salary']

mlr = linear_model.LinearRegression()
mlr.fit(X,y)

搜索此博客

xuefliang

python 数据预处理和pandas sklearn

评论

发表评论

此博客中的热门博文

windows 命令行下查看端口占用情况的方法

Android 7.0 开启Google Now

Rstudio 使用代理

python 数据预处理和pandas sklearn

评论

发表评论

此博客中的热门博文

windows 命令行下 查看端口占用情况的方法

Android 7.0 开启Google Now

Rstudio 使用代理

windows 命令行下查看端口占用情况的方法