数据分析:Numpy
- numpy基础
numpy基础
引入原因
- 列表([])中可以引入任意的数据类型,这样python每次都需要检查数据类型,降低了执行效率
[0, 1, 2, 3, 4, 'machine learning', 6, 7, 8, 9]
- 引入array后,存储方式是数组或者二维数组,没有把数据看作是向量或者矩阵,没有
创建数组和矩阵方法
array 创建
numpy.array()
numpy.array([i for i in range(10)])
dtype 查看类型
nparr = np.array([i for i in range(10)])
nparr2.dtype
zeros 创建 0 向量
numpy.zeros(10) # 创建一个一维向量,默认是浮点型
np.zeros(10, dtype='int') # 创建一个整数型的一维向量
np.zeros((3,5)) # 创建一个3行5列的向量,默认是浮点型
np.zeros(shape=(3,5), dtype='int')# 创建一个3行5列的向量,整数型
# result
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0]])
ones 创建 1 向量
同理0
np.ones((3,5))
#result
array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
full 创建一个自定义的向量
np.full(shape=(3,5), fill_value=666)
#result
array([[666, 666, 666, 666, 666],
[666, 666, 666, 666, 666],
[666, 666, 666, 666, 666]])
arange
在python中for创建一个列表:
[i for i in range(0, 20, 2)]
numpy也提供了这样的方法:
np.arange(0, 20, 2)
np.arange(0, 10)
np.arange(10)
# result
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
linspace
返回设定范围内,固定间隔的数值
np.linspace(0, 20, 5)
# result
array([ 0., 5., 10., 15., 20.])
np.linspace(0, 20, 6)
#result
array([ 0., 4., 8., 12., 16., 20.])
random
randint(x, y) 返回一个x-(y-1)的数
np.random.randint(0, 10)
randint(x, y, size = n) 指定一个大小为n的数组,同时可以使用()形式指定一个
np.random.randint(0, 1, 10)
#result
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
np.random.randint(4, 8, size = 10)
#result
array([7, 5, 4, 6, 7, 6, 6, 5, 6, 5])
np.random.randint(0, 10, size=(3,5))
#result
array([[6, 4, 4, 2, 0],
[7, 0, 3, 0, 5],
[1, 9, 0, 6, 7]])
np.random.seed(n) 创建一个随机种子,这个将会固定随机值
np.random.seed(666)
np.random.randint(0, 10, size=(3,5))
#result
array([[2, 6, 9, 4, 3],
[1, 0, 8, 7, 5],
[2, 5, 5, 4, 8]])
使用np.random.random创建一个浮点随机数或是浮点矩阵
np.random.random()
#result
0.8888993137763092
np.random.random(10)
#result
array([0.62640453, 0.81887369, 0.54734542, 0.41671201, 0.74304719,
0.36959638, 0.07516654, 0.77519298, 0.21940924, 0.07934213])
np.random.random((3,5))
#result
array([[0.48678052, 0.1536739 , 0.82846513, 0.19136857, 0.27040895],
[0.56103442, 0.90238039, 0.85178834, 0.41808196, 0.39347627],
[0.01622051, 0.29921337, 0.35377822, 0.89350267, 0.78613657]])
np.random.normal() 返回一组符合高斯分布的概率密度随机数
np.random.normal(10, 100)
#result
154.77350573628146
np.random.normal(0, 1, (3,5))
#result
array([[-0.1963061 , 1.51814514, 0.07722188, -0.06399132, 0.94592341],
[ 1.20409101, -0.45124074, -1.58744651, -1.86885548, 0.10037737],
[-3.09487059, 3.39351678, -0.12666878, -0.93713026, 0.56552529]])
通过帮助函数查看官方文档
语法:help(函数)
help(np.random.normal)
Help on built-in function normal:
normal(...) method of numpy.random.mtrand.RandomState instance
normal(loc=0.0, scale=1.0, size=None)
Draw random samples from a normal (Gaussian) distribution.
The probability density function of the normal distribution, first
derived by De Moivre and 200 years later by both Gauss and Laplace
independently [2]_, is often called the bell curve because of
its characteristic shape (see the example below).
The normal distributions occurs often in nature. For example, it
describes the commonly occurring distribution of samples influenced
by a large number of tiny, random disturbances, each with its own
unique distribution [2]_.
.. note::
New code should use the ``normal`` method of a ``default_rng()``
instance instead; please see the :ref:`random-quick-start`.
Parameters
----------
loc : float or array_like of floats
Mean ("centre") of the distribution.
scale : float or array_like of floats
Standard deviation (spread or "width") of the distribution. Must be
non-negative.
size : int or tuple of ints, optional
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn. If size is ``None`` (default),
a single value is returned if ``loc`` and ``scale`` are both scalars.
Otherwise, ``np.broadcast(loc, scale).size`` samples are drawn.
Returns
-------
out : ndarray or scalar
Drawn samples from the parameterized normal distribution.
See Also
--------
基本属性
前提:
x = np.arange(10)
x
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
X = np.arange(15).reshape(3, 5)
X
# array([[ 0, 1, 2, 3, 4],
# [ 5, 6, 7, 8, 9],
# [10, 11, 12, 13, 14]])
ndim 查看数组和维度
x.ndim
# 1
X.ndim
# 2
shape 查看各位维度大小的元组
x.shape
# (10,)
X.shape
# (3, 5) 3行5列
size 返回元素个数
x.size
# 10
X.size
# 15
数据访问
前提:
x = np.arange(10)
x
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
X = np.arange(15).reshape(3, 5)
X
# array([[ 0, 1, 2, 3, 4],
# [ 5, 6, 7, 8, 9],
# [10, 11, 12, 13, 14]])
基本访问
访问一维数组:
x[0]
# 0
x[-1]
# 9
x[0:5]
# array([0, 1, 2, 3, 4])
x[:5]
# array([0, 1, 2, 3, 4])
x[5:]
# array([5, 6, 7, 8, 9])
x[::2] # 步数
# array([0, 2, 4, 6, 8])
x[::-1]
# array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
访问二维数组/矩阵:
X[0][0] # 不建议使用
# 0
X[2, 2] # 建议使用
# 12
X[:2, :3]
# array([[0, 1, 2],
# [5, 6, 7]])
X[::-1, ::-1]
# array([[14, 13, 12, 11, 10],
# [ 9, 8, 7, 6, 5],
# [ 4, 3, 2, 1, 0]])
X[0]
# array([0, 1, 2, 3, 4])
X[0, :]
# array([0, 1, 2, 3, 4])
X[:, 1]
# array([ 1, 6, 11])
X[:, 2]
# array([ 2, 7, 12])
为什么不建议使用的原因:
X[:2][:3] #下列解析为什么不建议
# array([[0, 1, 2, 3, 4],
# [5, 6, 7, 8, 9]])
X[:2] #前两行
# array([[0, 1, 2, 3, 4],
# [5, 6, 7, 8, 9]])
X[:2][:3] # 解释X[:2]数组中的前3行元素,但是这个元素已经只有2行了!
# array([[0, 1, 2, 3, 4],
# [5, 6, 7, 8, 9]])
改变数值
如果不使用copy进行数值改变:
subX = X[:2, :3]
# array([[0, 1, 2],
# [5, 6, 7]])
subX[0,0] = 100
# array([[100, 1, 2],
# [ 5, 6, 7]])
# result
# array([[100, 1, 2, 3, 4],
# [ 5, 6, 7, 8, 9],
# [ 10, 11, 12, 13, 14]])
使用copy函数,原矩阵数值不变:
subX = X[:2, :3].copy()
# array([[0, 1, 2],
# [5, 6, 7]])
subX[0, 0] = 100
# array([[100, 1, 2],
# [ 5, 6, 7]])
# result
# array([[ 0, 1, 2, 3, 4],
# [ 5, 6, 7, 8, 9],
# [10, 11, 12, 13, 14]])
reshape函数
前提:
x = np.arange(10)
x
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
使用reshape:
x.reshape(2,5)
# array([[0, 1, 2, 3, 4],
# [5, 6, 7, 8, 9]])
B = x.reshape(1, 10)
# array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
B.ndim
# 2
B.shape
# (1, 10)
x.shape
# (10,)
只关心行或者列的写法:
x.reshape(10, -1) #只用关心行数,列数不管
# array([[0],
# [1],
# [2],
# [3],
# [4],
# [5],
# [6],
# [7],
# [8],
# [9]])
x.reshape(-1, 10)
# array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
x.reshape(3, -1) # 10不能被3整除
# 错误
合并和分割
合并
前提:
x = np.array([1,2,3])
y = np.array([3,2,1])
z = np.array([666,666,666])
# x --- array([1, 2, 3])
# y --- array([3, 2, 1])
concatenate合并
合并一维数组:
np.concatenate([x,y])
# array([1, 2, 3, 3, 2, 1])
np.concatenate([x,y,z])
# array([ 1, 2, 3, 3, 2, 1, 666, 666, 666])
合并矩阵:
Ⅰ. 沿着垂直方向合并
A = np.array([[1,2,3],
[4,5,6]])
np.concatenate([A, A])
# array([[1, 2, 3],
# [4, 5, 6],
# [1, 2, 3],
# [4, 5, 6]])
Ⅱ. 沿着水平方向合并
np.concatenate([A,A], axis=1) # 沿着列的方向拼接
# array([[1, 2, 3, 1, 2, 3],
# [4, 5, 6, 4, 5, 6]])
Ⅲ. 如果出现维度不一样的,将无法合并
np.concatenate([A, z]) # 维数不一样
Ⅳ. 使用reshape中-1的方法进行自动改变维度再合并
np.concatenate([A, z.reshape(1, -1)]) # z转换为列数不用管
# array([[ 1, 2, 3],
# [ 4, 5, 6],
# [666, 666, 666]])
stack的合并
vstack
A = np.array([[1,2,3],
[4,5,6]])
z = np.array([666,666,666])
np.vstack([A, z]) # 垂直方向智能的拼接在一起
# array([[ 1, 2, 3],
# [ 4, 5, 6],
# [666, 666, 666]])
hstack
B = np.full((2,2), 100)
# array([[100, 100],
# [100, 100]])
np.hstack([A, B]) # 水平方向智能拼接在一起
# array([[ 1, 2, 3, 100, 100],
# [ 4, 5, 6, 100, 100]])
如果维度不同,垂直(vstack)可以,水平(hstack)不可以
np.hstack([A, z]) # 垂直可以,水平不可以的情况(因为z是一维数组)
分割
前提:
x = np.arange(10)
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
split分割
分割一维数组
np.split(x, [3, 7])
# [array([0, 1, 2]), array([3, 4, 5, 6]), array([7, 8, 9])]
分割矩阵
前提:
A = np.arange(16).reshape(4, 4)
# array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11],
# [12, 13, 14, 15]])
分割成两半(按行分割):
A1, A2 = np.split(A, [2])
# A1
# array([[0, 1, 2, 3],
# [4, 5, 6, 7]])
# A2
# array([[ 8, 9, 10, 11],
# [12, 13, 14, 15]])
按列分割:
A1, A2 = np.split(A, [2], axis=1) # 列方向分割
#A1
# array([[ 0, 1],
# [ 4, 5],
# [ 8, 9],
# [12, 13]])
# A2
# array([[ 2, 3],
# [ 6, 7],
# [10, 11],
# [14, 15]])
v/hsplit分割
np.vsplit(A, [2])
# [array([[0, 1, 2, 3],
# [4, 5, 6, 7]]),
# array([[ 8, 9, 10, 11],
# [12, 13, 14, 15]])]
left, right = np.hsplit(A, [2])
# left
# array([[ 0, 1],
# [ 4, 5],
# [ 8, 9],
# [12, 13]])
# right
# array([[ 2, 3],
# [ 6, 7],
# [10, 11],
# [14, 15]])
意义
例如下列data,前3列可能代表3个同一种特征,最后一列表示和前3列不同的特征。 运行时就要进行分隔开。
data = np.arange(16).reshape((4,4))
# array([[ 0, 1, 2, 3],
# [ 4, 5, 6, 7],
# [ 8, 9, 10, 11],
# [12, 13, 14, 15]])
x, y = np.hsplit(data, [-1])
# x
# array([[ 0, 1, 2],
# [ 4, 5, 6],
# [ 8, 9, 10],
# [12, 13, 14]])
# y
# array([[ 3],
# [ 7],
# [11],
# [15]])
y[:, 0]
array([ 3, 7, 11, 15])
矩阵的运算
使用原生列表进行运算
如果两个列表直接相乘,得到的结果只是两个合并在一起:
n = 10
L = [i for i in range(n)]
2 * L
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
如果要对数据进行运算,需要逐个相乘再加入:
A = []
for e in L:
A.append(2 * e)
# [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]
在本次实验中,列表所用到的时间如下:
n = 1000000
%%time
A = []
for e in L:
A.append(e * 2)
# Wall time: 118 ms
%%time
A = [2*e for e in L]
# Wall time: 60.4 ms
使用numpy解决运算问题
L = np.arange(n) # n = 1000000
%%time
A = np.array(2*e for e in L)
# Wall time: 7.97 ms
%%time
A = 2 * L
# Wall time: 998 µs
在numpy中使用2*数组不会出现后面再增值一个数组的问题:
n = 10
L = np.arange(n)
2 * L
# array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
Universal Functions
前提:
X = np.arange(1, 16).reshape((3,5))
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15]])
对矩阵进行常用运算如下:
-
-
-
- /
- //
- %
- 1 /
- abs
- sin
- cos
- tan
- exp
- power
- log
- log2
- log10
举个栗子:
np.log10(X)
# array([[0. , 0.30103 , 0.47712125, 0.60205999, 0.69897 ],
# [0.77815125, 0.84509804, 0.90308999, 0.95424251, 1. ],
# [1.04139269, 1.07918125, 1.11394335, 1.14612804, 1.17609126]])
np.power(3, X)
# array([[ 3, 9, 27, 81, 243],
# [ 729, 2187, 6561, 19683, 59049],
# [ 177147, 531441, 1594323, 4782969, 14348907]], dtype=int32)
矩阵运算
前提:
A = np.arange(4).reshape(2, 2)
# array([[0, 1],
# [2, 3]])
B = np.full((2, 2), 10)
# array([[10, 10],
# [10, 10]])
普通运算
对于+、-、*、/来说,只是对应的矩阵进行运算,举个栗子:
A + B # 仅仅是0 + 10, 1 + 10, 2+ 10 ....
# array([[10, 11],
# [12, 13]])
矩阵乘法
A.dot(B) # 矩阵乘法
# array([[10, 10],
# [50, 50]])
矩阵的转置
A.T # 转置
# array([[0, 2],
# [1, 3]])
非法运算
对于两个大小不同的矩阵,是不能直接运算的,例如创建一个(3,3)的矩阵
C = np.full((3,3), 666)
# array([[666, 666, 666],
# [666, 666, 666],
# [666, 666, 666]])
A + C # A是一个(2,2)的矩阵,会报错!
A.dot(C) # 同样,矩阵的乘法也会报错
向量和矩阵的运算
前提:
v = np.array([1, 2])
# array([1, 2])
# A
# array([[0, 1],
# [2, 3]])
直接进行相加
v + A # 如果熟练使用numpy,使用这个也是可以的
# array([[1, 3],
# [3, 5]])
使用stack函数解决
np.vstack([v] * A.shape[0]) # 把v向量叠两次,成为和A向量维度相同的矩阵
# 此时的v就是
# array([[1, 2],
# [1, 2]])
np.vstack([v] * A.shape[0]) + A
# array([[1, 3],
# [3, 5]])