基于Python3。仅记录我自己最常用的部分,并列出更多参考链接。
Introduction
以下是Python常用库的示意图。
模块导入方法
安装格式:(Python安装包工具有easy_install、pip、setuptools、distribute)
1 | //以pip为例,在cmd命令行输入 |
(如果直接使用Anaconda集成环境,则一般的库会被自动安装)
调用格式:(More)
1 | #在.py文件中输入,类似C语言的#include<…> |
import库是有时间和空间成本的,斟酌而行。
STL
sys
1 | import sys |
参考资料:
sys.argv
: 实现从程序外部向程序传递参数。sys.exit([arg])
: 程序中间的退出,arg=0为正常退出。sys.executable
: 程序执行器路径(比如C:/Anaconda/python.exe)sys.getdefaultencoding()
: 获取系统当前编码,一般默认为ascii
。sys.setdefaultencoding()
: 设置系统默认编码- 执行dir(sys)时不会看到这个方法,在解释器中执行不通过
- 设置
UTF-8
:可以先执行reload(sys)
,在执行setdefaultencoding('utf8')
,此时将系统默认编码设置为utf8。(见设置系统默认编码 )
sys.getfilesystemencoding()
: 获取文件系统使用编码方式,Windows下返回’mbcs’,mac下返回’utf-8’.sys.path
: 获取指定模块搜索路径的字符串集合,可以将写好的模块放在得到的某个路径下,就可以在程序中import时正确找到。sys.platform
: 获取当前系统平台。sys.stdin,sys.stdout,sys.stderr
: stdin , stdout , 以及stderr 变量包含与标准I/O 流对应的流对象. 如果需要更好地控制输出,而print 不能满足你的要求, 它们就是你所需要的. 你也可以替换它们, 这时候你就可以重定向输出和输入到其它设备( device ), 或者以非标准的方式处理它们
time
1 | import time as ti |
参考资料:
时间处理包含多个模块:
time.time()
: 时间戳。(自从1970年1月1日午夜)
time.localtime(time.time())
: 将时间戳转化为本地时间
time.asctime( time.localtime(time.time()) )
: 格式化
os
1 | import os |
参考资料:
os 模块提供了非常丰富的方法用来处理文件和目录。详见参考网址。
Numpy
1 | import numpy as np |
参考资料:
如何系统地学习Python 中 matplotlib, numpy, scipy, pandas?
jupyter code和markdown转换(含常用快捷键)
Numpy 的应用范围:
- 机器学习模型:
- 矩阵计算,训练数据存储,模型调参。
- 图像处理和计算机图形学:
- 快速处理图像(镜像图像、按特定角度旋转图像等)。
- 数学任务:
- 数值积分、微分、内插、外推等。
张量生成
NumPy的主要对象是同类型的多维数组。它是一张表,所有元素(通常是数字)的类型都相同,并通过正整数元组索引。在NumPy中,维度称为轴(axis)。轴的数目为rank。
1 | import numpy as np #引入 |
NumPy的数组类被称为ndarray。别名为 array
。 请注意,numpy.array
与标准Python库类 array.array
不同,后者仅处理一维数组并提供较少的功能。 ndarray
对象则提供更关键的属性:
- ndarray.ndim:数组的轴(维度)的个数。在Python世界中,维度的数量被称为rank。
- ndarray.shape:数组的维度(就是形状)。这是一个整数的元组,表示每个维度中数组的大小。对于有n行和m列的矩阵,shape将是(n,m)。因此,
shape
元组的长度就是rank或维度的个数ndim
。 - ndarray.size:数组元素的总数。这等于shape的元素的乘积。
- ndarray.dtype:一个描述数组中元素类型的对象。可以使用标准的Python类型创建或指定dtype。另外NumPy提供它自己的类型。例如numpy.int32、numpy.int16和numpy.float64。
- ndarray.itemsize:数组中每个元素的字节大小。例如,元素为
float64
类型的数组的itemsize
为8(=64/8),而complex32
类型的数组的itemsize
为4(=32/8)。它等于ndarray.dtype.itemsize
。 - ndarray.data:该缓冲区包含数组的实际元素。通常,我们不需要使用此属性,因为我们将使用索引访问数组中的元素。
矩阵操作
1 | import numpy as np #引入 |
Matplotlib
1 | import numpy as np |
参考资料:
pylab
是 matplotlib 面向对象绘图库的一个接口。它的语法和 Matlab 十分相近。也就是说,它主要的绘图命令和 Matlab 对应的命令有相似的参数。(Matlab和Octave语法有几乎一致,所以可以参考这里)
绘制函数 plot()
1 | import numpy as np |
1 | # 导入 matplotlib 的所有内容(nympy 可以用 np 这个名字来使用) |
还有给一些特殊点做注释,坐标轴记号调整等功能。详见参考资料Matplotlib 教程。
注色图 fill_between()
1 | import numpy as np |
直方图 bar()
1 | import numpy as np |
饼状图 pie()
1 | import numpy as np |
散点图 scatter()
1 | import numpy as np |
灰度图 imshow()
1 | import numpy as np |
3D 图*
1 | import numpy as np |
等高线图 contourf()
1 | import numpy as np |
向量场图 quiver()
1 | import numpy as np |
网格 grid()
1 | import numpy as np |
多重网格
1 | import numpy as np |
极轴图
1 | import numpy as np |
文字图 text()
1 | import numpy as np |
Scipy
1 | import scipy as sp |
参考资料:
Scipy-Lecture-Notes,中文版
科学计算。包括统计,优化,整合,线性代数模块,傅里叶变换,信号和图像处理,常微分方程求解器等等。
| 模块 | 任务 |
| —————————————————————————————— | ————————— |
| scipy.cluster
| 向量计算 / Kmeans |
| scipy.constants
| 物理和数学常量 |
| scipy.fftpack
| 傅里叶变换 |
| scipy.integrate
| 积分程序 |
| scipy.interpolate
| 插值 |
| scipy.io
| 数据输入和输出 |
| scipy.linalg
| 线性代数程序 |
| scipy.ndimage
| n-维图像包 |
| scipy.odr
| 正交距离回归 |
| scipy.optimize
| 优化 |
| scipy.signal
| 信号处理 |
| scipy.sparse
| 稀疏矩阵 |
| scipy.spatial
| 空间数据结构和算法 |
| scipy.special
| 一些特殊数学函数 |
| scipy.stats
| 统计 |
(待续)
Pandas
1 | import pandas as pd |
参考资料:
Pandas Cheat Sheet - Dataquest
Main Features
Here are just a few of the things that pandas does well:
- Easy handling of missing data (represented as
NaN
) in floating point as well as non-floating point data - Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let
Series
,DataFrame
, etc. automatically align the data for you in computations - Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data
- Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
- Intuitive merging and joining data sets
- Flexible reshaping and pivoting of data sets
- Hierarchical labeling of axes (possible to have multiple labels per tick)
- Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format
- Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc.
总结:数据预处理,数据流/IO管理,鲁棒群操作,时间序列处理等。
Importing Data
pd.read_csv(filename)
| From a CSV file(读入训练数据)pd.read_table(filename)
| From a delimited text file (like TSV)pd.read_excel(filename)
| From an Excel filepd.read_sql(query, connection_object)
| Read from a SQL table/databasepd.read_json(json_string)
| Read from a JSON formatted string, URL or file.pd.read_html(url)
| Parses an html URL, string or file and extracts tables to a list of dataframespd.read_clipboard()
| Takes the contents of your clipboard and passes it to read_table()
pd.DataFrame(dict)
| From a dict, keys for columns names, values for data as lists
Exporting Data
df.to_csv(filename)
| Write to a CSV file(输出csv结果)df.to_excel(filename)
| Write to an Excel filedf.to_sql(table_name, connection_object)
| Write to a SQL tabledf.to_json(filename)
| Write to a file in JSON format(保存模型)
Create Test Objects
Useful for testing code segements
pd.DataFrame(np.random.rand(20,5))
| 5 columns and 20 rows of random floatspd.Series(my_list)
| Create a series from an iterable my_list
df.index = pd.date_range('1900/1/30', periods=df.shape[0])
| Add a date index
Viewing/Inspecting Data
df.head(n)
| First n rows of the DataFramedf.tail(n)
| Last n rows of the DataFramedf.shape
| Number of rows and columnsdf.info()
| Index, Datatype and Memory informationdf.describe()
| Summary statistics for numerical columnss.value_counts(dropna=False)
| View unique values and countsdf.apply(pd.Series.value_counts)
| Unique values and counts for all columns
Selection
df[col]
| Returns column with label col as Seriesdf[[col1, col2]]
| Returns columns as a new DataFrames.iloc[0]
| Selection by positions.loc['index_one']
| Selection by indexdf.iloc[0,:]
| First rowdf.iloc[0,0]
| First element of first column
Data Cleaning
df.columns = ['a','b','c']
| Rename columnspd.isnull()
| Checks for null Values, Returns Boolean Arrraypd.notnull()
| Opposite of pd.isnull()
df.dropna()
| Drop all rows that contain null valuesdf.dropna(axis=1)
| Drop all columns that contain null valuesdf.dropna(axis=1,thresh=n)
| Drop all rows have have less than n non null valuesdf.fillna(x)
| Replace all null values with xs.fillna(s.mean())
| Replace all null values with the mean (mean can be replaced with almost any function from the statistics section)s.astype(float)
| Convert the datatype of the series to floats.replace(1,'one')
| Replace all values equal to 1
with 'one'
s.replace([1,3],['one','three'])
| Replace all 1 with 'one'
and 3
with 'three'
df.rename(columns=lambda x: x + 1)
| Mass renaming of columnsdf.rename(columns={'old_name': 'new_ name'})
| Selective renamingdf.set_index('column_one')
| Change the indexdf.rename(index=lambda x: x + 1)
| Mass renaming of index
Filter, Sort, and Groupby
df[df[col] > 0.5]
| Rows where the column col
is greater than 0.5
df[(df[col] > 0.5) & (df[col] < 0.7)]
| Rows where 0.7 > col > 0.5
df.sort_values(col1)
| Sort values by col1 in ascending orderdf.sort_values(col2,ascending=False)
| Sort values by col2
in descending orderdf.sort_values([col1,col2],ascending=[True,False])
| Sort values by col1
in ascending order then col2
in descending orderdf.groupby(col)
| Returns a groupby object for values from one columndf.groupby([col1,col2])
| Returns groupby object for values from multiple columnsdf.groupby(col1)[col2]
| Returns the mean of the values in col2
, grouped by the values in col1
(mean can be replaced with almost any function from the statistics section)df.pivot_table(index=col1,values=[col2,col3],aggfunc=mean)
| Create a pivot table that groups by col1
and calculates the mean of col2
and col3
df.groupby(col1).agg(np.mean)
| Find the average across all columns for every unique col1
groupdf.apply(np.mean)
| Apply the function np.mean()
across each columnnf.apply(np.max,axis=1)
| Apply the function np.max()
across each row
Join/Combine
df1.append(df2)
| Add the rows in df1
to the end of df2
(columns should be identical)pd.concat([df1, df2],axis=1)
| Add the columns in df1
to the end of df2
(rows should be identical)df1.join(df2,on=col1,how='inner')
| SQL-style join the columns in df1
with the columns on df2
where the rows for col
have identical values. how can be one of 'left'
, 'right'
, 'outer'
, `’inner’
Statistics
These can all be applied to a series as well.
df.describe()
| Summary statistics for numerical columnsdf.mean()
| Returns the mean of all columnsdf.corr()
| Returns the correlation between columns in a DataFramedf.count()
| Returns the number of non-null values in each DataFrame columndf.max()
| Returns the highest value in each columndf.min()
| Returns the lowest value in each columndf.median()
| Returns the median of each columndf.std()
| Returns the standard deviation of each column
Sklearn (scikit-learn)
1 | #回归 |
Source: yhat
(待续)
TensorFlow
1 | import tensorflow as tf |
参考TensorFlow文档。(More)
Tensor是Google开源的深度学习框架,如其名“张量流”,即以处理张量形式的数据流见长。
(待续)
Keras
1 | form tensorflow import keras |
参考: Keras 中文文档
Source: datacamp
(待续)
Pytorch
参考: PyTorch中文文档
(待续)
OpenCV
参考资料:
(待续)
Scrapy
参考资料:
(待续)
Pyspider
参考资料:
(待续)