1. 便捷数据获取
1.1 本地数据获取:文件的打开,读写和关闭(另外的单独章节)
1.2 网络数据获取:
1.2.1 urllib, urllib2, httplib, httplib2 (python3中为urllib.request, http.client)
正则表达式(另外的单数章节)
1.2.2 通过matplotlib.finace模块获取雅虎财经上的数据
In [7]: from matplotlib.finance import quotes_historical_yahoo_ochlIn [8]: from datetime import dateIn [9]: from datetime import datetimeIn [10]: import pandas as pdIn [11]: today = date.today()In [12]: start = (today.year-1, today.month, today.day)In [14]: quotes = quotes_historical_yahoo_ochl('AXP', start, today) # 获取数据In [15]: fields = ['date', 'open', 'close', 'high', 'low', 'volume']In [16]: list1 = []In [18]: for i in range(0,len(quotes)):...: x = date.fromordinal(int(quotes[i][0])) # 取每一行的第一列,通过date.fromordinal设置为日期数据类型...: y = datetime.strftime(x,'%Y-%m-%d') # 通过datetime.strftime把日期设置为指定格式...: list1.append(y) # 将日期放入列表中...: In [19]: quotesdf = pd.DataFrame(quotes,index=list1,columns=fields) # index设置为日期,columns设置为字段In [20]: quotesdf = quotesdf.drop(['date'],axis=1) # 删除date列In [21]: print quotesdfopen close high low volume
2016-01-20 60.374146 61.835916 62.336256 60.128882 9043800.0
2016-01-21 61.806486 61.453305 63.101479 61.325767 8992300.0
2016-01-22 57.283819 54.016907 57.774347 53.114334 43783400.0
1.2.3 通过自然语言工具包NLTK获取语料库等数据
1. 下载nltk:pip install nltk
2. 下载语料库:
In [1]: import nltkIn [2]: nltk.download()
NLTK Downloader
---------------------------------------------------------------------------d) Download l) List u) Update c) Config h) Help q) Quit
---------------------------------------------------------------------------
Downloader> dDownload which package (l=list; x=cancel)?Identifier> gutenbergDownloading package gutenberg to /root/nltk_data...Package gutenberg is already up-to-date!
3. 获取数据:
In [3]: from nltk.corpus import gutenbergIn [4]: print gutenberg.fileids()
[u'austen-emma.txt', u'austen-persuasion.txt', u'austen-sense.txt', u'bible-kjv.txt', u'blake-poems.txt', u'bryant-stories.txt', u'burgess-busterbrown.txt', u'carroll-alice.txt', u'chesterton-ball.txt', u'chesterton-brown.txt', u'chesterton-thursday.txt', u'edgeworth-parents.txt', u'melville-moby_dick.txt', u'milton-paradise.txt', u'shakespeare-caesar.txt', u'shakespeare-hamlet.txt', u'shakespeare-macbeth.txt', u'whitman-leaves.txt']In [5]: texts = gutenberg.words('shakespeare-hamlet.txt')In [6]: texts
Out[6]: [u'[', u'The', u'Tragedie', u'of', u'Hamlet', u'by', ...]
2. 数据准备和整理
2.1 quotes数据加入[ 列 ]属性名
In [79]: quotesdf = pd.DataFrame(quotes)In [80]: quotesdf
Out[80]: 0 1 2 3 4 5
0 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0
1 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0
2 735985.0 57.283819 54.016907 57.774347 53.114334 43783400.0
3 735988.0 53.428272 53.977664 54.713455 53.114334 18498300.0[253 rows x 6 columns]In [81]: fields = ['date','open','close','high','low','volume']In [82]: quotesdf = pd.DataFrame(quotes,columns=fields) # 设置列属性名称In [83]: quotesdf
Out[83]: date open close high low volume
0 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0
1 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0
2 735985.0 57.283819 54.016907 57.774347 53.114334 43783400.0
3 735988.0 53.428272 53.977664 54.713455 53.114334 18498300.0
2.2 quotes数据加入[ index ]属性名
In [84]: quotesdf
Out[84]: date open close high low volume
0 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0
1 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0
2 735985.0 57.283819 54.016907 57.774347 53.114334 43783400.0[253 rows x 6 columns]In [85]: quotesdf = pd.DataFrame(quotes, index=range(1,len(quotes)+1),columns=fields) # 把index属性从0,1,2...改为1,2,3...In [86]: quotesdf
Out[86]: date open close high low volume
1 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0
2 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0
3 735985.0 57.283819 54.016907 57.774347 53.114334 43783400.0
2.3 日期转换:Gregorian日历表示法 => 普通表示方法
In [88]: from datetime import dateIn [89]: firstday = date.fromordinal(735190)In [93]: firstday
Out[93]: datetime.date(2013, 11, 18)In [95]: firstday = datetime.strftime(firstday,'%Y-%m-%d')In [96]: firstday
Out[96]: '2013-11-18'
2.4 创建时间序列:
In [120]: import pandas as pdIn [121]: dates = pd.date_range('20170101', periods=7) # 根据起始日期和长度生成日期序列In [122]: dates
Out[122]:
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04','2017-01-05', '2017-01-06', '2017-01-07'],dtype='datetime64[ns]', freq='D')In [123]: import numpy as npIn [124]: dates = pd.DataFrame(np.random.randn(7,3), index=dates, columns=list('ABC')) # 时间序列当作index,ABC当作列的name属性,表内容为七行三列随机数In [125]: dates
Out[125]: A B C
2017-01-01 0.705927 0.311453 1.455362
2017-01-02 -0.331531 -0.358449 0.175375
2017-01-03 -0.284583 -1.760700 -0.582880
2017-01-04 -0.759392 -2.080658 -2.015328
2017-01-05 -0.517370 0.906072 -0.106568
2017-01-06 -0.252802 -2.135604 -0.692153
2017-01-07 -0.275184 0.142973 -1.262126
2.5 练习
In [101]: datetime.now() # 显示当前日期和时间
Out[101]: datetime.datetime(2017, 1, 20, 16, 11, 50, 43258)
=========================================
In [108]: datetime.now().month # 显示当前月份
Out[108]: 1=========================================
In [126]: import pandas as pdIn [127]: dates = pd.date_range('2015-02-01',periods=10)In [128]: dates
Out[128]:
DatetimeIndex(['2015-02-01', '2015-02-02', '2015-02-03', '2015-02-04','2015-02-05', '2015-02-06', '2015-02-07', '2015-02-08','2015-02-09', '2015-02-10'],dtype='datetime64[ns]', freq='D')In [133]: res = pd.DataFrame(range(1,11),index=dates,columns=['value'])In [134]: res
Out[134]: value
2015-02-01 1
2015-02-02 2
2015-02-03 3
2015-02-04 4
2015-02-05 5
2015-02-06 6
2015-02-07 7
2015-02-08 8
2015-02-09 9
2015-02-10 10
3. 数据显示
3.1 显示方式:
In [180]: quotesdf2.index # 显示索引
Out[180]:
Index([u'2016-01-20', u'2016-01-21', u'2016-01-22', u'2016-01-25',...u'2017-01-11', u'2017-01-12', u'2017-01-13', u'2017-01-17',u'2017-01-18', u'2017-01-19'],dtype='object', length=253)In [181]: quotesdf2.columns # 显示列名
Out[181]: Index([u'open', u'close', u'high', u'low', u'volume'], dtype='object')In [182]: quotesdf2.values # 显示数据的值
Out[182]:
array([[ 6.03741455e+01, 6.18359160e+01, 6.23362562e+01,6.01288817e+01, 9.04380000e+06],..., [ 7.76100010e+01, 7.66900020e+01, 7.77799990e+01,7.66100010e+01, 7.79110000e+06]])In [183]: quotesdf2.describe # 显示数据描述
Out[183]:
<bound method DataFrame.describe of open close high low volume
2016-01-20 60.374146 61.835916 62.336256 60.128882 9043800.0
2016-01-21 61.806486 61.453305 63.101479 61.325767 8992300.0
2016-01-22 57.283819 54.016907 57.774347 53.114334 43783400.0
3.2 索引的格式&#xff1a;u 表示unicode编码
3.3 显示行&#xff1a;
In [193]: quotesdf.head(2) # 专用方式显示头两行
Out[193]: date open close high low volume
1 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0
2 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0In [194]: quotesdf.tail(2) # 专用方式显示尾两行
Out[194]: date open close high low volume
252 736347.0 77.110001 77.489998 77.610001 76.510002 5988400.0
253 736348.0 77.610001 76.690002 77.779999 76.610001 7791100.0In [195]: quotesdf[:2] # 切片方式显示头两行
Out[195]: date open close high low volume
1 735983.0 60.374146 61.835916 62.336256 60.128882 9043800.0
2 735984.0 61.806486 61.453305 63.101479 61.325767 8992300.0In [197]: quotesdf[251:] # 切片方式显示尾两行
Out[197]: date open close high low volume
252 736347.0 77.110001 77.489998 77.610001 76.510002 5988400.0
253 736348.0 77.610001 76.690002 77.779999 76.610001 7791100.0
4. 数据选择
5. 简单统计与处理
6. Grouping
7. Merge