作者:何炘柱_549 | 来源:互联网 | 2022-10-19 05:23
我在带有351837(110 MB大小)记录的配置单元中有一个表,我正在使用python读取此表并写入sql服务器。
在此过程中,将数据从蜂巢读取到熊猫数据帧时需要花费很长时间。当我加载全部记录(351k)时,需要90分钟。
为了改进,我采用了以下方法,例如从蜂巢中读取1万行并写入sql server。但是,仅从配置单元读取1万行并将其分配给Dataframe仅需要4-5分钟的时间。
def execute_hadoop_export():
"""
This will run the steps required for a Hadoop Export.
Return Values is boolean for success fail
"""
try:
hql='select * from db.table '
# Open Hive ODBC Connection
src_cOnn= pyodbc.connect("DSN=****",autocommit=True)
cursor=src_conn.cursor()
#tgt_cOnn= pyodbc.connect(target_connection)
# Using SQLAlchemy to dynamically generate query and leverage dataframe.to_sql to write to sql server...
sql_conn_url = urllib.quote_plus('DRIVER={ODBC Driver 13 for SQL Server};SERVER=Xyz;DATABASE=Db2;UID=ee;PWD=*****')
sql_conn_str = "mssql+pyodbc:///?odbc_cOnnect={0}".format(sql_conn_url)
engine = sqlalchemy.create_engine(sql_conn_str)
# read source table.
vstart=datetime.datetime.now()
for df in pandas.read_sql(hql, src_conn,chunksize=10000):
vfinish=datetime.datetime.now()
print 'Finished 10k rows reading from hive and it took', (vfinish-vstart).seconds/60.0,' minutes'
# Get connection string for target from Ctrl.Connnection
df.to_sql(name='table', schema='dbo', con=engine, chunksize=10000, if_exists="append", index=False)
print 'Finished 10k rows writing into sql server and it took', (datetime.datetime.now()-vfinish).seconds/60.0, ' minutes'
vstart=datetime.datetime.now()
cursor.Close()
except Exception, e:
print str(e)
输出:
在python中读取配置单元表数据的最快方法是什么?
更新配置单元表结构
CREATE TABLE `table1`(
`policynumber` varchar(15),
`unitidentifier` int,
`unitvin` varchar(150),
`unitdescription` varchar(100),
`unitmodelyear` varchar(4),
`unitpremium` decimal(18,2),
`garagelocation` varchar(150),
`garagestate` varchar(50),
`bodilyinjuryoccurrence` decimal(18,2),
`bodilyinjuryaggregate` decimal(18,2),
`bodilyinjurypremium` decimal(18,2),
`propertydamagelimits` decimal(18,2),
`propertydamagepremium` decimal(18,2),
`medicallimits` decimal(18,2),
`medicalpremium` decimal(18,2),
`uninsuredmotoristoccurrence` decimal(18,2),
`uninsuredmotoristaggregate` decimal(18,2),
`uninsuredmotoristpremium` decimal(18,2),
`underinsuredmotoristoccurrence` decimal(18,2),
`underinsuredmotoristaggregate` decimal(18,2),
`underinsuredmotoristpremium` decimal(18,2),
`umpdoccurrence` decimal(18,2),
`umpddeductible` decimal(18,2),
`umpdpremium` decimal(18,2),
`comprehensivedeductible` decimal(18,2),
`comprehensivepremium` decimal(18,2),
`collisiondeductible` decimal(18,2),
`collisionpremium` decimal(18,2),
`emergencyroadservicepremium` decimal(18,2),
`autohomecredit` tinyint,
`lossfreecredit` tinyint,
`multipleautopoliciescredit` tinyint,
`hybridcredit` tinyint,
`goodstudentcredit` tinyint,
`multipleautocredit` tinyint,
`fortyfivepluscredit` tinyint,
`passiverestraintcredit` tinyint,
`defensivedrivercredit` tinyint,
`antitheftcredit` tinyint,
`antilockbrakescredit` tinyint,
`perkcredit` tinyint,
`plantype` varchar(100),
`costnew` decimal(18,2),
`isnocontinuousinsurancesurcharge` tinyint)
CLUSTERED BY (
policynumber,
unitidentifier)
INTO 50 BUCKETS
注意:我也尝试了sqoop导出选项,但是我的配置单元表已经处于存储桶格式。