在处理过程中,今天上午需要更新A字段,下午爬虫组完成了规格书或图片的爬取又需要更新图片和规格书字段,由于单表千万级深度翻页会导致处理速度越来越慢。
是否有可以不需要深度翻页也可以进行数据更新的凭据?是的,利用自增id列
此单表有自增id列且为主键,根据索引列查询数据和更新数据是最理想的途径。
selecta,b,cfromdb.tbwhereid=9999999;updatedb.tbseta=xwhereid=9999999;
每个进程处理一定id范围内的数据,这样既避免的深度翻页又可以同时多进程处理数据。提高数据查询速度的同时也提高了数据处理速度。下面是我编写的任务分配函数,供参考:
defmission_handler(all_missions,worker_mission_size):"""根据总任务数和每个worker的任务数计算出任务列表,任务列表元素为(任务开始id,任务结束id)。例:总任务数100个,每个worker的任务数40,那么任务列表为:[(1,40),(41,80),(81,100)]:paramall_missions:总任务数:paramworker_mission_size:每个worker的最大任务数:return:[(start_id,end_id),(start_id,end_id),...]"""worker_mission_ids=[]current_id=0whilecurrent_id<=all_missions:start_id=all_missionsifcurrent_id+1>=all_missionselsecurrent_id+1end_id=all_missionsifcurrent_id+worker_mission_size>=all_missionselsecurrent_id+worker_mission_sizeifstart_id==end_id:ifworker_mission_ids[-1][1]==start_id:breakworker_mission_ids.append((start_id,end_id))current_id+=worker_mission_sizereturnworker_mission_ids假设单表id最大值为100,然后我们希望每个进程处理20个id,那么任务列表将为:
>>>mission_handler(100,40)[(1,40),(41,80),(81,100)]那么,进程1将只需要处理idbetween1to40的数据;进程2将只需要处理idbetween41to80的数据;进程3将只需要处理idbetween81to100的数据。
fromconcurrent.futuresimportProcessPoolExecutordefmain():#自增id最大值max_id=30000000#单worker处理数据量worker_mission_size=1000000#使用多进程进行处理missions=mission_handler(max_id,worker_mission_size)workers=[]executor=ProcessPoolExecutor()foridx,missioninenumerate(missions):start_id,end_id=missionworkers.append(executor.submit(data_handler,start_id,end_id,idx))defdata_handler(start_id,end_id,worker_id):pass
记录处理成功与处理失败的数据id,以便后续跟进处理
#用另外一张表记录处理状态insertintodb.tb_handle_status(row_id,success)values(999,0);循环体内进行异常捕获,避免程序异常退出
defdata_handler(start_id,end_id,worker_id):#数据连接conn,cursor=mysql()current_id=start_idtry:whilecurrent_id<=end_id:try:#TODO数据处理代码passexceptExceptionase:#TODO记录处理结果#数据移动到下一条current_id+=1continueelse:#无异常,继续处理下一条数据current_id+=1exceptExceptionase:return'worker_id({}):result({})'.format(worker_id,False)finally:#数据库资源释放cursor.close()conn.close()return'worker_id({}):result({})'.format(worker_id,True)