Reuse,Don’tRetrain:ARecipeforContinuedPretrainingofLanguageModels|recipe后用of还是for_食谱

Aslanguagemodelshavescaledboththeirnumberofparametersandpretrainingdatasetsizes,thecomputationalcostforpretraininghasbecomeintractableexceptforthemostwell-resourcedteams.Thisincreasingcostmakesitevermoreimportanttobeabletoreuseamodelafterithascompletedpretraining;allowingforamodel’sabilitiestofurtherimprovewithoutneedingtotrainfromscratch.Inthiswork,wedetailasetofguidelinesthatcoverhowtodesignefficaciousdatadistributionsandlearningrateschedulesforcontinuedpretrainingoflanguagemodels.Whenapplyingthesefindingswithinacontinuedpretrainingrunontopofawell-trained15Bparametermodel,weshowanimprovementof9%inaveragemodelaccuracycomparedtothebaselineofcontinuedtrainingonthepretrainingset.Theresultingrecipeprovidesapracticalstartingpointwithwhichtobegindevelopinglanguagemodelsthroughreuseratherthanretraining.

Reuse,Don’tRetrain:ARecipeforContinuedPretrainingofLanguageModels

ThesefindingsculminateinarecipethatcanbeusedtoperformcontinuedpretrainingtoimprovethecapabilitiesofanexistingLM.Wedemonstratethatthisrecipeisbeneficialatcontinuedtrainingscalesfrom100Bto1trilliontokens,illustratingitsflexibilityandrobustnesstobeusedinawidevarietyofsettings.Wehopethatthisrecipewillallowformodelproviderstoforgotheneedtoregularlyretrainmodelsfromscratchasitmakesitpossibletoreuseatrainedmodeltoattainimprovedcapabilities.

Thecontinuedpretrainingprocessisasfollows:amodelisfirstpretrained,thenadatadistributionandlearningrateschedulearechosen,acontinuedpretrainingruntakesplace,andfinallythe,hopefullyimproved,modelisreturned.Beforedelvingintotheexperimentsthatdefinethecontinuedtrainingrecipe,wedetailthedatasetsandmodelarchitecturethatareused.

Wepretrainthemodelfor8Ttokens.GiventhatcurrentstateoftheartLMsarepretrainedfortrillionsoftokens,wewanttoexperimentontopofapretrainedmodelthatisemblematicofthetypeofmodelswhichthecontinuedpretrainingrecipewouldbeusedfor.

Theexperimentalfindingswhichconstituteourcontinuedpretrainingrecipearesharedbelow:

Acrucialcomponentofanytrainingrunisthedatadistribution–itdefinestheinformationwhichamodelseesanddirectlyimpactsthemodel’scapabilities.Ascontinuouspretrainingbuildsontopofamodelwhichhasalreadyseenagivenpretrainingdistribution,itisimportanttodefineadatadistributionwhichallowsthemodeltolearnnewconceptswithoutalsodeviatingtoofarfromthepretrainingdistributionsuchthatthemodelbeginstoexperiencetraininginstabilityandaccuracyregression.Throughaseriesofrunswhichtacklewhatcompositionsofdatadistributionsbestimprovetheabilitiesofapretrainedmodel,weidentifygeneralcharacteristicsthatcanbeappliedacrossmostcontinuouspretrainingscenarios.Intheseexperiments,weusealearningrateschedulethatstartsfromηminsubscript\eta_{min}italic_ηstart_POSTSUBSCRIPTitalic_mitalic_iitalic_nend_POSTSUBSCRIPTanddecaysto0withcosineannealing.

WeinvestigatehowtoeffectivelycontinuetrainingLMstoimproveupontheirexistingcapabilities.Ourexperimentsshowthatitisespeciallyimportanttocarefullydefinethedatadistributionandlearningratedecayscheduleusedduringcontinuedpretrainingsothatthemodelisabletosmoothlytransitionawayfromthepretrainingdistributionandbetterlearnthenewlyemphasizeddatasources.WiththesefindingsweproposeageneralrecipethatmodeldeveloperscanuseinordertoperformcontinuedpretrainingontopoftheirownLMsandshowthatforourbasemodel,weareabletoimprovecumulativeaccuracybyover18%.WehopethatthiswillbeastartingpointtoenablefutureLMstobedevelopedthroughthereuseofexistingmodelsratherthanretrainingfromscratch.

Inthedevelopmentofourcontinuedpretrainingrecipe,weonlyexperimentalongtheaxesofdatadistributionsandhyperparameterconfigurations.Althoughwedidnotincludethemwithinourstudy,theremaybeaddedbenefitinexploringotheraspectssuchasalteringthelearningalgorithm.Additionally,giventhatourstudyisconductedontopofamodelwithagivenconfigurationandwhichwaspretrainedusingacertaindatadistribution,theresultsthatwehighlightarelikelytonotextrapolatewellwhenusedinsettingshighlydivergentfromtheoneutilizedinthestudy.Finally,welimitedourgoalwithincontinuedpretrainingtoimprovingthegeneralpurposecapabilitiesofthepretrainedmodel;however,therearemanyadditionalangleswhenconsideringmodelreusesuchasdomainspecializationandtheefficientadditionofnewknowledgeintoexistingmodels.

The53multilinguallanguagescontainedwithinthepretrainingsetare:AR,AZ,BG,BN,CA,CS,DA,DE,EL,ES,ET,FA,FI,FR,GL,HE,HI,HR,HU,HY,ID,IS,IT,JA,KA,KK,KN,KO,LT,LV,MK,ML,MR,NE,NL,NO,PL,PT,RO,RU,SK,SL,SQ,SR,SV,TA,TE,TH,TR,UK,UR,VI,andZH.

The43programminglanguagscontainedwithinourpretrainingsetare:assembly,c,c-sharp,common-lisp,cpp,css,cuda,dart,dockerfile,fortran,go,haskell,html,java,javascript,json,julia,jupyter-scripts,lua,makefile,markdown,mathematica,omniverse,pascal,perl,php,python,R,restructuredtext,ruby,rust,scala,shell,sql,swift,systemverilog,tex,typescript,verilog,vhdl,visual-basic,xml,andyaml.

Theevaluationresultsacrossallconsideredtasksaresharedbelowforeachofourexperiments.

THE END

Reuse,Don’tRetrain:ARecipeforContinuedPretrainingofLanguageModels

职场中，“收到”用英语怎么说？copyreceived

七年级英语阅读理解试题及答案解析120230427.pdf

新人教版八年级英语上册第八单元整体教学设计思路

ICCV2023最全AIGC梳理，5w字30个diffusion扩散模型方向，近百篇论文！

SECS

地牢围攻2攻略

ZongziisatraditionalChinesefood.madeofglutinousricewithdifferentfillingsandwrappedinbambooleavesorotherplants．Arecipeformaking10zongzi:50piecesofbambooleaves,Glutinousrice题目和参考答案——青夏教育精英家教网——

2016年初二英语下册知识点汇总

child的复数形式怎么写（精选9篇）

《地牢围攻2破碎的世界》支线任务攻略

高考英语满分英语作文模板24个话题24篇范文及读后续写金句汇总

China'srecipeforbiggerpieofglobaldevelopmentinnewblueprint

chocolates如何读是什么意思英英释义词组词源英文解释儿童词典英英释义

Reuse,Don’tRetrain:ARecipeforContinuedPretrainingofLanguageModels