Aslanguagemodelshavescaledboththeirnumberofparametersandpretrainingdatasetsizes,thecomputationalcostforpretraininghasbecomeintractableexceptforthemostwell-resourcedteams.Thisincreasingcostmakesitevermoreimportanttobeabletoreuseamodelafterithascompletedpretraining;allowingforamodel’sabilitiestofurtherimprovewithoutneedingtotrainfromscratch.Inthiswork,wedetailasetofguidelinesthatcoverhowtodesignefficaciousdatadistributionsandlearningrateschedulesforcontinuedpretrainingoflanguagemodels.Whenapplyingthesefindingswithinacontinuedpretrainingrunontopofawell-trained15Bparametermodel,weshowanimprovementof9%inaveragemodelaccuracycomparedtothebaselineofcontinuedtrainingonthepretrainingset.Theresultingrecipeprovidesapracticalstartingpointwithwhichtobegindevelopinglanguagemodelsthroughreuseratherthanretraining.
Reuse,Don’tRetrain:ARecipeforContinuedPretrainingofLanguageModels
ThesefindingsculminateinarecipethatcanbeusedtoperformcontinuedpretrainingtoimprovethecapabilitiesofanexistingLM.Wedemonstratethatthisrecipeisbeneficialatcontinuedtrainingscalesfrom100Bto1trilliontokens,illustratingitsflexibilityandrobustnesstobeusedinawidevarietyofsettings.Wehopethatthisrecipewillallowformodelproviderstoforgotheneedtoregularlyretrainmodelsfromscratchasitmakesitpossibletoreuseatrainedmodeltoattainimprovedcapabilities.
Thecontinuedpretrainingprocessisasfollows:amodelisfirstpretrained,thenadatadistributionandlearningrateschedulearechosen,acontinuedpretrainingruntakesplace,andfinallythe,hopefullyimproved,modelisreturned.Beforedelvingintotheexperimentsthatdefinethecontinuedtrainingrecipe,wedetailthedatasetsandmodelarchitecturethatareused.
Wepretrainthemodelfor8Ttokens.GiventhatcurrentstateoftheartLMsarepretrainedfortrillionsoftokens,wewanttoexperimentontopofapretrainedmodelthatisemblematicofthetypeofmodelswhichthecontinuedpretrainingrecipewouldbeusedfor.
Theexperimentalfindingswhichconstituteourcontinuedpretrainingrecipearesharedbelow:
Acrucialcomponentofanytrainingrunisthedatadistribution–itdefinestheinformationwhichamodelseesanddirectlyimpactsthemodel’scapabilities.Ascontinuouspretrainingbuildsontopofamodelwhichhasalreadyseenagivenpretrainingdistribution,itisimportanttodefineadatadistributionwhichallowsthemodeltolearnnewconceptswithoutalsodeviatingtoofarfromthepretrainingdistributionsuchthatthemodelbeginstoexperiencetraininginstabilityandaccuracyregression.Throughaseriesofrunswhichtacklewhatcompositionsofdatadistributionsbestimprovetheabilitiesofapretrainedmodel,weidentifygeneralcharacteristicsthatcanbeappliedacrossmostcontinuouspretrainingscenarios.Intheseexperiments,weusealearningrateschedulethatstartsfromηminsubscript\eta_{min}italic_ηstart_POSTSUBSCRIPTitalic_mitalic_iitalic_nend_POSTSUBSCRIPTanddecaysto0withcosineannealing.
WeinvestigatehowtoeffectivelycontinuetrainingLMstoimproveupontheirexistingcapabilities.Ourexperimentsshowthatitisespeciallyimportanttocarefullydefinethedatadistributionandlearningratedecayscheduleusedduringcontinuedpretrainingsothatthemodelisabletosmoothlytransitionawayfromthepretrainingdistributionandbetterlearnthenewlyemphasizeddatasources.WiththesefindingsweproposeageneralrecipethatmodeldeveloperscanuseinordertoperformcontinuedpretrainingontopoftheirownLMsandshowthatforourbasemodel,weareabletoimprovecumulativeaccuracybyover18%.WehopethatthiswillbeastartingpointtoenablefutureLMstobedevelopedthroughthereuseofexistingmodelsratherthanretrainingfromscratch.
Inthedevelopmentofourcontinuedpretrainingrecipe,weonlyexperimentalongtheaxesofdatadistributionsandhyperparameterconfigurations.Althoughwedidnotincludethemwithinourstudy,theremaybeaddedbenefitinexploringotheraspectssuchasalteringthelearningalgorithm.Additionally,giventhatourstudyisconductedontopofamodelwithagivenconfigurationandwhichwaspretrainedusingacertaindatadistribution,theresultsthatwehighlightarelikelytonotextrapolatewellwhenusedinsettingshighlydivergentfromtheoneutilizedinthestudy.Finally,welimitedourgoalwithincontinuedpretrainingtoimprovingthegeneralpurposecapabilitiesofthepretrainedmodel;however,therearemanyadditionalangleswhenconsideringmodelreusesuchasdomainspecializationandtheefficientadditionofnewknowledgeintoexistingmodels.
The53multilinguallanguagescontainedwithinthepretrainingsetare:AR,AZ,BG,BN,CA,CS,DA,DE,EL,ES,ET,FA,FI,FR,GL,HE,HI,HR,HU,HY,ID,IS,IT,JA,KA,KK,KN,KO,LT,LV,MK,ML,MR,NE,NL,NO,PL,PT,RO,RU,SK,SL,SQ,SR,SV,TA,TE,TH,TR,UK,UR,VI,andZH.
The43programminglanguagscontainedwithinourpretrainingsetare:assembly,c,c-sharp,common-lisp,cpp,css,cuda,dart,dockerfile,fortran,go,haskell,html,java,javascript,json,julia,jupyter-scripts,lua,makefile,markdown,mathematica,omniverse,pascal,perl,php,python,R,restructuredtext,ruby,rust,scala,shell,sql,swift,systemverilog,tex,typescript,verilog,vhdl,visual-basic,xml,andyaml.
Theevaluationresultsacrossallconsideredtasksaresharedbelowforeachofourexperiments.