Theunrivaledthreatofandroidmalwareistherootcauseofvarioussecurityproblemsontheinternet.Androidmalwareindustryisbecomingincreasinglydisruptivewithalmost12,000newandroidmalwareinstanceseveryday.Detectingandroidmalwareinsmartphonesisanessentialtargetforcybercommunitytogetridofmenacingmalwaresamples.
Androidmalwareisoneofthemostseriousthreatsontheinternetwhichhaswitnessedanunprecedentedupsurgeinrecentyears.Itisanopenchallengeforcybersecurityexperts.Therearemanytechniquesavailabletoidentifyandclassifyandroidmalwarebasedonmachinelearning,butrecently,deeplearninghasemergedasaprominentclassificationmethodforsuchsamples.
Thisresearchworkproposesanewcomprehensiveandhugeandroidmalwaredataset,namedCCCS-CIC-AndMal-2020.Thedatasetincludes200Kbenignand200Kmalwaresamplestotallingto400Kandroidappswith14prominentmalwarecategoriesand191eminentmalwarefamilies.
Acompletetaxonomyofallthemalwarefamiliesofcapturedmalwareappsiscreatedbydividingthemintoeightcategoriessuchassensitivedatacollection,media,hardware,actions/activities,internetconnection,C&C,antivirusandstorage&settings.Thetaxonomyispresentedintheresearchpapermentionedunderlicense(Section5).
CCCSsupportedustocapturethereal-worldandroidmalwareappsforanalysis.WeusedVirusTotaltospecifymalwarefamilyandlabelthedatasetbyfollowingaconsensusof70%anti-virusestoincorporatereliabilityinlabeleddataset.Wesearchedforsimilarmalwaresamplestocategorizemalwaresamplesindatasetwithsimilarcharacteristics.Table1presentsthedetailsof14androidmalwarecategoriesalongwithnumberofrespectivefamiliesandsamplesinthedataset.
Table1:Datasetdetails
ThefamiliesofeachmalwarecategoryinTable1alongwiththenumbersofthecapturedsamplesareaspresentedbelow:
Forbenignandroidapps,weusedtheAndrozoodataset,whichcurrentlycontainsmorethaneightmillionuniqueandroidapps,andthenumberisstillgrowing.ThearchitectureisdevelopedtocollecttheAndrozoodatasetfromdifferentsourcesincludingofficialandroidmarket,GooglePlay,Anshi,AppChina,1mobile,andGenomeprojectdataset.Aweeklyupdatedlistcontainingallthedetailedinformationabouttheappsiscreated.HTTPAPIisprovidedtoallowthefulldownloadoftheunalteredAPKsfromtheAndrozoodataset.
AndroidManifest.xmlcontainsalotoffeaturesthatcanbeusedforstaticanalysis.Themainextractedfeaturesinclude:
Table2presentstheexamplesofstaticfeaturesextractedfromcaptureddataset.
Table2:Listofstaticfeatures
Forunderstandingthebehavioralchangesofthesemalwarecategoriesandfamilies,sixcategoriesoffeaturesareextractedafterexecutingthemalwareinanemulatedenvironment.Themainextractedfeaturesinclude:
Table3presentsthecompletelistofdynamicfeaturesextractedfromdynamicexecutionofmalware.
Table3:Listofdynamicfeatures
Youmayredistribute,republishandmirrortheCCCS-CIC-AndMal-2020datasetinanyform.However,anyuseorredistributionofthedatamustincludeacitationtotheCCCS-CIC-AndMal-2020datasetandthefollowingpapers.
WethanktheMitacsGlobalinkProgramforprovidingtheResearchInternship(GRI)opportunityandHarrisonMcCainYoungScholarFoundationfundsfromUniversityofNewBrunswick(UNB)forsupportingthisproject.WealsothankCCCSforsharingthemalwaresamplesofthisdatasetwithus.