您的位置:首页 > 编程语言 > Python开发

A Complete Tutorial to Learn Data Science with Python from Scratch

2017-10-02 20:20 513 查看
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Introduction

Ithappenedfewyearsback.AfterworkingonSASformorethan5years,Idecidedtomoveoutofmycomfortzone.Beingadatascientist,myhuntforotherusefultoolswasON!Fortunately,itdidn’ttakemelongtodecide,Pythonwasmyappetizer.Ialwayshadainclinationtowardscoding.ThiswasthetimetodowhatIreallyloved.Code.Turnedout,codingwassoeasy!IlearnedbasicsofPythonwithinaweek.And,sincethen,I’venotonlyexploredthislanguagetothedepth,butalsohavehelpedmanyothertolearnthislanguage.Pythonwasoriginallyageneralpurposelanguage.But,overtheyears,withstrongcommunitysupport,thislanguagegotdedicatedlibraryfordataanalysisandpredictivemodeling.Duetolackofresourceonpythonfordatascience,Idecidedtocreatethistutorialtohelpmanyotherstolearnpythonfaster.Inthistutorial,wewilltakebitesizedinformationabouthowtousePythonforDataAnalysis,chewittillwearecomfortableandpracticeitatourownend.

TableofContents

BasicsofPythonforDataAnalysisWhylearnPythonfordataanalysis?Python2.7v/s3.4HowtoinstallPython?RunningafewsimpleprogramsinPython
PythonlibrariesanddatastructuresPythonDataStructuresPythonIterationandConditionalConstructsPythonLibraries
ExploratoryanalysisinPythonusingPandasIntroductiontoseriesanddataframesAnalyticsVidhyadataset-LoanPredictionProblem
DataMunginginPythonusingPandasBuildingaPredictiveModelinPythonLogisticRegressionDecisionTreeRandomForest

Let’sgetstarted!

1.BasicsofPythonforDataAnalysis

WhylearnPythonfordataanalysis?Pythonhasgatheredalotofinterestrecentlyasachoiceoflanguagefordataanalysis.IhadcompareditagainstSAS&Rsometimeback.HerearesomereasonswhichgoinfavouroflearningPython:OpenSource–freetoinstallAwesomeonlinecommunityVeryeasytolearnCanbecomeacommonlanguagefordatascienceandproductionofwebbasedanalyticsproducts.
Needlesstosay,itstillhasfewdrawbackstoo:Itisaninterpretedlanguageratherthancompiledlanguage–hencemighttakeupmoreCPUtime.However,giventhesavingsinprogrammertime(duetoeaseoflearning),itmightstillbeagoodchoice.
Python2.7v/s3.4ThisisoneofthemostdebatedtopicsinPython.Youwillinvariablycrosspathswithit,speciallyifyouareabeginner.Thereisnoright/wrongchoicehere.Ittotallydependsonthesituationandyourneedtouse.Iwilltrytogiveyousomepointerstohelpyoumakeaninformedchoice.WhyPython2.7?Awesomecommunitysupport!Thisissomethingyou’dneedinyourearlydays.Python2wasreleasedinlate2000andhasbeeninuseformorethan15years.Plethoraofthird-partylibraries!Thoughmanylibrarieshaveprovided3.xsupportbutstillalargenumberofmodulesworkonlyon2.xversions.IfyouplantousePythonforspecificapplicationslikeweb-developmentwithhighrelianceonexternalmodules,youmightbebetteroffwith2.7.Someofthefeaturesof3.xversionshavebackwardcompatibilityandcanworkwith2.7version.
WhyPython3.4?Cleanerandfaster!Pythondevelopershavefixedsomeinherentglitchesandminordrawbacksinordertosetastrongerfoundationforthefuture.Thesemightnotbeveryrelevantinitially,butwillmattereventually.Itisthefuture!2.7isthelastreleaseforthe2.xfamilyandeventuallyeveryonehastoshiftto3.xversions.Python3hasreleasedstableversionsforpast5yearsandwillcontinuethesame.
ThereisnoclearwinnerbutIsupposethebottomlineisthatyoushouldfocusonlearningPythonasalanguage.Shiftingbetweenversionsshouldjustbeamatteroftime.StaytunedforadedicatedarticleonPython2.xvs3.xinthenearfuture!HowtoinstallPython?Thereare2approachestoinstallPython:YoucandownloadPythondirectlyfromitsprojectsiteandinstallindividualcomponentsandlibrariesyouwantAlternately,youcandownloadandinstallapackage,whichcomeswithpre-installedlibraries.IwouldrecommenddownloadingAnaconda.AnotheroptioncouldbeEnthoughtCanopyExpress.
SecondmethodprovidesahasslefreeinstallationandhenceI’llrecommendthattobeginners.Theimitationofthisapproachisyouhavetowaitfortheentirepackagetobeupgraded,evenifyouareinterestedinthelatestversionofasinglelibrary.Itshouldnotmatteruntilandunless,untilandunless,youaredoingcuttingedgestatisticalresearch.ChoosingadevelopmentenvironmentOnceyouhaveinstalledPython,therearevariousoptionsforchoosinganenvironment.Herearethe3mostcommonoptions:Terminal/ShellbasedIDLE(defaultenvironment)iPythonnotebook–similartomarkdowninR


IDLEeditorforPythonWhiletherightenvironmentdependsonyourneed,IpersonallypreferiPythonNotebooksalot.Itprovidesalotofgoodfeaturesfordocumentingwhilewritingthecodeitselfandyoucanchoosetorunthecodeinblocks(ratherthanthelinebylineexecution)WewilluseiPythonenvironmentforthiscompletetutorial.Warmingup:RunningyourfirstPythonprogramYoucanusePythonasasimplecalculatortostartwith:

FewthingstonoteYoucanstartiPythonnotebookbywriting“ipythonnotebook”onyourterminal/cmd,dependingontheOSyouareworkingonYoucannameaiPythonnotebookbysimplyclickingonthename–UntitledOintheabovescreenshotTheinterfaceshowsIn[*]forinputsandOut[*]foroutput.Youcanexecuteacodebypressing“Shift+Enter”or“ALT+Enter”,ifyouwanttoinsertanadditionalrowafter.
Beforewedeepdiveintoproblemsolving,letstakeastepbackandunderstandthebasicsofPython.Asweknowthatdatastructuresanditerationandconditionalconstructsformthecruxofanylanguage.InPython,theseincludelists,strings,tuples,dictionaries,for-loop,while-loop,if-else,etc.Let’stakealookatsomeofthese.

2.PythonlibrariesandDataStructures

PythonDataStructuresFollowingaresomedatastructures,whichareusedinPython.Youshouldbefamiliarwiththeminordertousethemasappropriate.Lists–ListsareoneofthemostversatiledatastructureinPython.Alistcansimplybedefinedbywritingalistofcommaseparatedvaluesinsquarebrackets.Listsmightcontainitemsofdifferenttypes,butusuallytheitemsallhavethesametype.Pythonlistsaremutableandindividualelementsofalistcanbechanged.
Hereisaquickexampletodefinealistandthenaccessit:

Strings–Stringscansimplybedefinedbyuseofsingle(‘),double(”)ortriple(”’)invertedcommas.Stringsenclosedintripequotes(”’)canspanovermultiplelinesandareusedfrequentlyindocstrings(Python’swayofdocumentingfunctions).\isusedasanescapecharacter.PleasenotethatPythonstringsareimmutable,soyoucannotchangepartofstrings.


Tuples–Atupleisrepresentedbyanumberofvaluesseparatedbycommas.Tuplesareimmutableandtheoutputissurroundedbyparenthesessothatnestedtuplesareprocessedcorrectly.Additionally,eventhoughtuplesareimmutable,theycanholdmutabledataifneeded.
SinceTuplesareimmutableandcannotchange,theyarefasterinprocessingascomparedtolists.Hence,ifyourlistisunlikelytochange,youshouldusetuples,insteadoflists.

Dictionary–Dictionaryisanunorderedsetofkey:valuepairs,withtherequirementthatthekeysareunique(withinonedictionary).Apairofbracescreatesanemptydictionary:{}.


PythonIterationandConditionalConstructsLikemostlanguages,PythonalsohasaFOR-loopwhichisthemostwidelyusedmethodforiteration.Ithasasimplesyntax:
foriin[PythonIterable]:
expression(i)

Here“PythonIterable”canbealist,tupleorotheradvanceddatastructureswhichwewillexploreinlatersections.Let’stakealookatasimpleexample,determiningthefactorialofanumber.
fact=1
foriinrange(1,N+1):
fact*=i

Comingtoconditionalstatements,theseareusedtoexecutecodefragmentsbasedonacondition.Themostcommonlyusedconstructisif-else,withfollowingsyntax:
if[condition]:
__executioniftrue__
else:
__executioniffalse__

Forinstance,ifwewanttoprintwhetherthenumberNisevenorodd:
ifN%2==0:
print'Even'
else:
print'Odd'

NowthatyouarefamiliarwithPythonfundamentals,let’stakeastepfurther.Whatifyouhavetoperformthefollowingtasks:

Multiply2matrices
Findtherootofaquadraticequation
Plotbarchartsandhistograms
Makestatisticalmodels
Accessweb-pages

Ifyoutrytowritecodefromscratch,itsgoingtobeanightmareandyouwon’tstayonPythonformorethan2days!Butletsnotworryaboutthat.Thankfully,therearemanylibrarieswithpredefinedwhichwecandirectlyimportintoourcodeandmakeourlifeeasy.
Forexample,considerthefactorialexamplewejustsaw.Wecandothatinasinglestepas:
math.factorial(N)

Off-courseweneedtoimportthemathlibraryforthat.Letsexplorethevariouslibrariesnext.
PythonLibraries
LetstakeonestepaheadinourjourneytolearnPythonbygettingacquaintedwithsomeusefullibraries.Thefirststepisobviouslytolearntoimportthemintoourenvironment.ThereareseveralwaysofdoingsoinPython:
importmathasm
frommathimport*

Inthefirstmanner,wehavedefinedanaliasmtolibrarymath.Wecannowusevariousfunctionsfrommathlibrary(e.g.factorial)byreferencingitusingthealiasm.factorial().
Inthesecondmanner,youhaveimportedtheentirenamespaceinmathi.e.youcandirectlyusefactorial()withoutreferringtomath.
Tip:Googlerecommendsthatyouusefirststyleofimportinglibraries,asyouwillknowwherethefunctionshavecomefrom.
Followingarealistoflibraries,youwillneedforanyscientificcomputationsanddataanalysis:

NumPystandsforNumericalPython.ThemostpowerfulfeatureofNumPyisn-dimensionalarray.Thislibraryalsocontainsbasiclinearalgebrafunctions,Fouriertransforms,advancedrandomnumbercapabilitiesandtoolsforintegrationwithotherlowlevellanguageslikeFortran,CandC++
SciPystandsforScientificPython.SciPyisbuiltonNumPy.ItisoneofthemostusefullibraryforvarietyofhighlevelscienceandengineeringmoduleslikediscreteFouriertransform,LinearAlgebra,OptimizationandSparsematrices.
Matplotlibforplottingvastvarietyofgraphs,startingfromhistogramstolineplotstoheatplots..YoucanusePylabfeatureinipythonnotebook(ipythonnotebook–pylab=inline)tousetheseplottingfeaturesinline.Ifyouignoretheinlineoption,thenpylabconvertsipythonenvironmenttoanenvironment,verysimilartoMatlab.YoucanalsouseLatexcommandstoaddmathtoyourplot.
Pandasforstructureddataoperationsandmanipulations.Itisextensivelyusedfordatamungingandpreparation.PandaswereaddedrelativelyrecentlytoPythonandhavebeeninstrumentalinboostingPython’susageindatascientistcommunity.
ScikitLearnformachinelearning.BuiltonNumPy,SciPyandmatplotlib,thislibrarycontainsalotofeffiecienttoolsformachinelearningandstatisticalmodelingincludingclassification,regression,clusteringanddimensionalityreduction.
Statsmodelsforstatisticalmodeling.StatsmodelsisaPythonmodulethatallowsuserstoexploredata,estimatestatisticalmodels,andperformstatisticaltests.Anextensivelistofdescriptivestatistics,statisticaltests,plottingfunctions,andresultstatisticsareavailablefordifferenttypesofdataandeachestimator.
Seabornforstatisticaldatavisualization.SeabornisalibraryformakingattractiveandinformativestatisticalgraphicsinPython.Itisbasedonmatplotlib.Seabornaimstomakevisualizationacentralpartofexploringandunderstandingdata.
Bokehforcreatinginteractiveplots,dashboardsanddataapplicationsonmodernweb-browsers.ItempowerstheusertogenerateelegantandconcisegraphicsinthestyleofD3.js.Moreover,ithasthecapabilityofhigh-performanceinteractivityoververylargeorstreamingdatasets.
BlazeforextendingthecapabilityofNumpyandPandastodistributedandstreamingdatasets.ItcanbeusedtoaccessdatafromamultitudeofsourcesincludingBcolz,MongoDB,SQLAlchemy,ApacheSpark,PyTables,etc.TogetherwithBokeh,Blazecanactasaverypowerfultoolforcreatingeffectivevisualizationsanddashboardsonhugechunksofdata.
Scrapyforwebcrawling.Itisaveryusefulframeworkforgettingspecificpatternsofdata.Ithasthecapabilitytostartatawebsitehomeurlandthendigthroughweb-pageswithinthewebsitetogatherinformation.
SymPyforsymboliccomputation.Ithaswide-rangingcapabilitiesfrombasicsymbolicarithmetictocalculus,algebra,discretemathematicsandquantumphysics.AnotherusefulfeatureisthecapabilityofformattingtheresultofthecomputationsasLaTeXcode.
Requestsforaccessingtheweb.Itworkssimilartothethestandardpythonlibraryurllib2butismucheasiertocode.Youwillfindsubtledifferenceswithurllib2butforbeginners,Requestsmightbemoreconvenient.

Additionallibraries,youmightneed:

osforOperatingsystemandfileoperations
networkxandigraphforgraphbaseddatamanipulations
regularexpressionsforfindingpatternsintextdata
BeautifulSoupforscrappingweb.ItisinferiortoScrapyasitwillextractinformationfromjustasinglewebpageinarun.

NowthatwearefamiliarwithPythonfundamentalsandadditionallibraries,letstakeadeepdiveintoproblemsolvingthroughPython.YesImeanmakingapredictivemodel!Intheprocess,weusesomepowerfullibrariesandalsocomeacrossthenextlevelofdatastructures.Wewilltakeyouthroughthe3keyphases:

DataExploration–findingoutmoreaboutthedatawehave
DataMunging–cleaningthedataandplayingwithittomakeitbettersuitstatisticalmodeling
PredictiveModeling–runningtheactualalgorithmsandhavingfun

3.ExploratoryanalysisinPythonusingPandas

Inordertoexploreourdatafurther,letmeintroduceyoutoanotheranimal(asifPythonwasnotenough!)–Pandas



ImageSource:Wikipedia
PandasisoneofthemostusefuldataanalysislibraryinPython(Iknowthesenamessoundsweird,buthangon!).TheyhavebeeninstrumentalinincreasingtheuseofPythonindatasciencecommunity.WewillnowusePandastoreadadatasetfromanAnalyticsVidhyacompetition,performexploratoryanalysisandbuildourfirstbasiccategorizationalgorithmforsolvingthisproblem.
Beforeloadingthedata,letsunderstandthe2keydatastructuresinPandas–SeriesandDataFrames
IntroductiontoSeriesandDataframes
Seriescanbeunderstoodasa1dimensionallabelled/indexedarray.Youcanaccessindividualelementsofthisseriesthroughtheselabels.
AdataframeissimilartoExcelworkbook–youhavecolumnnamesreferringtocolumnsandyouhaverows,whichcanbeaccessedwithuseofrownumbers.Theessentialdifferencebeingthatcolumnnamesandrownumbersareknownascolumnandrowindex,incaseofdataframes.
SeriesanddataframesformthecoredatamodelforPandasinPython.Thedatasetsarefirstreadintothesedataframesandthenvariousoperations(e.g.groupby,aggregationetc.)canbeappliedveryeasilytoitscolumns.
More:10MinutestoPandas
Practicedataset–LoanPredictionProblem
Youcandownloadthedatasetfromhere.Hereisthedescriptionofvariables:
VARIABLEDESCRIPTIONS:
VariableDescription
Loan_IDUniqueLoanID
GenderMale/Female
MarriedApplicantmarried(Y/N)
DependentsNumberofdependents
EducationApplicantEducation(Graduate/UnderGraduate)
Self_EmployedSelfemployed(Y/N)
ApplicantIncomeApplicantincome
CoapplicantIncomeCoapplicantincome
LoanAmountLoanamountinthousands
Loan_Amount_TermTermofloaninmonths
Credit_Historycredithistorymeetsguidelines
Property_AreaUrban/SemiUrban/Rural
Loan_StatusLoanapproved(Y/N)

Let’sbeginwithexploration
Tobegin,startiPythoninterfaceinInlinePylabmodebytypingfollowingonyourterminal/windowscommandprompt:
ipythonnotebook--pylab=inline

ThisopensupiPythonnotebookinpylabenvironment,whichhasafewusefullibrariesalreadyimported.Also,youwillbeabletoplotyourdatainline,whichmakesthisareallygoodenvironmentforinteractivedataanalysis.Youcancheckwhethertheenvironmenthasloadedcorrectly,bytypingthefollowingcommand(andgettingtheoutputasseeninthefigurebelow):
plot(arange(5))




IamcurrentlyworkinginLinux,andhavestoredthedatasetinthefollowinglocation:
/home/kunal/Downloads/Loan_Prediction/train.csv
Importinglibrariesandthedataset:
Followingarethelibrarieswewilluseduringthistutorial:

numpy
matplotlib
pandas

PleasenotethatyoudonotneedtoimportmatplotlibandnumpybecauseofPylabenvironment.Ihavestillkepttheminthecode,incaseyouusethecodeinadifferentenvironment.
Afterimportingthelibrary,youreadthedatasetusingfunctionread_csv().Thisishowthecodelooksliketillthisstage:
importpandasaspd
importnumpyasnp
importmatplotlibasplt
df=pd.read_csv("/home/kunal/Downloads/Loan_Prediction/train.csv")#ReadingthedatasetinadataframeusingPandas

QuickDataExploration
Onceyouhavereadthedataset,youcanhavealookatfewtoprowsbyusingthefunctionhead()
df.head(10)




Thisshouldprint10rows.Alternately,youcanalsolookatmorerowsbyprintingthedataset.
Next,youcanlookatsummaryofnumericalfieldsbyusingdescribe()function
df.describe()




describe()functionwouldprovidecount,mean,standarddeviation(std),min,quartilesandmaxinitsoutput(Readthisarticletorefreshbasicstatisticstounderstandpopulationdistribution)
Hereareafewinferences,youcandrawbylookingattheoutputofdescribe()function:

LoanAmounthas(614–592)22missingvalues.
Loan_Amount_Termhas(614–600)14missingvalues.
Credit_Historyhas(614–564)50missingvalues.
Wecanalsolookthatabout84%applicantshaveacredit_history.How?ThemeanofCredit_Historyfieldis0.84(Remember,Credit_Historyhasvalue1forthosewhohaveacredithistoryand0otherwise)
TheApplicantIncomedistributionseemstobeinlinewithexpectation.SamewithCoapplicantIncome

Pleasenotethatwecangetanideaofapossibleskewinthedatabycomparingthemeantothemedian,i.e.the50%figure.
Forthenon-numericalvalues(e.g.Property_Area,Credit_Historyetc.),wecanlookatfrequencydistributiontounderstandwhethertheymakesenseornot.Thefrequencytablecanbeprintedbyfollowingcommand:
df['Property_Area'].value_counts()

Similarly,wecanlookatuniquevaluesofportofcredithistory.Notethatdfname[‘column_name’]isabasicindexingtechniquetoacessaparticularcolumnofthedataframe.Itcanbealistofcolumnsaswell.Formoreinformation,refertothe“10MinutestoPandas”resourcesharedabove.
Distributionanalysis
Nowthatwearefamiliarwithbasicdatacharacteristics,letusstudydistributionofvariousvariables.Letusstartwithnumericvariables–namelyApplicantIncomeandLoanAmount
LetsstartbyplottingthehistogramofApplicantIncomeusingthefollowingcommands:
df['ApplicantIncome'].hist(bins=50)




Hereweobservethattherearefewextremevalues.Thisisalsothereasonwhy50binsarerequiredtodepictthedistributionclearly.
Next,welookatboxplotstounderstandthedistributions.Boxplotforfarecanbeplottedby:
df.boxplot(column='ApplicantIncome')




Thisconfirmsthepresenceofalotofoutliers/extremevalues.Thiscanbeattributedtotheincomedisparityinthesociety.Partofthiscanbedrivenbythefactthatwearelookingatpeoplewithdifferenteducationlevels.LetussegregatethembyEducation:
df.boxplot(column='ApplicantIncome',by='Education')




Wecanseethatthereisnosubstantialdifferentbetweenthemeanincomeofgraduateandnon-graduates.Butthereareahighernumberofgraduateswithveryhighincomes,whichareappearingtobetheoutliers.
Now,Let’slookatthehistogramandboxplotofLoanAmountusingthefollowingcommand:
df['LoanAmount'].hist(bins=50)


df.boxplot(column='LoanAmount')




Again,therearesomeextremevalues.Clearly,bothApplicantIncomeandLoanAmountrequiresomeamountofdatamunging.LoanAmounthasmissingandwellasextremevaluesvalues,whileApplicantIncomehasafewextremevalues,whichdemanddeeperunderstanding.Wewilltakethisupincomingsections.
Categoricalvariableanalysis
NowthatweunderstanddistributionsforApplicantIncomeandLoanIncome,letusunderstandcategoricalvariablesinmoredetails.WewilluseExcelstylepivottableandcross-tabulation.Forinstance,letuslookatthechancesofgettingaloanbasedoncredithistory.ThiscanbeachievedinMSExcelusingapivottableas:



Note:hereloanstatushasbeencodedas1forYesand0forNo.Sothemeanrepresentstheprobabilityofgettingloan.
NowwewilllookatthestepsrequiredtogenerateasimilarinsightusingPython.PleaserefertothisarticleforgettingahangofthedifferentdatamanipulationtechniquesinPandas.
temp1=df['Credit_History'].value_counts(ascending=True)
temp2=df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambdax:x.map({'Y':1,'N':0}).mean())
print'FrequencyTableforCreditHistory:'
printtemp1
print'\nProbilityofgettingloanforeachCreditHistoryclass:'
printtemp2




Nowwecanobservethatwegetasimilarpivot_tableliketheMSExcelone.Thiscanbeplottedasabarchartusingthe“matplotlib”librarywithfollowingcode:
importmatplotlib.pyplotasplt
fig=plt.figure(figsize=(8,4))
ax1=fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('CountofApplicants')
ax1.set_title("ApplicantsbyCredit_History")
temp1.plot(kind='bar')
ax2=fig.add_subplot(122)
temp2.plot(kind='bar')
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probabilityofgettingloan')
ax2.set_title("Probabilityofgettingloanbycredithistory")




Thisshowsthatthechancesofgettingaloanareeight-foldiftheapplicanthasavalidcredithistory.YoucanplotsimilargraphsbyMarried,Self-Employed,Property_Area,etc.
Alternately,thesetwoplotscanalsobevisualizedbycombiningtheminastackedchart::
temp3=pd.crosstab(df['Credit_History'],df['Loan_Status'])
temp3.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)




Youcanalsoaddgenderintothemix(similartothepivottableinExcel):



Ifyouhavenotrealizedalready,wehavejustcreatedtwobasicclassificationalgorithmshere,onebasedoncredithistory,whileotheron2categoricalvariables(includinggender).YoucanquicklycodethistocreateyourfirstsubmissiononAVDatahacks.
WejustsawhowwecandoexploratoryanalysisinPythonusingPandas.Ihopeyourloveforpandas(theanimal)wouldhaveincreasedbynow–giventheamountofhelp,thelibrarycanprovideyouinanalyzingdatasets.
Nextlet’sexploreApplicantIncomeandLoanStatusvariablesfurther,performdatamungingandcreateadatasetforapplyingvariousmodelingtechniques.Iwouldstronglyurgethatyoutakeanotherdatasetandproblemandgothroughanindependentexamplebeforereadingfurther.

4.DataMunginginPython:UsingPandas

Forthose,whohavebeenfollowing,hereareyourmustwearshoestostartrunning.

Datamunging–recapoftheneed
Whileourexplorationofthedata,wefoundafewproblemsinthedataset,whichneedstobesolvedbeforethedataisreadyforagoodmodel.Thisexerciseistypicallyreferredas“DataMunging”.Herearetheproblems,wearealreadyawareof:

Therearemissingvaluesinsomevariables.Weshouldestimatethosevalueswiselydependingontheamountofmissingvaluesandtheexpectedimportanceofvariables.
Whilelookingatthedistributions,wesawthatApplicantIncomeandLoanAmountseemedtocontainextremevaluesateitherend.Thoughtheymightmakeintuitivesense,butshouldbetreatedappropriately.

Inadditiontotheseproblemswithnumericalfields,weshouldalsolookatthenon-numericalfieldsi.e.Gender,Property_Area,Married,EducationandDependentstosee,iftheycontainanyusefulinformation.
IfyouarenewtoPandas,Iwouldrecommendreadingthisarticlebeforemovingon.Itdetailssomeusefultechniquesofdatamanipulation.
Checkmissingvaluesinthedataset
Letuslookatmissingvaluesinallthevariablesbecausemostofthemodelsdon’tworkwithmissingdataandeveniftheydo,imputingthemhelpsmoreoftenthannot.So,letuscheckthenumberofnulls/NaNsinthedataset
df.apply(lambdax:sum(x.isnull()),axis=0)

Thiscommandshouldtellusthenumberofmissingvaluesineachcolumnasisnull()returns1,ifthevalueisnull.



Thoughthemissingvaluesarenotveryhighinnumber,butmanyvariableshavethemandeachoneoftheseshouldbeestimatedandaddedinthedata.Getadetailedviewondifferentimputationtechniquesthroughthisarticle.
Note:RememberthatmissingvaluesmaynotalwaysbeNaNs.Forinstance,iftheLoan_Amount_Termis0,doesitmakessenseorwouldyouconsiderthatmissing?Isupposeyouranswerismissingandyou’reright.Soweshouldcheckforvalueswhichareunpractical.
HowtofillmissingvaluesinLoanAmount?
Therearenumerouswaystofillthemissingvaluesofloanamount–thesimplestbeingreplacementbymean,whichcanbedonebyfollowingcode:
df['LoanAmount'].fillna(df['LoanAmount'].mean(),inplace=True)

Theotherextremecouldbetobuildasupervisedlearningmodeltopredictloanamountonthebasisofothervariablesandthenuseagealongwithothervariablestopredictsurvival.
Since,thepurposenowistobringoutthestepsindatamunging,I’llrathertakeanapproach,whichliessomewhereinbetweenthese2extremes.Akeyhypothesisisthatthewhetherapersoniseducatedorself-employedcancombinetogiveagoodestimateofloanamount.
First,let’slookattheboxplottoseeifatrendexists:



Thusweseesomevariationsinthemedianofloanamountforeachgroupandthiscanbeusedtoimputethevalues.Butfirst,wehavetoensurethateachofSelf_EmployedandEducationvariablesshouldnothaveamissingvalues.
Aswesayearlier,Self_Employedhassomemissingvalues.Let’slookatthefrequencytable:



Since~86%valuesare“No”,itissafetoimputethemissingvaluesas“No”asthereisahighprobabilityofsuccess.Thiscanbedoneusingthefollowingcode:
df['Self_Employed'].fillna('No',inplace=True)

Now,wewillcreateaPivottable,whichprovidesusmedianvaluesforallthegroupsofuniquevaluesofSelf_EmployedandEducationfeatures.Next,wedefineafunction,whichreturnsthevaluesofthesecellsandapplyittofillthemissingvaluesofloanamount:
table=df.pivot_table(values='LoanAmount',index='Self_Employed',columns='Education',aggfunc=np.median)
#Definefunctiontoreturnvalueofthispivot_table
deffage(x):
returntable.loc[x['Self_Employed'],x['Education']]
#Replacemissingvalues
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage,axis=1),inplace=True)

Thisshouldprovideyouagoodwaytoimputemissingvaluesofloanamount.
[b]HowtotreatforextremevaluesindistributionofLoanAmountandApplicantIncome?[/b]
Let’sanalyzeLoanAmountfirst.Sincetheextremevaluesarepracticallypossible,i.e.somepeoplemightapplyforhighvalueloansduetospecificneeds.Soinsteadoftreatingthemasoutliers,let’stryalogtransformationtonullifytheireffect:
df['LoanAmount_log']=np.log(df['LoanAmount'])
df['LoanAmount_log'].hist(bins=20)

Lookingatthehistogramagain:



Nowthedistributionlooksmuchclosertonormalandeffectofextremevalueshasbeensignificantlysubsided.
ComingtoApplicantIncome.OneintuitioncanbethatsomeapplicantshavelowerincomebutstrongsupportCo-applicants.Soitmightbeagoodideatocombinebothincomesastotalincomeandtakealogtransformationofthesame.
df['TotalIncome']=df['ApplicantIncome']+df['CoapplicantIncome']
df['TotalIncome_log']=np.log(df['TotalIncome'])
df['LoanAmount_log'].hist(bins=20)




Nowweseethatthedistributionismuchbetterthanbefore.IwillleaveituptoyoutoimputethemissingvaluesforGender,Married,Dependents,Loan_Amount_Term,Credit_History.Also,Iencourageyoutothinkaboutpossibleadditionalinformationwhichcanbederivedfromthedata.Forexample,creatingacolumnforLoanAmount/TotalIncomemightmakesenseasitgivesanideaofhowwelltheapplicantissuitedtopaybackhisloan.
Next,wewilllookatmakingpredictivemodels.

5.BuildingaPredictiveModelinPython

After,wehavemadethedatausefulformodeling,let’snowlookatthepythoncodetocreateapredictivemodelonourdataset.Skicit-Learn(sklearn)isthemostcommonlyusedlibraryinPythonforthispurposeandwewillfollowthetrail.Iencourageyoutogetarefresheronsklearnthroughthisarticle.
Since,sklearnrequiresallinputstobenumeric,weshouldconvertallourcategoricalvariablesintonumericbyencodingthecategories.Thiscanbedoneusingthefollowingcode:
fromsklearn.preprocessingimportLabelEncoder
var_mod=['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le=LabelEncoder()
foriinvar_mod:
df[i]=le.fit_transform(df[i])
df.dtypes

Next,wewillimporttherequiredmodules.Thenwewilldefineagenericclassificationfunction,whichtakesamodelasinputanddeterminestheAccuracyandCross-Validationscores.Sincethisisanintroductoryarticle,Iwillnotgointothedetailsofcoding.PleaserefertothisarticleforgettingdetailsofthealgorithmswithRandPythoncodes.Also,it’llbegoodtogetarefresheroncross-validationthroughthisarticle,asitisaveryimportantmeasureofpowerperformance.
#Importmodelsfromscikitlearnmodule:
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.cross_validationimportKFold#ForK-foldcrossvalidation
fromsklearn.ensembleimportRandomForestClassifier
fromsklearn.treeimportDecisionTreeClassifier,export_graphviz
fromsklearnimportmetrics
#Genericfunctionformakingaclassificationmodelandaccessingperformance:
defclassification_model(model,data,predictors,outcome):
#Fitthemodel:
model.fit(data[predictors],data[outcome])
#Makepredictionsontrainingset:
predictions=model.predict(data[predictors])
#Printaccuracy
accuracy=metrics.accuracy_score(predictions,data[outcome])
print"Accuracy:%s"%"{0:.3%}".format(accuracy)
#Performk-foldcross-validationwith5folds
kf=KFold(data.shape[0],n_folds=5)
error=[]
fortrain,testinkf:
#Filtertrainingdata
train_predictors=(data[predictors].iloc[train,:])
#Thetargetwe'reusingtotrainthealgorithm.
train_target=data[outcome].iloc[train]
#Trainingthealgorithmusingthepredictorsandtarget.
model.fit(train_predictors,train_target)
#Recorderrorfromeachcross-validationrun
error.append(model.score(data[predictors].iloc[test,:],data[outcome].iloc[test]))
print"Cross-ValidationScore:%s"%"{0:.3%}".format(np.mean(error))
#Fitthemodelagainsothatitcanbereferedoutsidethefunction:
model.fit(data[predictors],data[outcome])

LogisticRegression
Let’smakeourfirstLogisticRegressionmodel.Onewaywouldbetotakeallthevariablesintothemodelbutthismightresultinoverfitting(don’tworryifyou’reunawareofthisterminologyyet).Insimplewords,takingallvariablesmightresultinthemodelunderstandingcomplexrelationsspecifictothedataandwillnotgeneralizewell.ReadmoreaboutLogisticRegression.
Wecaneasilymakesomeintuitivehypothesistosettheballrolling.Thechancesofgettingaloanwillbehigherfor:

Applicantshavingacredithistory(rememberweobservedthisinexploration?)
Applicantswithhigherapplicantandco-applicantincomes
Applicantswithhighereducationlevel
Propertiesinurbanareaswithhighgrowthperspectives

Solet’smakeourfirstmodelwith‘Credit_History’.
outcome_var='Loan_Status'
model=LogisticRegression()
predictor_var=['Credit_History']
classification_model(model,df,predictor_var,outcome_var)

Accuracy:80.945%Cross-ValidationScore:80.946%
#Wecantrydifferentcombinationofvariables:
predictor_var=['Credit_History','Education','Married','Self_Employed','Property_Area']
classification_model(model,df,predictor_var,outcome_var)

Accuracy:80.945%Cross-ValidationScore:80.946%
Generallyweexpecttheaccuracytoincreaseonaddingvariables.Butthisisamorechallengingcase.Theaccuracyandcross-validationscorearenotgettingimpactedbylessimportantvariables.Credit_Historyisdominatingthemode.Wehavetwooptionsnow:

FeatureEngineering:dereivenewinformationandtrytopredictthose.Iwillleavethistoyourcreativity.
Bettermodelingtechniques.Let’sexplorethisnext.

DecisionTree
Decisiontreeisanothermethodformakingapredictivemodel.Itisknowntoprovidehigheraccuracythanlogisticregressionmodel.ReadmoreaboutDecisionTrees.
model=DecisionTreeClassifier()
predictor_var=['Credit_History','Gender','Married','Education']
classification_model(model,df,predictor_var,outcome_var)

Accuracy:81.930%Cross-ValidationScore:76.656%
HerethemodelbasedoncategoricalvariablesisunabletohaveanimpactbecauseCreditHistoryisdominatingoverthem.Let’stryafewnumericalvariables:
#Wecantrydifferentcombinationofvariables:
predictor_var=['Credit_History','Loan_Amount_Term','LoanAmount_log']
classification_model(model,df,predictor_var,outcome_var)

Accuracy:92.345%Cross-ValidationScore:71.009%
Hereweobservedthatalthoughtheaccuracywentuponaddingvariables,thecross-validationerrorwentdown.Thisistheresultofmodelover-fittingthedata.Let’stryanevenmoresophisticatedalgorithmandseeifithelps:
RandomForest
Randomforestisanotheralgorithmforsolvingtheclassificationproblem.ReadmoreaboutRandomForest.
AnadvantagewithRandomForestisthatwecanmakeitworkwithallthefeaturesanditreturnsafeatureimportancematrixwhichcanbeusedtoselectfeatures.
model=RandomForestClassifier(n_estimators=100)
predictor_var=['Gender','Married','Dependents','Education',
'Self_Employed','Loan_Amount_Term','Credit_History','Property_Area',
'LoanAmount_log','TotalIncome_log']
classification_model(model,df,predictor_var,outcome_var)

Accuracy:100.000%Cross-ValidationScore:78.179%
Hereweseethattheaccuracyis100%forthetrainingset.Thisistheultimatecaseofoverfittingandcanberesolvedintwoways:

Reducingthenumberofpredictors
Tuningthemodelparameters

Let’strybothofthese.Firstweseethefeatureimportancematrixfromwhichwe’lltakethemostimportantfeatures.
#Createaserieswithfeatureimportances:
featimp=pd.Series(model.feature_importances_,index=predictor_var).sort_values(ascending=False)
printfeatimp




Let’susethetop5variablesforcreatingamodel.Also,wewillmodifytheparametersofrandomforestmodelalittlebit:
model=RandomForestClassifier(n_estimators=25,min_samples_split=25,max_depth=7,max_features=1)
predictor_var=['TotalIncome_log','LoanAmount_log','Credit_History','Dependents','Property_Area']
classification_model(model,df,predictor_var,outcome_var)

Accuracy:82.899%Cross-ValidationScore:81.461%
Noticethatalthoughaccuracyreduced,butthecross-validationscoreisimprovingshowingthatthemodelisgeneralizingwell.Rememberthatrandomforestmodelsarenotexactlyrepeatable.Differentrunswillresultinslightvariationsbecauseofrandomization.Buttheoutputshouldstayintheballpark.
Youwouldhavenoticedthatevenaftersomebasicparametertuningonrandomforest,wehavereachedacross-validationaccuracyonlyslightlybetterthantheoriginallogisticregressionmodel.Thisexercisegivesussomeveryinterestinganduniquelearning:

Usingamoresophisticatedmodeldoesnotguaranteebetterresults.
Avoidusingcomplexmodelingtechniquesasablackboxwithoutunderstandingtheunderlyingconcepts.Doingsowouldincreasethetendencyofoverfittingthusmakingyourmodelslessinterpretable
FeatureEngineeringisthekeytosuccess.EveryonecanuseanXgboostmodelsbuttherealartandcreativityliesinenhancingyourfeaturestobettersuitthemodel.

Soareyoureadytotakeonthechallenge?StartyourdatasciencejourneywithLoanPredictionProblem.

EndNotes

IhopethistutorialwillhelpyoumaximizeyourefficiencywhenstartingwithdatascienceinPython.Iamsurethisnotonlygaveyouanideaaboutbasicdataanalysismethodsbutitalsoshowedyouhowtoimplementsomeofthemoresophisticatedtechniquesavailabletoday.
Pythonisreallyagreattool,andisbecominganincreasinglypopularlanguageamongthedatascientists.Thereasonbeing,it’seasytolearn,integrateswellwithotherdatabasesandtoolslikeSparkandHadoop.Majorly,ithasgreatcomputationalintensityandhaspowerfuldataanalyticslibraries.
So,learnPythontoperformthefulllife-cycleofanydatascienceproject.Itincludesreading,analyzing,visualizingandfinallymakingpredictions.
IfyoucomeacrossanydifficultywhilepracticingPython,oryouhaveanythoughts/suggestions/feedbackonthepost,pleasefeelfreetopostthemthroughcommentsbelow.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: