您的位置：首页 > 编程语言 > Python开发

A Complete Tutorial to Learn Data Science with Python from Scratch

2017-10-02 20:20 513 查看

ACompleteTutorialtoLearnDataSciencewithPythonfromScratch

Introduction

Ithappenedfewyearsback.AfterworkingonSASformorethan5years,Idecidedtomoveoutofmycomfortzone.Beingadatascientist,myhuntforotherusefultoolswasON!Fortunately,itdidn’ttakemelongtodecide,Pythonwasmyappetizer.Ialwayshadainclinationtowardscoding.ThiswasthetimetodowhatIreallyloved.Code.Turnedout,codingwassoeasy!IlearnedbasicsofPythonwithinaweek.And,sincethen,I’venotonlyexploredthislanguagetothedepth,butalsohavehelpedmanyothertolearnthislanguage.Pythonwasoriginallyageneralpurposelanguage.But,overtheyears,withstrongcommunitysupport,thislanguagegotdedicatedlibraryfordataanalysisandpredictivemodeling.Duetolackofresourceonpythonfordatascience,Idecidedtocreatethistutorialtohelpmanyotherstolearnpythonfaster.Inthistutorial,wewilltakebitesizedinformationabouthowtousePythonforDataAnalysis,chewittillwearecomfortableandpracticeitatourownend.

BasicsofPythonforDataAnalysisWhylearnPythonfordataanalysis?Python2.7v/s3.4HowtoinstallPython?RunningafewsimpleprogramsinPython
PythonlibrariesanddatastructuresPythonDataStructuresPythonIterationandConditionalConstructsPythonLibraries
ExploratoryanalysisinPythonusingPandasIntroductiontoseriesanddataframesAnalyticsVidhyadataset-LoanPredictionProblem
DataMunginginPythonusingPandasBuildingaPredictiveModelinPythonLogisticRegressionDecisionTreeRandomForest

Let’sgetstarted!

1.BasicsofPythonforDataAnalysis

WhylearnPythonfordataanalysis?Pythonhasgatheredalotofinterestrecentlyasachoiceoflanguagefordataanalysis.IhadcompareditagainstSAS&Rsometimeback.HerearesomereasonswhichgoinfavouroflearningPython:OpenSource–freetoinstallAwesomeonlinecommunityVeryeasytolearnCanbecomeacommonlanguagefordatascienceandproductionofwebbasedanalyticsproducts.
Needlesstosay,itstillhasfewdrawbackstoo:Itisaninterpretedlanguageratherthancompiledlanguage–hencemighttakeupmoreCPUtime.However,giventhesavingsinprogrammertime(duetoeaseoflearning),itmightstillbeagoodchoice.
Python2.7v/s3.4ThisisoneofthemostdebatedtopicsinPython.Youwillinvariablycrosspathswithit,speciallyifyouareabeginner.Thereisnoright/wrongchoicehere.Ittotallydependsonthesituationandyourneedtouse.Iwilltrytogiveyousomepointerstohelpyoumakeaninformedchoice.WhyPython2.7?Awesomecommunitysupport!Thisissomethingyou’dneedinyourearlydays.Python2wasreleasedinlate2000andhasbeeninuseformorethan15years.Plethoraofthird-partylibraries!Thoughmanylibrarieshaveprovided3.xsupportbutstillalargenumberofmodulesworkonlyon2.xversions.IfyouplantousePythonforspecificapplicationslikeweb-developmentwithhighrelianceonexternalmodules,youmightbebetteroffwith2.7.Someofthefeaturesof3.xversionshavebackwardcompatibilityandcanworkwith2.7version.
WhyPython3.4?Cleanerandfaster!Pythondevelopershavefixedsomeinherentglitchesandminordrawbacksinordertosetastrongerfoundationforthefuture.Thesemightnotbeveryrelevantinitially,butwillmattereventually.Itisthefuture!2.7isthelastreleaseforthe2.xfamilyandeventuallyeveryonehastoshiftto3.xversions.Python3hasreleasedstableversionsforpast5yearsandwillcontinuethesame.
ThereisnoclearwinnerbutIsupposethebottomlineisthatyoushouldfocusonlearningPythonasalanguage.Shiftingbetweenversionsshouldjustbeamatteroftime.StaytunedforadedicatedarticleonPython2.xvs3.xinthenearfuture!HowtoinstallPython?Thereare2approachestoinstallPython:YoucandownloadPythondirectlyfromitsprojectsiteandinstallindividualcomponentsandlibrariesyouwantAlternately,youcandownloadandinstallapackage,whichcomeswithpre-installedlibraries.IwouldrecommenddownloadingAnaconda.AnotheroptioncouldbeEnthoughtCanopyExpress.
SecondmethodprovidesahasslefreeinstallationandhenceI’llrecommendthattobeginners.Theimitationofthisapproachisyouhavetowaitfortheentirepackagetobeupgraded,evenifyouareinterestedinthelatestversionofasinglelibrary.Itshouldnotmatteruntilandunless,untilandunless,youaredoingcuttingedgestatisticalresearch.ChoosingadevelopmentenvironmentOnceyouhaveinstalledPython,therearevariousoptionsforchoosinganenvironment.Herearethe3mostcommonoptions:Terminal/ShellbasedIDLE(defaultenvironment)iPythonnotebook–similartomarkdowninR

IDLEeditorforPythonWhiletherightenvironmentdependsonyourneed,IpersonallypreferiPythonNotebooksalot.Itprovidesalotofgoodfeaturesfordocumentingwhilewritingthecodeitselfandyoucanchoosetorunthecodeinblocks(ratherthanthelinebylineexecution)WewilluseiPythonenvironmentforthiscompletetutorial.Warmingup:RunningyourfirstPythonprogramYoucanusePythonasasimplecalculatortostartwith:

FewthingstonoteYoucanstartiPythonnotebookbywriting“ipythonnotebook”onyourterminal/cmd,dependingontheOSyouareworkingonYoucannameaiPythonnotebookbysimplyclickingonthename–UntitledOintheabovescreenshotTheinterfaceshowsIn[*]forinputsandOut[*]foroutput.Youcanexecuteacodebypressing“Shift+Enter”or“ALT+Enter”,ifyouwanttoinsertanadditionalrowafter.
Beforewedeepdiveintoproblemsolving,letstakeastepbackandunderstandthebasicsofPython.Asweknowthatdatastructuresanditerationandconditionalconstructsformthecruxofanylanguage.InPython,theseincludelists,strings,tuples,dictionaries,for-loop,while-loop,if-else,etc.Let’stakealookatsomeofthese.

2.PythonlibrariesandDataStructures

PythonDataStructuresFollowingaresomedatastructures,whichareusedinPython.Youshouldbefamiliarwiththeminordertousethemasappropriate.Lists–ListsareoneofthemostversatiledatastructureinPython.Alistcansimplybedefinedbywritingalistofcommaseparatedvaluesinsquarebrackets.Listsmightcontainitemsofdifferenttypes,butusuallytheitemsallhavethesametype.Pythonlistsaremutableandindividualelementsofalistcanbechanged.
Hereisaquickexampletodefinealistandthenaccessit:

Strings–Stringscansimplybedefinedbyuseofsingle(‘),double(”)ortriple(”’)invertedcommas.Stringsenclosedintripequotes(”’)canspanovermultiplelinesandareusedfrequentlyindocstrings(Python’swayofdocumentingfunctions).\isusedasanescapecharacter.PleasenotethatPythonstringsareimmutable,soyoucannotchangepartofstrings.

Tuples–Atupleisrepresentedbyanumberofvaluesseparatedbycommas.Tuplesareimmutableandtheoutputissurroundedbyparenthesessothatnestedtuplesareprocessedcorrectly.Additionally,eventhoughtuplesareimmutable,theycanholdmutabledataifneeded.
SinceTuplesareimmutableandcannotchange,theyarefasterinprocessingascomparedtolists.Hence,ifyourlistisunlikelytochange,youshouldusetuples,insteadoflists.

Dictionary–Dictionaryisanunorderedsetofkey:valuepairs,withtherequirementthatthekeysareunique(withinonedictionary).Apairofbracescreatesanemptydictionary:{}.

PythonIterationandConditionalConstructsLikemostlanguages,PythonalsohasaFOR-loopwhichisthemostwidelyusedmethodforiteration.Ithasasimplesyntax:

foriin[PythonIterable]:

expression(i)

Here“PythonIterable”canbealist,tupleorotheradvanceddatastructureswhichwewillexploreinlatersections.Let’stakealookatasimpleexample,determiningthefactorialofanumber.

fact=1

foriinrange(1,N+1):

fact*=i

Comingtoconditionalstatements,theseareusedtoexecutecodefragmentsbasedonacondition.Themostcommonlyusedconstructisif-else,withfollowingsyntax:

if[condition]:

__executioniftrue__

else:

__executioniffalse__

Forinstance,ifwewanttoprintwhetherthenumberNisevenorodd:

ifN%2==0:

print'Even'

else:

print'Odd'

NowthatyouarefamiliarwithPythonfundamentals,let’stakeastepfurther.Whatifyouhavetoperformthefollowingtasks:

Multiply2matrices
Findtherootofaquadraticequation
Plotbarchartsandhistograms
Makestatisticalmodels
Accessweb-pages

Ifyoutrytowritecodefromscratch,itsgoingtobeanightmareandyouwon’tstayonPythonformorethan2days!Butletsnotworryaboutthat.Thankfully,therearemanylibrarieswithpredefinedwhichwecandirectlyimportintoourcodeandmakeourlifeeasy.
Forexample,considerthefactorialexamplewejustsaw.Wecandothatinasinglestepas:

math.factorial(N)

Off-courseweneedtoimportthemathlibraryforthat.Letsexplorethevariouslibrariesnext.
PythonLibraries
LetstakeonestepaheadinourjourneytolearnPythonbygettingacquaintedwithsomeusefullibraries.Thefirststepisobviouslytolearntoimportthemintoourenvironment.ThereareseveralwaysofdoingsoinPython:

importmathasm

frommathimport*

Inthefirstmanner,wehavedefinedanaliasmtolibrarymath.Wecannowusevariousfunctionsfrommathlibrary(e.g.factorial)byreferencingitusingthealiasm.factorial().
Inthesecondmanner,youhaveimportedtheentirenamespaceinmathi.e.youcandirectlyusefactorial()withoutreferringtomath.
Tip:Googlerecommendsthatyouusefirststyleofimportinglibraries,asyouwillknowwherethefunctionshavecomefrom.
Followingarealistoflibraries,youwillneedforanyscientificcomputationsanddataanalysis:

NumPystandsforNumericalPython.ThemostpowerfulfeatureofNumPyisn-dimensionalarray.Thislibraryalsocontainsbasiclinearalgebrafunctions,Fouriertransforms,advancedrandomnumbercapabilitiesandtoolsforintegrationwithotherlowlevellanguageslikeFortran,CandC++
SciPystandsforScientificPython.SciPyisbuiltonNumPy.ItisoneofthemostusefullibraryforvarietyofhighlevelscienceandengineeringmoduleslikediscreteFouriertransform,LinearAlgebra,OptimizationandSparsematrices.
Matplotlibforplottingvastvarietyofgraphs,startingfromhistogramstolineplotstoheatplots..YoucanusePylabfeatureinipythonnotebook(ipythonnotebook–pylab=inline)tousetheseplottingfeaturesinline.Ifyouignoretheinlineoption,thenpylabconvertsipythonenvironmenttoanenvironment,verysimilartoMatlab.YoucanalsouseLatexcommandstoaddmathtoyourplot.
Pandasforstructureddataoperationsandmanipulations.Itisextensivelyusedfordatamungingandpreparation.PandaswereaddedrelativelyrecentlytoPythonandhavebeeninstrumentalinboostingPython’susageindatascientistcommunity.
ScikitLearnformachinelearning.BuiltonNumPy,SciPyandmatplotlib,thislibrarycontainsalotofeffiecienttoolsformachinelearningandstatisticalmodelingincludingclassification,regression,clusteringanddimensionalityreduction.
Statsmodelsforstatisticalmodeling.StatsmodelsisaPythonmodulethatallowsuserstoexploredata,estimatestatisticalmodels,andperformstatisticaltests.Anextensivelistofdescriptivestatistics,statisticaltests,plottingfunctions,andresultstatisticsareavailablefordifferenttypesofdataandeachestimator.
Seabornforstatisticaldatavisualization.SeabornisalibraryformakingattractiveandinformativestatisticalgraphicsinPython.Itisbasedonmatplotlib.Seabornaimstomakevisualizationacentralpartofexploringandunderstandingdata.
Bokehforcreatinginteractiveplots,dashboardsanddataapplicationsonmodernweb-browsers.ItempowerstheusertogenerateelegantandconcisegraphicsinthestyleofD3.js.Moreover,ithasthecapabilityofhigh-performanceinteractivityoververylargeorstreamingdatasets.
BlazeforextendingthecapabilityofNumpyandPandastodistributedandstreamingdatasets.ItcanbeusedtoaccessdatafromamultitudeofsourcesincludingBcolz,MongoDB,SQLAlchemy,ApacheSpark,PyTables,etc.TogetherwithBokeh,Blazecanactasaverypowerfultoolforcreatingeffectivevisualizationsanddashboardsonhugechunksofdata.
Scrapyforwebcrawling.Itisaveryusefulframeworkforgettingspecificpatternsofdata.Ithasthecapabilitytostartatawebsitehomeurlandthendigthroughweb-pageswithinthewebsitetogatherinformation.
SymPyforsymboliccomputation.Ithaswide-rangingcapabilitiesfrombasicsymbolicarithmetictocalculus,algebra,discretemathematicsandquantumphysics.AnotherusefulfeatureisthecapabilityofformattingtheresultofthecomputationsasLaTeXcode.
Requestsforaccessingtheweb.Itworkssimilartothethestandardpythonlibraryurllib2butismucheasiertocode.Youwillfindsubtledifferenceswithurllib2butforbeginners,Requestsmightbemoreconvenient.

Additionallibraries,youmightneed:

osforOperatingsystemandfileoperations
networkxandigraphforgraphbaseddatamanipulations
regularexpressionsforfindingpatternsintextdata
BeautifulSoupforscrappingweb.ItisinferiortoScrapyasitwillextractinformationfromjustasinglewebpageinarun.

NowthatwearefamiliarwithPythonfundamentalsandadditionallibraries,letstakeadeepdiveintoproblemsolvingthroughPython.YesImeanmakingapredictivemodel!Intheprocess,weusesomepowerfullibrariesandalsocomeacrossthenextlevelofdatastructures.Wewilltakeyouthroughthe3keyphases:

DataExploration–findingoutmoreaboutthedatawehave
DataMunging–cleaningthedataandplayingwithittomakeitbettersuitstatisticalmodeling
PredictiveModeling–runningtheactualalgorithmsandhavingfun

3.ExploratoryanalysisinPythonusingPandas

Inordertoexploreourdatafurther,letmeintroduceyoutoanotheranimal(asifPythonwasnotenough!)–Pandas

ImageSource:Wikipedia
PandasisoneofthemostusefuldataanalysislibraryinPython(Iknowthesenamessoundsweird,buthangon!).TheyhavebeeninstrumentalinincreasingtheuseofPythonindatasciencecommunity.WewillnowusePandastoreadadatasetfromanAnalyticsVidhyacompetition,performexploratoryanalysisandbuildourfirstbasiccategorizationalgorithmforsolvingthisproblem.
Beforeloadingthedata,letsunderstandthe2keydatastructuresinPandas–SeriesandDataFrames
IntroductiontoSeriesandDataframes
Seriescanbeunderstoodasa1dimensionallabelled/indexedarray.Youcanaccessindividualelementsofthisseriesthroughtheselabels.
AdataframeissimilartoExcelworkbook–youhavecolumnnamesreferringtocolumnsandyouhaverows,whichcanbeaccessedwithuseofrownumbers.Theessentialdifferencebeingthatcolumnnamesandrownumbersareknownascolumnandrowindex,incaseofdataframes.
SeriesanddataframesformthecoredatamodelforPandasinPython.Thedatasetsarefirstreadintothesedataframesandthenvariousoperations(e.g.groupby,aggregationetc.)canbeappliedveryeasilytoitscolumns.
More:10MinutestoPandas
Practicedataset–LoanPredictionProblem
Youcandownloadthedatasetfromhere.Hereisthedescriptionofvariables:

VARIABLEDESCRIPTIONS:

VariableDescription

Loan_IDUniqueLoanID

GenderMale/Female

MarriedApplicantmarried(Y/N)

DependentsNumberofdependents

EducationApplicantEducation(Graduate/UnderGraduate)

Self_EmployedSelfemployed(Y/N)

ApplicantIncomeApplicantincome

CoapplicantIncomeCoapplicantincome

LoanAmountLoanamountinthousands

Loan_Amount_TermTermofloaninmonths

Credit_Historycredithistorymeetsguidelines

Property_AreaUrban/SemiUrban/Rural

Loan_StatusLoanapproved(Y/N)

Let’sbeginwithexploration
Tobegin,startiPythoninterfaceinInlinePylabmodebytypingfollowingonyourterminal/windowscommandprompt:

ipythonnotebook--pylab=inline

ThisopensupiPythonnotebookinpylabenvironment,whichhasafewusefullibrariesalreadyimported.Also,youwillbeabletoplotyourdatainline,whichmakesthisareallygoodenvironmentforinteractivedataanalysis.Youcancheckwhethertheenvironmenthasloadedcorrectly,bytypingthefollowingcommand(andgettingtheoutputasseeninthefigurebelow):

plot(arange(5))

IamcurrentlyworkinginLinux,andhavestoredthedatasetinthefollowinglocation:
/home/kunal/Downloads/Loan_Prediction/train.csv
Importinglibrariesandthedataset:
Followingarethelibrarieswewilluseduringthistutorial:

numpy
matplotlib
pandas

PleasenotethatyoudonotneedtoimportmatplotlibandnumpybecauseofPylabenvironment.Ihavestillkepttheminthecode,incaseyouusethecodeinadifferentenvironment.
Afterimportingthelibrary,youreadthedatasetusingfunctionread_csv().Thisishowthecodelooksliketillthisstage:

importpandasaspd

importnumpyasnp

importmatplotlibasplt

df=pd.read_csv("/home/kunal/Downloads/Loan_Prediction/train.csv")#ReadingthedatasetinadataframeusingPandas

QuickDataExploration
Onceyouhavereadthedataset,youcanhavealookatfewtoprowsbyusingthefunctionhead()

df.head(10)

Thisshouldprint10rows.Alternately,youcanalsolookatmorerowsbyprintingthedataset.
Next,youcanlookatsummaryofnumericalfieldsbyusingdescribe()function

df.describe()

describe()functionwouldprovidecount,mean,standarddeviation(std),min,quartilesandmaxinitsoutput(Readthisarticletorefreshbasicstatisticstounderstandpopulationdistribution)
Hereareafewinferences,youcandrawbylookingattheoutputofdescribe()function:

LoanAmounthas(614–592)22missingvalues.
Loan_Amount_Termhas(614–600)14missingvalues.
Credit_Historyhas(614–564)50missingvalues.
Wecanalsolookthatabout84%applicantshaveacredit_history.How?ThemeanofCredit_Historyfieldis0.84(Remember,Credit_Historyhasvalue1forthosewhohaveacredithistoryand0otherwise)
TheApplicantIncomedistributionseemstobeinlinewithexpectation.SamewithCoapplicantIncome

Pleasenotethatwecangetanideaofapossibleskewinthedatabycomparingthemeantothemedian,i.e.the50%figure.
Forthenon-numericalvalues(e.g.Property_Area,Credit_Historyetc.),wecanlookatfrequencydistributiontounderstandwhethertheymakesenseornot.Thefrequencytablecanbeprintedbyfollowingcommand:

df['Property_Area'].value_counts()

Similarly,wecanlookatuniquevaluesofportofcredithistory.Notethatdfname[‘column_name’]isabasicindexingtechniquetoacessaparticularcolumnofthedataframe.Itcanbealistofcolumnsaswell.Formoreinformation,refertothe“10MinutestoPandas”resourcesharedabove.
Distributionanalysis
Nowthatwearefamiliarwithbasicdatacharacteristics,letusstudydistributionofvariousvariables.Letusstartwithnumericvariables–namelyApplicantIncomeandLoanAmount
LetsstartbyplottingthehistogramofApplicantIncomeusingthefollowingcommands:

df['ApplicantIncome'].hist(bins=50)

Hereweobservethattherearefewextremevalues.Thisisalsothereasonwhy50binsarerequiredtodepictthedistributionclearly.
Next,welookatboxplotstounderstandthedistributions.Boxplotforfarecanbeplottedby:

df.boxplot(column='ApplicantIncome')

Thisconfirmsthepresenceofalotofoutliers/extremevalues.Thiscanbeattributedtotheincomedisparityinthesociety.Partofthiscanbedrivenbythefactthatwearelookingatpeoplewithdifferenteducationlevels.LetussegregatethembyEducation:

df.boxplot(column='ApplicantIncome',by='Education')

Wecanseethatthereisnosubstantialdifferentbetweenthemeanincomeofgraduateandnon-graduates.Butthereareahighernumberofgraduateswithveryhighincomes,whichareappearingtobetheoutliers.
Now,Let’slookatthehistogramandboxplotofLoanAmountusingthefollowingcommand:

df['LoanAmount'].hist(bins=50)

df.boxplot(column='LoanAmount')

Again,therearesomeextremevalues.Clearly,bothApplicantIncomeandLoanAmountrequiresomeamountofdatamunging.LoanAmounthasmissingandwellasextremevaluesvalues,whileApplicantIncomehasafewextremevalues,whichdemanddeeperunderstanding.Wewilltakethisupincomingsections.
Categoricalvariableanalysis
NowthatweunderstanddistributionsforApplicantIncomeandLoanIncome,letusunderstandcategoricalvariablesinmoredetails.WewilluseExcelstylepivottableandcross-tabulation.Forinstance,letuslookatthechancesofgettingaloanbasedoncredithistory.ThiscanbeachievedinMSExcelusingapivottableas:

Note:hereloanstatushasbeencodedas1forYesand0forNo.Sothemeanrepresentstheprobabilityofgettingloan.
NowwewilllookatthestepsrequiredtogenerateasimilarinsightusingPython.PleaserefertothisarticleforgettingahangofthedifferentdatamanipulationtechniquesinPandas.

temp1=df['Credit_History'].value_counts(ascending=True)

temp2=df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambdax:x.map({'Y':1,'N':0}).mean())

print'FrequencyTableforCreditHistory:'

printtemp1

print'\nProbilityofgettingloanforeachCreditHistoryclass:'

printtemp2

Nowwecanobservethatwegetasimilarpivot_tableliketheMSExcelone.Thiscanbeplottedasabarchartusingthe“matplotlib”librarywithfollowingcode:

importmatplotlib.pyplotasplt

fig=plt.figure(figsize=(8,4))

ax1=fig.add_subplot(121)

ax1.set_xlabel('Credit_History')

ax1.set_ylabel('CountofApplicants')

ax1.set_title("ApplicantsbyCredit_History")

temp1.plot(kind='bar')

ax2=fig.add_subplot(122)

temp2.plot(kind='bar')

ax2.set_xlabel('Credit_History')

ax2.set_ylabel('Probabilityofgettingloan')

ax2.set_title("Probabilityofgettingloanbycredithistory")

Thisshowsthatthechancesofgettingaloanareeight-foldiftheapplicanthasavalidcredithistory.YoucanplotsimilargraphsbyMarried,Self-Employed,Property_Area,etc.
Alternately,thesetwoplotscanalsobevisualizedbycombiningtheminastackedchart::

temp3=pd.crosstab(df['Credit_History'],df['Loan_Status'])

temp3.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)

Youcanalsoaddgenderintothemix(similartothepivottableinExcel):

Ifyouhavenotrealizedalready,wehavejustcreatedtwobasicclassificationalgorithmshere,onebasedoncredithistory,whileotheron2categoricalvariables(includinggender).YoucanquicklycodethistocreateyourfirstsubmissiononAVDatahacks.
WejustsawhowwecandoexploratoryanalysisinPythonusingPandas.Ihopeyourloveforpandas(theanimal)wouldhaveincreasedbynow–giventheamountofhelp,thelibrarycanprovideyouinanalyzingdatasets.
Nextlet’sexploreApplicantIncomeandLoanStatusvariablesfurther,performdatamungingandcreateadatasetforapplyingvariousmodelingtechniques.Iwouldstronglyurgethatyoutakeanotherdatasetandproblemandgothroughanindependentexamplebeforereadingfurther.

4.DataMunginginPython:UsingPandas

Forthose,whohavebeenfollowing,hereareyourmustwearshoestostartrunning.

Datamunging–recapoftheneed
Whileourexplorationofthedata,wefoundafewproblemsinthedataset,whichneedstobesolvedbeforethedataisreadyforagoodmodel.Thisexerciseistypicallyreferredas“DataMunging”.Herearetheproblems,wearealreadyawareof:

Therearemissingvaluesinsomevariables.Weshouldestimatethosevalueswiselydependingontheamountofmissingvaluesandtheexpectedimportanceofvariables.
Whilelookingatthedistributions,wesawthatApplicantIncomeandLoanAmountseemedtocontainextremevaluesateitherend.Thoughtheymightmakeintuitivesense,butshouldbetreatedappropriately.

Inadditiontotheseproblemswithnumericalfields,weshouldalsolookatthenon-numericalfieldsi.e.Gender,Property_Area,Married,EducationandDependentstosee,iftheycontainanyusefulinformation.
IfyouarenewtoPandas,Iwouldrecommendreadingthisarticlebeforemovingon.Itdetailssomeusefultechniquesofdatamanipulation.
Checkmissingvaluesinthedataset
Letuslookatmissingvaluesinallthevariablesbecausemostofthemodelsdon’tworkwithmissingdataandeveniftheydo,imputingthemhelpsmoreoftenthannot.So,letuscheckthenumberofnulls/NaNsinthedataset

df.apply(lambdax:sum(x.isnull()),axis=0)

Thiscommandshouldtellusthenumberofmissingvaluesineachcolumnasisnull()returns1,ifthevalueisnull.

Thoughthemissingvaluesarenotveryhighinnumber,butmanyvariableshavethemandeachoneoftheseshouldbeestimatedandaddedinthedata.Getadetailedviewondifferentimputationtechniquesthroughthisarticle.
Note:RememberthatmissingvaluesmaynotalwaysbeNaNs.Forinstance,iftheLoan_Amount_Termis0,doesitmakessenseorwouldyouconsiderthatmissing?Isupposeyouranswerismissingandyou’reright.Soweshouldcheckforvalueswhichareunpractical.
HowtofillmissingvaluesinLoanAmount?
Therearenumerouswaystofillthemissingvaluesofloanamount–thesimplestbeingreplacementbymean,whichcanbedonebyfollowingcode:

df['LoanAmount'].fillna(df['LoanAmount'].mean(),inplace=True)

Theotherextremecouldbetobuildasupervisedlearningmodeltopredictloanamountonthebasisofothervariablesandthenuseagealongwithothervariablestopredictsurvival.
Since,thepurposenowistobringoutthestepsindatamunging,I’llrathertakeanapproach,whichliessomewhereinbetweenthese2extremes.Akeyhypothesisisthatthewhetherapersoniseducatedorself-employedcancombinetogiveagoodestimateofloanamount.
First,let’slookattheboxplottoseeifatrendexists:

Thusweseesomevariationsinthemedianofloanamountforeachgroupandthiscanbeusedtoimputethevalues.Butfirst,wehavetoensurethateachofSelf_EmployedandEducationvariablesshouldnothaveamissingvalues.
Aswesayearlier,Self_Employedhassomemissingvalues.Let’slookatthefrequencytable:

Since~86%valuesare“No”,itissafetoimputethemissingvaluesas“No”asthereisahighprobabilityofsuccess.Thiscanbedoneusingthefollowingcode:

df['Self_Employed'].fillna('No',inplace=True)

Now,wewillcreateaPivottable,whichprovidesusmedianvaluesforallthegroupsofuniquevaluesofSelf_EmployedandEducationfeatures.Next,wedefineafunction,whichreturnsthevaluesofthesecellsandapplyittofillthemissingvaluesofloanamount:

table=df.pivot_table(values='LoanAmount',index='Self_Employed',columns='Education',aggfunc=np.median)

#Definefunctiontoreturnvalueofthispivot_table

deffage(x):

returntable.loc[x['Self_Employed'],x['Education']]

#Replacemissingvalues

df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage,axis=1),inplace=True)

Thisshouldprovideyouagoodwaytoimputemissingvaluesofloanamount.
[b]HowtotreatforextremevaluesindistributionofLoanAmountandApplicantIncome?[/b]
Let’sanalyzeLoanAmountfirst.Sincetheextremevaluesarepracticallypossible,i.e.somepeoplemightapplyforhighvalueloansduetospecificneeds.Soinsteadoftreatingthemasoutliers,let’stryalogtransformationtonullifytheireffect:

df['LoanAmount_log']=np.log(df['LoanAmount'])

df['LoanAmount_log'].hist(bins=20)

Lookingatthehistogramagain:

Nowthedistributionlooksmuchclosertonormalandeffectofextremevalueshasbeensignificantlysubsided.
ComingtoApplicantIncome.OneintuitioncanbethatsomeapplicantshavelowerincomebutstrongsupportCo-applicants.Soitmightbeagoodideatocombinebothincomesastotalincomeandtakealogtransformationofthesame.

df['TotalIncome']=df['ApplicantIncome']+df['CoapplicantIncome']

df['TotalIncome_log']=np.log(df['TotalIncome'])

df['LoanAmount_log'].hist(bins=20)

Nowweseethatthedistributionismuchbetterthanbefore.IwillleaveituptoyoutoimputethemissingvaluesforGender,Married,Dependents,Loan_Amount_Term,Credit_History.Also,Iencourageyoutothinkaboutpossibleadditionalinformationwhichcanbederivedfromthedata.Forexample,creatingacolumnforLoanAmount/TotalIncomemightmakesenseasitgivesanideaofhowwelltheapplicantissuitedtopaybackhisloan.
Next,wewilllookatmakingpredictivemodels.

5.BuildingaPredictiveModelinPython

After,wehavemadethedatausefulformodeling,let’snowlookatthepythoncodetocreateapredictivemodelonourdataset.Skicit-Learn(sklearn)isthemostcommonlyusedlibraryinPythonforthispurposeandwewillfollowthetrail.Iencourageyoutogetarefresheronsklearnthroughthisarticle.
Since,sklearnrequiresallinputstobenumeric,weshouldconvertallourcategoricalvariablesintonumericbyencodingthecategories.Thiscanbedoneusingthefollowingcode:

fromsklearn.preprocessingimportLabelEncoder

var_mod=['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']

le=LabelEncoder()

foriinvar_mod:

df[i]=le.fit_transform(df[i])

df.dtypes

Next,wewillimporttherequiredmodules.Thenwewilldefineagenericclassificationfunction,whichtakesamodelasinputanddeterminestheAccuracyandCross-Validationscores.Sincethisisanintroductoryarticle,Iwillnotgointothedetailsofcoding.PleaserefertothisarticleforgettingdetailsofthealgorithmswithRandPythoncodes.Also,it’llbegoodtogetarefresheroncross-validationthroughthisarticle,asitisaveryimportantmeasureofpowerperformance.

#Importmodelsfromscikitlearnmodule:

fromsklearn.linear_modelimportLogisticRegression

fromsklearn.cross_validationimportKFold#ForK-foldcrossvalidation

fromsklearn.ensembleimportRandomForestClassifier

fromsklearn.treeimportDecisionTreeClassifier,export_graphviz

fromsklearnimportmetrics

#Genericfunctionformakingaclassificationmodelandaccessingperformance:

defclassification_model(model,data,predictors,outcome):

#Fitthemodel:

model.fit(data[predictors],data[outcome])

#Makepredictionsontrainingset:

predictions=model.predict(data[predictors])

#Printaccuracy

accuracy=metrics.accuracy_score(predictions,data[outcome])

print"Accuracy:%s"%"{0:.3%}".format(accuracy)

#Performk-foldcross-validationwith5folds

kf=KFold(data.shape[0],n_folds=5)

error=[]

fortrain,testinkf:

#Filtertrainingdata

train_predictors=(data[predictors].iloc[train,:])

#Thetargetwe'reusingtotrainthealgorithm.

train_target=data[outcome].iloc[train]

#Trainingthealgorithmusingthepredictorsandtarget.

model.fit(train_predictors,train_target)

#Recorderrorfromeachcross-validationrun

error.append(model.score(data[predictors].iloc[test,:],data[outcome].iloc[test]))

print"Cross-ValidationScore:%s"%"{0:.3%}".format(np.mean(error))

#Fitthemodelagainsothatitcanbereferedoutsidethefunction:

model.fit(data[predictors],data[outcome])

LogisticRegression
Let’smakeourfirstLogisticRegressionmodel.Onewaywouldbetotakeallthevariablesintothemodelbutthismightresultinoverfitting(don’tworryifyou’reunawareofthisterminologyyet).Insimplewords,takingallvariablesmightresultinthemodelunderstandingcomplexrelationsspecifictothedataandwillnotgeneralizewell.ReadmoreaboutLogisticRegression.
Wecaneasilymakesomeintuitivehypothesistosettheballrolling.Thechancesofgettingaloanwillbehigherfor:

Applicantshavingacredithistory(rememberweobservedthisinexploration?)
Applicantswithhigherapplicantandco-applicantincomes
Applicantswithhighereducationlevel
Propertiesinurbanareaswithhighgrowthperspectives

Solet’smakeourfirstmodelwith‘Credit_History’.

outcome_var='Loan_Status'

model=LogisticRegression()

predictor_var=['Credit_History']

classification_model(model,df,predictor_var,outcome_var)

Accuracy:80.945%Cross-ValidationScore:80.946%

#Wecantrydifferentcombinationofvariables:

predictor_var=['Credit_History','Education','Married','Self_Employed','Property_Area']

classification_model(model,df,predictor_var,outcome_var)

Accuracy:80.945%Cross-ValidationScore:80.946%
Generallyweexpecttheaccuracytoincreaseonaddingvariables.Butthisisamorechallengingcase.Theaccuracyandcross-validationscorearenotgettingimpactedbylessimportantvariables.Credit_Historyisdominatingthemode.Wehavetwooptionsnow:

FeatureEngineering:dereivenewinformationandtrytopredictthose.Iwillleavethistoyourcreativity.
Bettermodelingtechniques.Let’sexplorethisnext.

DecisionTree
Decisiontreeisanothermethodformakingapredictivemodel.Itisknowntoprovidehigheraccuracythanlogisticregressionmodel.ReadmoreaboutDecisionTrees.

model=DecisionTreeClassifier()

predictor_var=['Credit_History','Gender','Married','Education']

classification_model(model,df,predictor_var,outcome_var)

Accuracy:81.930%Cross-ValidationScore:76.656%
HerethemodelbasedoncategoricalvariablesisunabletohaveanimpactbecauseCreditHistoryisdominatingoverthem.Let’stryafewnumericalvariables:

#Wecantrydifferentcombinationofvariables:

predictor_var=['Credit_History','Loan_Amount_Term','LoanAmount_log']

classification_model(model,df,predictor_var,outcome_var)

Accuracy:92.345%Cross-ValidationScore:71.009%
Hereweobservedthatalthoughtheaccuracywentuponaddingvariables,thecross-validationerrorwentdown.Thisistheresultofmodelover-fittingthedata.Let’stryanevenmoresophisticatedalgorithmandseeifithelps:
RandomForest
Randomforestisanotheralgorithmforsolvingtheclassificationproblem.ReadmoreaboutRandomForest.
AnadvantagewithRandomForestisthatwecanmakeitworkwithallthefeaturesanditreturnsafeatureimportancematrixwhichcanbeusedtoselectfeatures.

model=RandomForestClassifier(n_estimators=100)

predictor_var=['Gender','Married','Dependents','Education',

'Self_Employed','Loan_Amount_Term','Credit_History','Property_Area',

'LoanAmount_log','TotalIncome_log']

classification_model(model,df,predictor_var,outcome_var)

Accuracy:100.000%Cross-ValidationScore:78.179%
Hereweseethattheaccuracyis100%forthetrainingset.Thisistheultimatecaseofoverfittingandcanberesolvedintwoways:

Reducingthenumberofpredictors
Tuningthemodelparameters

Let’strybothofthese.Firstweseethefeatureimportancematrixfromwhichwe’lltakethemostimportantfeatures.

#Createaserieswithfeatureimportances:

featimp=pd.Series(model.feature_importances_,index=predictor_var).sort_values(ascending=False)

printfeatimp

Let’susethetop5variablesforcreatingamodel.Also,wewillmodifytheparametersofrandomforestmodelalittlebit:

model=RandomForestClassifier(n_estimators=25,min_samples_split=25,max_depth=7,max_features=1)

predictor_var=['TotalIncome_log','LoanAmount_log','Credit_History','Dependents','Property_Area']

classification_model(model,df,predictor_var,outcome_var)

Accuracy:82.899%Cross-ValidationScore:81.461%
Noticethatalthoughaccuracyreduced,butthecross-validationscoreisimprovingshowingthatthemodelisgeneralizingwell.Rememberthatrandomforestmodelsarenotexactlyrepeatable.Differentrunswillresultinslightvariationsbecauseofrandomization.Buttheoutputshouldstayintheballpark.
Youwouldhavenoticedthatevenaftersomebasicparametertuningonrandomforest,wehavereachedacross-validationaccuracyonlyslightlybetterthantheoriginallogisticregressionmodel.Thisexercisegivesussomeveryinterestinganduniquelearning:

Usingamoresophisticatedmodeldoesnotguaranteebetterresults.
Avoidusingcomplexmodelingtechniquesasablackboxwithoutunderstandingtheunderlyingconcepts.Doingsowouldincreasethetendencyofoverfittingthusmakingyourmodelslessinterpretable
FeatureEngineeringisthekeytosuccess.EveryonecanuseanXgboostmodelsbuttherealartandcreativityliesinenhancingyourfeaturestobettersuitthemodel.

Soareyoureadytotakeonthechallenge?StartyourdatasciencejourneywithLoanPredictionProblem.

EndNotes

IhopethistutorialwillhelpyoumaximizeyourefficiencywhenstartingwithdatascienceinPython.Iamsurethisnotonlygaveyouanideaaboutbasicdataanalysismethodsbutitalsoshowedyouhowtoimplementsomeofthemoresophisticatedtechniquesavailabletoday.
Pythonisreallyagreattool,andisbecominganincreasinglypopularlanguageamongthedatascientists.Thereasonbeing,it’seasytolearn,integrateswellwithotherdatabasesandtoolslikeSparkandHadoop.Majorly,ithasgreatcomputationalintensityandhaspowerfuldataanalyticslibraries.
So,learnPythontoperformthefulllife-cycleofanydatascienceproject.Itincludesreading,analyzing,visualizingandfinallymakingpredictions.
IfyoucomeacrossanydifficultywhilepracticingPython,oryouhaveanythoughts/suggestions/feedbackonthepost,pleasefeelfreetopostthemthroughcommentsbelow.

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航