A Complete Tutorial to Learn Data Science with Python from Scratch
2017-10-02 20:20
513 查看
ACompleteTutorialtoLearnDataSciencewithPythonfromScratch
PythonlibrariesanddatastructuresPythonDataStructuresPythonIterationandConditionalConstructsPythonLibraries
ExploratoryanalysisinPythonusingPandasIntroductiontoseriesanddataframesAnalyticsVidhyadataset-LoanPredictionProblem
DataMunginginPythonusingPandasBuildingaPredictiveModelinPythonLogisticRegressionDecisionTreeRandomForest
Let’sgetstarted!compareditagainstSAS&Rsometimeback.HerearesomereasonswhichgoinfavouroflearningPython:OpenSource–freetoinstallAwesomeonlinecommunityVeryeasytolearnCanbecomeacommonlanguagefordatascienceandproductionofwebbasedanalyticsproducts.
Needlesstosay,itstillhasfewdrawbackstoo:Itisaninterpretedlanguageratherthancompiledlanguage–hencemighttakeupmoreCPUtime.However,giventhesavingsinprogrammertime(duetoeaseoflearning),itmightstillbeagoodchoice.
Python2.7v/s3.4ThisisoneofthemostdebatedtopicsinPython.Youwillinvariablycrosspathswithit,speciallyifyouareabeginner.Thereisnoright/wrongchoicehere.Ittotallydependsonthesituationandyourneedtouse.Iwilltrytogiveyousomepointerstohelpyoumakeaninformedchoice.WhyPython2.7?Awesomecommunitysupport!Thisissomethingyou’dneedinyourearlydays.Python2wasreleasedinlate2000andhasbeeninuseformorethan15years.Plethoraofthird-partylibraries!Thoughmanylibrarieshaveprovided3.xsupportbutstillalargenumberofmodulesworkonlyon2.xversions.IfyouplantousePythonforspecificapplicationslikeweb-developmentwithhighrelianceonexternalmodules,youmightbebetteroffwith2.7.Someofthefeaturesof3.xversionshavebackwardcompatibilityandcanworkwith2.7version.
WhyPython3.4?Cleanerandfaster!Pythondevelopershavefixedsomeinherentglitchesandminordrawbacksinordertosetastrongerfoundationforthefuture.Thesemightnotbeveryrelevantinitially,butwillmattereventually.Itisthefuture!2.7isthelastreleaseforthe2.xfamilyandeventuallyeveryonehastoshiftto3.xversions.Python3hasreleasedstableversionsforpast5yearsandwillcontinuethesame.
ThereisnoclearwinnerbutIsupposethebottomlineisthatyoushouldfocusonlearningPythonasalanguage.Shiftingbetweenversionsshouldjustbeamatteroftime.StaytunedforadedicatedarticleonPython2.xvs3.xinthenearfuture!HowtoinstallPython?Thereare2approachestoinstallPython:YoucandownloadPythondirectlyfromitsprojectsiteandinstallindividualcomponentsandlibrariesyouwantAlternately,youcandownloadandinstallapackage,whichcomeswithpre-installedlibraries.IwouldrecommenddownloadingAnaconda.AnotheroptioncouldbeEnthoughtCanopyExpress.
SecondmethodprovidesahasslefreeinstallationandhenceI’llrecommendthattobeginners.Theimitationofthisapproachisyouhavetowaitfortheentirepackagetobeupgraded,evenifyouareinterestedinthelatestversionofasinglelibrary.Itshouldnotmatteruntilandunless,untilandunless,youaredoingcuttingedgestatisticalresearch.ChoosingadevelopmentenvironmentOnceyouhaveinstalledPython,therearevariousoptionsforchoosinganenvironment.Herearethe3mostcommonoptions:Terminal/ShellbasedIDLE(defaultenvironment)iPythonnotebook–similartomarkdowninR
IDLEeditorforPythonWhiletherightenvironmentdependsonyourneed,IpersonallypreferiPythonNotebooksalot.Itprovidesalotofgoodfeaturesfordocumentingwhilewritingthecodeitselfandyoucanchoosetorunthecodeinblocks(ratherthanthelinebylineexecution)WewilluseiPythonenvironmentforthiscompletetutorial.Warmingup:RunningyourfirstPythonprogramYoucanusePythonasasimplecalculatortostartwith:
FewthingstonoteYoucanstartiPythonnotebookbywriting“ipythonnotebook”onyourterminal/cmd,dependingontheOSyouareworkingonYoucannameaiPythonnotebookbysimplyclickingonthename–UntitledOintheabovescreenshotTheinterfaceshowsIn[*]forinputsandOut[*]foroutput.Youcanexecuteacodebypressing“Shift+Enter”or“ALT+Enter”,ifyouwanttoinsertanadditionalrowafter.
Beforewedeepdiveintoproblemsolving,letstakeastepbackandunderstandthebasicsofPython.Asweknowthatdatastructuresanditerationandconditionalconstructsformthecruxofanylanguage.InPython,theseincludelists,strings,tuples,dictionaries,for-loop,while-loop,if-else,etc.Let’stakealookatsomeofthese.
Hereisaquickexampletodefinealistandthenaccessit:
Strings–Stringscansimplybedefinedbyuseofsingle(‘),double(”)ortriple(”’)invertedcommas.Stringsenclosedintripequotes(”’)canspanovermultiplelinesandareusedfrequentlyindocstrings(Python’swayofdocumentingfunctions).\isusedasanescapecharacter.PleasenotethatPythonstringsareimmutable,soyoucannotchangepartofstrings.
Tuples–Atupleisrepresentedbyanumberofvaluesseparatedbycommas.Tuplesareimmutableandtheoutputissurroundedbyparenthesessothatnestedtuplesareprocessedcorrectly.Additionally,eventhoughtuplesareimmutable,theycanholdmutabledataifneeded.
SinceTuplesareimmutableandcannotchange,theyarefasterinprocessingascomparedtolists.Hence,ifyourlistisunlikelytochange,youshouldusetuples,insteadoflists.
Dictionary–Dictionaryisanunorderedsetofkey:valuepairs,withtherequirementthatthekeysareunique(withinonedictionary).Apairofbracescreatesanemptydictionary:{}.
PythonIterationandConditionalConstructsLikemostlanguages,PythonalsohasaFOR-loopwhichisthemostwidelyusedmethodforiteration.Ithasasimplesyntax:
Here“PythonIterable”canbealist,tupleorotheradvanceddatastructureswhichwewillexploreinlatersections.Let’stakealookatasimpleexample,determiningthefactorialofanumber.
Comingtoconditionalstatements,theseareusedtoexecutecodefragmentsbasedonacondition.Themostcommonlyusedconstructisif-else,withfollowingsyntax:
Forinstance,ifwewanttoprintwhetherthenumberNisevenorodd:
NowthatyouarefamiliarwithPythonfundamentals,let’stakeastepfurther.Whatifyouhavetoperformthefollowingtasks:
Multiply2matrices
Findtherootofaquadraticequation
Plotbarchartsandhistograms
Makestatisticalmodels
Accessweb-pages
Ifyoutrytowritecodefromscratch,itsgoingtobeanightmareandyouwon’tstayonPythonformorethan2days!Butletsnotworryaboutthat.Thankfully,therearemanylibrarieswithpredefinedwhichwecandirectlyimportintoourcodeandmakeourlifeeasy.
Forexample,considerthefactorialexamplewejustsaw.Wecandothatinasinglestepas:
Off-courseweneedtoimportthemathlibraryforthat.Letsexplorethevariouslibrariesnext.
PythonLibraries
LetstakeonestepaheadinourjourneytolearnPythonbygettingacquaintedwithsomeusefullibraries.Thefirststepisobviouslytolearntoimportthemintoourenvironment.ThereareseveralwaysofdoingsoinPython:
Inthefirstmanner,wehavedefinedanaliasmtolibrarymath.Wecannowusevariousfunctionsfrommathlibrary(e.g.factorial)byreferencingitusingthealiasm.factorial().
Inthesecondmanner,youhaveimportedtheentirenamespaceinmathi.e.youcandirectlyusefactorial()withoutreferringtomath.
Tip:Googlerecommendsthatyouusefirststyleofimportinglibraries,asyouwillknowwherethefunctionshavecomefrom.
Followingarealistoflibraries,youwillneedforanyscientificcomputationsanddataanalysis:
NumPystandsforNumericalPython.ThemostpowerfulfeatureofNumPyisn-dimensionalarray.Thislibraryalsocontainsbasiclinearalgebrafunctions,Fouriertransforms,advancedrandomnumbercapabilitiesandtoolsforintegrationwithotherlowlevellanguageslikeFortran,CandC++
SciPystandsforScientificPython.SciPyisbuiltonNumPy.ItisoneofthemostusefullibraryforvarietyofhighlevelscienceandengineeringmoduleslikediscreteFouriertransform,LinearAlgebra,OptimizationandSparsematrices.
Matplotlibforplottingvastvarietyofgraphs,startingfromhistogramstolineplotstoheatplots..YoucanusePylabfeatureinipythonnotebook(ipythonnotebook–pylab=inline)tousetheseplottingfeaturesinline.Ifyouignoretheinlineoption,thenpylabconvertsipythonenvironmenttoanenvironment,verysimilartoMatlab.YoucanalsouseLatexcommandstoaddmathtoyourplot.
Pandasforstructureddataoperationsandmanipulations.Itisextensivelyusedfordatamungingandpreparation.PandaswereaddedrelativelyrecentlytoPythonandhavebeeninstrumentalinboostingPython’susageindatascientistcommunity.
ScikitLearnformachinelearning.BuiltonNumPy,SciPyandmatplotlib,thislibrarycontainsalotofeffiecienttoolsformachinelearningandstatisticalmodelingincludingclassification,regression,clusteringanddimensionalityreduction.
Statsmodelsforstatisticalmodeling.StatsmodelsisaPythonmodulethatallowsuserstoexploredata,estimatestatisticalmodels,andperformstatisticaltests.Anextensivelistofdescriptivestatistics,statisticaltests,plottingfunctions,andresultstatisticsareavailablefordifferenttypesofdataandeachestimator.
Seabornforstatisticaldatavisualization.SeabornisalibraryformakingattractiveandinformativestatisticalgraphicsinPython.Itisbasedonmatplotlib.Seabornaimstomakevisualizationacentralpartofexploringandunderstandingdata.
Bokehforcreatinginteractiveplots,dashboardsanddataapplicationsonmodernweb-browsers.ItempowerstheusertogenerateelegantandconcisegraphicsinthestyleofD3.js.Moreover,ithasthecapabilityofhigh-performanceinteractivityoververylargeorstreamingdatasets.
BlazeforextendingthecapabilityofNumpyandPandastodistributedandstreamingdatasets.ItcanbeusedtoaccessdatafromamultitudeofsourcesincludingBcolz,MongoDB,SQLAlchemy,ApacheSpark,PyTables,etc.TogetherwithBokeh,Blazecanactasaverypowerfultoolforcreatingeffectivevisualizationsanddashboardsonhugechunksofdata.
Scrapyforwebcrawling.Itisaveryusefulframeworkforgettingspecificpatternsofdata.Ithasthecapabilitytostartatawebsitehomeurlandthendigthroughweb-pageswithinthewebsitetogatherinformation.
SymPyforsymboliccomputation.Ithaswide-rangingcapabilitiesfrombasicsymbolicarithmetictocalculus,algebra,discretemathematicsandquantumphysics.AnotherusefulfeatureisthecapabilityofformattingtheresultofthecomputationsasLaTeXcode.
Requestsforaccessingtheweb.Itworkssimilartothethestandardpythonlibraryurllib2butismucheasiertocode.Youwillfindsubtledifferenceswithurllib2butforbeginners,Requestsmightbemoreconvenient.
Additionallibraries,youmightneed:
osforOperatingsystemandfileoperations
networkxandigraphforgraphbaseddatamanipulations
regularexpressionsforfindingpatternsintextdata
BeautifulSoupforscrappingweb.ItisinferiortoScrapyasitwillextractinformationfromjustasinglewebpageinarun.
NowthatwearefamiliarwithPythonfundamentalsandadditionallibraries,letstakeadeepdiveintoproblemsolvingthroughPython.YesImeanmakingapredictivemodel!Intheprocess,weusesomepowerfullibrariesandalsocomeacrossthenextlevelofdatastructures.Wewilltakeyouthroughthe3keyphases:
DataExploration–findingoutmoreaboutthedatawehave
DataMunging–cleaningthedataandplayingwithittomakeitbettersuitstatisticalmodeling
PredictiveModeling–runningtheactualalgorithmsandhavingfun
ImageSource:Wikipedia
PandasisoneofthemostusefuldataanalysislibraryinPython(Iknowthesenamessoundsweird,buthangon!).TheyhavebeeninstrumentalinincreasingtheuseofPythonindatasciencecommunity.WewillnowusePandastoreadadatasetfromanAnalyticsVidhyacompetition,performexploratoryanalysisandbuildourfirstbasiccategorizationalgorithmforsolvingthisproblem.
Beforeloadingthedata,letsunderstandthe2keydatastructuresinPandas–SeriesandDataFrames
IntroductiontoSeriesandDataframes
Seriescanbeunderstoodasa1dimensionallabelled/indexedarray.Youcanaccessindividualelementsofthisseriesthroughtheselabels.
AdataframeissimilartoExcelworkbook–youhavecolumnnamesreferringtocolumnsandyouhaverows,whichcanbeaccessedwithuseofrownumbers.Theessentialdifferencebeingthatcolumnnamesandrownumbersareknownascolumnandrowindex,incaseofdataframes.
SeriesanddataframesformthecoredatamodelforPandasinPython.Thedatasetsarefirstreadintothesedataframesandthenvariousoperations(e.g.groupby,aggregationetc.)canbeappliedveryeasilytoitscolumns.
More:10MinutestoPandas
Practicedataset–LoanPredictionProblem
Youcandownloadthedatasetfromhere.Hereisthedescriptionofvariables:
Let’sbeginwithexploration
Tobegin,startiPythoninterfaceinInlinePylabmodebytypingfollowingonyourterminal/windowscommandprompt:
ThisopensupiPythonnotebookinpylabenvironment,whichhasafewusefullibrariesalreadyimported.Also,youwillbeabletoplotyourdatainline,whichmakesthisareallygoodenvironmentforinteractivedataanalysis.Youcancheckwhethertheenvironmenthasloadedcorrectly,bytypingthefollowingcommand(andgettingtheoutputasseeninthefigurebelow):
IamcurrentlyworkinginLinux,andhavestoredthedatasetinthefollowinglocation:
/home/kunal/Downloads/Loan_Prediction/train.csv
Importinglibrariesandthedataset:
Followingarethelibrarieswewilluseduringthistutorial:
numpy
matplotlib
pandas
PleasenotethatyoudonotneedtoimportmatplotlibandnumpybecauseofPylabenvironment.Ihavestillkepttheminthecode,incaseyouusethecodeinadifferentenvironment.
Afterimportingthelibrary,youreadthedatasetusingfunctionread_csv().Thisishowthecodelooksliketillthisstage:
QuickDataExploration
Onceyouhavereadthedataset,youcanhavealookatfewtoprowsbyusingthefunctionhead()
Thisshouldprint10rows.Alternately,youcanalsolookatmorerowsbyprintingthedataset.
Next,youcanlookatsummaryofnumericalfieldsbyusingdescribe()function
describe()functionwouldprovidecount,mean,standarddeviation(std),min,quartilesandmaxinitsoutput(Readthisarticletorefreshbasicstatisticstounderstandpopulationdistribution)
Hereareafewinferences,youcandrawbylookingattheoutputofdescribe()function:
LoanAmounthas(614–592)22missingvalues.
Loan_Amount_Termhas(614–600)14missingvalues.
Credit_Historyhas(614–564)50missingvalues.
Wecanalsolookthatabout84%applicantshaveacredit_history.How?ThemeanofCredit_Historyfieldis0.84(Remember,Credit_Historyhasvalue1forthosewhohaveacredithistoryand0otherwise)
TheApplicantIncomedistributionseemstobeinlinewithexpectation.SamewithCoapplicantIncome
Pleasenotethatwecangetanideaofapossibleskewinthedatabycomparingthemeantothemedian,i.e.the50%figure.
Forthenon-numericalvalues(e.g.Property_Area,Credit_Historyetc.),wecanlookatfrequencydistributiontounderstandwhethertheymakesenseornot.Thefrequencytablecanbeprintedbyfollowingcommand:
Similarly,wecanlookatuniquevaluesofportofcredithistory.Notethatdfname[‘column_name’]isabasicindexingtechniquetoacessaparticularcolumnofthedataframe.Itcanbealistofcolumnsaswell.Formoreinformation,refertothe“10MinutestoPandas”resourcesharedabove.
Distributionanalysis
Nowthatwearefamiliarwithbasicdatacharacteristics,letusstudydistributionofvariousvariables.Letusstartwithnumericvariables–namelyApplicantIncomeandLoanAmount
LetsstartbyplottingthehistogramofApplicantIncomeusingthefollowingcommands:
Hereweobservethattherearefewextremevalues.Thisisalsothereasonwhy50binsarerequiredtodepictthedistributionclearly.
Next,welookatboxplotstounderstandthedistributions.Boxplotforfarecanbeplottedby:
Thisconfirmsthepresenceofalotofoutliers/extremevalues.Thiscanbeattributedtotheincomedisparityinthesociety.Partofthiscanbedrivenbythefactthatwearelookingatpeoplewithdifferenteducationlevels.LetussegregatethembyEducation:
Wecanseethatthereisnosubstantialdifferentbetweenthemeanincomeofgraduateandnon-graduates.Butthereareahighernumberofgraduateswithveryhighincomes,whichareappearingtobetheoutliers.
Now,Let’slookatthehistogramandboxplotofLoanAmountusingthefollowingcommand:
Again,therearesomeextremevalues.Clearly,bothApplicantIncomeandLoanAmountrequiresomeamountofdatamunging.LoanAmounthasmissingandwellasextremevaluesvalues,whileApplicantIncomehasafewextremevalues,whichdemanddeeperunderstanding.Wewilltakethisupincomingsections.
Categoricalvariableanalysis
NowthatweunderstanddistributionsforApplicantIncomeandLoanIncome,letusunderstandcategoricalvariablesinmoredetails.WewilluseExcelstylepivottableandcross-tabulation.Forinstance,letuslookatthechancesofgettingaloanbasedoncredithistory.ThiscanbeachievedinMSExcelusingapivottableas:
Note:hereloanstatushasbeencodedas1forYesand0forNo.Sothemeanrepresentstheprobabilityofgettingloan.
NowwewilllookatthestepsrequiredtogenerateasimilarinsightusingPython.PleaserefertothisarticleforgettingahangofthedifferentdatamanipulationtechniquesinPandas.
Nowwecanobservethatwegetasimilarpivot_tableliketheMSExcelone.Thiscanbeplottedasabarchartusingthe“matplotlib”librarywithfollowingcode:
Thisshowsthatthechancesofgettingaloanareeight-foldiftheapplicanthasavalidcredithistory.YoucanplotsimilargraphsbyMarried,Self-Employed,Property_Area,etc.
Alternately,thesetwoplotscanalsobevisualizedbycombiningtheminastackedchart::
Youcanalsoaddgenderintothemix(similartothepivottableinExcel):
Ifyouhavenotrealizedalready,wehavejustcreatedtwobasicclassificationalgorithmshere,onebasedoncredithistory,whileotheron2categoricalvariables(includinggender).YoucanquicklycodethistocreateyourfirstsubmissiononAVDatahacks.
WejustsawhowwecandoexploratoryanalysisinPythonusingPandas.Ihopeyourloveforpandas(theanimal)wouldhaveincreasedbynow–giventheamountofhelp,thelibrarycanprovideyouinanalyzingdatasets.
Nextlet’sexploreApplicantIncomeandLoanStatusvariablesfurther,performdatamungingandcreateadatasetforapplyingvariousmodelingtechniques.Iwouldstronglyurgethatyoutakeanotherdatasetandproblemandgothroughanindependentexamplebeforereadingfurther.
Datamunging–recapoftheneed
Whileourexplorationofthedata,wefoundafewproblemsinthedataset,whichneedstobesolvedbeforethedataisreadyforagoodmodel.Thisexerciseistypicallyreferredas“DataMunging”.Herearetheproblems,wearealreadyawareof:
Therearemissingvaluesinsomevariables.Weshouldestimatethosevalueswiselydependingontheamountofmissingvaluesandtheexpectedimportanceofvariables.
Whilelookingatthedistributions,wesawthatApplicantIncomeandLoanAmountseemedtocontainextremevaluesateitherend.Thoughtheymightmakeintuitivesense,butshouldbetreatedappropriately.
Inadditiontotheseproblemswithnumericalfields,weshouldalsolookatthenon-numericalfieldsi.e.Gender,Property_Area,Married,EducationandDependentstosee,iftheycontainanyusefulinformation.
IfyouarenewtoPandas,Iwouldrecommendreadingthisarticlebeforemovingon.Itdetailssomeusefultechniquesofdatamanipulation.
Checkmissingvaluesinthedataset
Letuslookatmissingvaluesinallthevariablesbecausemostofthemodelsdon’tworkwithmissingdataandeveniftheydo,imputingthemhelpsmoreoftenthannot.So,letuscheckthenumberofnulls/NaNsinthedataset
Thiscommandshouldtellusthenumberofmissingvaluesineachcolumnasisnull()returns1,ifthevalueisnull.
Thoughthemissingvaluesarenotveryhighinnumber,butmanyvariableshavethemandeachoneoftheseshouldbeestimatedandaddedinthedata.Getadetailedviewondifferentimputationtechniquesthroughthisarticle.
Note:RememberthatmissingvaluesmaynotalwaysbeNaNs.Forinstance,iftheLoan_Amount_Termis0,doesitmakessenseorwouldyouconsiderthatmissing?Isupposeyouranswerismissingandyou’reright.Soweshouldcheckforvalueswhichareunpractical.
HowtofillmissingvaluesinLoanAmount?
Therearenumerouswaystofillthemissingvaluesofloanamount–thesimplestbeingreplacementbymean,whichcanbedonebyfollowingcode:
Theotherextremecouldbetobuildasupervisedlearningmodeltopredictloanamountonthebasisofothervariablesandthenuseagealongwithothervariablestopredictsurvival.
Since,thepurposenowistobringoutthestepsindatamunging,I’llrathertakeanapproach,whichliessomewhereinbetweenthese2extremes.Akeyhypothesisisthatthewhetherapersoniseducatedorself-employedcancombinetogiveagoodestimateofloanamount.
First,let’slookattheboxplottoseeifatrendexists:
Thusweseesomevariationsinthemedianofloanamountforeachgroupandthiscanbeusedtoimputethevalues.Butfirst,wehavetoensurethateachofSelf_EmployedandEducationvariablesshouldnothaveamissingvalues.
Aswesayearlier,Self_Employedhassomemissingvalues.Let’slookatthefrequencytable:
Since~86%valuesare“No”,itissafetoimputethemissingvaluesas“No”asthereisahighprobabilityofsuccess.Thiscanbedoneusingthefollowingcode:
Now,wewillcreateaPivottable,whichprovidesusmedianvaluesforallthegroupsofuniquevaluesofSelf_EmployedandEducationfeatures.Next,wedefineafunction,whichreturnsthevaluesofthesecellsandapplyittofillthemissingvaluesofloanamount:
Thisshouldprovideyouagoodwaytoimputemissingvaluesofloanamount.
[b]HowtotreatforextremevaluesindistributionofLoanAmountandApplicantIncome?[/b]
Let’sanalyzeLoanAmountfirst.Sincetheextremevaluesarepracticallypossible,i.e.somepeoplemightapplyforhighvalueloansduetospecificneeds.Soinsteadoftreatingthemasoutliers,let’stryalogtransformationtonullifytheireffect:
Lookingatthehistogramagain:
Nowthedistributionlooksmuchclosertonormalandeffectofextremevalueshasbeensignificantlysubsided.
ComingtoApplicantIncome.OneintuitioncanbethatsomeapplicantshavelowerincomebutstrongsupportCo-applicants.Soitmightbeagoodideatocombinebothincomesastotalincomeandtakealogtransformationofthesame.
Nowweseethatthedistributionismuchbetterthanbefore.IwillleaveituptoyoutoimputethemissingvaluesforGender,Married,Dependents,Loan_Amount_Term,Credit_History.Also,Iencourageyoutothinkaboutpossibleadditionalinformationwhichcanbederivedfromthedata.Forexample,creatingacolumnforLoanAmount/TotalIncomemightmakesenseasitgivesanideaofhowwelltheapplicantissuitedtopaybackhisloan.
Next,wewilllookatmakingpredictivemodels.
thisarticle.
Since,sklearnrequiresallinputstobenumeric,weshouldconvertallourcategoricalvariablesintonumericbyencodingthecategories.Thiscanbedoneusingthefollowingcode:
Next,wewillimporttherequiredmodules.Thenwewilldefineagenericclassificationfunction,whichtakesamodelasinputanddeterminestheAccuracyandCross-Validationscores.Sincethisisanintroductoryarticle,Iwillnotgointothedetailsofcoding.PleaserefertothisarticleforgettingdetailsofthealgorithmswithRandPythoncodes.Also,it’llbegoodtogetarefresheroncross-validationthroughthisarticle,asitisaveryimportantmeasureofpowerperformance.
LogisticRegression
Let’smakeourfirstLogisticRegressionmodel.Onewaywouldbetotakeallthevariablesintothemodelbutthismightresultinoverfitting(don’tworryifyou’reunawareofthisterminologyyet).Insimplewords,takingallvariablesmightresultinthemodelunderstandingcomplexrelationsspecifictothedataandwillnotgeneralizewell.ReadmoreaboutLogisticRegression.
Wecaneasilymakesomeintuitivehypothesistosettheballrolling.Thechancesofgettingaloanwillbehigherfor:
Applicantshavingacredithistory(rememberweobservedthisinexploration?)
Applicantswithhigherapplicantandco-applicantincomes
Applicantswithhighereducationlevel
Propertiesinurbanareaswithhighgrowthperspectives
Solet’smakeourfirstmodelwith‘Credit_History’.
Accuracy:80.945%Cross-ValidationScore:80.946%
Accuracy:80.945%Cross-ValidationScore:80.946%
Generallyweexpecttheaccuracytoincreaseonaddingvariables.Butthisisamorechallengingcase.Theaccuracyandcross-validationscorearenotgettingimpactedbylessimportantvariables.Credit_Historyisdominatingthemode.Wehavetwooptionsnow:
FeatureEngineering:dereivenewinformationandtrytopredictthose.Iwillleavethistoyourcreativity.
Bettermodelingtechniques.Let’sexplorethisnext.
DecisionTree
Decisiontreeisanothermethodformakingapredictivemodel.Itisknowntoprovidehigheraccuracythanlogisticregressionmodel.ReadmoreaboutDecisionTrees.
Accuracy:81.930%Cross-ValidationScore:76.656%
HerethemodelbasedoncategoricalvariablesisunabletohaveanimpactbecauseCreditHistoryisdominatingoverthem.Let’stryafewnumericalvariables:
Accuracy:92.345%Cross-ValidationScore:71.009%
Hereweobservedthatalthoughtheaccuracywentuponaddingvariables,thecross-validationerrorwentdown.Thisistheresultofmodelover-fittingthedata.Let’stryanevenmoresophisticatedalgorithmandseeifithelps:
RandomForest
Randomforestisanotheralgorithmforsolvingtheclassificationproblem.ReadmoreaboutRandomForest.
AnadvantagewithRandomForestisthatwecanmakeitworkwithallthefeaturesanditreturnsafeatureimportancematrixwhichcanbeusedtoselectfeatures.
Accuracy:100.000%Cross-ValidationScore:78.179%
Hereweseethattheaccuracyis100%forthetrainingset.Thisistheultimatecaseofoverfittingandcanberesolvedintwoways:
Reducingthenumberofpredictors
Tuningthemodelparameters
Let’strybothofthese.Firstweseethefeatureimportancematrixfromwhichwe’lltakethemostimportantfeatures.
Let’susethetop5variablesforcreatingamodel.Also,wewillmodifytheparametersofrandomforestmodelalittlebit:
Accuracy:82.899%Cross-ValidationScore:81.461%
Noticethatalthoughaccuracyreduced,butthecross-validationscoreisimprovingshowingthatthemodelisgeneralizingwell.Rememberthatrandomforestmodelsarenotexactlyrepeatable.Differentrunswillresultinslightvariationsbecauseofrandomization.Buttheoutputshouldstayintheballpark.
Youwouldhavenoticedthatevenaftersomebasicparametertuningonrandomforest,wehavereachedacross-validationaccuracyonlyslightlybetterthantheoriginallogisticregressionmodel.Thisexercisegivesussomeveryinterestinganduniquelearning:
Usingamoresophisticatedmodeldoesnotguaranteebetterresults.
Avoidusingcomplexmodelingtechniquesasablackboxwithoutunderstandingtheunderlyingconcepts.Doingsowouldincreasethetendencyofoverfittingthusmakingyourmodelslessinterpretable
FeatureEngineeringisthekeytosuccess.EveryonecanuseanXgboostmodelsbuttherealartandcreativityliesinenhancingyourfeaturestobettersuitthemodel.
Soareyoureadytotakeonthechallenge?StartyourdatasciencejourneywithLoanPredictionProblem.
Pythonisreallyagreattool,andisbecominganincreasinglypopularlanguageamongthedatascientists.Thereasonbeing,it’seasytolearn,integrateswellwithotherdatabasesandtoolslikeSparkandHadoop.Majorly,ithasgreatcomputationalintensityandhaspowerfuldataanalyticslibraries.
So,learnPythontoperformthefulllife-cycleofanydatascienceproject.Itincludesreading,analyzing,visualizingandfinallymakingpredictions.
IfyoucomeacrossanydifficultywhilepracticingPython,oryouhaveanythoughts/suggestions/feedbackonthepost,pleasefeelfreetopostthemthroughcommentsbelow.
Introduction
Ithappenedfewyearsback.AfterworkingonSASformorethan5years,Idecidedtomoveoutofmycomfortzone.Beingadatascientist,myhuntforotherusefultoolswasON!Fortunately,itdidn’ttakemelongtodecide,Pythonwasmyappetizer.Ialwayshadainclinationtowardscoding.ThiswasthetimetodowhatIreallyloved.Code.Turnedout,codingwassoeasy!IlearnedbasicsofPythonwithinaweek.And,sincethen,I’venotonlyexploredthislanguagetothedepth,butalsohavehelpedmanyothertolearnthislanguage.Pythonwasoriginallyageneralpurposelanguage.But,overtheyears,withstrongcommunitysupport,thislanguagegotdedicatedlibraryfordataanalysisandpredictivemodeling.Duetolackofresourceonpythonfordatascience,Idecidedtocreatethistutorialtohelpmanyotherstolearnpythonfaster.Inthistutorial,wewilltakebitesizedinformationabouthowtousePythonforDataAnalysis,chewittillwearecomfortableandpracticeitatourownend.TableofContents
BasicsofPythonforDataAnalysisWhylearnPythonfordataanalysis?Python2.7v/s3.4HowtoinstallPython?RunningafewsimpleprogramsinPythonPythonlibrariesanddatastructuresPythonDataStructuresPythonIterationandConditionalConstructsPythonLibraries
ExploratoryanalysisinPythonusingPandasIntroductiontoseriesanddataframesAnalyticsVidhyadataset-LoanPredictionProblem
DataMunginginPythonusingPandasBuildingaPredictiveModelinPythonLogisticRegressionDecisionTreeRandomForest
Let’sgetstarted!
1.BasicsofPythonforDataAnalysis
WhylearnPythonfordataanalysis?Pythonhasgatheredalotofinterestrecentlyasachoiceoflanguagefordataanalysis.IhadNeedlesstosay,itstillhasfewdrawbackstoo:Itisaninterpretedlanguageratherthancompiledlanguage–hencemighttakeupmoreCPUtime.However,giventhesavingsinprogrammertime(duetoeaseoflearning),itmightstillbeagoodchoice.
Python2.7v/s3.4ThisisoneofthemostdebatedtopicsinPython.Youwillinvariablycrosspathswithit,speciallyifyouareabeginner.Thereisnoright/wrongchoicehere.Ittotallydependsonthesituationandyourneedtouse.Iwilltrytogiveyousomepointerstohelpyoumakeaninformedchoice.WhyPython2.7?Awesomecommunitysupport!Thisissomethingyou’dneedinyourearlydays.Python2wasreleasedinlate2000andhasbeeninuseformorethan15years.Plethoraofthird-partylibraries!Thoughmanylibrarieshaveprovided3.xsupportbutstillalargenumberofmodulesworkonlyon2.xversions.IfyouplantousePythonforspecificapplicationslikeweb-developmentwithhighrelianceonexternalmodules,youmightbebetteroffwith2.7.Someofthefeaturesof3.xversionshavebackwardcompatibilityandcanworkwith2.7version.
WhyPython3.4?Cleanerandfaster!Pythondevelopershavefixedsomeinherentglitchesandminordrawbacksinordertosetastrongerfoundationforthefuture.Thesemightnotbeveryrelevantinitially,butwillmattereventually.Itisthefuture!2.7isthelastreleaseforthe2.xfamilyandeventuallyeveryonehastoshiftto3.xversions.Python3hasreleasedstableversionsforpast5yearsandwillcontinuethesame.
ThereisnoclearwinnerbutIsupposethebottomlineisthatyoushouldfocusonlearningPythonasalanguage.Shiftingbetweenversionsshouldjustbeamatteroftime.StaytunedforadedicatedarticleonPython2.xvs3.xinthenearfuture!HowtoinstallPython?Thereare2approachestoinstallPython:YoucandownloadPythondirectlyfromits
SecondmethodprovidesahasslefreeinstallationandhenceI’llrecommendthattobeginners.Theimitationofthisapproachisyouhavetowaitfortheentirepackagetobeupgraded,evenifyouareinterestedinthelatestversionofasinglelibrary.Itshouldnotmatteruntilandunless,untilandunless,youaredoingcuttingedgestatisticalresearch.ChoosingadevelopmentenvironmentOnceyouhaveinstalledPython,therearevariousoptionsforchoosinganenvironment.Herearethe3mostcommonoptions:Terminal/ShellbasedIDLE(defaultenvironment)iPythonnotebook–similartomarkdowninR
IDLEeditorforPythonWhiletherightenvironmentdependsonyourneed,IpersonallypreferiPythonNotebooksalot.Itprovidesalotofgoodfeaturesfordocumentingwhilewritingthecodeitselfandyoucanchoosetorunthecodeinblocks(ratherthanthelinebylineexecution)WewilluseiPythonenvironmentforthiscompletetutorial.Warmingup:RunningyourfirstPythonprogramYoucanusePythonasasimplecalculatortostartwith:
FewthingstonoteYoucanstartiPythonnotebookbywriting“ipythonnotebook”onyourterminal/cmd,dependingontheOSyouareworkingonYoucannameaiPythonnotebookbysimplyclickingonthename–UntitledOintheabovescreenshotTheinterfaceshowsIn[*]forinputsandOut[*]foroutput.Youcanexecuteacodebypressing“Shift+Enter”or“ALT+Enter”,ifyouwanttoinsertanadditionalrowafter.
Beforewedeepdiveintoproblemsolving,letstakeastepbackandunderstandthebasicsofPython.Asweknowthatdatastructuresanditerationandconditionalconstructsformthecruxofanylanguage.InPython,theseincludelists,strings,tuples,dictionaries,for-loop,while-loop,if-else,etc.Let’stakealookatsomeofthese.
2.PythonlibrariesandDataStructures
PythonDataStructuresFollowingaresomedatastructures,whichareusedinPython.Youshouldbefamiliarwiththeminordertousethemasappropriate.Lists–ListsareoneofthemostversatiledatastructureinPython.Alistcansimplybedefinedbywritingalistofcommaseparatedvaluesinsquarebrackets.Listsmightcontainitemsofdifferenttypes,butusuallytheitemsallhavethesametype.Pythonlistsaremutableandindividualelementsofalistcanbechanged.Hereisaquickexampletodefinealistandthenaccessit:
Strings–Stringscansimplybedefinedbyuseofsingle(‘),double(”)ortriple(”’)invertedcommas.Stringsenclosedintripequotes(”’)canspanovermultiplelinesandareusedfrequentlyindocstrings(Python’swayofdocumentingfunctions).\isusedasanescapecharacter.PleasenotethatPythonstringsareimmutable,soyoucannotchangepartofstrings.
Tuples–Atupleisrepresentedbyanumberofvaluesseparatedbycommas.Tuplesareimmutableandtheoutputissurroundedbyparenthesessothatnestedtuplesareprocessedcorrectly.Additionally,eventhoughtuplesareimmutable,theycanholdmutabledataifneeded.
SinceTuplesareimmutableandcannotchange,theyarefasterinprocessingascomparedtolists.Hence,ifyourlistisunlikelytochange,youshouldusetuples,insteadoflists.
Dictionary–Dictionaryisanunorderedsetofkey:valuepairs,withtherequirementthatthekeysareunique(withinonedictionary).Apairofbracescreatesanemptydictionary:{}.
PythonIterationandConditionalConstructsLikemostlanguages,PythonalsohasaFOR-loopwhichisthemostwidelyusedmethodforiteration.Ithasasimplesyntax:
foriin[PythonIterable]:
expression(i)
Here“PythonIterable”canbealist,tupleorotheradvanceddatastructureswhichwewillexploreinlatersections.Let’stakealookatasimpleexample,determiningthefactorialofanumber.
fact=1
foriinrange(1,N+1):
fact*=i
Comingtoconditionalstatements,theseareusedtoexecutecodefragmentsbasedonacondition.Themostcommonlyusedconstructisif-else,withfollowingsyntax:
if[condition]:
__executioniftrue__
else:
__executioniffalse__
Forinstance,ifwewanttoprintwhetherthenumberNisevenorodd:
ifN%2==0:
print'Even'
else:
print'Odd'
NowthatyouarefamiliarwithPythonfundamentals,let’stakeastepfurther.Whatifyouhavetoperformthefollowingtasks:
Multiply2matrices
Findtherootofaquadraticequation
Plotbarchartsandhistograms
Makestatisticalmodels
Accessweb-pages
Ifyoutrytowritecodefromscratch,itsgoingtobeanightmareandyouwon’tstayonPythonformorethan2days!Butletsnotworryaboutthat.Thankfully,therearemanylibrarieswithpredefinedwhichwecandirectlyimportintoourcodeandmakeourlifeeasy.
Forexample,considerthefactorialexamplewejustsaw.Wecandothatinasinglestepas:
math.factorial(N)
Off-courseweneedtoimportthemathlibraryforthat.Letsexplorethevariouslibrariesnext.
PythonLibraries
LetstakeonestepaheadinourjourneytolearnPythonbygettingacquaintedwithsomeusefullibraries.Thefirststepisobviouslytolearntoimportthemintoourenvironment.ThereareseveralwaysofdoingsoinPython:
importmathasm
frommathimport*
Inthefirstmanner,wehavedefinedanaliasmtolibrarymath.Wecannowusevariousfunctionsfrommathlibrary(e.g.factorial)byreferencingitusingthealiasm.factorial().
Inthesecondmanner,youhaveimportedtheentirenamespaceinmathi.e.youcandirectlyusefactorial()withoutreferringtomath.
Tip:Googlerecommendsthatyouusefirststyleofimportinglibraries,asyouwillknowwherethefunctionshavecomefrom.
Followingarealistoflibraries,youwillneedforanyscientificcomputationsanddataanalysis:
NumPystandsforNumericalPython.ThemostpowerfulfeatureofNumPyisn-dimensionalarray.Thislibraryalsocontainsbasiclinearalgebrafunctions,Fouriertransforms,advancedrandomnumbercapabilitiesandtoolsforintegrationwithotherlowlevellanguageslikeFortran,CandC++
SciPystandsforScientificPython.SciPyisbuiltonNumPy.ItisoneofthemostusefullibraryforvarietyofhighlevelscienceandengineeringmoduleslikediscreteFouriertransform,LinearAlgebra,OptimizationandSparsematrices.
Matplotlibforplottingvastvarietyofgraphs,startingfromhistogramstolineplotstoheatplots..YoucanusePylabfeatureinipythonnotebook(ipythonnotebook–pylab=inline)tousetheseplottingfeaturesinline.Ifyouignoretheinlineoption,thenpylabconvertsipythonenvironmenttoanenvironment,verysimilartoMatlab.YoucanalsouseLatexcommandstoaddmathtoyourplot.
Pandasforstructureddataoperationsandmanipulations.Itisextensivelyusedfordatamungingandpreparation.PandaswereaddedrelativelyrecentlytoPythonandhavebeeninstrumentalinboostingPython’susageindatascientistcommunity.
ScikitLearnformachinelearning.BuiltonNumPy,SciPyandmatplotlib,thislibrarycontainsalotofeffiecienttoolsformachinelearningandstatisticalmodelingincludingclassification,regression,clusteringanddimensionalityreduction.
Statsmodelsforstatisticalmodeling.StatsmodelsisaPythonmodulethatallowsuserstoexploredata,estimatestatisticalmodels,andperformstatisticaltests.Anextensivelistofdescriptivestatistics,statisticaltests,plottingfunctions,andresultstatisticsareavailablefordifferenttypesofdataandeachestimator.
Seabornforstatisticaldatavisualization.SeabornisalibraryformakingattractiveandinformativestatisticalgraphicsinPython.Itisbasedonmatplotlib.Seabornaimstomakevisualizationacentralpartofexploringandunderstandingdata.
Bokehforcreatinginteractiveplots,dashboardsanddataapplicationsonmodernweb-browsers.ItempowerstheusertogenerateelegantandconcisegraphicsinthestyleofD3.js.Moreover,ithasthecapabilityofhigh-performanceinteractivityoververylargeorstreamingdatasets.
BlazeforextendingthecapabilityofNumpyandPandastodistributedandstreamingdatasets.ItcanbeusedtoaccessdatafromamultitudeofsourcesincludingBcolz,MongoDB,SQLAlchemy,ApacheSpark,PyTables,etc.TogetherwithBokeh,Blazecanactasaverypowerfultoolforcreatingeffectivevisualizationsanddashboardsonhugechunksofdata.
Scrapyforwebcrawling.Itisaveryusefulframeworkforgettingspecificpatternsofdata.Ithasthecapabilitytostartatawebsitehomeurlandthendigthroughweb-pageswithinthewebsitetogatherinformation.
SymPyforsymboliccomputation.Ithaswide-rangingcapabilitiesfrombasicsymbolicarithmetictocalculus,algebra,discretemathematicsandquantumphysics.AnotherusefulfeatureisthecapabilityofformattingtheresultofthecomputationsasLaTeXcode.
Requestsforaccessingtheweb.Itworkssimilartothethestandardpythonlibraryurllib2butismucheasiertocode.Youwillfindsubtledifferenceswithurllib2butforbeginners,Requestsmightbemoreconvenient.
Additionallibraries,youmightneed:
osforOperatingsystemandfileoperations
networkxandigraphforgraphbaseddatamanipulations
regularexpressionsforfindingpatternsintextdata
BeautifulSoupforscrappingweb.ItisinferiortoScrapyasitwillextractinformationfromjustasinglewebpageinarun.
NowthatwearefamiliarwithPythonfundamentalsandadditionallibraries,letstakeadeepdiveintoproblemsolvingthroughPython.YesImeanmakingapredictivemodel!Intheprocess,weusesomepowerfullibrariesandalsocomeacrossthenextlevelofdatastructures.Wewilltakeyouthroughthe3keyphases:
DataExploration–findingoutmoreaboutthedatawehave
DataMunging–cleaningthedataandplayingwithittomakeitbettersuitstatisticalmodeling
PredictiveModeling–runningtheactualalgorithmsandhavingfun
3.ExploratoryanalysisinPythonusingPandas
Inordertoexploreourdatafurther,letmeintroduceyoutoanotheranimal(asifPythonwasnotenough!)–PandasImageSource:Wikipedia
PandasisoneofthemostusefuldataanalysislibraryinPython(Iknowthesenamessoundsweird,buthangon!).TheyhavebeeninstrumentalinincreasingtheuseofPythonindatasciencecommunity.WewillnowusePandastoreadadatasetfromanAnalyticsVidhyacompetition,performexploratoryanalysisandbuildourfirstbasiccategorizationalgorithmforsolvingthisproblem.
Beforeloadingthedata,letsunderstandthe2keydatastructuresinPandas–SeriesandDataFrames
IntroductiontoSeriesandDataframes
Seriescanbeunderstoodasa1dimensionallabelled/indexedarray.Youcanaccessindividualelementsofthisseriesthroughtheselabels.
AdataframeissimilartoExcelworkbook–youhavecolumnnamesreferringtocolumnsandyouhaverows,whichcanbeaccessedwithuseofrownumbers.Theessentialdifferencebeingthatcolumnnamesandrownumbersareknownascolumnandrowindex,incaseofdataframes.
SeriesanddataframesformthecoredatamodelforPandasinPython.Thedatasetsarefirstreadintothesedataframesandthenvariousoperations(e.g.groupby,aggregationetc.)canbeappliedveryeasilytoitscolumns.
More:
Practicedataset–LoanPredictionProblem
Youcandownloadthedatasetfrom
VARIABLEDESCRIPTIONS:
VariableDescription
Loan_IDUniqueLoanID
GenderMale/Female
MarriedApplicantmarried(Y/N)
DependentsNumberofdependents
EducationApplicantEducation(Graduate/UnderGraduate)
Self_EmployedSelfemployed(Y/N)
ApplicantIncomeApplicantincome
CoapplicantIncomeCoapplicantincome
LoanAmountLoanamountinthousands
Loan_Amount_TermTermofloaninmonths
Credit_Historycredithistorymeetsguidelines
Property_AreaUrban/SemiUrban/Rural
Loan_StatusLoanapproved(Y/N)
Let’sbeginwithexploration
Tobegin,startiPythoninterfaceinInlinePylabmodebytypingfollowingonyourterminal/windowscommandprompt:
ipythonnotebook--pylab=inline
ThisopensupiPythonnotebookinpylabenvironment,whichhasafewusefullibrariesalreadyimported.Also,youwillbeabletoplotyourdatainline,whichmakesthisareallygoodenvironmentforinteractivedataanalysis.Youcancheckwhethertheenvironmenthasloadedcorrectly,bytypingthefollowingcommand(andgettingtheoutputasseeninthefigurebelow):
plot(arange(5))
IamcurrentlyworkinginLinux,andhavestoredthedatasetinthefollowinglocation:
/home/kunal/Downloads/Loan_Prediction/train.csv
Importinglibrariesandthedataset:
Followingarethelibrarieswewilluseduringthistutorial:
numpy
matplotlib
pandas
PleasenotethatyoudonotneedtoimportmatplotlibandnumpybecauseofPylabenvironment.Ihavestillkepttheminthecode,incaseyouusethecodeinadifferentenvironment.
Afterimportingthelibrary,youreadthedatasetusingfunctionread_csv().Thisishowthecodelooksliketillthisstage:
importpandasaspd
importnumpyasnp
importmatplotlibasplt
df=pd.read_csv("/home/kunal/Downloads/Loan_Prediction/train.csv")#ReadingthedatasetinadataframeusingPandas
QuickDataExploration
Onceyouhavereadthedataset,youcanhavealookatfewtoprowsbyusingthefunctionhead()
df.head(10)
Thisshouldprint10rows.Alternately,youcanalsolookatmorerowsbyprintingthedataset.
Next,youcanlookatsummaryofnumericalfieldsbyusingdescribe()function
df.describe()
describe()functionwouldprovidecount,mean,standarddeviation(std),min,quartilesandmaxinitsoutput(Read
Hereareafewinferences,youcandrawbylookingattheoutputofdescribe()function:
LoanAmounthas(614–592)22missingvalues.
Loan_Amount_Termhas(614–600)14missingvalues.
Credit_Historyhas(614–564)50missingvalues.
Wecanalsolookthatabout84%applicantshaveacredit_history.How?ThemeanofCredit_Historyfieldis0.84(Remember,Credit_Historyhasvalue1forthosewhohaveacredithistoryand0otherwise)
TheApplicantIncomedistributionseemstobeinlinewithexpectation.SamewithCoapplicantIncome
Pleasenotethatwecangetanideaofapossibleskewinthedatabycomparingthemeantothemedian,i.e.the50%figure.
Forthenon-numericalvalues(e.g.Property_Area,Credit_Historyetc.),wecanlookatfrequencydistributiontounderstandwhethertheymakesenseornot.Thefrequencytablecanbeprintedbyfollowingcommand:
df['Property_Area'].value_counts()
Similarly,wecanlookatuniquevaluesofportofcredithistory.Notethatdfname[‘column_name’]isabasicindexingtechniquetoacessaparticularcolumnofthedataframe.Itcanbealistofcolumnsaswell.Formoreinformation,refertothe“10MinutestoPandas”resourcesharedabove.
Distributionanalysis
Nowthatwearefamiliarwithbasicdatacharacteristics,letusstudydistributionofvariousvariables.Letusstartwithnumericvariables–namelyApplicantIncomeandLoanAmount
LetsstartbyplottingthehistogramofApplicantIncomeusingthefollowingcommands:
df['ApplicantIncome'].hist(bins=50)
Hereweobservethattherearefewextremevalues.Thisisalsothereasonwhy50binsarerequiredtodepictthedistributionclearly.
Next,welookatboxplotstounderstandthedistributions.Boxplotforfarecanbeplottedby:
df.boxplot(column='ApplicantIncome')
Thisconfirmsthepresenceofalotofoutliers/extremevalues.Thiscanbeattributedtotheincomedisparityinthesociety.Partofthiscanbedrivenbythefactthatwearelookingatpeoplewithdifferenteducationlevels.LetussegregatethembyEducation:
df.boxplot(column='ApplicantIncome',by='Education')
Wecanseethatthereisnosubstantialdifferentbetweenthemeanincomeofgraduateandnon-graduates.Butthereareahighernumberofgraduateswithveryhighincomes,whichareappearingtobetheoutliers.
Now,Let’slookatthehistogramandboxplotofLoanAmountusingthefollowingcommand:
df['LoanAmount'].hist(bins=50)
df.boxplot(column='LoanAmount')
Again,therearesomeextremevalues.Clearly,bothApplicantIncomeandLoanAmountrequiresomeamountofdatamunging.LoanAmounthasmissingandwellasextremevaluesvalues,whileApplicantIncomehasafewextremevalues,whichdemanddeeperunderstanding.Wewilltakethisupincomingsections.
Categoricalvariableanalysis
NowthatweunderstanddistributionsforApplicantIncomeandLoanIncome,letusunderstandcategoricalvariablesinmoredetails.WewilluseExcelstylepivottableandcross-tabulation.Forinstance,letuslookatthechancesofgettingaloanbasedoncredithistory.ThiscanbeachievedinMSExcelusingapivottableas:
Note:hereloanstatushasbeencodedas1forYesand0forNo.Sothemeanrepresentstheprobabilityofgettingloan.
NowwewilllookatthestepsrequiredtogenerateasimilarinsightusingPython.Pleasereferto
temp1=df['Credit_History'].value_counts(ascending=True)
temp2=df.pivot_table(values='Loan_Status',index=['Credit_History'],aggfunc=lambdax:x.map({'Y':1,'N':0}).mean())
print'FrequencyTableforCreditHistory:'
printtemp1
print'\nProbilityofgettingloanforeachCreditHistoryclass:'
printtemp2
Nowwecanobservethatwegetasimilarpivot_tableliketheMSExcelone.Thiscanbeplottedasabarchartusingthe“matplotlib”librarywithfollowingcode:
importmatplotlib.pyplotasplt
fig=plt.figure(figsize=(8,4))
ax1=fig.add_subplot(121)
ax1.set_xlabel('Credit_History')
ax1.set_ylabel('CountofApplicants')
ax1.set_title("ApplicantsbyCredit_History")
temp1.plot(kind='bar')
ax2=fig.add_subplot(122)
temp2.plot(kind='bar')
ax2.set_xlabel('Credit_History')
ax2.set_ylabel('Probabilityofgettingloan')
ax2.set_title("Probabilityofgettingloanbycredithistory")
Thisshowsthatthechancesofgettingaloanareeight-foldiftheapplicanthasavalidcredithistory.YoucanplotsimilargraphsbyMarried,Self-Employed,Property_Area,etc.
Alternately,thesetwoplotscanalsobevisualizedbycombiningtheminastackedchart::
temp3=pd.crosstab(df['Credit_History'],df['Loan_Status'])
temp3.plot(kind='bar',stacked=True,color=['red','blue'],grid=False)
Youcanalsoaddgenderintothemix(similartothepivottableinExcel):
Ifyouhavenotrealizedalready,wehavejustcreatedtwobasicclassificationalgorithmshere,onebasedoncredithistory,whileotheron2categoricalvariables(includinggender).YoucanquicklycodethistocreateyourfirstsubmissiononAVDatahacks.
WejustsawhowwecandoexploratoryanalysisinPythonusingPandas.Ihopeyourloveforpandas(theanimal)wouldhaveincreasedbynow–giventheamountofhelp,thelibrarycanprovideyouinanalyzingdatasets.
Nextlet’sexploreApplicantIncomeandLoanStatusvariablesfurther,
4.DataMunginginPython:UsingPandas
Forthose,whohavebeenfollowing,hereareyourmustwearshoestostartrunning.Datamunging–recapoftheneed
Whileourexplorationofthedata,wefoundafewproblemsinthedataset,whichneedstobesolvedbeforethedataisreadyforagoodmodel.Thisexerciseistypicallyreferredas“DataMunging”.Herearetheproblems,wearealreadyawareof:
Therearemissingvaluesinsomevariables.Weshouldestimatethosevalueswiselydependingontheamountofmissingvaluesandtheexpectedimportanceofvariables.
Whilelookingatthedistributions,wesawthatApplicantIncomeandLoanAmountseemedtocontainextremevaluesateitherend.Thoughtheymightmakeintuitivesense,butshouldbetreatedappropriately.
Inadditiontotheseproblemswithnumericalfields,weshouldalsolookatthenon-numericalfieldsi.e.Gender,Property_Area,Married,EducationandDependentstosee,iftheycontainanyusefulinformation.
IfyouarenewtoPandas,Iwouldrecommendreading
Checkmissingvaluesinthedataset
Letuslookatmissingvaluesinallthevariablesbecausemostofthemodelsdon’tworkwithmissingdataandeveniftheydo,imputingthemhelpsmoreoftenthannot.So,letuscheckthenumberofnulls/NaNsinthedataset
df.apply(lambdax:sum(x.isnull()),axis=0)
Thiscommandshouldtellusthenumberofmissingvaluesineachcolumnasisnull()returns1,ifthevalueisnull.
Thoughthemissingvaluesarenotveryhighinnumber,butmanyvariableshavethemandeachoneoftheseshouldbeestimatedandaddedinthedata.Getadetailedviewondifferentimputationtechniquesthrough
Note:RememberthatmissingvaluesmaynotalwaysbeNaNs.Forinstance,iftheLoan_Amount_Termis0,doesitmakessenseorwouldyouconsiderthatmissing?Isupposeyouranswerismissingandyou’reright.Soweshouldcheckforvalueswhichareunpractical.
HowtofillmissingvaluesinLoanAmount?
Therearenumerouswaystofillthemissingvaluesofloanamount–thesimplestbeingreplacementbymean,whichcanbedonebyfollowingcode:
df['LoanAmount'].fillna(df['LoanAmount'].mean(),inplace=True)
Theotherextremecouldbetobuildasupervisedlearningmodeltopredictloanamountonthebasisofothervariablesandthenuseagealongwithothervariablestopredictsurvival.
Since,thepurposenowistobringoutthestepsindatamunging,I’llrathertakeanapproach,whichliessomewhereinbetweenthese2extremes.Akeyhypothesisisthatthewhetherapersoniseducatedorself-employedcancombinetogiveagoodestimateofloanamount.
First,let’slookattheboxplottoseeifatrendexists:
Thusweseesomevariationsinthemedianofloanamountforeachgroupandthiscanbeusedtoimputethevalues.Butfirst,wehavetoensurethateachofSelf_EmployedandEducationvariablesshouldnothaveamissingvalues.
Aswesayearlier,Self_Employedhassomemissingvalues.Let’slookatthefrequencytable:
Since~86%valuesare“No”,itissafetoimputethemissingvaluesas“No”asthereisahighprobabilityofsuccess.Thiscanbedoneusingthefollowingcode:
df['Self_Employed'].fillna('No',inplace=True)
Now,wewillcreateaPivottable,whichprovidesusmedianvaluesforallthegroupsofuniquevaluesofSelf_EmployedandEducationfeatures.Next,wedefineafunction,whichreturnsthevaluesofthesecellsandapplyittofillthemissingvaluesofloanamount:
table=df.pivot_table(values='LoanAmount',index='Self_Employed',columns='Education',aggfunc=np.median)
#Definefunctiontoreturnvalueofthispivot_table
deffage(x):
returntable.loc[x['Self_Employed'],x['Education']]
#Replacemissingvalues
df['LoanAmount'].fillna(df[df['LoanAmount'].isnull()].apply(fage,axis=1),inplace=True)
Thisshouldprovideyouagoodwaytoimputemissingvaluesofloanamount.
[b]HowtotreatforextremevaluesindistributionofLoanAmountandApplicantIncome?[/b]
Let’sanalyzeLoanAmountfirst.Sincetheextremevaluesarepracticallypossible,i.e.somepeoplemightapplyforhighvalueloansduetospecificneeds.Soinsteadoftreatingthemasoutliers,let’stryalogtransformationtonullifytheireffect:
df['LoanAmount_log']=np.log(df['LoanAmount'])
df['LoanAmount_log'].hist(bins=20)
Lookingatthehistogramagain:
Nowthedistributionlooksmuchclosertonormalandeffectofextremevalueshasbeensignificantlysubsided.
ComingtoApplicantIncome.OneintuitioncanbethatsomeapplicantshavelowerincomebutstrongsupportCo-applicants.Soitmightbeagoodideatocombinebothincomesastotalincomeandtakealogtransformationofthesame.
df['TotalIncome']=df['ApplicantIncome']+df['CoapplicantIncome']
df['TotalIncome_log']=np.log(df['TotalIncome'])
df['LoanAmount_log'].hist(bins=20)
Nowweseethatthedistributionismuchbetterthanbefore.IwillleaveituptoyoutoimputethemissingvaluesforGender,Married,Dependents,Loan_Amount_Term,Credit_History.Also,Iencourageyoutothinkaboutpossibleadditionalinformationwhichcanbederivedfromthedata.Forexample,creatingacolumnforLoanAmount/TotalIncomemightmakesenseasitgivesanideaofhowwelltheapplicantissuitedtopaybackhisloan.
Next,wewilllookatmakingpredictivemodels.
5.BuildingaPredictiveModelinPython
After,wehavemadethedatausefulformodeling,let’snowlookatthepythoncodetocreateapredictivemodelonourdataset.Skicit-Learn(sklearn)isthemostcommonlyusedlibraryinPythonforthispurposeandwewillfollowthetrail.IencourageyoutogetarefresheronsklearnthroughSince,sklearnrequiresallinputstobenumeric,weshouldconvertallourcategoricalvariablesintonumericbyencodingthecategories.Thiscanbedoneusingthefollowingcode:
fromsklearn.preprocessingimportLabelEncoder
var_mod=['Gender','Married','Dependents','Education','Self_Employed','Property_Area','Loan_Status']
le=LabelEncoder()
foriinvar_mod:
df[i]=le.fit_transform(df[i])
df.dtypes
Next,wewillimporttherequiredmodules.Thenwewilldefineagenericclassificationfunction,whichtakesamodelasinputanddeterminestheAccuracyandCross-Validationscores.Sincethisisanintroductoryarticle,Iwillnotgointothedetailsofcoding.Pleasereferto
#Importmodelsfromscikitlearnmodule:
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.cross_validationimportKFold#ForK-foldcrossvalidation
fromsklearn.ensembleimportRandomForestClassifier
fromsklearn.treeimportDecisionTreeClassifier,export_graphviz
fromsklearnimportmetrics
#Genericfunctionformakingaclassificationmodelandaccessingperformance:
defclassification_model(model,data,predictors,outcome):
#Fitthemodel:
model.fit(data[predictors],data[outcome])
#Makepredictionsontrainingset:
predictions=model.predict(data[predictors])
#Printaccuracy
accuracy=metrics.accuracy_score(predictions,data[outcome])
print"Accuracy:%s"%"{0:.3%}".format(accuracy)
#Performk-foldcross-validationwith5folds
kf=KFold(data.shape[0],n_folds=5)
error=[]
fortrain,testinkf:
#Filtertrainingdata
train_predictors=(data[predictors].iloc[train,:])
#Thetargetwe'reusingtotrainthealgorithm.
train_target=data[outcome].iloc[train]
#Trainingthealgorithmusingthepredictorsandtarget.
model.fit(train_predictors,train_target)
#Recorderrorfromeachcross-validationrun
error.append(model.score(data[predictors].iloc[test,:],data[outcome].iloc[test]))
print"Cross-ValidationScore:%s"%"{0:.3%}".format(np.mean(error))
#Fitthemodelagainsothatitcanbereferedoutsidethefunction:
model.fit(data[predictors],data[outcome])
LogisticRegression
Let’smakeourfirstLogisticRegressionmodel.Onewaywouldbetotakeallthevariablesintothemodelbutthismightresultinoverfitting(don’tworryifyou’reunawareofthisterminologyyet).Insimplewords,takingallvariablesmightresultinthemodelunderstandingcomplexrelationsspecifictothedataandwillnotgeneralizewell.Readmoreabout
Wecaneasilymakesomeintuitivehypothesistosettheballrolling.Thechancesofgettingaloanwillbehigherfor:
Applicantshavingacredithistory(rememberweobservedthisinexploration?)
Applicantswithhigherapplicantandco-applicantincomes
Applicantswithhighereducationlevel
Propertiesinurbanareaswithhighgrowthperspectives
Solet’smakeourfirstmodelwith‘Credit_History’.
outcome_var='Loan_Status'
model=LogisticRegression()
predictor_var=['Credit_History']
classification_model(model,df,predictor_var,outcome_var)
Accuracy:80.945%Cross-ValidationScore:80.946%
#Wecantrydifferentcombinationofvariables:
predictor_var=['Credit_History','Education','Married','Self_Employed','Property_Area']
classification_model(model,df,predictor_var,outcome_var)
Accuracy:80.945%Cross-ValidationScore:80.946%
Generallyweexpecttheaccuracytoincreaseonaddingvariables.Butthisisamorechallengingcase.Theaccuracyandcross-validationscorearenotgettingimpactedbylessimportantvariables.Credit_Historyisdominatingthemode.Wehavetwooptionsnow:
FeatureEngineering:dereivenewinformationandtrytopredictthose.Iwillleavethistoyourcreativity.
Bettermodelingtechniques.Let’sexplorethisnext.
DecisionTree
Decisiontreeisanothermethodformakingapredictivemodel.Itisknowntoprovidehigheraccuracythanlogisticregressionmodel.Readmoreabout
model=DecisionTreeClassifier()
predictor_var=['Credit_History','Gender','Married','Education']
classification_model(model,df,predictor_var,outcome_var)
Accuracy:81.930%Cross-ValidationScore:76.656%
HerethemodelbasedoncategoricalvariablesisunabletohaveanimpactbecauseCreditHistoryisdominatingoverthem.Let’stryafewnumericalvariables:
#Wecantrydifferentcombinationofvariables:
predictor_var=['Credit_History','Loan_Amount_Term','LoanAmount_log']
classification_model(model,df,predictor_var,outcome_var)
Accuracy:92.345%Cross-ValidationScore:71.009%
Hereweobservedthatalthoughtheaccuracywentuponaddingvariables,thecross-validationerrorwentdown.Thisistheresultofmodelover-fittingthedata.Let’stryanevenmoresophisticatedalgorithmandseeifithelps:
RandomForest
Randomforestisanotheralgorithmforsolvingtheclassificationproblem.Readmoreabout
AnadvantagewithRandomForestisthatwecanmakeitworkwithallthefeaturesanditreturnsafeatureimportancematrixwhichcanbeusedtoselectfeatures.
model=RandomForestClassifier(n_estimators=100)
predictor_var=['Gender','Married','Dependents','Education',
'Self_Employed','Loan_Amount_Term','Credit_History','Property_Area',
'LoanAmount_log','TotalIncome_log']
classification_model(model,df,predictor_var,outcome_var)
Accuracy:100.000%Cross-ValidationScore:78.179%
Hereweseethattheaccuracyis100%forthetrainingset.Thisistheultimatecaseofoverfittingandcanberesolvedintwoways:
Reducingthenumberofpredictors
Tuningthemodelparameters
Let’strybothofthese.Firstweseethefeatureimportancematrixfromwhichwe’lltakethemostimportantfeatures.
#Createaserieswithfeatureimportances:
featimp=pd.Series(model.feature_importances_,index=predictor_var).sort_values(ascending=False)
printfeatimp
Let’susethetop5variablesforcreatingamodel.Also,wewillmodifytheparametersofrandomforestmodelalittlebit:
model=RandomForestClassifier(n_estimators=25,min_samples_split=25,max_depth=7,max_features=1)
predictor_var=['TotalIncome_log','LoanAmount_log','Credit_History','Dependents','Property_Area']
classification_model(model,df,predictor_var,outcome_var)
Accuracy:82.899%Cross-ValidationScore:81.461%
Noticethatalthoughaccuracyreduced,butthecross-validationscoreisimprovingshowingthatthemodelisgeneralizingwell.Rememberthatrandomforestmodelsarenotexactlyrepeatable.Differentrunswillresultinslightvariationsbecauseofrandomization.Buttheoutputshouldstayintheballpark.
Youwouldhavenoticedthatevenaftersomebasicparametertuningonrandomforest,wehavereachedacross-validationaccuracyonlyslightlybetterthantheoriginallogisticregressionmodel.Thisexercisegivesussomeveryinterestinganduniquelearning:
Usingamoresophisticatedmodeldoesnotguaranteebetterresults.
Avoidusingcomplexmodelingtechniquesasablackboxwithoutunderstandingtheunderlyingconcepts.Doingsowouldincreasethetendencyofoverfittingthusmakingyourmodelslessinterpretable
Soareyoureadytotakeonthechallenge?Startyourdatasciencejourneywith
EndNotes
IhopethistutorialwillhelpyoumaximizeyourefficiencywhenstartingwithdatascienceinPython.Iamsurethisnotonlygaveyouanideaaboutbasicdataanalysismethodsbutitalsoshowedyouhowtoimplementsomeofthemoresophisticatedtechniquesavailabletoday.Pythonisreallyagreattool,andisbecominganincreasinglypopularlanguageamongthedatascientists.Thereasonbeing,it’seasytolearn,integrateswellwithotherdatabasesandtoolslikeSparkandHadoop.Majorly,ithasgreatcomputationalintensityandhaspowerfuldataanalyticslibraries.
So,learnPythontoperformthefulllife-cycleofanydatascienceproject.Itincludesreading,analyzing,visualizingandfinallymakingpredictions.
IfyoucomeacrossanydifficultywhilepracticingPython,oryouhaveanythoughts/suggestions/feedbackonthepost,pleasefeelfreetopostthemthroughcommentsbelow.
相关文章推荐
- A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
- A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
- A Complete Tutorial on Tree Based Modeling from Scratch (in R & Python)
- Auto Complete Tutorial for iOS: How To Auto Complete With Custom Values
- How to Recover from a Lost or Deleted Datafile with Different Scenarios [ID 198640.1]
- Complete Guide to Parameter Tuning in XGBoost (with codes in Python)
- a universal class to complete import data from an excel file into a database
- How to get started with data science in containers
- How to read binary data from HDFS with Thrift?
- How to bind a GridView to a list of multiple types? NHibernate proxy causing problems with databinding [From stack overflow]
- Writing binary data to a socket (or file) with Python - Stack Overflow
- Export Data from GridView to Excel, Word, HTML with C#
- How to create iOS 8 Today extension and share data with containing app – tutorial
- Data Science from Scratch 之 MapReduce
- 6 Easy Steps to Learn Naive Bayes Algorithm (with code in Python)
- Key Points from "Introduce to Data Science"
- ImportError: The Python Imaging Library (PIL) is required to load data from jpeg files
- Mini-Tutorial: Saving Tweets to a Database with Python and CouchDB, a free NoSQL database.
- Machine Learning from Start to Finish with Scikit-Learn