您的位置:首页 > 其它

【Udacity】3,2,13,因子变量

2018-03-31 15:56 232 查看

因子变量

使用数据reggit.csv,谁是reddit

* 导入数据

> getwd()
[1] "C:/Users/Administrator/Documents"
> setwd('C:/Users/Administrator/Downloads')
> reggit <- read.csv('reddit.csv')


使用搅拌命令,str命令–str(data)

运行搅拌命令,我们看到这里有很多数据,大多数变量都有一种因数(因数就是facter)。因数一种类别变量,具有不同的偏好和级别。

例子之一就是就业状态,这个变量有多种不同的级别,比如全职就业或兼职就业或者无工作。

str(data)
function (..., list = character(), package = NULL, lib.loc = NULL, verbose = getOption("verbose"), envir = .GlobalEnv)
> str(reggit)
'data.frame':   32754 obs. of  14 variables:
$ id               : int  1 2 3 4 5 6 7 8 9 10 ...
$ gender           : int  0 0 1 0 1 0 0 0 0 0 ...
$ age.range        : Factor w/ 7 levels "18-24","25-34",..: 2 2 1 2 2 2 2 1 3 2 ...
$ marital.status   : Factor w/ 6 levels "Engaged","Forever Alone",..: NA NA NA NA NA 4 3 4 4 3 ...
$ employment.status: Factor w/ 6 levels "Employed full time",..: 1 1 2 2 1 1 1 4 1 2 ...
$ military.service : Factor w/ 2 levels "No","Yes": NA NA NA NA NA 1 1 1 1 1 ...
$ children         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
$ education        : Factor w/ 7 levels "Associate degree",..: 2 2 5 2 2 2 5 2 2 5 ...
$ country          : Factor w/ 439 levels " Canada"," Canada eh",..: 394 394 394 394 394 394 125 394 394 125 ...
$ state            : Factor w/ 53 levels "","Alabama","Alaska",..: 33 33 48 33 6 33 1 6 33 1 ...
$ income.range     : Factor w/ 8 levels "$100,000 - $149,999",..: 2 2 8 2 7 2 NA 7 2 7 ...
$ fav.reddit       : Factor w/ 1834 levels "","'home' page (or front page if you prefer)",..: 720 691 1511 1528 188 691 1318 571 1629 1 ...
$ dog.cat          : Factor w/ 3 levels "I like cats.",..: NA NA NA NA NA 2 2 2 1 1 ...
$ cheese           : Factor w/ 11 levels "American","Brie",..: NA NA NA NA NA 3 3 1 10 7 ...


针对以上不同因子,拿就业状态举例,我们可能感兴趣的是,每个就业状态组中有多少人。我们可以将变量制成表,莞城每个组中的人数。

table(reggit$employment.status)

Employed full time                             Freelance Not employed and not looking for work
14814                                  1948                                   682
Not employed, but looking for work                               Retired                               Student
2087                                    85                                 12987


通过在我们的运行框架上运行汇总命令,可以得到一些计数以及其它数据点。

summary(reggit)
id            gender          age.range                                      marital.status
Min.   :    1   Min.   :0.0000   18-24   :15802   Engaged                                 : 1109
1st Qu.: 8189   1st Qu.:0.0000   25-34   :11575   Forever Alone                           : 5850
Median :16380   Median :0.0000   Under 18: 2330   In a relationship                       : 9828
Mean   :16379   Mean   :0.1885   35-44   : 2257   Married/civil union/domestic partnership: 5490
3rd Qu.:24568   3rd Qu.:0.0000   45-54   :  502   Single                                  :10428
Max.   :32756   Max.   :1.0000   (Other) :  200   Widowed                                 :   44
NA's   :201      NA's    :   88   NA's                                    :    5
employment.status military.service children                                  education
Employed full time                   :14814   No  :30526       No  :27488   Bachelor's degree                 :11046
Freelance                            : 1948   Yes : 2223       Yes : 5047   Some college                      : 9600
Not employed and not looking for work:  682   NA's:    5       NA's:  219   Graduate or professional degree   : 4722
Not employed, but looking for work   : 2087                                 High school graduate or equivalent: 3272
Retired                              :   85                                 Some high school                  : 1924
Student                              :12987                                 (Other)                           : 2046
NA's                                 :  151                                 NA's                              :  144
country             state                    income.range                fav.reddit               dog.cat
United States :20967             :11908   Under $20,000      :7892                      : 4335   I like cats.   :11156
Canada        : 2888   California: 3401   $50,000 - $69,999  :4133   askreddit          : 2123   I like dogs.   :17151
United Kingdom: 1782   Texas     : 1541   $70,000 - $99,999  :4101   fffffffuuuuuuuuuuuu: 1746   I like turtles.: 4442
Australia     : 1051   New York  : 1418   $100,000 - $149,999:3522   pics               : 1651   NA's           :    5
Germany       :  407   Illinois  :  976   $20,000 - $29,999  :3206   trees              : 1311
(Other)       : 5482   Washington:  910   (Other)            :8285   (Other)            :21562
NA's          :  177   (Other)   :12600   NA's               :1615   NA's               :   26
cheese
Other    :6563
Cheddar  :6102
Brie     :3742
Provolone:3456
Swiss    :3214
(Other)  :9672
NA's     :   5


注:除了因子变量之外,还有其他的数据类型,比如列表和矩阵。详细数据类型介绍,参考链接:[]https://www.statmethods.net/input/datatypes.html]

有序因子

(有序因子)深入观察这些因数变量,这里重点关注age.range变量,注意它表示,因数变量有7个不同的级别。

$ age.range        : Factor w/ 7 levels "18-24","25-34",..: 2 2 1 2 2 2 2 1 3 2 ...


仔细观察变量的级别,方法是键入命令级别levels

levels(reggit$age.range)
[1] "18-24"       "25-34"       "35-44"       "45-54"       "55-64"       "65 or Above" "Under 18"


如果想查看每个级别中的用户数,这里讲用ggplot2程序包,和附带的qplot函数来创建图形。

library(ggplot2)
Warning message:
程辑包‘ggplot2’是用R版本3.4.4 来建造的
> qplot(data=reggit,x=age.range)




注:这里年纪分组未按照顺序排列,可以考虑让年纪分组按照顺序排列的函数

设置有序因子的水平

学习如何设置和排列因子水平

设置级别解决排序问题

#设置级别
reggit$age.range <- ordered(reggit$age.range,levels = c("Under 18","18-24","25-34","35-44","45-54","55-64","65 or Above"))
#绘图
qplot(data=reggit,x=age.range)




使用因数功能(factor)解决

> reggit$age.range <- factor(reggit$age.range,levels = c("Under 18","18-24","25-34","35-44","45-54","55-64","65 or Above"),ordered = T)
> qplot(data=reggit,x=age.range)


代码输出结果和上面一样。
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: