您的位置:首页 > 编程语言 > Python开发

5: Which Domains Were Submitted Most Often?(Guided Project: Transforming data with Python )

2016-09-15 00:00 369 查看
摘要: 6: When Are The Most Articles Submitted?

You can now move on to our second question, and explore which domains were submitted most often. We'll want to make a separate script, called
domains.py
, for this.

Instructions

Here are the steps:

Make a file called
domains.py
, using the file browser, or the command line.

Add in the code to read the file
hn_stories.csv
, and add column names.

You can think of each domain name as a "word". A domain will look like
scala-lang.org
, or
blog.iweb.com
.

You can use the
value_counts
method in pandas to count the number of occurrences of each value in a column. Here are the docs.

Print the
100
most submitted domains.

By default, Pandas only prints
10
rows of a Dataframe or Series. There is a pandas option to make it print more rows (see this thread on Stackoverflow), but there are bugs with it and Series. Instead, just loop through the series and print the index value, and the total. Here's some sample code:

for name, row in domains.items():
print("{0}: {1}".format(name, row))

The above code assumes that the results of running
value_counts
is assigned to
domains
.

You can extend this analysis and make it a bit more robust by removing subdomains. For example,
blog.iweb.com
and
iweb.com
would be separate domains at the moment, but they are the same. By removing the subdomain, you can turn
blog.iweb.com
into
iweb.com
. You can remove the subdomain using the
apply
method on Pandas Series and Dataframes. Here's the documentation

##################################################################

6: When Are The Most Articles Submitted?

We want to know when the most articles are submitted. One easy way to reframe this is to look at what hour articles are submitted. To figure this out, we'll need to use the
submission_time
column.

The
submission_time
column contains timestamps, which look like this:
2011-11-09T21:56:22Z
. These times are expressed inUTC, which is a universal time zone used by most software for consistency (imagine a database populated with times all having different timezones; it would be a huge pain to work with).

To get hour from a timestamp, we can use the
dateutil
library. The
parser
module in
dateutil
contains the
parse
function, which can take in a timestamp, and return a datetimeobject. Here's a link to the documentation. After parsing the timestamp, the
hour
property of the resulting date object will tell you the hour the article was submitted.

Instructions

Make a file called
times.py
to find the submission times.

Write a function to extract the hour from a timestamp. This function should first use
dateutil.parser.parse
to parse the timestamp, then extract the hour from the resultingdatetime object, then return the hour.

Use the pandas
apply
method to make a column of submission hours.

Use the
value_counts
method to find the number of occurences of each hour.

Print out the results.

You can repeat this procedure to find how many articles were submitted on each day of the month, year, minute, day of the week, and so on

###################################################################

7: Next Steps

That's all for the guided steps, but feel free to keep going through the data and answering questions. We encourage you to think of your own questions, and to be creative in exploring the dataset!

If you can't think of any questions, some interesting ones are:

What headline length leads to the most upvotes?

What submission time leads to the most upvotes?

How are the total numbers of upvotes changing over time?

You can write scripts and explore here, or download the code to your computer using the download icon to the right. You'll then be able to run the scripts on your own computer.

Hope this guided project has been a good experience, and please email us at hello@dataquest.io if you want to share your work -- we'd love to see it!
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐