您的位置：首页 > 编程语言 > Python开发

5: Which Domains Were Submitted Most Often?(Guided Project: Transforming data with Python )

2016-09-15 00:00 369 查看

摘要: 6: When Are The Most Articles Submitted?

You can now move on to our second question, and explore which domains were submitted most often. We'll want to make a separate script, called

domains.py

, for this.

Instructions

Here are the steps:

Make a file called

domains.py

, using the file browser, or the command line.

Add in the code to read the file

hn_stories.csv

, and add column names.

You can think of each domain name as a "word". A domain will look like

scala-lang.org

, or

blog.iweb.com

.

You can use the

value_counts

method in pandas to count the number of occurrences of each value in a column. Here are the docs.

Print the

most submitted domains.

By default, Pandas only prints

rows of a Dataframe or Series. There is a pandas option to make it print more rows (see this thread on Stackoverflow), but there are bugs with it and Series. Instead, just loop through the series and print the index value, and the total. Here's some sample code:

for name, row in domains.items():
print("{0}: {1}".format(name, row))

The above code assumes that the results of running

value_counts

is assigned to

domains

.

You can extend this analysis and make it a bit more robust by removing subdomains. For example,

blog.iweb.com

and

iweb.com

would be separate domains at the moment, but they are the same. By removing the subdomain, you can turn

blog.iweb.com

into

iweb.com

. You can remove the subdomain using the

apply

method on Pandas Series and Dataframes. Here's the documentation

##################################################################

6: When Are The Most Articles Submitted?

We want to know when the most articles are submitted. One easy way to reframe this is to look at what hour articles are submitted. To figure this out, we'll need to use the

submission_time

column.

The

submission_time

column contains timestamps, which look like this:

2011-11-09T21:56:22Z

. These times are expressed inUTC, which is a universal time zone used by most software for consistency (imagine a database populated with times all having different timezones; it would be a huge pain to work with).

To get hour from a timestamp, we can use the

dateutil

library. The

parser

module in

dateutil

contains the

parse

function, which can take in a timestamp, and return a datetimeobject. Here's a link to the documentation. After parsing the timestamp, the

hour

property of the resulting date object will tell you the hour the article was submitted.

Instructions

Make a file called

times.py

to find the submission times.

Write a function to extract the hour from a timestamp. This function should first use

dateutil.parser.parse

to parse the timestamp, then extract the hour from the resultingdatetime object, then return the hour.

Use the pandas

apply

method to make a column of submission hours.

Use the

value_counts

method to find the number of occurences of each hour.

Print out the results.

You can repeat this procedure to find how many articles were submitted on each day of the month, year, minute, day of the week, and so on

###################################################################

7: Next Steps

That's all for the guided steps, but feel free to keep going through the data and answering questions. We encourage you to think of your own questions, and to be creative in exploring the dataset!

If you can't think of any questions, some interesting ones are:

What headline length leads to the most upvotes?

What submission time leads to the most upvotes?

How are the total numbers of upvotes changing over time?

You can write scripts and explore here, or download the code to your computer using the download icon to the right. You'll then be able to run the scripts on your own computer.

Hope this guided project has been a good experience, and please email us at hello@dataquest.io if you want to share your work -- we'd love to see it!

内容来自用户分享和网络整理，不保证内容的准确性，如有侵权内容，可联系管理员处理

标签：

相关文章推荐

新的分享

章节导航