5: Which Domains Were Submitted Most Often?(Guided Project: Transforming data with Python )
2016-09-15 00:00
369 查看
摘要: 6: When Are The Most Articles Submitted?
You can now move on to our second question, and explore which domains were submitted most often. We'll want to make a separate script, called
Make a file called
Add in the code to read the file
You can think of each domain name as a "word". A domain will look like
You can use the
Print the
By default, Pandas only prints
The above code assumes that the results of running
You can extend this analysis and make it a bit more robust by removing subdomains. For example,
##################################################################
The
To get hour from a timestamp, we can use the
Write a function to extract the hour from a timestamp. This function should first use
Use the pandas
Use the
Print out the results.
You can repeat this procedure to find how many articles were submitted on each day of the month, year, minute, day of the week, and so on
###################################################################
If you can't think of any questions, some interesting ones are:
What headline length leads to the most upvotes?
What submission time leads to the most upvotes?
How are the total numbers of upvotes changing over time?
You can write scripts and explore here, or download the code to your computer using the download icon to the right. You'll then be able to run the scripts on your own computer.
Hope this guided project has been a good experience, and please email us at hello@dataquest.io if you want to share your work -- we'd love to see it!
You can now move on to our second question, and explore which domains were submitted most often. We'll want to make a separate script, called
domains.py, for this.
Instructions
Here are the steps:Make a file called
domains.py, using the file browser, or the command line.
Add in the code to read the file
hn_stories.csv, and add column names.
You can think of each domain name as a "word". A domain will look like
scala-lang.org, or
blog.iweb.com.
You can use the
value_countsmethod in pandas to count the number of occurrences of each value in a column. Here are the docs.
Print the
100most submitted domains.
By default, Pandas only prints
10rows of a Dataframe or Series. There is a pandas option to make it print more rows (see this thread on Stackoverflow), but there are bugs with it and Series. Instead, just loop through the series and print the index value, and the total. Here's some sample code:
for name, row in domains.items(): print("{0}: {1}".format(name, row))
The above code assumes that the results of running
value_countsis assigned to
domains.
You can extend this analysis and make it a bit more robust by removing subdomains. For example,
blog.iweb.comand
iweb.comwould be separate domains at the moment, but they are the same. By removing the subdomain, you can turn
blog.iweb.cominto
iweb.com. You can remove the subdomain using the
applymethod on Pandas Series and Dataframes. Here's the documentation
##################################################################
6: When Are The Most Articles Submitted?
We want to know when the most articles are submitted. One easy way to reframe this is to look at what hour articles are submitted. To figure this out, we'll need to use thesubmission_timecolumn.
The
submission_timecolumn contains timestamps, which look like this:
2011-11-09T21:56:22Z. These times are expressed inUTC, which is a universal time zone used by most software for consistency (imagine a database populated with times all having different timezones; it would be a huge pain to work with).
To get hour from a timestamp, we can use the
dateutillibrary. The
parsermodule in
dateutilcontains the
parsefunction, which can take in a timestamp, and return a datetimeobject. Here's a link to the documentation. After parsing the timestamp, the
hourproperty of the resulting date object will tell you the hour the article was submitted.
Instructions
Make a file calledtimes.pyto find the submission times.
Write a function to extract the hour from a timestamp. This function should first use
dateutil.parser.parseto parse the timestamp, then extract the hour from the resultingdatetime object, then return the hour.
Use the pandas
applymethod to make a column of submission hours.
Use the
value_countsmethod to find the number of occurences of each hour.
Print out the results.
You can repeat this procedure to find how many articles were submitted on each day of the month, year, minute, day of the week, and so on
###################################################################
7: Next Steps
That's all for the guided steps, but feel free to keep going through the data and answering questions. We encourage you to think of your own questions, and to be creative in exploring the dataset!If you can't think of any questions, some interesting ones are:
What headline length leads to the most upvotes?
What submission time leads to the most upvotes?
How are the total numbers of upvotes changing over time?
You can write scripts and explore here, or download the code to your computer using the download icon to the right. You'll then be able to run the scripts on your own computer.
Hope this guided project has been a good experience, and please email us at hello@dataquest.io if you want to share your work -- we'd love to see it!
相关文章推荐
- Transforming XML Data with XSLT in .NET [2/3]
- LINQ: There is already an open DataReader associated with this Command which must be closed first
- Problem Solving with algorithms and data structures using Python 翻译计划
- Leetcode -- Python --Container With Most Water
- **Leetcode_container-with-most-water(c++ and python version)
- LINQ迸发访问错误提示:DataReader associated with this Command which must be closed first
- error 'there is already an open datareader associated with this command which must be closed first'
- Writing binary data to a socket (or file) with Python - Stack Overflow
- Fixing the "There is already an open DataReader associated with this Command which must be closed first." exception in Entity Framework
- 【LeetCode with Python】 Container With Most Water
- "There is already an open DataReader associated with this Command which must be closed first"错误
- 解决"There is already an open DataReader associated with this Command which must be closed first." exception in EF 中
- There is already an open DataReader associated with this Connection which must be closed first
- Machine learning and Data Mining - Association Analysis with Python
- REST web services with Python, MongoDB, and Spatial data in the Cloud - Part 2
- Transforming XML Data with XSLT in .NET [3/3]
- LINQ: There is already an open DataReader associated with this Command which must be closed first
- MySql: There is already an open DataReader associated with this Command which must be closed first.2/16
- There is already an open DataReader associated with this Connection which must be closed first
- Mapping Data in Python with Pandas and Vincent