您的位置:首页 > 其它

FileSystem.getFileBlockLocations

2011-08-31 16:53 393 查看
 

Details

Type:


New Feature

Status:



Resolved

Priority:


Major

Resolution:
Fixed

Affects Version/s:
None

Fix Version/s:
0.22.0

Component/s:
hdfs client,name-node

Labels:
NoneLabels

Hadoop Flags:

Incompatible change(1)

, Reviewed

Description

Currently map-reduce applications (specifically file-based input-formats) use FileSystem.getFileBlockLocations to compute splits. However they are forced to call it once per file.

The downsides are multiple:

Even with a few thousand files to process the number of RPCs quickly starts getting noticeable
The current implementation of getFileBlockLocations is too slow since each call results in 'search' in the namesystem. Assuming a few thousand input files it results in that many RPCs and 'searches'.
It would be nice to have a FileSystem.getFileBlockLocations which can take in a directory, and return the block-locations for all files in that directory. We could eliminate both the per-file RPC and also the 'search' by a 'scan'.

When I tested this for terasort, a moderate job with 8000 input files the runtime halved from the current 8s to 4s. Clearly this is much more important for latency-sensitive applications...

Attach
one or more files to this issue

More Options

Sort By Name
Sort By Date
Ascending
Descending
Manage
Attachments

Attachments






hdfsListFiles.patch23/Jul/10 19:0343 kBHairong Kuang






hdfsListFiles1.patch27/Jul/10 04:4240 kBHairong Kuang






hdfsListFiles2.patch31/Jul/10 00:0742 kBHairong Kuang






hdfsListFiles3.patch02/Aug/10 22:3054 kBHairong Kuang






hdfsListFiles4.patch11/Aug/10 18:2047 kBHairong Kuang






hdfsListFiles5.patch11/Aug/10 19:3248 kBHairong Kuang

Add
Link

Issue Links

This issue blocks:
MAPREDUCE-1981Improve getSplits performance by using listFiles, the new FileSystem API

Delete
this link
This issue relates to:
HADOOP-6870Add FileSystem#listLocatedStatus to list a directory's content together with each file's block locations

Delete
this link

Activity

Ascending order - Click to sort in descending
order
All
Comments
Work Log
History
Activity
Subversion Commits

Hide
Permalink
Doug Cutting added a comment -08/May/09
17:27

An alternative to passing directories might be to pass a list of files. The request might get larger, but this is more precise, e.g., when only a subset of files in a directory will be used only that subset need be passed. Since globbing is client-side,
this requires two round trips, one to list files and one to list their blocks, but that would still be a huge improvement over per-file RPC.

Show
Doug Cutting added a comment -08/May/09
17:27 An alternative to passing directories might be to pass a list of files. The request might get larger, but this is more precise, e.g., when only a subset of files in a directory will be used only that subset need be passed. Since globbing
is client-side, this requires two round trips, one to list files and one to list their blocks, but that would still be a huge improvement over per-file RPC.

Hide
Permalink
Doug Cutting added a comment -08/May/09
17:48

How about adding something like:

Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);

This would permit a glob-free job to get everything it needs in a single RPC, and a globbing job to do so with two RPCs.

Show
Doug Cutting added a comment -08/May/09
17:48 How about adding something like: Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]); This would permit a glob-free job to get everything it needs in a single RPC, and a globbing job to do so with two RPCs.

Hide
Permalink
Arun C Murthy added a comment -08/May/09
17:49

Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);

+1

Show
Arun C Murthy added a comment -08/May/09
17:49
Map<FileStatus, BlockLocation[]> listBlockLocations(Path[]);
+1

Hide
Permalink
Konstantin Shvachko added a comment -08/May/09
18:14

Currently getBlockLocations(src, offset, length) returns a class calledLocatedBlocks, which contains a list of
LocatedBlock belonging to the file.

public class LocatedBlocks implements Writable {
private long fileLength;
private List<LocatedBlock> blocks; // array of blocks with prioritized locations
}


The question is whether we should modify LocatedBlocks, which would include the map proposed by Doug and extend the semantics ofgetBlockLocations() to handle directories, or should we introduce a new method (rpc)getBlockLocations(srcDir)
returning LocatedBlockMap.

Is there a reason to keep current per file getBlockLocations() if we had a more generic method?

Show
Konstantin Shvachko added a comment -08/May/09
18:14 Currently getBlockLocations(src, offset, length) returns a class called
LocatedBlocks, which contains a list ofLocatedBlock belonging to the file.

public class LocatedBlocks implements Writable {
private long fileLength;
private List<LocatedBlock> blocks; // array of blocks with prioritized locations
}


The question is whether we should modify LocatedBlocks, which would include the map proposed by Doug and extend the semantics ofgetBlockLocations() to handle directories, or should we introduce a new method (rpc)getBlockLocations(srcDir)
returning LocatedBlockMap. Is there a reason to keep current per filegetBlockLocations() if we had a more generic method?

Hide
Permalink
Doug Cutting added a comment -08/May/09
18:28

> Is there a reason to keep current per file getBlockLocations() if we had a more generic method?

Not that I can think of. +1 for replacing it.

Show
Doug Cutting added a comment -08/May/09
18:28 > Is there a reason to keep current per file getBlockLocations() if we had a more generic method? Not that I can think of. +1 for replacing it.

Hide
Permalink
dhruba borthakur added a comment -10/May/09
09:40

If we adopt the approach that Doug has suggested, then the namenode still has to search for each input path in the file system namespace. This approach still has the advantage that the number of RPC calls are reduced. If we adopt Arun's proposal that specifies
a directory and the RPC-call returns the splits of all the files in that directory, then it reduces the number of searches in the FS namespace as well as the number of RPC calls. I was kind-of leaning towards Arun's proposal, but Doug's approach is a little
more flexible in nature, isn't it?

Show
dhruba borthakur added a comment -10/May/09
09:40 If we adopt the approach that Doug has suggested, then the namenode still has to search for each input path in the file system namespace. This approach still has the advantage that the number of RPC calls are reduced. If we adopt Arun's
proposal that specifies a directory and the RPC-call returns the splits of all the files in that directory, then it reduces the number of searches in the FS namespace as well as the number of RPC calls. I was kind-of leaning towards Arun's proposal, but Doug's
approach is a little more flexible in nature, isn't it?

Hide
Permalink
Arun C Murthy added a comment -11/May/09
17:50

Dhruba, I was thinking it was implict in Doug's proposal that if one of the paths in the Path[] is a directory, then the new api would return block-locations of all its' children (non-recursively?) which would satisfy the original requirement. Doug, can
you please confirm?

Show
Arun C Murthy added a comment -11/May/09
17:50 Dhruba, I was thinking it was implict in Doug's proposal that if one of the paths in the Path[] is a directory, then the new api would return block-locations of all its' children (non-recursively?) which would satisfy the original requirement.
Doug, can you please confirm?

Hide
Permalink
Doug Cutting added a comment -11/May/09
22:39

> Doug, can you please confirm?

Yes, I had assumed that any directories in the request would be expanded. The goal is to have something we can call from FileInputFormat, which takes a list of patterns. When the patterns contain no wildcards, we should be able to create splits with a single
RPC to the NameNode. So the semantics should match those of FileInputFormat in this case.

Show
Doug Cutting added a comment -11/May/09
22:39 > Doug, can you please confirm? Yes, I had assumed that any directories in the request would be expanded. The goal is to have something we can call from FileInputFormat, which takes a list of patterns. When the patterns contain no wildcards,
we should be able to create splits with a single RPC to the NameNode. So the semantics should match those of FileInputFormat in this case.

Hide
Permalink
Konstantin Shvachko added a comment -12/May/09
01:16

> (non-recursively?)

I think the rpc call itself should not be recursive. It is like with ls: the getListing() call is non-recursive, but the client recursively calls getListing() on sub-directories.

The idea is to prevent people from making a mistake to call getBlockLocation("/") on large directory trees recursively, which may freeze the name-node for a long period of time.

Non-recursive variant should be sufficient to cover Arun's use case.

Show
Konstantin Shvachko added a comment -12/May/09
01:16 > (non-recursively?) I think the rpc call itself should not be recursive. It is like with ls: the getListing() call is non-recursive, but the client recursively calls getListing() on sub-directories. The idea is to prevent people from making
a mistake to call getBlockLocation("/") on large directory trees recursively, which may freeze the name-node for a long period of time. Non-recursive variant should be sufficient to cover Arun's use case.

Hide
Permalink
dhruba borthakur added a comment -12/May/09
13:35

Ok, so from what I can understand, here is the proposal:

Map<FileStatus, BlockLocation[]> listBlockLocations(Path[] inputPaths);

The "inputPaths" can be a set of files and/or directories. If one of the inputPaths is a directory, then all items inside that directory (only one level, not recursive) are listed and their block locations are returned by this call. if one of the inputPaths
is a file, then its block locations are returned by this call. The FileStatus returned by this call should have the absolulte path of the object being returned.

Show
dhruba borthakur added a comment -12/May/09
13:35 Ok, so from what I can understand, here is the proposal: Map<FileStatus, BlockLocation[]> listBlockLocations(Path[] inputPaths); The "inputPaths" can be a set of files and/or directories. If one of the inputPaths is a directory, then all
items inside that directory (only one level, not recursive) are listed and their block locations are returned by this call. if one of the inputPaths is a file, then its block locations are returned by this call. The FileStatus returned by this call should
have the absolulte path of the object being returned.

Hide
Permalink
Doug Cutting added a comment -12/May/09
20:09

Dhruba: yes, that sounds right to me.

A further clarification: should subdirectories be included, with empty block lists, or elided? My hunch is to eliminate them, so that every FileStatus returned is for a plain file--no directories. Does that sound right to others?

Show
Doug Cutting added a comment -12/May/09
20:09 Dhruba: yes, that sounds right to me. A further clarification: should subdirectories be included, with empty block lists, or elided? My hunch is to eliminate them, so that every FileStatus returned is for a plain file--no directories. Does
that sound right to others?

Hide
Permalink
dhruba borthakur added a comment -12/May/09
22:20

> every FileStatus returned is for a plain file--no directories

Sounds good to me.

Show
dhruba borthakur added a comment -12/May/09
22:20 > every FileStatus returned is for a plain file--no directories Sounds good to me.

Hide
Permalink
Raghu Angadi added a comment -28/May/09
21:56

I see why the interface takes array or paths. But not sure why it returns a map (not that there is anything wrong it). This is probably the only RPC returning a map in Hadoop.

How does a user figure out which were valid and which were invalid/non-existent/empty paths? May be user does not care?

'getBlockLocations()' returns the blocks (sort of) sorted w.r.t client. Should this interface do that too? M/R use case does not need that sorted.

Is there any other interface that resembles this?

Show
Raghu Angadi added a comment -28/May/09
21:56 I see why the interface takes array or paths. But not sure why it returns a map (not that there is anything wrong it). This is probably the only RPC returning a map in Hadoop. How does a user figure out which were valid and which were invalid/non-existent/empty
paths? May be user does not care? 'getBlockLocations()' returns the blocks (sort of) sorted w.r.t client. Should this interface do that too? M/R use case does not need that sorted. Is there any other interface that resembles this?

Hide
Permalink
dhruba borthakur added a comment -29/May/09
00:37

> Is there any other interface that resembles this?

The only thing that comes relatively close to this one is the READDIRPLUS operation in NFS. This call is more like getFileStatusBulk() for HDFS.

Show
dhruba borthakur added a comment -29/May/09
00:37 > Is there any other interface that resembles this? The only thing that comes relatively close to this one is the READDIRPLUS operation in NFS. This call is more like getFileStatusBulk() for HDFS.

Hide
Permalink
Doug Cutting added a comment -01/Jun/09
23:02

> But not sure why it returns a map

It could perhaps instead return an array of two-element structs, each containing a filestatis/blocklocations pair, but a Map seems simpler.

> How does a user figure out which were valid and which were invalid/non-existent/empty paths?

Non-existent paths should be ignored. Paths whose URIs are for different filesystems or are somehow unparseable should cause an exception.

Show
Doug Cutting added a comment -01/Jun/09
23:02 > But not sure why it returns a map It could perhaps instead return an array of two-element structs, each containing a filestatis/blocklocations pair, but a Map seems simpler. > How does a user figure out which were valid and which were
invalid/non-existent/empty paths? Non-existent paths should be ignored. Paths whose URIs are for different filesystems or are somehow unparseable should cause an exception.

Hide
Permalink
Jakob Homan added a comment -16/Jun/09
00:52

The current API implementation of getBlockLocations includes parameters for the byte offset within the file and the number of bytes within the files for which to return blocks. These parameters aren't provided currently in the specification for the new API.
Would it be better to pass in an array of BlockLocationRequests, each of which would consist of the path, start and length?

The other option would be to add the start and length specifications and for them to apply to each of the paths within the array, which doesn't seem particularly useful.

Show
Jakob Homan added a comment -16/Jun/09
00:52 The current API implementation of getBlockLocations includes parameters for the byte offset within the file and the number of bytes within the files for which to return blocks. These parameters aren't provided currently in the specification
for the new API. Would it be better to pass in an array of BlockLocationRequests, each of which would consist of the path, start and length? The other option would be to add the start and length specifications and for them to apply to each of the paths within
the array, which doesn't seem particularly useful.

Hide
Permalink
dhruba borthakur added a comment -16/Jun/09
05:09

> pass in an array of BlockLocationRequests, each of which would consist of the path, start and length

+1. This sounds better than assuming that we need to send back all blocks for the specified path(s)..

Show
dhruba borthakur added a comment -16/Jun/09
05:09 > pass in an array of BlockLocationRequests, each of which would consist of the path, start and length +1. This sounds better than assuming that we need to send back all blocks for the specified path(s)..

Hide
Permalink
Doug Cutting added a comment -16/Jun/09
16:48

> Would it be better to pass in an array of BlockLocationRequests, each of which would consist of the path, start and length?

That is more general, but what is the use case? The motivating use case for listBlockLocations() is map reduce split construction, which typically takes a list of files as input, not a list of sections of files. Adding a feature that won't be used will just
make this new API harder to use. -1 without a compelling use case.

Show
Doug Cutting added a comment -16/Jun/09
16:48 > Would it be better to pass in an array of BlockLocationRequests, each of which would consist of the path, start and length? That is more general, but what is the use case? The motivating use case for listBlockLocations() is map reduce
split construction, which typically takes a list of files as input, not a list of sections of files. Adding a feature that won't be used will just make this new API harder to use. -1 without a compelling use case.

Hide
Permalink
Jakob Homan added a comment -16/Jun/09
17:37

That is more general, but what is the use case?

The original motivation was Arun and Owen noticing during the terasort work that there were a large number of rpc calls were made during the task scheduling and that a bulk method could ameliorate that. That seems reasonable to me. I'll let Arun lobby further.

One point that came up in discussions is that it would be a good idea to have a maximum number of files that can be returned at once in order to not overwhelm the namenode. Whether this is hard-coded or configurable was not decided.

Show
Jakob Homan added a comment -16/Jun/09
17:37
That is more general, but what is the use case?
The original motivation was Arun and Owen noticing during the terasort work that there were a large number of rpc calls were made during the task scheduling and that a bulk method could ameliorate that. That seems reasonable to me. I'll let Arun lobby further.
One point that came up in discussions is that it would be a good idea to have a maximum number of files that can be returned at once in order to not overwhelm the namenode. Whether this is hard-coded or configurable was not decided.

Hide
Permalink
Doug Cutting added a comment -16/Jun/09
19:05

I meant, what is the use case for passing in start/end positions per file? I support the idea of a bulk call, but don't see the need to pass start/end positions per file.

Show
Doug Cutting added a comment -16/Jun/09
19:05 I meant, what is the use case for passing in start/end positions per file? I support the idea of a bulk call, but don't see the need to pass start/end positions per file.

Hide
Permalink
dhruba borthakur added a comment -16/Jun/09
19:56

I think the extended version of the API would help in doing incremental distcp when hdfs-append is supported. We use "distcp -update" to do an incremental copy of files that have changed in length, but having this proposed extended API (and more) allows
distcp to copy only changed portions of a file.

Show
dhruba borthakur added a comment -16/Jun/09
19:56 I think the extended version of the API would help in doing incremental distcp when hdfs-append is supported. We use "distcp -update" to do an incremental copy of files that have changed in length, but having this proposed extended API
(and more) allows distcp to copy only changed portions of a file.

Hide
Permalink
Arun C Murthy added a comment -17/Jun/09
01:42

Quick note: making the length mandatory (i.e. part of the api) has the unfortunate side-effect of forcing a stat on each file apriori to the call to listBlockLocations. So, from a Map-Reduce perspective it is important to have an api which does not force
InputFormats to pass in the lengths.

OTOH if we really need the more general version of the api I'd like to pass in "-1" to imply the whole file.

Show
Arun C Murthy added a comment -17/Jun/09
01:42 Quick note: making the length mandatory (i.e. part of the api) has the unfortunate side-effect of forcing a stat on each file apriori to the call to listBlockLocations. So, from a Map-Reduce perspective it is important to have an api which
does not force InputFormats to pass in the lengths. OTOH if we really need the more general version of the api I'd like to pass in "-1" to imply the whole file.

Hide
Permalink
Doug Cutting added a comment -17/Jun/09
19:16

> I think the extended version of the API would help in doing incremental distcp when hdfs-append is supported.

Thanks for the use case! An append-savvy incremental distcp might first use listStatus to get all file lengths and dates from both filesystems, then figure out which had grown longer but whose creation dates had not changed, indicating they'd been appended
to. Then a batch call could be made to fetch block locations of just newly appended sections, and these would be used to construct splits that can be localized well. Does that sound right?

In this case we would not list directories, but rather always pass in a list of individual files. The mapping from inputs to outputs would be 1:1 so it could take the form:

BlockLocation[] getBlockLocations(BlockLocationRequest[])

A corollary is that it does not make sense to pass start/end positions for a directory, although these could be ignored.

Do we want to try to develop a single swiss-army-knife batch call, or add operation-optimized calls as we go?

Show
Doug Cutting added a comment -17/Jun/09
19:16 > I think the extended version of the API would help in doing incremental distcp when hdfs-append is supported. Thanks for the use case! An append-savvy incremental distcp might first use listStatus to get all file lengths and dates from
both filesystems, then figure out which had grown longer but whose creation dates had not changed, indicating they'd been appended to. Then a batch call could be made to fetch block locations of just newly appended sections, and these would be used to construct
splits that can be localized well. Does that sound right? In this case we would not list directories, but rather always pass in a list of individual files. The mapping from inputs to outputs would be 1:1 so it could take the form: BlockLocation[] getBlockLocations(BlockLocationRequest[])
A corollary is that it does not make sense to pass start/end positions for a directory, although these could be ignored. Do we want to try to develop a single swiss-army-knife batch call, or add operation-optimized calls as we go?

Hide
Permalink
Sanjay Radia added a comment -08/Sep/09
21:14

Is the optimization for sending only partial block reports really necessary? Most files have very few blocks ...

Also arun's point of doing an extra call for doing the getFileStatus() is valid.

Why not create a class called DetailedFileStatus which contains both the file status and block locations:

DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping between the two arrays as Doug suggested.

We can add the range one later if we really need that optimization.

Show
Sanjay Radia added a comment -08/Sep/09
21:14 Is the optimization for sending only partial block reports really necessary? Most files have very few blocks ... Also arun's point of doing an extra call for doing the getFileStatus() is valid. Why not create a class called DetailedFileStatus
which contains both the file status and block locations: DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping between the two arrays as Doug suggested. We can add the range one later if we really need that optimization.

Hide
Permalink
Doug Cutting added a comment -09/Sep/09
19:08

> Is the optimization for sending only partial block reports really necessary?

It may help the append-savvy distcp use case, but is not in the mapred job submission use case. Even in the append-savvy distcp use case, it's not clear that it's required. Maybe we should punt that until someone develops an append-savvy distcp?

> Why not create a class called DetailedFileStatus which contains both the file status and block locations:

Why is DetailedFileStatus[] better than Map<FileStatus,BlockLocation[]>? The latter seems more transparent.

> DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping between the two arrays as Doug suggested.

That was intended for the append-savvy distcp use case. The original use case was for mapred job submission, where we typically have a list of directories. With directories there is not a 1:1 mapping.

Show
Doug Cutting added a comment -09/Sep/09
19:08 > Is the optimization for sending only partial block reports really necessary? It may help the append-savvy distcp use case, but is not in the mapred job submission use case. Even in the append-savvy distcp use case, it's not clear that
it's required. Maybe we should punt that until someone develops an append-savvy distcp? > Why not create a class called DetailedFileStatus which contains both the file status and block locations: Why is DetailedFileStatus[] better than Map<FileStatus,BlockLocation[]>?
The latter seems more transparent. > DetailedFileStatus[] = getBlockLocations(Path[] paths); // 1:1 mapping between the two arrays as Doug suggested. That was intended for the append-savvy distcp use case. The original use case was for mapred job submission,
where we typically have a list of directories. With directories there is not a 1:1 mapping.

Hide
Permalink
Sanjay Radia added a comment -15/Sep/09
16:45

> Maybe we should punt that until someone develops an append-savvy distcp?

+1

>Why is DetailedFileStatus[] better than Map<FileStatus,BlockLocation[]>? The latter seems more transparent.

I was holding out on a file system interface return a map. But that is old school.

Fine I am convinced.

I suspect you also want the rpc signature to return a map (that makes me more nervous because most rpcs do not support that - but ours does I guess.).

Wrt to the new FileContext api, my proposal is that its provides a single getBlockLocation method:

Map<FileStatus,BlockLocation[]> getBlockLocations(Path[] path)

and abandon the BlockLocation[] getBlockLocations(path, start, end).

(of course FileSystem will continue to support the old getBlockLocations.)

Show
Sanjay Radia added a comment -15/Sep/09
16:45 > Maybe we should punt that until someone develops an append-savvy distcp? +1 >Why is DetailedFileStatus[] better than Map<FileStatus,BlockLocation[]>? The latter seems more transparent. I was holding out on a file system interface return
a map. But that is old school. Fine I am convinced. I suspect you also want the rpc signature to return a map (that makes me more nervous because most rpcs do not support that - but ours does I guess.).

Wrt to the new FileContext api, my proposal is that its provides a single getBlockLocation method: Map<FileStatus,BlockLocation[]> getBlockLocations(Path[] path) and abandon the BlockLocation[] getBlockLocations(path, start, end). (of course FileSystem will
continue to support the old getBlockLocations.)

Hide
Permalink
Doug Cutting added a comment -15/Sep/09
17:56

+1

Show
Doug Cutting added a comment -15/Sep/09
17:56 +1

Hide
Permalink
dhruba borthakur added a comment -15/Sep/09
18:23

+1

Show
dhruba borthakur added a comment -15/Sep/09
18:23 +1

Hide
Permalink
Hairong Kuang added a comment -12/Jul/10
23:58

I am quite bothered that the proposed API returns a map. Is the reason for returning a map because the API does one-level listPath? Is there a use case that needs only one-level expansion?

If we eventually need to get the block locations of all files recursively under the input paths, is the following API a better choice?

/**
* @return the block locations of all files recursively under the input paths
*/
Iterator<BlockLocation> getBlockLocations(Path[] paths)


When implementing this in HDFS, we might need to issue multiple RPCs and be very careful to limit the size of each RPC request and response.

Show
Hairong Kuang added a comment -12/Jul/10
23:58 I am quite bothered that the proposed API returns a map. Is the reason for returning a map because the API does one-level listPath? Is there a use case that needs only one-level expansion? If we eventually need to get the block locations
of all files recursively under the input paths, is the following API a better choice?

/**
* @return the block locations of all files recursively under the input paths
*/
Iterator<BlockLocation> getBlockLocations(Path[] paths)


When implementing this in HDFS, we might need to issue multiple RPCs and be very careful to limit the size of each RPC request and response.

Hide
Permalink
Hairong Kuang added a comment -13/Jul/10
00:26

I read FileInputFormat and understand the usecase much better. So the client needs to know FileStatus for filtering and there is a configuration parameter to specify whether the input paths need to be traversed recursively. In this case, how about the following
revised API?

class FileStatusAndBlockLocations {
FileStatus fileStatus;
BlockLocation [] blocks;
}

Iterator<FileStatusAndBlockLocations> getBlockLocations(Path[] paths, boolean isRecursive);


Show
Hairong Kuang added a comment -13/Jul/10
00:26 I read FileInputFormat and understand the usecase much better. So the client needs to know FileStatus for filtering and there is a configuration parameter to specify whether the input paths need to be traversed recursively. In this case,
how about the following revised API?

class FileStatusAndBlockLocations {
FileStatus fileStatus;
BlockLocation [] blocks;
}

Iterator<FileStatusAndBlockLocations> getBlockLocations(Path[] paths, boolean isRecursive);


Hide
Permalink
Hairong Kuang added a comment -15/Jul/10
00:08

The above proposed method is an API in FileSystem.

Internally in HDFS, I plan to add two new client-to-namenode RPCs:

class HdfsFileStatusAndBlockLocations { HdfsFileStatus fileStatus; BlockLocation [] blocks; }

/**

Given an array of input paths, return an array of file status and block locations.
The input array and output array have the same size.
The ith item in the output array is the file status and block locations of the ith path in input array.
if an input path is a directory, its block locations is empty.

*/

HdfsFileStatusAndBlockLocations[] getFileStatusAndBlockLocations( Path[] paths);
/**

Given an input directory, return the file status and block locations of its children.

*/

HdfsFileStatusAndBlockLocations[] listFileStatusAndBlockLocations(Path path);
Suppose the subtrees that represent a job's input paths contain N directories, the two APIs allow a dfs client to issue N+1 RPCs to NameNode to implement the above proposed file system API.

Show
Hairong Kuang added a comment -15/Jul/10
00:08 The above proposed method is an API in FileSystem. Internally in HDFS, I plan to add two new client-to-namenode RPCs: class HdfsFileStatusAndBlockLocations { HdfsFileStatus fileStatus; BlockLocation [] blocks; } /**
Given an array of input paths, return an array of file status and block locations.
The input array and output array have the same size.
The ith item in the output array is the file status and block locations of the ith path in input array.
if an input path is a directory, its block locations is empty. */ HdfsFileStatusAndBlockLocations[] getFileStatusAndBlockLocations( Path[] paths);

/**
Given an input directory, return the file status and block locations of its children. */ HdfsFileStatusAndBlockLocations[] listFileStatusAndBlockLocations(Path path);

Suppose the subtrees that represent a job's input paths contain N directories, the two APIs allow a dfs client to issue N+1 RPCs to NameNode to implement the above proposed file system API.

Hide
Permalink
Hairong Kuang added a comment -15/Jul/10
00:27

I also plan to use the same idea of iterative listing (HDFS-985) to limit the size of the response when listingFileStatusAndBlockLocations
of a directory.

Show
Hairong Kuang added a comment -15/Jul/10
00:27 I also plan to use the same idea of iterative listing (HDFS-985) to limit the size of the response
when listingFileStatusAndBlockLocations of a directory.

Hide
Permalink
Hairong Kuang added a comment -16/Jul/10
18:08

I want to explain the difference between my proposal and the previous proposal.

1. For the FileSystem API, the user can specify whether the input paths need to be recursively traversed or not. The return result is an iterator, which allows the input files to be fetched from server one batch at a time so to avoid OOM exception when input
paths are huge.

2. The design of new RPCs allows us to return HdfsFileStatus (local file name) instead of FileStatus (full path name), saving CPU processing time. It also allows us to easily limit the response size.

If nobody is against it, I will go ahead with the implementation.

Show
Hairong Kuang added a comment -16/Jul/10
18:08 I want to explain the difference between my proposal and the previous proposal. 1. For the FileSystem API, the user can specify whether the input paths need to be recursively traversed or not. The return result is an iterator, which allows
the input files to be fetched from server one batch at a time so to avoid OOM exception when input paths are huge. 2. The design of new RPCs allows us to return HdfsFileStatus (local file name) instead of FileStatus (full path name), saving CPU processing
time. It also allows us to easily limit the response size. If nobody is against it, I will go ahead with the implementation.

Hide
Permalink
dhruba borthakur added a comment -16/Jul/10
23:34

+1 to this proposal.

> The return result is an iterator, which allows the input files to be fetched from

However, if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it?

Show
dhruba borthakur added a comment -16/Jul/10
23:34 +1 to this proposal. > The return result is an iterator, which allows the input files to be fetched from However, if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it?

Hide
Permalink
Hairong Kuang added a comment -17/Jul/10
00:21

> if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it?

I will reuse DFS_LIST_LIMIT introduced in
HDFS-985. Its default value is 1000. So by default, 500 will be fetched in one RPC.

Show
Hairong Kuang added a comment -17/Jul/10
00:21 > if the number of files in a diectory are few (say 500), then we can still fetch everything in on RPC, isn't it? I will reuse DFS_LIST_LIMIT introduced inHDFS-985.
Its default value is 1000. So by default, 500 will be fetched in one RPC.

Hide
Permalink
Hairong Kuang added a comment -21/Jul/10
21:42

Taking multiple paths as an input to a FileContext API and HDFS clinet-NN rpc seems to be a bad idea. It adds quite a lot of complexity for grouping paths by file systems and for resolving symbolic links. Does not sound clean and I'd like to avoid it. So
here is the revised proposal:

class LocatedFileStatus extends FileStatus {
BlockLocation [] blocks;
}


FileSystem and FileContext will have a new API

public Iterator<FileStatusAndBlockLocations> listLocatedFileStatus(Path path, boolean isRecursive);


This new API is similar to FileContext#listStatus in many ways except that the returned LocatedFileStatus contains its block locations and if isRecursive is true, all the files in the subtree rooted at the input path will be returned.

Similarly in HDFS, we will have

class HdfsLocatedFileStatus extends HdfsFileStaus {
BlockLocations[] blocks;
}


ClientProtocol will add one more parameter "boolean withLocation" to the existing getListing RPC.

public DirectoryListing getListing(String src,
byte[] startAfter,
boolean withLocation)
throws AccessControlException, FileNotFoundException,
UnresolvedLinkException, IOException;


If withLocation is false, the semantics is the same as before. When withLocations is true, DirectoryListing will contains LocatedFileStatus.

Show
Hairong Kuang added a comment -21/Jul/10
21:42 Taking multiple paths as an input to a FileContext API and HDFS clinet-NN rpc seems to be a bad idea. It adds quite a lot of complexity for grouping paths by file systems and for resolving symbolic links. Does not sound clean and I'd like
to avoid it. So here is the revised proposal:

class LocatedFileStatus extends FileStatus {
BlockLocation [] blocks;
}


FileSystem and FileContext will have a new API

public Iterator<FileStatusAndBlockLocations> listLocatedFileStatus(Path path, boolean isRecursive);


This new API is similar to FileContext#listStatus in many ways except that the returned LocatedFileStatus contains its block locations and if isRecursive is true, all the files in the subtree rooted at the input path will be returned. Similarly in HDFS, we
will have

class HdfsLocatedFileStatus extends HdfsFileStaus {
BlockLocations[] blocks;
}


ClientProtocol will add one more parameter "boolean withLocation" to the existing getListing RPC.

public DirectoryListing getListing(String src,
byte[] startAfter,
boolean withLocation)
throws AccessControlException, FileNotFoundException,
UnresolvedLinkException, IOException;


If withLocation is false, the semantics is the same as before. When withLocations is true, DirectoryListing will contains LocatedFileStatus.

Hide
Permalink
Hairong Kuang added a comment -21/Jul/10
21:45

Sorry the new FileSystem/FileContext API should be

public Iterator<LocatedFileStatus> listLocatedFileStatus(Path path, boolean isRecursive);


Show
Hairong Kuang added a comment -21/Jul/10
21:45 Sorry the new FileSystem/FileContext API should be

public Iterator<LocatedFileStatus> listLocatedFileStatus(Path path, boolean isRecursive);


Hide
Permalink
Hairong Kuang added a comment -23/Jul/10
19:03

This is an initial patch for the HDFS implementation of the proposed FileSystem API.

Show
Hairong Kuang added a comment -23/Jul/10
19:03 This is an initial patch for the HDFS implementation of the proposed FileSystem API.

Hide
Permalink
Hairong Kuang added a comment -23/Jul/10
19:12

I am not sure what should we do if a child of the input directory is a symbolic link. Whether the symbolic link should be resolved or not better to be decided by applications.

It seems cleaner if the new API changes to be listLocatedFileStatus(Path path) so it does not traverse the subtree recursively and it returns all the content of the directory. BlockLocations are piggybacked if a child is a file. This design decision leaves
the questions like how to deal with when a child is a symbolic link or a directory to be answered by applications.

Show
Hairong Kuang added a comment -23/Jul/10
19:12 I am not sure what should we do if a child of the input directory is a symbolic link. Whether the symbolic link should be resolved or not better to be decided by applications. It seems cleaner if the new API changes to be listLocatedFileStatus(Path
path) so it does not traverse the subtree recursively and it returns all the content of the directory. BlockLocations are piggybacked if a child is a file. This design decision leaves the questions like how to deal with when a child is a symbolic link or a
directory to be answered by applications.

Hide
Permalink
Doug Cutting added a comment -23/Jul/10
19:50

> I am not sure what should we do if a child of the input directory is a symbolic link.

Handling of symlinks should be addressed in
HADOOP-6870, no?

Show
Doug Cutting added a comment -23/Jul/10
19:50 > I am not sure what should we do if a child of the input directory is a symbolic link. Handling of symlinks should be addressed inHADOOP-6870,
no?

Hide
Permalink
Hairong Kuang added a comment -23/Jul/10
21:42

Hi Doug, thanks for your review comments.

Yes, Handling of symlinks should be addressed in FileContext in
HADOOP-6870.
HDFS-202 severs as the discussion board for this issue. So I posted the question here.

My question is whether this new API should handle recursive traversal and symbolic resolution. Is it cleaner if it does not do any of these and leave decisions to applications?

Show
Hairong Kuang added a comment -23/Jul/10
21:42 Hi Doug, thanks for your review comments. Yes, Handling of symlinks should be addressed in FileContext inHADOOP-6870.

HDFS-202 severs as the discussion board for this issue. So I posted the question here. My question is whether this new API should handle recursive traversal and symbolic resolution. Is it cleaner if it does not do any of these and leave decisions
to applications?

Hide
Permalink
Doug Cutting added a comment -23/Jul/10
22:19

> My question is whether this new API should handle recursive traversal and symbolic resolution.

My intuition is that recursive file listings for open should follow symbolic links, since open follows symbolic links. Recursive traversal for remove should not follow symbolic links, but should just remove the symbolic link, like remove does on a symbolic
link.

Show
Doug Cutting added a comment -23/Jul/10
22:19 > My question is whether this new API should handle recursive traversal and symbolic resolution. My intuition is that recursive file listings for open should follow symbolic links, since open follows symbolic links. Recursive traversal
for remove should not follow symbolic links, but should just remove the symbolic link, like remove does on a symbolic link.

Hide
Permalink
Hairong Kuang added a comment -27/Jul/10
04:42

hdfsListFiles1.patch support listFiles API in both DistributedFileSystem and Hdfs.

It has two unit tests. In particular, TestListFilesInFileContext includes a test on a input directory that contains two symbolic links.

Show
Hairong Kuang added a comment -27/Jul/10
04:42 hdfsListFiles1.patch support listFiles API in both DistributedFileSystem and Hdfs. It has two unit tests. In particular, TestListFilesInFileContext includes a test on a input directory that contains two symbolic links.

Hide
Permalink
Suresh Srinivas added a comment -29/Jul/10
21:33

HDFS.java
Not sure about "NB:" in the comment
DirListingIterator
make f, src, needLocation final. Add javadoc to the class.
getNext() should call hasNext(). A caller calling next() without calling hasNext() could result in not fetching new partial list and also ArrayIndexOutOfBoundsExeption.

DistributedFileSystem
listLocatedStatus() make src final.
listFiles() make itor, curFile private and dirStats final.

HDFSFileLocatedStatus.java - missing banner.
FSNamesystem.java - make getBlockLocationsInternal() private
NameNode.java - change variable hasLocation to needLocation. Method getListing() variant with boolean flag is just used by fsck? Do we need that variant?
General - as in hadoop common implementation, on IOException, should next() and hasNext() throw RuntimeException instead of returning false. Please note the comment inHADOOP-6870
about if this is right way to handle FileNotFoundException.

Show
Suresh Srinivas added a comment -29/Jul/10
21:33
HDFS.java
Not sure about "NB:" in the comment
DirListingIterator
make f, src, needLocation final. Add javadoc to the class.
getNext() should call hasNext(). A caller calling next() without calling hasNext() could result in not fetching new partial list and also ArrayIndexOutOfBoundsExeption.

DistributedFileSystem
listLocatedStatus() make src final.
listFiles() make itor, curFile private and dirStats final.

HDFSFileLocatedStatus.java - missing banner.
FSNamesystem.java - make getBlockLocationsInternal() private
NameNode.java - change variable hasLocation to needLocation. Method getListing() variant with boolean flag is just used by fsck? Do we need that variant?
General - as in hadoop common implementation, on IOException, should next() and hasNext() throw RuntimeException instead of returning false. Please note the comment inHADOOP-6870
about if this is right way to handle FileNotFoundException.

Hide
Permalink
Doug Cutting added a comment -30/Jul/10
00:28

By default I think exceptions should be thrown. This is like the return status of the unix 'ls' command, which is non-zero if, e.g., a directory is unreadable. But perhaps an option to force enumeration in light of exception would be useful.

Show
Doug Cutting added a comment -30/Jul/10
00:28 By default I think exceptions should be thrown. This is like the return status of the unix 'ls' command, which is non-zero if, e.g., a directory is unreadable. But perhaps an option to force enumeration in light of exception would be useful.

Hide
Permalink
Suresh Srinivas added a comment -30/Jul/10
00:42

Unix 'ls' returns all the results in one shot. However, when getting response iteratively the behavior is different:

When listing a single directory, if some ls results has been returned and the directory is deleted, we should throw FileNotFoundException, to indicate the directory is no longer available.
When recursively listing under a directory, if a subdirectory is deleted, the more appropriate response is to ignore FileNotFound for that directory and return the remaining results. This would be consistent with what the result would be, if the command
is repeated. Further, if an application is listing recursively a large directory, the state of the directory keeps changing, an application may have to try many times to list it.

Show
Suresh Srinivas added a comment -30/Jul/10
00:42 Unix 'ls' returns all the results in one shot. However, when getting response iteratively the behavior is different:
When listing a single directory, if some ls results has been returned and the directory is deleted, we should throw FileNotFoundException, to indicate the directory is no longer available.
When recursively listing under a directory, if a subdirectory is deleted, the more appropriate response is to ignore FileNotFound for that directory and return the remaining results. This would be consistent with what the result would be, if the command
is repeated. Further, if an application is listing recursively a large directory, the state of the directory keeps changing, an application may have to try many times to list it.

Hide
Permalink
Hairong Kuang added a comment -30/Jul/10
16:41

As I commented in
HADOOP-6890, I would prefer throwing exceptions when a file/directory is deleted during listing. This is because getFiles is used by MapReduce job client to calculate splits. So the expectation is that the input directories remain no change during
job execution. It is good to fail the job earlier than later.

Show
Hairong Kuang added a comment -30/Jul/10
16:41 As I commented inHADOOP-6890, I would prefer throwing exceptions when a file/directory is deleted during
listing. This is because getFiles is used by MapReduce job client to calculate splits. So the expectation is that the input directories remain no change during job execution. It is good to fail the job earlier than later.

Hide
Permalink
Hairong Kuang added a comment -31/Jul/10
00:07

This patch addressed Suresh's review comments.

Show
Hairong Kuang added a comment -31/Jul/10
00:07 This patch addressed Suresh's review comments.

Hide
Permalink
Hairong Kuang added a comment -02/Aug/10
22:30

hdfsListFiles3,patch adds an fault-injection test to makes sure that the iterator returned by listLocatedStatus throws RuntimeException in case of io error.

Show
Hairong Kuang added a comment -02/Aug/10
22:30 hdfsListFiles3,patch adds an fault-injection test to makes sure that the iterator returned by listLocatedStatus throws RuntimeException in case of io error.

Hide
Permalink
Hadoop QA added a comment -02/Aug/10
22:45

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch

against trunk revision 981289.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 15 new or modified tests.

+1 javadoc. The javadoc tool did not generate any warning messages.

-1 javac. The patch appears to cause tar ant target to fail.

-1 findbugs. The patch appears to cause Findbugs to fail.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed core unit tests.

-1 contrib tests. The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/testReport/
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/console
This message is automatically generated.

Show
Hadoop QA added a comment -02/Aug/10
22:45 -1 overall. Here are the results of testing the latest attachmenthttp://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch against
trunk revision 981289. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. -1 javac. The patch appears to cause
tar ant target to fail. -1 findbugs. The patch appears to cause Findbugs to fail. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch
failed contrib unit tests. Test results:http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/testReport/ Checkstyle results:http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/artifact/trunk/build/test/checkstyle-errors.html
Console output:http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/226/console This message is automatically generated.

Hide
Permalink
Hairong Kuang added a comment -04/Aug/10
04:39

There seems to be a problem with Hudson. Let me try it one more time.

Show
Hairong Kuang added a comment -04/Aug/10
04:39 There seems to be a problem with Hudson. Let me try it one more time.

Hide
Permalink
Hadoop QA added a comment -04/Aug/10
04:54

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch

against trunk revision 982091.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 15 new or modified tests.

-1 patch. The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/228/console
This message is automatically generated.

Show
Hadoop QA added a comment -04/Aug/10
04:54 -1 overall. Here are the results of testing the latest attachmenthttp://issues.apache.org/jira/secure/attachment/12451084/hdfsListFiles3.patch against
trunk revision 982091. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 15 new or modified tests. -1 patch. The patch command could not apply the patch. Console output:
http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/228/console This message is automatically generated.

Hide
Permalink
Hairong Kuang added a comment -11/Aug/10
18:00

This patch incorporated changes made in
HADOOP-6900.

Show
Hairong Kuang added a comment -11/Aug/10
18:00 This patch incorporated changes made inHADOOP-6900.

Hide
Permalink
Suresh Srinivas added a comment -11/Aug/10
18:44

Comments:

ListPathAspects.aj - callGetListing() method has description which says rename
HDFSFileLocatedStatus.java - missing banner.

Show
Suresh Srinivas added a comment -11/Aug/10
18:44 Comments:
ListPathAspects.aj - callGetListing() method has description which says rename
HDFSFileLocatedStatus.java - missing banner.

Hide
Permalink
Suresh Srinivas added a comment -11/Aug/10
18:46

+1 for the patch if the above comments are taken care of.

Show
Suresh Srinivas added a comment -11/Aug/10
18:46 +1 for the patch if the above comments are taken care of.

Hide
Permalink
Hairong Kuang added a comment -11/Aug/10
19:32

This patch addresses Suresh's comments.

Show
Hairong Kuang added a comment -11/Aug/10
19:32 This patch addresses Suresh's comments.

Hide
Permalink
Hairong Kuang added a comment -11/Aug/10
19:41

Not able to run ant test-patch because the trunk does not compile. But I checked that this patch does not introduce new Javadoc warnings and adds new tests. There were quite a few unit tests failing. But seems not related to this patch.

Show
Hairong Kuang added a comment -11/Aug/10
19:41 Not able to run ant test-patch because the trunk does not compile. But I checked that this patch does not introduce new Javadoc warnings and adds new tests. There were quite a few unit tests failing. But seems not related to this patch.

Hide
Permalink
Konstantin Shvachko added a comment -11/Aug/10
20:21

> the trunk does not compile

See
here

Show
Konstantin Shvachko added a comment -11/Aug/10
20:21 > the trunk does not compile Seehere

Hide
Permalink
Hairong Kuang added a comment -11/Aug/10
20:45

Konstantin, the hdfs trunk should be able to compile because I've committed this patch.HDFS-202 is the HDFS side of

HADOOP-6900!

Thanks Suresh for reviewing this patch at full speed!


Show
Hairong Kuang added a comment -11/Aug/10
20:45 Konstantin, the hdfs trunk should be able to compile because I've committed this patch.HDFS-202 is the HDFS side
of
HADOOP-6900! Thanks Suresh for reviewing this patch at full speed!


Hide
Permalink
Hairong Kuang added a comment -11/Aug/10
20:46

I've committed this!

Show
Hairong Kuang added a comment -11/Aug/10
20:46 I've committed this!

Hide
Permalink
Hudson added a comment -20/Aug/10
08:00

Integrated in Hadoop-Hdfs-trunk-Commit #370 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/370/)
Show
Hudson added a comment -20/Aug/10
08:00 Integrated in Hadoop-Hdfs-trunk-Commit #370 (Seehttps://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/370/)

Hide
Permalink
Amareshwari Sriramadasu added a comment -20/Aug/10
09:10

Shouldn't we mark feature as Incompatible change? It changed the signature of getListing() and broke MaReduce build,MAPREDUCE-2022

Show
Amareshwari Sriramadasu added a comment -20/Aug/10
09:10 Shouldn't we mark feature as Incompatible change? It changed the signature of getListing() and broke MaReduce build,MAPREDUCE-2022

Hide
Permalink
Tsz Wo (Nicholas), SZE added a comment -20/Aug/10
16:02

Yes, this is an Incompatible change.
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息