Sunday, June 15, 2014

What is Short-Circuit Local Reads

Default:(Hadoop 2.0)

Client connects to DataNode via a TCP socket,  and transferred the data via DataTransferProtocol.

HDFS-2246:Short Circuit Read

Client can open and read the data directly, bypass DataNode. Client just needs the file path from DataNode.
Feature:  block path cache, which allow client to reopen the file that it had already read recently.
Downside: Security concern.
Configuration:hdfs-site.xml
<property>
    <name>dfs.client.read.shortcircuit</name>
    <value>true</value>
    <description>This configuration parameter turns on short-circuit local reads.</description>
</property>

<property>
    <name>dfs.block.local-path-access.user</name>
    <value>gpadmin,hdfs,mapred,yarn,hbase,hive</value>
    <description>User that can use the shortcut</description>
</property>

<property>
    <name>dfs.client.read.shortcircuit.skip.checksum</name>
    <value>false</value>
    <description>If this configuration parameter is set, short-circuit local reads will skip checksums. This is normally not recommended, but it may be useful for special setups. You might consider using this if you are doing your own checksumming outside of HDFS.</description>
</property>

HDFS-347:Secure Short Circuit Read

DataNode opens the file and passes file descriptors to Client.
Feature: FileInputStreamCache(file descriptor cache), and Client does not need to re-open the files many times; Fastest.
Downside: N/A.
Configuration:hdfs-site.xml
<property>
  <name>dfs.client.read.shortcircuit</name>
  <value>true</value>
  <description>
    This configuration parameter turns on short-circuit local reads.
  </description>
</property>

<property>
  <name>dfs.domain.socket.path</name>
  <value>/home/stack/sockets/short_circuit_read_socket_PORT</value>
  <description>
    Optional.  This is a path to a UNIX domain socket that will be used for
    communication between the DataNode and local HDFS clients.
    If the string "_PORT" is present in this path, it will be replaced by the
    TCP port of the DataNode.
  </description>
</property>

<property>
  <name>dfs.client.read.shortcircuit.streams.cache.size</name>
  <value>256</value>
  <description>
    The DFSClient maintains a cache of recently opened file descriptors. This parameter controls the size of that cache. Setting this higher will use more file descriptors, but potentially provide better performance on workloads involving lots of seeks.
  </description>
<property>

<property>
  <name>dfs.client.read.shortcircuit.streams.cache.expiry.ms</name>
  <value>300000</value>
  <description>
    This controls the minimum amount of time file descriptors need to sit in the FileInputStreamCache before they can be closed for being inactive for too long.
  </description>
<property>

Reference:
How Improved Short-Circuit Local Reads Bring Better Performance and Security to Hadoop
HDFS Short-Circuit Local Reads

No comments:

Post a Comment

Popular Posts