Review Board 1.7.22


Additional configuration parameters for HDFSSink

Review Request #10606 - Created April 18, 2013 and updated

Thiruvalluvan M. G.
FLUME-2003
Reviewers
Flume
flume-git
This patch adds additional configuration parameters for HDFS Sink. They are for choosing the HDFS block size, HDFS replication factor and buffer size. These can now be chosen on per-sink basis.
I've tested this as follows:

(1) Without specifying these things work as before
(2) With these new parameters specified, the new HDFS files have the specified block size or replication or both.
flume-ng-doc/sphinx/FlumeUserGuide.rst
Revision 693c0d7 New Change
[20] 1400 lines
[+20]
1401
          "timestamp" must exist among the headers of the event (unless ``hdfs.useLocalTimeStamp`` is set to ``true``). One way to add
1401
          "timestamp" must exist among the headers of the event (unless ``hdfs.useLocalTimeStamp`` is set to ``true``). One way to add
1402
          this automatically is to use the TimestampInterceptor.
1402
          this automatically is to use the TimestampInterceptor.
1403

    
   
1403

   
1404
======================  ============  ======================================================================
1404
======================  ============  ======================================================================
1405
Name                    Default       Description
1405
Name                    Default       Description
1406
======================  ============  ======================================================================
1406
============================  ============  ======================================================================
1407
**channel**             --
1407
**channel**                   --
1408
**type**                --            The component type name, needs to be ``hdfs``
1408
**type**                      --            The component type name, needs to be ``hdfs``
1409
**hdfs.path**           --            HDFS directory path (eg hdfs://namenode/flume/webdata/)
1409
**hdfs.path**                 --            HDFS directory path (eg hdfs://namenode/flume/webdata/)
1410
hdfs.filePrefix         FlumeData     Name prefixed to files created by Flume in hdfs directory
1410
hdfs.filePrefix               FlumeData     Name prefixed to files created by Flume in hdfs directory
1411
hdfs.fileSuffix         --            Suffix to append to file (eg ``.avro`` - *NOTE: period is not automatically added*)
1411
hdfs.fileSuffix               --            Suffix to append to file (eg ``.avro`` - *NOTE: period is not automatically added*)
1412
hdfs.inUsePrefix        --            Prefix that is used for temporal files that flume actively writes into
1412
hdfs.inUsePrefix              --            Prefix that is used for temporal files that flume actively writes into
1413
hdfs.inUseSuffix        ``.tmp``      Suffix that is used for temporal files that flume actively writes into
1413
hdfs.inUseSuffix              ``.tmp``      Suffix that is used for temporal files that flume actively writes into
1414
hdfs.rollInterval       30            Number of seconds to wait before rolling current file
1414
hdfs.rollInterval             30            Number of seconds to wait before rolling current file
1415
                                      (0 = never roll based on time interval)
1415
                                            (0 = never roll based on time interval)
1416
hdfs.rollSize           1024          File size to trigger roll, in bytes (0: never roll based on file size)
1416
hdfs.rollSize                 1024          File size to trigger roll, in bytes (0: never roll based on file size)
1417
hdfs.rollCount          10            Number of events written to file before it rolled
1417
hdfs.rollCount                10            Number of events written to file before it rolled
1418
                                      (0 = never roll based on number of events)
1418
                                            (0 = never roll based on number of events)
1419
hdfs.idleTimeout        0             Timeout after which inactive files get closed
1419
hdfs.idleTimeout              0             Timeout after which inactive files get closed
1420
                                      (0 = disable automatic closing of idle files)
1420
                                            (0 = disable automatic closing of idle files)
1421
hdfs.batchSize          100           number of events written to file before it is flushed to HDFS
1421
hdfs.batchSize                100           number of events written to file before it is flushed to HDFS
1422
hdfs.codeC              --            Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy
1422
hdfs.codeC                    --            Compression codec. one of following : gzip, bzip2, lzo, lzop, snappy
1423
hdfs.fileType           SequenceFile  File format: currently ``SequenceFile``, ``DataStream`` or ``CompressedStream``
1423
hdfs.fileType                 SequenceFile  File format: currently ``SequenceFile``, ``DataStream`` or ``CompressedStream``
1424
                                      (1)DataStream will not compress output file and please don't set codeC
1424
                                            (1)DataStream will not compress output file and please don't set codeC
1425
                                      (2)CompressedStream requires set hdfs.codeC with an available codeC
1425
                                            (2)CompressedStream requires set hdfs.codeC with an available codeC
1426
hdfs.maxOpenFiles       5000          Allow only this number of open files. If this number is exceeded, the oldest file is closed.
1426
hdfs.maxOpenFiles             5000          Allow only this number of open files. If this number is exceeded, the oldest file is closed.
1427
hdfs.minBlockReplicas   --            Specify minimum number of replicas per HDFS block. If not specified, it comes from the default Hadoop config in the classpath.
1427
hdfs.minBlockReplicas         --            Specify minimum number of replicas per HDFS block. If not specified, it comes from the default Hadoop config in the classpath.
1428
hdfs.writeFormat        --            "Text" or "Writable"
1428
hdfs.writeFormat              --            "Text" or "Writable"
1429
hdfs.callTimeout        10000         Number of milliseconds allowed for HDFS operations, such as open, write, flush, close.
1429
hdfs.callTimeout              10000         Number of milliseconds allowed for HDFS operations, such as open, write, flush, close.
1430
                                      This number should be increased if many HDFS timeout operations are occurring.
1430
                                            This number should be increased if many HDFS timeout operations are occurring.
1431
hdfs.threadsPoolSize    10            Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
1431
hdfs.threadsPoolSize          10            Number of threads per HDFS sink for HDFS IO ops (open, write, etc.)
1432
hdfs.rollTimerPoolSize  1             Number of threads per HDFS sink for scheduling timed file rolling
1432
hdfs.rollTimerPoolSize        1             Number of threads per HDFS sink for scheduling timed file rolling
1433
hdfs.kerberosPrincipal  --            Kerberos user principal for accessing secure HDFS
1433
hdfs.kerberosPrincipal        --            Kerberos user principal for accessing secure HDFS
1434
hdfs.kerberosKeytab     --            Kerberos keytab for accessing secure HDFS
1434
hdfs.kerberosKeytab           --            Kerberos keytab for accessing secure HDFS
1435
hdfs.proxyUser
1435
hdfs.proxyUser
1436
hdfs.round              false         Should the timestamp be rounded down (if true, affects all time based escape sequences except %t)
1436
hdfs.round                    false         Should the timestamp be rounded down (if true, affects all time based escape sequences except %t)
1437
hdfs.roundValue         1             Rounded down to the highest multiple of this (in the unit configured using ``hdfs.roundUnit``), less than current time.
1437
hdfs.roundValue               1             Rounded down to the highest multiple of this (in the unit configured using ``hdfs.roundUnit``), less than current time.
1438
hdfs.roundUnit          second        The unit of the round down value - ``second``, ``minute`` or ``hour``.
1438
hdfs.roundUnit                second        The unit of the round down value - ``second``, ``minute`` or ``hour``.
1439
hdfs.timeZone           Local Time    Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
1439
hdfs.timeZone                 Local Time    Name of the timezone that should be used for resolving the directory path, e.g. America/Los_Angeles.
1440
hdfs.useLocalTimeStamp  false         Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.
1440
hdfs.useLocalTimeStamp        false         Use the local time (instead of the timestamp from the event header) while replacing the escape sequences.
1441
serializer              ``TEXT``      Other possible options include ``avro_event`` or the
1441
hdfs.hdfsDfsBlockSize          --           HDFS block-size for the files created. If omitted, the default block size for the HDFS client on the host will be used.

    
   
1442
hdfs.hdfsDfsReplication        --           HDFS replication-factor for the files created. If omitted, the default replication for the HDFS client on the host will be used.

    
   
1443
hdfs.hdfsDfsStreamBufferSize  --            Size of buffer to stream files. If omitted, the default value for the HDFS client on the host will be used.

    
   
1444
serializer                    ``TEXT``      Other possible options include ``avro_event`` or the
1442
                                      fully-qualified class name of an implementation of the
1445
                                            fully-qualified class name of an implementation of the
1443
                                      ``EventSerializer.Builder`` interface.
1446
                                            ``EventSerializer.Builder`` interface.
1444
serializer.*
1447
serializer.*
1445
======================  ============  ======================================================================
1448
============================  ============  ======================================================================
1446

    
   
1449

   
1447
Example for agent named a1:
1450
Example for agent named a1:
1448

    
   
1451

   
1449
.. code-block:: properties
1452
.. code-block:: properties
1450

    
   
1453

   
[+20] [20] 1575 lines
flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/AbstractHDFSWriter.java
Revision ff4f223 New Change
 
flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSCompressedDataStream.java
Revision 0c618b5 New Change
 
flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSDataStream.java
Revision c87fafe New Change
 
flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSSequenceFile.java
Revision 1a401d6 New Change
 
  1. flume-ng-doc/sphinx/FlumeUserGuide.rst: Loading...
  2. flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/AbstractHDFSWriter.java: Loading...
  3. flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSCompressedDataStream.java: Loading...
  4. flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSDataStream.java: Loading...
  5. flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSSequenceFile.java: Loading...