Review Board 1.7.22


FLUME-1507: Have "Topology Design Considerations" in User Guide

Review Request #6889 - Created Sept. 1, 2012 and updated

Patrick Wendell
FLUME-1507
Reviewers
Flume
jarcec
flume-git
This patch adds a topology design consideration section to the user guide.
Documentation only.

Diff revision 2 (Latest)

1 2
1 2

  1. flume-ng-doc/sphinx/FlumeUserGuide.rst: Loading...
flume-ng-doc/sphinx/FlumeUserGuide.rst
Revision 2d64249 New Change
[20] 1940 lines
[+20]
1941
=======================  =======  ========================================
1941
=======================  =======  ========================================
1942
**type**                 --       The component type name, has to be FQCN
1942
**type**                 --       The component type name, has to be FQCN
1943
=======================  =======  ========================================
1943
=======================  =======  ========================================
1944

    
   
1944

   
1945

    
   
1945

   

    
   
1946
Topology Design Considerations

    
   
1947
==============================

    
   
1948
Flume is very flexible and allows a large range of possible deployment

    
   
1949
scenarios. If you plan to use Flume in a large, production deployment, it is

    
   
1950
prudent to spend some time thinking about how to express your problem in

    
   
1951
terms of a Flume topology. This section covers a few considerations.

    
   
1952

   

    
   
1953
Is Flume a good fit for your problem?

    
   
1954
-------------------------------------

    
   
1955
If you need to ingest textual log data into Hadoop/HDFS then Flume is the

    
   
1956
right fit for your problem, full stop. For other use cases, here are some

    
   
1957
guidelines:

    
   
1958

   

    
   
1959
Flume is designed to transport and ingest regularly generated event data over

    
   
1960
relatively stable, potentially complex topologies. The notion of "event data"

    
   
1961
is very broadly defined. To Flume, an event is just a generic blob of bytes.

    
   
1962
There are some limitations on how large an event can be - for instance, it

    
   
1963
cannot be larger than you can store in memory or on disk on a single machine -

    
   
1964
but in practice flume events can be everything from textual log entries to

    
   
1965
image files. The key property of an event  is that they are generated in a

    
   
1966
continuous, streaming fashion. If your data is not regularly generated

    
   
1967
(i.e. you are trying to do a single bulk load of data into a Hadoop cluster)

    
   
1968
then Flume will still work, but it is probably overkill for your situation.

    
   
1969
Flume likes relatively stable topologies. Your topologies do not need to be

    
   
1970
immutable, because Flume can deal with changes in topology without losing data

    
   
1971
and can also tolerate periodic reconfiguration due to fail-over or

    
   
1972
provisioning. It probably won't work well if you plant to change topologies

    
   
1973
every day, because reconfiguration takes some thought and overhead.

    
   
1974

   

    
   
1975
Flow reliability in Flume

    
   
1976
-------------------------

    
   
1977
The reliability of a Flume flow depends on several factors. By adjusting these

    
   
1978
factors, you can achieve a wide array of reliability options with Flume.

    
   
1979

   

    
   
1980
**What type of channel you use.** Flume has both durable channels (those which

    
   
1981
will persist data to disk) and non durable channels (those which will lose

    
   
1982
data if a machine fails). Durable channels use disk-based storage, and data

    
   
1983
stored in such channels will persist across machine restarts or non

    
   
1984
disk-related failures.

    
   
1985

   

    
   
1986
**Whether your channels are sufficiently provisioned for the workload.** Channels

    
   
1987
in Flume act as buffers at various hops. These buffers have a fixed capacity,

    
   
1988
and once that capacity is full you will create back pressure on earlier points

    
   
1989
in the flow. If this pressure propagates to the source of the flow, Flume will

    
   
1990
become unavailable and may lose data.

    
   
1991

   

    
   
1992
**Whether you use redundant topologies.** Flume let's you replicate flows

    
   
1993
across redundant topologies. This can provide a very easy source of fault

    
   
1994
tolerance and one which is overcomes both disk or machine failures. 

    
   
1995

   

    
   
1996
*The best way to think about reliability in a Flume topology is to consider

    
   
1997
various failure scenarios and their outcomes.* What happens if a disk fails?

    
   
1998
What happens if a machine fails? What happens if your terminal sink

    
   
1999
(e.g. HDFS) goes down for some time and you have back pressure? The space of

    
   
2000
possible designs is huge, but the underlying questions you need to ask are

    
   
2001
just a handful.

    
   
2002

   

    
   
2003
Flume topology design

    
   
2004
---------------------

    
   
2005
The first step in designing a Flume topology is to enumerate all sources

    
   
2006
and destinations (terminal sinks) for your data. These will define the edge

    
   
2007
points of your topology. The next consideration is whether to introduce

    
   
2008
intermediate aggregation tiers or event routing. If you are collecting data

    
   
2009
form a large number of sources, it can be helpful to aggregate the data in

    
   
2010
order to simplify ingestion at the terminal sink. An aggregation tier can

    
   
2011
also smooth out burstiness from sources or unavailability at sinks, by

    
   
2012
acting as a buffer. If you are routing data between different locations,

    
   
2013
you may also want to split flows at various points: this creates

    
   
2014
sub-topologies which may themselves include aggregation points.

    
   
2015

   

    
   
2016
Sizing a Flume deployment

    
   
2017
-------------------------

    
   
2018
Once you have an idea of what your topology will look like, the next question

    
   
2019
is how much hardware and networking capacity is needed. This starts by

    
   
2020
quantifying how much data you generate. That is not always

    
   
2021
a simple task! Most data streams are bursty (for instance, due to diurnal

    
   
2022
patterns) and potentially unpredictable. A good starting point is to think

    
   
2023
about the maximum throughput you'll have in each tier of the topology, both

    
   
2024
in terms of *events per second* and *bytes per second*. Once you know the

    
   
2025
required throughput of a given tier, you can calulate a lower bound on how many

    
   
2026
nodes you require for that tier. To determine attainable throughput, it's

    
   
2027
best to experiment with Flume on your hardware, using synthetic or sampled

    
   
2028
event data. In general, disk-based channels

    
   
2029
should get 10's of MB/s and memory based channels should get 100's of MB/s or

    
   
2030
more. Performance will vary widely, however depending on hardware and

    
   
2031
operating environment.

    
   
2032

   

    
   
2033
Sizing aggregate throughput gives you a lower bound on the number of nodes

    
   
2034
you will need to each tier. There are several reasons to have additional

    
   
2035
nodes, such as increased redundancy and better ability to absorb bursts in load.
1946

    
   
2036

   
1947
Troubleshooting
2037
Troubleshooting
1948
===============
2038
===============
1949

    
   
2039

   
1950
Handling agent failures
2040
Handling agent failures
[+20] [20] 78 lines
  1. flume-ng-doc/sphinx/FlumeUserGuide.rst: Loading...