Review Board 1.7.22


Review request for SQOOP-1192 Add option "--skip-dist-cache" to allow Sqoop not copying jars in %SQOOP_HOME%\lib folder when launched by Oozie and use Oozie share lib

Review Request #14085 - Created Sept. 11, 2013 and updated

Shuaishuai Nie
trunk
SQOOP-1192
Reviewers
Sqoop
sqoop-trunk
Now Sqoop will copy jar files in %SQOOP_HOME%\lib folder to the job cache every time a Sqoop job is launched. When Oozie launch a Sqoop job, this behavior can be optimized by add these jars in Oozie Sqoop sharelib. In this case, the jar files in share lib only needed be localized to each worker node once and reuse by all Sqoop job launched by Oozie. This can reduce massive disk I/O on worker node when using Sqoop by Oozie. To enable this, Sqoop need to have an option which enable the job to skip adding lib jars to the job cache. For now, this option should only be used by Oozie started Sqoop job. The patch attached introduce "--skip-dist-cache" option to enable this feature.
Tested the new option with Oozie-Sqoop workflow to ensure it doesn't break Sqoop library dependencies when launched by Oozie
src/docs/user/import.txt
Revision 71b50d8 New Change
[20] 205 lines
[+20]
206
For example, +\--split-by employee_id+. Sqoop cannot currently split on
206
For example, +\--split-by employee_id+. Sqoop cannot currently split on
207
multi-column indices. If your table has no index column, or has a
207
multi-column indices. If your table has no index column, or has a
208
multi-column key, then you must also manually choose a splitting
208
multi-column key, then you must also manually choose a splitting
209
column.
209
column.
210

    
   
210

   

    
   
211
Controlling Distributed Cache

    
   
212
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    
   
213

   

    
   
214
Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every

    
   
215
time when start a Sqoop job. When launched by Oozie this is unnecessary

    
   
216
since Oozie use its own Sqoop share lib which keeps Sqoop dependencies

    
   
217
in the distributed cache. Oozie will do the localization on each

    
   
218
worker node for the Sqoop dependencies only once during the first Sqoop

    
   
219
job and reuse the jars on worker node for subsquencial jobs. Using

    
   
220
option +--skip-dist-cache+ in Sqoop command when launched by Oozie will

    
   
221
skip the step which Sqoop copies its dependencies to job cache and save

    
   
222
massive I/O.

    
   
223

   
211
Controlling the Import Process
224
Controlling the Import Process
212
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
225
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
213

    
   
226

   
214
By default, the import process will use JDBC which provides a
227
By default, the import process will use JDBC which provides a
215
reasonable cross-vendor import channel. Some databases can perform
228
reasonable cross-vendor import channel. Some databases can perform
[+20] [20] 520 lines
src/java/org/apache/sqoop/SqoopOptions.java
Revision 01805f9 New Change
 
src/java/org/apache/sqoop/mapreduce/JobBase.java
Revision 322df1c New Change
 
src/java/org/apache/sqoop/mapreduce/hcat/SqoopHCatUtilities.java
Revision b05f587 New Change
 
src/java/org/apache/sqoop/tool/BaseSqoopTool.java
Revision ebb1857 New Change
 
src/test/com/cloudera/sqoop/TestSqoopOptions.java
Revision 03e2504 New Change
 
  1. src/docs/user/import.txt: Loading...
  2. src/java/org/apache/sqoop/SqoopOptions.java: Loading...
  3. src/java/org/apache/sqoop/mapreduce/JobBase.java: Loading...
  4. src/java/org/apache/sqoop/mapreduce/hcat/SqoopHCatUtilities.java: Loading...
  5. src/java/org/apache/sqoop/tool/BaseSqoopTool.java: Loading...
  6. src/test/com/cloudera/sqoop/TestSqoopOptions.java: Loading...