Review Board 1.7.22


Sqoop 2 documentation for connector development

Review Request #15220 - Created Nov. 5, 2013 and updated

Masatake Iwasaki
sqoop2
SQOOP-1225
Reviewers
Sqoop
sqoop-sqoop2
added contents of connector developers guide.

 

Diff revision 2 (Latest)

1 2
1 2

  1. docs/src/site/sphinx/ConnectorDevelopment.rst: Loading...
docs/src/site/sphinx/ConnectorDevelopment.rst
Revision 918ca00 New Change
[20] 15 lines
[+20]
16

    
   
16

   
17
=============================
17
=============================
18
Sqoop 2 Connector Development
18
Sqoop 2 Connector Development
19
=============================
19
=============================
20

    
   
20

   
21
This document describes you how to implement connector for Sqoop 2.
21
This document describes you how to implement connector for Sqoop 2

    
   
22
using the code of built-in connector ( ``GenericJdbcConnector`` ) as example.
22

    
   
23

   

    
   
24
.. contents::
23

    
   
25

   
24
What is Connector?
26
What is Connector?
25
++++++++++++++++++
27
++++++++++++++++++
26

    
   
28

   
27
Connector provides interaction with external databases.
29
Connector provides interaction with external databases.
28
Connector reads data from databases for import,
30
Connector reads data from databases for import,
29
and write data to databases for export.
31
and write data to databases for export.
30
Interaction with Hadoop is taken cared by common modules of Sqoop 2 framework.
32
Interaction with Hadoop is taken cared by common modules of Sqoop 2 framework.
31

    
   
33

   
32

    
   
34

   
33
Connector Implementation
35
Connector Implementation
34
++++++++++++++++++++++++
36
++++++++++++++++++++++++
35

    
   
37

   
36
The SqoopConnector class defines functionality
38
The ``SqoopConnector`` class defines functionality
37
which must be provided by Connectors.
39
which must be provided by Connectors.
38
Each Connector must extends SqoopConnector and overrides methods shown below.
40
Each Connector must extends ``SqoopConnector`` and overrides methods shown below.
39
::
41
::
40

    
   
42

   
41
  public abstract String getVersion();
43
  public abstract String getVersion();
42
  public abstract ResourceBundle getBundle(Locale locale);
44
  public abstract ResourceBundle getBundle(Locale locale);
43
  public abstract Class getConnectionConfigurationClass();
45
  public abstract Class getConnectionConfigurationClass();
44
  public abstract Class getJobConfigurationClass(MJob.Type jobType);
46
  public abstract Class getJobConfigurationClass(MJob.Type jobType);
45
  public abstract Importer getImporter();
47
  public abstract Importer getImporter();
46
  public abstract Exporter getExporter();
48
  public abstract Exporter getExporter();
47
  public abstract Validator getValidator();
49
  public abstract Validator getValidator();
48
  public abstract MetadataUpgrader getMetadataUpgrader();
50
  public abstract MetadataUpgrader getMetadataUpgrader();
49

    
   
51

   
50
The getImporter method returns Importer_ instance
52
The ``getImporter`` method returns Importer_ instance
51
which is a placeholder for the modules needed for import.
53
which is a placeholder for the modules needed for import.
52

    
   
54

   
53
The getExporter method returns Exporter_ instance
55
The ``getExporter`` method returns Exporter_ instance
54
which is a placeholder for the modules needed for export.
56
which is a placeholder for the modules needed for export.
55

    
   
57

   
56
Methods such as getBundle, getConnectionConfigurationClass,
58
Methods such as ``getBundle`` , ``getConnectionConfigurationClass`` ,
57
getJobConfigurationClass and getValidator
59
``getJobConfigurationClass`` and ``getValidator``
58
are concerned to `Connector configurations`_ .
60
are concerned to `Connector configurations`_ .
59

    
   
61

   
60

    
   
62

   
61
Importer
63
Importer
62
========
64
========
63

    
   
65

   
64
Connector#getImporter method returns Importer instance
66
Connector's ``getImporter`` method returns ``Importer`` instance
65
which is a placeholder for the modules needed for import
67
which is a placeholder for the modules needed for import
66
such as Partitioner_ and Extractor_ .
68
such as Partitioner_ and Extractor_ .
67
Built-in GenericJdbcConnector defines Importer like this.
69
Built-in ``GenericJdbcConnector`` defines ``Importer`` like this.
68
::
70
::
69

    
   
71

   
70
  private static final Importer IMPORTER = new Importer(
72
  private static final Importer IMPORTER = new Importer(
71
      GenericJdbcImportInitializer.class,
73
      GenericJdbcImportInitializer.class,
72
      GenericJdbcImportPartitioner.class,
74
      GenericJdbcImportPartitioner.class,
[+20] [20] 12 lines
[+20]
85
---------
87
---------
86

    
   
88

   
87
Extractor (E for ETL) extracts data from external database and
89
Extractor (E for ETL) extracts data from external database and
88
writes it to Sqoop framework for import.
90
writes it to Sqoop framework for import.
89

    
   
91

   
90
Extractor must overrides extract method.
92
Extractor must overrides ``extract`` method.
91
::
93
::
92

    
   
94

   
93
  public abstract void extract(ExtractorContext context,
95
  public abstract void extract(ExtractorContext context,
94
                               ConnectionConfiguration connectionConfiguration,
96
                               ConnectionConfiguration connectionConfiguration,
95
                               JobConfiguration jobConfiguration,
97
                               JobConfiguration jobConfiguration,
96
                               Partition partition);
98
                               Partition partition);
97

    
   
99

   
98
The extract method extracts data from database in some way and
100
The ``extract`` method extracts data from database in some way and
99
writes it to DataWriter (provided by context) as `Intermediate representation`_ .
101
writes it to ``DataWriter`` (provided by context) as `Intermediate representation`_ .
100

    
   
102

   
101
Extractor must iterates in the extract method until the data from database exhausts.
103
Extractor must iterates in the ``extract`` method until the data from database exhausts.
102
::
104
::
103

    
   
105

   
104
  while (resultSet.next()) {
106
  while (resultSet.next()) {
105
    ...
107
    ...
106
    context.getDataWriter().writeArrayRecord(array);
108
    context.getDataWriter().writeArrayRecord(array);
107
    ...
109
    ...
108
  }
110
  }
109

    
   
111

   
110

    
   
112

   
111
Partitioner
113
Partitioner
112
-----------
114
-----------
113

    
   
115

   
114
Partitioner creates Partition instances based on configurations.
116
Partitioner creates ``Partition`` instances based on configurations.
115
The number of Partition instances is interpreted as the number of map tasks.
117
The number of ``Partition`` instances is decided
116
Partition instances are passed to Extractor_ as the argument of extract method.
118
based on the value users specified as the numbers of ectractors

    
   
119
in job configuration.

    
   
120

   

    
   
121
``Partition`` instances are passed to Extractor_ as the argument of ``extract`` method.
117
Extractor_ determines which portion of the data to extract by Partition.
122
Extractor_ determines which portion of the data to extract by Partition.
118

    
   
123

   
119
There is no actual convention for Partition classes
124
There is no actual convention for Partition classes
120
other than being actually Writable and toString()-able.
125
other than being actually ``Writable`` and ``toString()`` -able.
121
::
126
::
122

    
   
127

   
123
  public abstract class Partition {
128
  public abstract class Partition {
124
    public abstract void readFields(DataInput in) throws IOException;
129
    public abstract void readFields(DataInput in) throws IOException;
125
    public abstract void write(DataOutput out) throws IOException;
130
    public abstract void write(DataOutput out) throws IOException;
126
    public abstract String toString();
131
    public abstract String toString();
127
  }
132
  }
128

    
   
133

   
129
Connectors can define the design of Partition on their own.
134
Connectors can define the design of ``Partition`` on their own.
130

    
   
135

   
131

    
   
136

   
132
Initializer and Destroyer
137
Initializer and Destroyer
133
-------------------------
138
-------------------------
134

    
   
139

   
[+20] [20] 4 lines
[+20]
139

    
   
144

   
140

    
   
145

   
141
Exporter
146
Exporter
142
========
147
========
143

    
   
148

   
144
Connector#getExporter method returns Exporter instance
149
Connector's ``getExporter`` method returns ``Exporter`` instance
145
which is a placeholder for the modules needed for export
150
which is a placeholder for the modules needed for export
146
such as Loader_ .
151
such as Loader_ .
147
Built-in GenericJdbcConnector defines Exporter like this.
152
Built-in ``GenericJdbcConnector`` defines ``Exporter`` like this.
148
::
153
::
149

    
   
154

   
150
  private static final Exporter EXPORTER = new Exporter(
155
  private static final Exporter EXPORTER = new Exporter(
151
      GenericJdbcExportInitializer.class,
156
      GenericJdbcExportInitializer.class,
152
      GenericJdbcExportLoader.class,
157
      GenericJdbcExportLoader.class,
[+20] [20] 11 lines
[+20]
164
------
169
------
165

    
   
170

   
166
Loader (L for ETL) receives data from Sqoop framework and
171
Loader (L for ETL) receives data from Sqoop framework and
167
loads it to external database.
172
loads it to external database.
168

    
   
173

   
169
Loader must overrides load method.
174
Loader must overrides ``load`` method.
170
::
175
::
171

    
   
176

   
172
  public abstract void load(LoaderContext context,
177
  public abstract void load(LoaderContext context,
173
                            ConnectionConfiguration connectionConfiguration,
178
                            ConnectionConfiguration connectionConfiguration,
174
                            JobConfiguration jobConfiguration) throws Exception;
179
                            JobConfiguration jobConfiguration) throws Exception;
175

    
   
180

   
176
The load method reads data from DataReader (provided by context)
181
The ``load`` method reads data from ``DataReader`` (provided by context)
177
in `Intermediate representation`_ and loads it to database in some way.
182
in `Intermediate representation`_ and loads it to database in some way.
178

    
   
183

   
179
Loader must iterates in the load method until the data from DataReader exhausts.
184
Loader must iterates in the ``load`` method until the data from ``DataReader`` exhausts.
180
::
185
::
181

    
   
186

   
182
  while ((array = context.getDataReader().readArrayRecord()) != null) {
187
  while ((array = context.getDataReader().readArrayRecord()) != null) {
183
    ...
188
    ...
184
  }
189
  }
[+20] [20] 9 lines
[+20]
194

    
   
199

   
195

    
   
200

   
196
Connector Configurations
201
Connector Configurations
197
++++++++++++++++++++++++
202
++++++++++++++++++++++++
198

    
   
203

   

    
   
204
Connector specifications

    
   
205
========================

    
   
206

   

    
   
207
Framework of the Sqoop loads definitions of connectors

    
   
208
from the file named ``sqoopconnector.properties``

    
   
209
which each connector implementation provides.

    
   
210
::

    
   
211

   

    
   
212
  # Generic JDBC Connector Properties

    
   
213
  org.apache.sqoop.connector.class = org.apache.sqoop.connector.jdbc.GenericJdbcConnector

    
   
214
  org.apache.sqoop.connector.name = generic-jdbc-connector

    
   
215

   

    
   
216

   
199
Configurations
217
Configurations
200
==============
218
==============
201

    
   
219

   
202
The definition of the configurations are represented
220
Implementation of ``SqoopConnector`` overrides methods such as
203
by models defined in org.apache.sqoop.model package.
221
``getConnectionConfigurationClass`` and ``getJobConfigurationClass``

    
   
222
returning configuration class.

    
   
223
::

    
   
224

   

    
   
225
  @Override

    
   
226
  public Class getConnectionConfigurationClass() {

    
   
227
    return ConnectionConfiguration.class;

    
   
228
  }

    
   
229

   

    
   
230
  @Override

    
   
231
  public Class getJobConfigurationClass(MJob.Type jobType) {

    
   
232
    switch (jobType) {

    
   
233
      case IMPORT:

    
   
234
        return ImportJobConfiguration.class;

    
   
235
      case EXPORT:

    
   
236
        return ExportJobConfiguration.class;

    
   
237
      default:

    
   
238
        return null;

    
   
239
    }

    
   
240
  }

    
   
241

   

    
   
242
Configurations are represented

    
   
243
by models defined in ``org.apache.sqoop.model`` package.

    
   
244
Annotations such as

    
   
245
``ConfigurationClass`` , ``FormClass`` , ``Form`` and ``Input``

    
   
246
are provided for defining configurations of each connectors

    
   
247
using these models.

    
   
248

   

    
   
249
``ConfigurationClass`` is place holder for ``FormClasses`` .

    
   
250
::

    
   
251

   

    
   
252
  @ConfigurationClass

    
   
253
  public class ConnectionConfiguration {
204

    
   
254

   

    
   
255
    @Form public ConnectionForm connection;
205

    
   
256

   
206
ConnectionConfigurationClass
257
    public ConnectionConfiguration() {
207
----------------------------
258
      connection = new ConnectionForm();

    
   
259
    }

    
   
260
  }
208

    
   
261

   

    
   
262
Each ``FormClass`` defines names and types of configs.

    
   
263
::
209

    
   
264

   
210
JobConfigurationClass
265
  @FormClass
211
---------------------
266
  public class ConnectionForm {

    
   
267
    @Input(size = 128) public String jdbcDriver;

    
   
268
    @Input(size = 128) public String connectionString;

    
   
269
    @Input(size = 40)  public String username;

    
   
270
    @Input(size = 40, sensitive = true) public String password;

    
   
271
    @Input public Map<String, String> jdbcProperties;

    
   
272
  }
212

    
   
273

   
213

    
   
274

   
214
ResourceBundle
275
ResourceBundle
215
==============
276
==============
216

    
   
277

   
217
Resources for Configurations_ are stored in properties file
278
Resources used by client user interfaces are defined in properties file.
218
accessed by getBundle method of the Connector.
279
::

    
   
280

   

    
   
281
  # jdbc driver

    
   
282
  connection.jdbcDriver.label = JDBC Driver Class

    
   
283
  connection.jdbcDriver.help = Enter the fully qualified class name of the JDBC \

    
   
284
                     driver that will be used for establishing this connection.

    
   
285

   

    
   
286
  # connect string

    
   
287
  connection.connectionString.label = JDBC Connection String

    
   
288
  connection.connectionString.help = Enter the value of JDBC connection string to be \

    
   
289
                     used by this connector for creating connections.

    
   
290

   

    
   
291
  ...

    
   
292

   

    
   
293
Those resources are loaded by ``getBundle`` method of connector.

    
   
294
::

    
   
295

   

    
   
296
  @Override

    
   
297
  public ResourceBundle getBundle(Locale locale) {

    
   
298
    return ResourceBundle.getBundle(

    
   
299
    GenericJdbcConnectorConstants.RESOURCE_BUNDLE_NAME, locale);

    
   
300
  }
219

    
   
301

   
220

    
   
302

   
221
Validator
303
Validator
222
=========
304
=========
223

    
   
305

   
224
Validator validates configurations set by users.
306
Validator validates configurations set by users.
225

    
   
307

   
226

    
   
308

   
227
Internal of Sqoop2 MapReduce Job
309
Internal of Sqoop2 MapReduce Job
228
++++++++++++++++++++++++++++++++
310
++++++++++++++++++++++++++++++++
229

    
   
311

   
230
Sqoop 2 provides common MapReduce modules such as SqoopMapper and SqoopReducer
312
Sqoop 2 provides common MapReduce modules such as ``SqoopMapper`` and ``SqoopReducer``
231
for the both of import and export.
313
for the both of import and export.
232

    
   
314

   
233
- InputFormat create splits using Partitioner.
315
- For import, ``Extractor`` provided by connector extracts data from databases,

    
   
316
  and ``Loader`` provided by Sqoop2 loads data into Hadoop.
234

    
   
317

   
235
- SqoopMapper invokes Extractor's extract method.
318
- For export, ``Extractor`` provided by Sqoop2 exracts data from Hadoop,

    
   
319
  and ``Loader`` provided by connector loads data into databases.
236

    
   
320

   
237
- SqoopReducer do no actual works.
321
The diagram below describes the initialization phase of IMPORT job.

    
   
322
``SqoopInputFormat`` create splits using ``Partitioner`` .

    
   
323
::
238

    
   
324

   
239
- OutputFormat invokes Loader's load method (via SqoopOutputFormatLoadExecutor).
325
      ,----------------.          ,-----------.

    
   
326
      |SqoopInputFormat|          |Partitioner|

    
   
327
      `-------+--------'          `-----+-----'

    
   
328
   getSplits  |                         |

    
   
329
  ----------->|                         |

    
   
330
              |      getPartitions      |

    
   
331
              |------------------------>|

    
   
332
              |                         |         ,---------.

    
   
333
              |                         |-------> |Partition|

    
   
334
              |                         |         `----+----'

    
   
335
              |<- - - - - - - - - - - - |              |

    
   
336
              |                         |              |          ,----------.

    
   
337
              |-------------------------------------------------->|SqoopSplit|

    
   
338
              |                         |              |          `----+-----'
240

    
   
339

   
241
.. todo: sequence diagram like figure.
340
The diagram below describes the map phase of IMPORT job.

    
   
341
``SqoopMapper`` invokes extractor's ``extract`` method.

    
   
342
::

    
   
343

   

    
   
344
      ,-----------.

    
   
345
      |SqoopMapper|

    
   
346
      `-----+-----'

    
   
347
     run    |

    
   
348
  --------->|                                   ,-------------.

    
   
349
            |---------------------------------->|MapDataWriter|

    
   
350
            |                                   `------+------'

    
   
351
            |                ,---------.               |

    
   
352
            |--------------> |Extractor|               |

    
   
353
            |                `----+----'               |

    
   
354
            |      extract        |                    |

    
   
355
            |-------------------->|                    |

    
   
356
            |                     |                    |

    
   
357
           read from DB           |                    |

    
   
358
  <-------------------------------|      write*        |

    
   
359
            |                     |------------------->|

    
   
360
            |                     |                    |           ,----.

    
   
361
            |                     |                    |---------->|Data|

    
   
362
            |                     |                    |           `-+--'

    
   
363
            |                     |                    |

    
   
364
            |                     |                    |      context.write

    
   
365
            |                     |                    |-------------------------->

    
   
366

   

    
   
367
The diagram below decribes the reduce phase of EXPORT job.

    
   
368
``OutputFormat`` invokes loader's ``load`` method (via ``SqoopOutputFormatLoadExecutor`` ).

    
   
369
::
242

    
   
370

   
243
For import, Extractor provided by Connector extracts data from databases,
371
    ,-------.  ,---------------------.
244
and Loader provided by Sqoop2 loads data into Hadoop.
372
    |Reducer|  |SqoopNullOutputFormat|

    
   
373
    `---+---'  `----------+----------'

    
   
374
        |                 |   ,-----------------------------.

    
   
375
        |                 |-> |SqoopOutputFormatLoadExecutor|

    
   
376
        |                 |   `--------------+--------------'        ,----.

    
   
377
        |                 |                  |---------------------> |Data|

    
   
378
        |                 |                  |                       `-+--'

    
   
379
        |                 |                  |   ,-----------------.   |

    
   
380
        |                 |                  |-> |SqoopRecordWriter|   |

    
   
381
      getRecordWriter     |                  |   `--------+--------'   |

    
   
382
  ----------------------->| getRecordWriter  |            |            |

    
   
383
        |                 |----------------->|            |            |     ,--------------.

    
   
384
        |                 |                  |-----------------------------> |ConsumerThread|

    
   
385
        |                 |                  |            |            |     `------+-------'

    
   
386
        |                 |<- - - - - - - - -|            |            |            |    ,------.

    
   
387
  <- - - - - - - - - - - -|                  |            |            |            |--->|Loader|

    
   
388
        |                 |                  |            |            |            |    `--+---'

    
   
389
        |                 |                  |            |            |            |       |

    
   
390
        |                 |                  |            |            |            | load  |

    
   
391
   run  |                 |                  |            |            |            |------>|

    
   
392
  ----->|                 |     write        |            |            |            |       |

    
   
393
        |------------------------------------------------>| setContent |            | read* |

    
   
394
        |                 |                  |            |----------->| getContent |<------|

    
   
395
        |                 |                  |            |            |<-----------|       |

    
   
396
        |                 |                  |            |            |            | - - ->|

    
   
397
        |                 |                  |            |            |            |       | write into DB

    
   
398
        |                 |                  |            |            |            |       |-------------->
245

    
   
399

   
246
For export, Extractor provided Sqoop2 exracts data from Hadoop,

   
247
and Loader provided by Connector loads data into databases.

   
248

    
   
400

   
249

    
   
401

   
250
.. _`Intermediate representation`: https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation
402
.. _`Intermediate representation`: https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation
  1. docs/src/site/sphinx/ConnectorDevelopment.rst: Loading...