Review Board 1.7.22


SQOOP-777. Sqoop2: Pluggable Intermediate Data Format

Review Request #12936 - Created July 25, 2013 and updated

Hari Shreedharan
SQOOP-777
Reviewers
Sqoop
sqoop-sqoop2
Implemented a pluggable intermediate data format that decouples the internal representation of the data from the connector and the output formats. Connectors can choose to implement and support a format that is more efficient for them. Also separated the SqoopWritable so that we can use the intermediate data format independent of (current) Hadoop. 

I ran a full build - all tests including integration tests pass. I have not added any new tests, yet. I will add unit tests for the new classes. Also, I have not tried running this on an actual cluster - so things may be broken. I'd like some initial feedback based on the current patch. 

I also implemented escaping of characters. There is some work remaining to support binary format, but it is mostly integration, the basic implementation is in place.

 
Total:
55
Open:
45
Resolved:
10
Dropped:
0
Status:
From:
Description From Last Updated Status
Do you think this should be writeContent (or conversely the method in DataReader should be changed to readRecord instead of ... Venkat Ranganathan Aug. 1, 2013, 4:42 a.m. Open
Extreme nit: Read data from the execution framework... (not necessary MR) Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Extreme nit: Read data from execution framework... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Nit: Read data from execution framework as a native format. This should be independent native format right? Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Extreme nit: Write an array of objects into the execution framework... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Extreme nit: Write data into execution framework.... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Can we add descriptive java doc describing what exactly are expecting to validate? Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Those imports seems to be unused. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
We should also take into account exportJobConfiguration.table.schema when building the query. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Let's clean up unused imports after this refactoring. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
This method seems to be generated automatically, but it seems to be removing the fail() call, is it intentional? Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
The wiki [1] is also saying that the 0x5C should be substituted. 1: https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Similarly as above, it seems that we are missing the \\ replacement for 0x5C. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Is it wise to consume the exception here without any counters or other reporting except of log message? It also ... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
The setData() method is conditionally running the validations, are they intentionally skipped here in setTextData()? Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
I'm afraid that String.split() won't work correctly for following example input: 1,"Hi, here is Jarcec" That contains two columns, but ... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Nit: Please pass the correct array size instead of the zero. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
I would suggest to be strict here and accept only upper case variant of NULL. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
I would suggest to die quickly for unsupported data types rather than just silently pass them as a un-escaped strings. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Is there a reason why to do our own join rather than StringUtils.join()? Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Is expected and desirable that user can't reset schema to NULL? Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Nit: Would it make more sense to have IntermediateDataFormat (without type here) or CSVIntermediateDataFormat since we are not allowing nothing ... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
This will convert the binary string to an array of 0 and 1, right? The intermediate data format is suggesting ... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Since the regular expressions are pre-build and therefore the PatternSyntaxException should not be thrown, is there any other exception that ... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
The conditional escaped argument seems quite dangerous to me - what if the user has saved the data on disk ... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Since the regular expressions are pre-build and therefore the PatternSyntaxException should not be thrown, is there any other exception that ... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Since couple of the methods, such as setSchema(), getSchema(), validateData() will have to be implemented by every IDF in a ... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Can we javadoc for those two methods? Also it seems that they are not used, so I'm wondering if we ... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Can we make this a Class instead of String? Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Nit: Can we keep the style from previous lines and put this on single line? Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
The length variable seems to be used only readFields method, so I'm wondering if local variable wouldn't be better fit? Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Casting the data to UTF might be quite dangerous, especially for binary values as they will be interpreted and possibly ... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
This import do not seem to be relevant any more. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Nit: Please keep the line on a single line. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Nit: This change do not seem necessary. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
This import do not seem to be relevant any more. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Nit: Can we please put this on single line? Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
This import do not seem to be relevant any more. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Nit: Please put the line on a single line. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
This import do not seem relevant any more. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Nit: Please put this on a single line. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
It seems that we are building the same schema in multiple test cases. Would it make sense to create a ... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Hari got 20 experience points for using awesome text messages in the tests! Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
I'm not sure whether we want to make all connectors depending on the SDK. The SDK should be light library ... Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Nit: Please put this on a single line. Jarek Cecho Aug. 26, 2013, 11:54 p.m. Open
Review request changed
Updated (Aug. 1, 2013, 3:41 a.m.)
Incorporate review feedback
Posted (Aug. 1, 2013, 4:42 a.m.)
Thanks for working on this and looks good.   The ability to have an intermediate format is a good thing (I am mimicking somewhat similar targeted work for Sqoop 1 for some new changes).
Do you think this should be writeContent (or conversely the method in DataReader should be changed to readRecord instead of Content?)
  1. Venkat:
    
    Maybe I can make the javadoc clearer, but the idea of having readContent and writeContent in the DataReader/DataWriter and the IntermediateDataFormat is that if we allow the connector to choose a IntermediateDataFormat - the connector can read/write in the native format and (once we make the serialization part in the OutputFormat pluggable) we would be able to have the serializer also read/write in the connector's native format. In that case, it is possible that the native format might be efficiently able to put in several records in one call itself - which is why I named it as such (so all others will be record oriented while this method is not). Does that make the intent clearer?
  2. Hari:
    
    I guessed your intent, but was commenting that one is called readContent and another is called writeRecord.  Do you think they both should be xxxxContent?
    
    Thanks
  3. Ah, yes. Will update in the next patch.
Posted (Aug. 26, 2013, 11:55 p.m.)
Hi Hari,
thank you again for working on this JIRA and please accept my apologies for the late response. I've deep dived into the patch and I do have couple of comments and questions.
Extreme nit: Read data from the execution framework...

(not necessary MR)
Extreme nit: Read data from execution framework...
Nit: Read data from execution framework as a native format.

This should be independent native format right?
Extreme nit: Write an array of objects into the execution framework...
Extreme nit: Write data into execution framework....
Can we add descriptive java doc describing what exactly are expecting to validate?
connector/connector-generic-jdbc/src/main/java/org/apache/sqoop/connector/jdbc/GenericJdbcExportInitializer.java (Diff revision 5)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Those imports seems to be unused.
We should also take into account exportJobConfiguration.table.schema when building the query.
Let's clean up unused imports after this refactoring.
This method seems to be generated automatically, but it seems to be removing the fail() call, is it intentional?
The wiki [1] is also saying that the 0x5C should be substituted.

1: https://cwiki.apache.org/confluence/display/SQOOP/Sqoop2+Intermediate+representation
Similarly as above, it seems that we are missing the \\ replacement for 0x5C.
Is it wise to consume the exception here without any counters or other reporting except of log message?

It also seems that other parts of the code are not null safe, so this error handling will most likely just cause NPE somewhere else.
The setData() method is conditionally running the validations, are they intentionally skipped here in setTextData()?
I'm afraid that String.split() won't work correctly for following example input:

1,"Hi, here is Jarcec"

That contains two columns, but three columns will be created when splitting by comma.
Nit: Please pass the correct array size instead of the zero.
I would suggest to be strict here and accept only upper case variant of NULL.
I would suggest to die quickly for unsupported data types rather than just silently pass them as a un-escaped strings.
Is there a reason why to do our own join rather than StringUtils.join()?
Is expected and desirable that user can't reset schema to NULL?
Nit: Would it make more sense to have IntermediateDataFormat (without type here) or CSVIntermediateDataFormat since we are not allowing nothing else?
This will convert the binary string to an array of 0 and 1, right? The intermediate data format is suggesting the MySQLdump approach that is serializing the bytes as they are though. The 0 and 1 seems to be quite inefficient.
Since the regular expressions are pre-build and therefore the PatternSyntaxException should not be thrown, is there any other exception that we are expecting?
The conditional escaped argument seems quite dangerous to me - what if the user has saved the data on disk with escaped = true and is using this class to read the data from disk? The default value is false, so this will remove some actual data, right?
Since the regular expressions are pre-build and therefore the PatternSyntaxException should not be thrown, is there any other exception that we are expecting?
Since couple of the methods, such as setSchema(), getSchema(), validateData() will have to be implemented by every IDF in a very similar way, I'm wondering if it would make sense to convert the IDF to abstract class and provide the base implementation?
Can we javadoc for those two methods?

Also it seems that they are not used, so I'm wondering if we are planning to use them in the future or if they are artifact from previous development?
Can we make this a Class instead of String?
Nit: Can we keep the style from previous lines and put this on single line?
The length variable seems to be used only readFields method, so I'm wondering if local variable wouldn't be better fit?
Casting the data to UTF might be quite dangerous, especially for binary values as they will be interpreted and possibly corrupted. This won't be an issue with current implementation that stores binary data as a stream of 0s and 1s, but will become an issue when (if) we switch to the format designed on the wiki.
This import do not seem to be relevant any more.
Nit: Please keep the line on a single line.
Nit: This change do not seem necessary.
This import do not seem to be relevant any more.
Nit: Can we please put this on single line?
This import do not seem to be relevant any more.
Nit: Please put the line on a single line.
This import do not seem relevant any more.
Nit: Please put this on a single line.
It seems that we are building the same schema in multiple test cases. Would it make sense to create a helper method for this rather than copy&pasting the code?
Hari got 20 experience points for using awesome text messages in the tests!
spi/pom.xml (Diff revision 5)
 
 
 
 
 
I'm not sure whether we want to make all connectors depending on the SDK. The SDK should be light library of code for connectors, We should not force the connector to reuse anything. Is the dependency required?
Nit: Please put this on a single line.
Jarcec