表格式改为orc之后,对于array类型的数据查询结果不正确 问题排查和解决方案
2017-03-30 14:35
525 查看
Description:
We create a partitioned text format table with one partition, after we change the format of table to orc, then the array type field may output error.The step to reproduce the result.
First crate a text format table with array type field in hive.create table test_text_orc ( col_int bigint, col_text string, col_array array<string>, col_map map<string, string> ) PARTITIONED BY ( day string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' collection items TERMINATED BY ']' map keys TERMINATED BY ':' ;
Create new text file hive-orc-text-file-array-error-test.txt.
1,text_value1,array_value1]array_value2]array_value3, map_key1:map_value1,map_key2:map_value2 2,text_value2,array_value4, map_key1:map_value3 ,text_value3,, map_key1:]map_key3:map_value3
Load the data into one partition.
LOAD DATA local INPATH '.hive-orc-text-file-array-error-test.txt' overwrite into table test_text_orc partition(day=20170329)
select the data to verify the result.
hive> select * from test.test_text_orc; OK 1 text_value1 ["array_value1","array_value2","array_value3"] {" map_key1":"map_value1","map_key2":"map_value2"} 20170329 2 text_value2 ["array_value4"] {"map_key1":"map_value3"} 20170329 NULL text_value3 [] {" map_key1":"","map_key3":"map_value3"} 20170329
Alter table format of table to orc;
alter table test_text_orc set fileformat orc;
Check the result again, and you can see the error result.
hive> select * from test.test_text_orc; OK 1 text_value1 ["array_value1","array_value2","array_value3"] {" map_key1":"map_value1","map_key2":"map_value2"} 20170329 2 text_value2 ["array_value4","array_value2","array_value3"] {"map_key1":"map_value3"} 20170329 NULL text_value3 ["array_value4","array_value2","array_value3"] {"map_key3":"map_value3"," map_key1":""} 20170329
Reason Analysis
ObjectInspectorConverters$ListConverter instance does not clean the data of previous record,When the size of array of current row is less than that of previous row, it data of list will not be fully overwrite
and the not overwrited data will be output.
Code Analysis
In FetchOperator.nextRow. At first, it deserializes the value using the currSerDe, currSerDe is the SerDe of partition.Second, ObjectConverter is an instance of ObjectInspectorConverters$StructConverter
Object deserialized = currSerDe.deserialize(value); if (ObjectConverter != null) { deserialized = ObjectConverter.convert(deserialized); }
In method convert, it read out every field value in turn, and it uses the consponding converter to convert the field value.
After change the format of table to orc, with the type field array, the consponding convert is ObjectInspectorConverters$ListConverter.
@Override public Object convert(Object input) { if (input == null) { return null; } int minFields = Math.min(inputFields.size(), outputFields.size()); // Convert the fields for (int f = 0; f < minFields; f++) { Object inputFieldValue = inputOI.getStructFieldData(input, inputFields.get(f)); Object outputFieldValue = fieldConverters.get(f).convert(inputFieldValue); outputOI.setStructFieldData(output, outputFields.get(f), outputFieldValue); } // set the extra fields to null for (int f = minFields; f < outputFields.size(); f++) { outputOI.setStructFieldData(output, outputFields.get(f), null); } return output; } }
In Method ObjectInspectorConverters$ListConverter.convert, it first creates separate element converter for each element.
Then, it call outputIO.resize(output,size).
Finally, it set every converted element to outputOI.
@Override public Object convert(Object input) { if (input == null) { return null; } // Create enough elementConverters // NOTE: we have to have a separate elementConverter for each element, // because the elementConverters can reuse the internal object. // So it's not safe to use the same elementConverter to convert multiple // elements. int size = inputOI.getListLength(input); while (elementConverters.size() < size) { elementConverters.add(getConverter(inputElementOI, outputElementOI)); } // Convert the elements outputOI.resize(output, size); for (int index = 0; index < size; index++) { Object inputElement = inputOI.getListElement(input, index); Object outputElement = elementConverters.get(index).convert( inputElement); outputOI.set(output, index, outputElement); } return output; } }
The problem is in method resize, it does not clear all the data of previous, simply calls ensureCapacity.
When the size of array of current row is less than that of previous row, it data of list will not be fully overwrite and the not overwrited data will be output.
public Object resize(Object list, int newSize) { ((ArrayList) list).ensureCapacity(newSize); return list; }
## The method of amending.
Replace the previous method with the following code.
public Object resize(Object list, int newSize) { ((ArrayList) list).clear(); return list; }
相关文章推荐
- ADO.NET Entity framework 中 实体的对应数据库中text类型的问题 (更新) :asp.net entity 传入的表格格式数据流(TDS)远程过程调用(RPC)协议流不正确。参数 3 ("@0"): 数据类型 0
- 数据值为NULL,导致条件查询不到正确结果,ISNULL函数的使用解决问题
- Bootstrap分页之后条件框查询不到数据问题解决方案
- 一起ORACLE数据库中数据查询结果不一致问题的排查过程
- SQL C# nvarchar类型转换为int类型 多表查询的问题,查询结果到新表,TXT数据读取到控件和数据库,生成在控件中的数据如何存到TXT文件中
- <问题解决>数据库date类型数据前端页面显示格式不正确问题
- VS2008下使用Linq To Entity的Skip().Take()分页查询时遇到数据结果不对的问题
- 发布NBearLite v1.0.0: 提供强类型查询语法的非ORM数据访问组件 [8/2 更新至v1.0.0.9 beta - 修复NBearLite参考手册某些操作系统打开错误的问题]
- oracle 连表查询时的数据类型问题
- 关于Devexpress xtraGrid中数字字段删除数字后出现输入字符串格式不正确问题的解决方案
- subsonic text类型插入时出现问题:传入的表格格式数据流 tds 远程过程调用 rpc 协议流不正确
- Hibernate数据查询结果转json格式
- RIA Service中对于递归实体类型处理的问题及解决方案
- 对于solr 在更新数据后在admin中查询发现未更新问题
- 导数据时日期格式的问题、将float类型转成字符型的问题
- sqlite查询日期类型数据时出现问题的解决方法
- html文件修改后缀为aspx之后在ie6中格式显示不正确问题的解决办法
- html文件修改后缀为aspx之后在ie6中格式显示不正确问题的解决办法
- Mysql 通过中文查询数据查不出结果问题
- oracle查询date类型的数据的时候的小问题