Apache Drill and the lack of support for nested arrays
Apache Drill is very efficient and fast, till you try to use it with huge chunk of one file (such as a few GB) or if you attempt to query a complex data structure with nested data. Now, this is what I am trying to do right now - attempting to query large segments of data with a dynamic structure and nested schema.
I may construct a parquet data source from a nested array, as below,
create table dfs.tmp.camic as ( select camic.geometry.coordinates[0][0] as geo_coordinates from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/camic.json` camic);
Here I am giving the indices of the array.
Then I can query the data efficiently. For example,
select * from dfs.tmp.camic;
However, giving the indices wont work as I need, as I dont just need the first element. Rather I need the entire elements - in a large and dynamic array, representing the coordinates of geojson.
$ create table dfs.tmp.camic as ( select camic.geometry.coordinates[0] as geo_coordinates from dfs.`/home/pradeeban/programs/apache-drill-1.6.0/camic.json` camic);
Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type LIST
Fragment 0:0
[Error Id: a6d68a6c-50ea-437b-b1db-f1c8ace0e11d on llovizna:31010]
(java.lang.UnsupportedOperationException) Unsupported type LIST
org.apache.drill.exec.store.parquet.ParquetRecordWriter.getType():225
org.apache.drill.exec.store.parquet.ParquetRecordWriter.newSchema():187
org.apache.drill.exec.store.parquet.ParquetRecordWriter.updateSchema():172
org.apache.drill.exec.physical.impl.WriterRecordBatch.setupNewSchema():155
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():103
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():744 (state=,code=0)
Error: SYSTEM ERROR: UnsupportedOperationException: Unsupported type LIST
Fragment 0:0
[Error Id: a6d68a6c-50ea-437b-b1db-f1c8ace0e11d on llovizna:31010]
(java.lang.UnsupportedOperationException) Unsupported type LIST
org.apache.drill.exec.store.parquet.ParquetRecordWriter.getType():225
org.apache.drill.exec.store.parquet.ParquetRecordWriter.newSchema():187
org.apache.drill.exec.store.parquet.ParquetRecordWriter.updateSchema():172
org.apache.drill.exec.physical.impl.WriterRecordBatch.setupNewSchema():155
org.apache.drill.exec.physical.impl.WriterRecordBatch.innerNext():103
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.record.AbstractRecordBatch.next():119
org.apache.drill.exec.record.AbstractRecordBatch.next():109
org.apache.drill.exec.record.AbstractSingleRecordBatch.innerNext():51
org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.innerNext():129
org.apache.drill.exec.record.AbstractRecordBatch.next():162
org.apache.drill.exec.physical.impl.BaseRootExec.next():104
org.apache.drill.exec.physical.impl.ScreenCreator$ScreenRoot.innerNext():81
org.apache.drill.exec.physical.impl.BaseRootExec.next():94
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():257
org.apache.drill.exec.work.fragment.FragmentExecutor$1.run():251
java.security.AccessController.doPrivileged():-2
javax.security.auth.Subject.doAs():422
org.apache.hadoop.security.UserGroupInformation.doAs():1657
org.apache.drill.exec.work.fragment.FragmentExecutor.run():251
org.apache.drill.common.SelfCleaningRunnable.run():38
java.util.concurrent.ThreadPoolExecutor.runWorker():1142
java.util.concurrent.ThreadPoolExecutor$Worker.run():617
java.lang.Thread.run():744 (state=,code=0)
Here, I am trying to query a multi-dimensional array, which is not straight-forward.
(I set the error messages to be verbose using SET `exec.errors.verbose` = true;
above).
above).
The commonly suggested options to query multi-dimensional arrays are:
1. Using the array indexes in the select query: This is impractical. I do not know how many elements I would have in this geojson - the coordinates. It may be millions or as low as 3.
2. Flatten keyword: I am using Drill on top of Mongo - and finding an interesting case where Drill outperforms certain queries in a distributed execution than just using Mongo. Using Flatten basically kills all the performance benefits I have with Drill otherwise. Flatten is just plain expensive operation for the scale of my data (around 48 GB. But I can split them into a few GB each).
This is a known limitation of Drill. However, this significantly reduces its usability, as the proposed workarounds are either impractical or inefficient.