Suppose the following Avro schema:
[
{ "type": "enum",
"name": "POSTag",
"namespace": "com.nitro.nlp",
"symbols": ["TO", "VB", "CC", "RB"]
},
{ "type": "record",
"name": "POS",
"namespace": "com.nitro.nlp",
"doc": "Part of speech tag",
"fields": [
{"name": "id", "type": "long"},
{"name": "token", "type": "string"},
{"name": "tag", "type": "POSTag"}
]
},
{ "type": "record",
"name": "ParsedPDF",
"namespace": "com.nitro.nlp",
"doc": "Parsed PDF",
"fields": [
{"name": "text", "type": "string"},
{"name": "pos", "type": { "type": "array", "items": "POS" } }
]
}
]
Then, let's create an RDD and try to do a predicate on the "pos" array from the ParsedPDF object:
ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[ParsedPDF]])
val rdd = sc.newAPIHadoopFile(path, classOf[ParquetInputFormat[ParsedPDF]],
classOf[Void], classOf[ParsedPDF], job.getConfiguration
). map(_._2) // drop Void key
val projection = Projection[ParsedPDF](_.getText)
AvroParquetInputFormat.setRequestedProjection(job, projection)
val results = rdd.map(_.toString).collect().mkString("\n")
println(s"Projection of text field only: $results")
This fails the following way:
Error:(100, 41) exception during macro expansion:
java.lang.RuntimeException: Unsupported value type: ARRAY
at me.lyh.parquet.avro.Predicate$.applyToPredicate$1(Predicate.scala:137)
at me.lyh.parquet.avro.Predicate$.parse$1(Predicate.scala:162)
at me.lyh.parquet.avro.Predicate$.buildFilterPredicate(Predicate.scala:169)
at me.lyh.parquet.avro.Predicate$.applyImpl(Predicate.scala:17)
val predicate = Predicate[ParsedPDF](x => x.getPos.exists(i => i.getTag == POSTag.CC))
^
Is array support not available? If so, what's the list of Avro types that are currently supported?
Is there a plan to add array support to predicates?
Thanks!
Marek Kolodziej