I'm running Spark 1.3.0 and want to read a number of parquet files based on pattern matching. the parquet files are basically the underlying files of a Hive DB and I want to read some of the files (across different folders) only. the folder structure is
hdfs://myhost:8020/user/hive/warehouse/db/blogs/some/meta/files/hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=20160101/01/file1.parq hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=20160101/02/file2.parqhdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd=20160103/01/file3.parq
Something like
val v1 = sqlContext.parquetFile("hdfs://myhost:8020/user/hive/warehouse/db/blogs/yymmdd={[0-9]*}")
I want to ignore the meta files and load only the parquet files inside the date folders. Is this possible?