Git Product home page Git Product logo

Comments (6)

alamb avatar alamb commented on August 23, 2024 2

The more I think about this the more I think trying to make SessionState a container that doesn't have all the optional features (like parquet support) by default makes sense

So like

let session_state = SessionState::new();
// no table providers, etc
// install standard built in table providers
SessionContex::install_built_in(&mut session_state);
// now session_state has them here

from datafusion.

alamb avatar alamb commented on August 23, 2024 1

I made #11183 to start breaking apart the API and implementation -- there is still a ways to go

from datafusion.

devinjdangelo avatar devinjdangelo commented on August 23, 2024 1

@alamb I took a stab at moving parquet functionality into a datafusion-parquet crate (#11188) , and I ran into similar challenges you highlight here. I think to accomplish these goals core will need to be refactored into a number of different crates to avoid circular dependencies and allow core to still offer a batteries included experience.

from datafusion.

alamb avatar alamb commented on August 23, 2024 1

The more I think about this the more I think trying to make SessionState a container that doesn't have all the optional features (like parquet support) by default makes sense

So like

let session_state = SessionState::new();
// no table providers, etc
// install standard built in table providers
SessionContex::install_built_in(&mut session_state);
// now session_state has them here

I filed #11320 to track this idea

from datafusion.

alamb avatar alamb commented on August 23, 2024

I looked into this --

One of the major challenges is that SessionState's constructor basically installs the "pre-provided" functionality (like data sources, etc

pub fn new_with_config_rt_and_catalog_list(
config: SessionConfig,
runtime: Arc<RuntimeEnv>,
catalog_list: Arc<dyn CatalogProviderList>,
) -> Self {
let session_id = Uuid::new_v4().to_string();
// Create table_factories for all default formats
let mut table_factories: HashMap<String, Arc<dyn TableProviderFactory>> =
HashMap::new();
#[cfg(feature = "parquet")]
table_factories.insert("PARQUET".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("CSV".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("JSON".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("NDJSON".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("AVRO".into(), Arc::new(DefaultTableFactory::new()));
table_factories.insert("ARROW".into(), Arc::new(DefaultTableFactory::new()));
if config.create_default_catalog_and_schema() {
let default_catalog = MemoryCatalogProvider::new();
default_catalog
.register_schema(
&config.options().catalog.default_schema,
Arc::new(MemorySchemaProvider::new()),
)
.expect("memory catalog provider can register schema");
Self::register_default_schema(
&config,
&table_factories,
&runtime,
&default_catalog,
);
catalog_list.register_catalog(
config.options().catalog.default_catalog.clone(),
Arc::new(default_catalog),
);
}
let mut new_self = SessionState {
session_id,
analyzer: Analyzer::new(),
optimizer: Optimizer::new(),
physical_optimizers: PhysicalOptimizer::new(),
query_planner: Arc::new(DefaultQueryPlanner {}),
catalog_list,
table_functions: HashMap::new(),
scalar_functions: HashMap::new(),
aggregate_functions: HashMap::new(),
window_functions: HashMap::new(),
serializer_registry: Arc::new(EmptySerializerRegistry),
file_formats: HashMap::new(),
table_options: TableOptions::default_from_session_config(config.options()),
config,
execution_props: ExecutionProps::new(),
runtime_env: runtime,
table_factories,
function_factory: None,
};
#[cfg(feature = "parquet")]
if let Err(e) =
new_self.register_file_format(Arc::new(ParquetFormatFactory::new()), false)
{
log::info!("Unable to register default ParquetFormat: {e}")
};
if let Err(e) =
new_self.register_file_format(Arc::new(JsonFormatFactory::new()), false)
{
log::info!("Unable to register default JsonFormat: {e}")
};
if let Err(e) =
new_self.register_file_format(Arc::new(CsvFormatFactory::new()), false)
{
log::info!("Unable to register default CsvFormat: {e}")
};
if let Err(e) =
new_self.register_file_format(Arc::new(ArrowFormatFactory::new()), false)
{
log::info!("Unable to register default ArrowFormat: {e}")
};
if let Err(e) =
new_self.register_file_format(Arc::new(AvroFormatFactory::new()), false)
{
log::info!("Unable to register default AvroFormat: {e}")
};
// register built in functions
functions::register_all(&mut new_self)
.expect("can not register built in functions");
// register crate of array expressions (if enabled)
#[cfg(feature = "array_expressions")]
functions_array::register_all(&mut new_self)
.expect("can not register array expressions");
functions_aggregate::register_all(&mut new_self)
.expect("can not register aggregate functions");
new_self

One way to handle this would be to allow constructing SessionState with the minimal built in features, and have a function in SessionContext like SessionContext::register_built_ins that would register things like listing table, information schema, etc.

That way it would still be easy to use DataFusion with a minimal SessionState but also easily register all the built in extensions 🤔

from datafusion.

alamb avatar alamb commented on August 23, 2024

I think to accomplish these goals core will need to be refactored into a number of different crates to avoid circular dependencies and allow core to still offer a batteries included experience.

I agree with this entirely

The center of the knot is SessionState I think -- figuring out how to get that out of the core is likely key to breaking things up reasonably

from datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.