Git Product home page Git Product logo

Comments (4)

laughingman7743 avatar laughingman7743 commented on May 25, 2024

I think that it is a problem related to quotation escaping.
I would like to know the definition of ccindex table.
Thanks,

from pyathena.

mhlr avatar mhlr commented on May 25, 2024

I cut and pasted i into the Atena web interfacet from the CommonCrawl article:
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
It is:

CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
  url_surtkey                   STRING,
  url                           STRING,
  url_host_name                 STRING,
  url_host_tld                  STRING,
  url_host_2nd_last_part        STRING,
  url_host_3rd_last_part        STRING,
  url_host_4th_last_part        STRING,
  url_host_5th_last_part        STRING,
  url_host_registry_suffix      STRING,
  url_host_registered_domain    STRING,
  url_host_private_suffix       STRING,
  url_host_private_domain       STRING,
  url_protocol                  STRING,
  url_port                      INT,
  url_path                      STRING,
  url_query                     STRING,
  fetch_time                    TIMESTAMP,
  fetch_status                  SMALLINT,
  content_digest                STRING,
  content_mime_type             STRING,
  content_mime_detected         STRING,
  warc_filename                 STRING,
  warc_record_offset            INT,
  warc_record_length            INT,
  warc_segment                  STRING)
PARTITIONED BY (
  crawl                         STRING,
  subset                        STRING)
STORED AS parquet
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';

The same query runs in the Athena web interface without complaint.

from pyathena.

laughingman7743 avatar laughingman7743 commented on May 25, 2024

SQLAlchemy seems to call the get_columns method before executing the query.
https://github.com/laughingman7743/PyAthena/blob/master/pyathena/sqlalchemy_athena.py#L134

With read_sql method of pandas, it seems that the table_name argument of get_columns method is not the table name, but the query to be executed is getting passed.
The get_columns method gets an error when trying to execute the following query.

SELECT
  table_schema,
  table_name,
  column_name,
  data_type,
  is_nullable,
  column_default,
  ordinal_position,
  comment
FROM information_schema.columns
WHERE table_schema = 'ccindex'
AND table_name = '
SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
'

Even if an error occurs in the get_columns method, it seems that the query execution ends normally.
I feel like a problem with the implementation of pandas's read_sql method.

As a solution, it is better to pass the DB-API connection instead of the SQLAlchemy engine to the read_sql method.

df = pd.read_sql("""
SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
""", engine.connect().connection)

If it is a read_sql_query method, it seems to be ok to pass SQLAlchemy engine.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html#pandas.read_sql_query

df = pd.read_sql_query("""
SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
""", engine)

from pyathena.

mhlr avatar mhlr commented on May 25, 2024

@laughingman7743 Thanks!!

from pyathena.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.