#I am using PyAthena to query the recently released CommonCrawl parquet archives as de

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

query succeds with correct answer, but a botocore.errorfactory.InvalidRequestException error message is logged about pyathena HOT 4 CLOSED

laughingman7743 commented on May 25, 2024

query succeds with correct answer, but a botocore.errorfactory.InvalidRequestException error message is logged

from pyathena.

Comments (4)

laughingman7743 commented on May 25, 2024

I think that it is a problem related to quotation escaping.
I would like to know the definition of ccindex table.
Thanks,

from pyathena.

mhlr commented on May 25, 2024

I cut and pasted i into the Atena web interfacet from the CommonCrawl article:
http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
It is:

CREATE EXTERNAL TABLE IF NOT EXISTS ccindex (
  url_surtkey                   STRING,
  url                           STRING,
  url_host_name                 STRING,
  url_host_tld                  STRING,
  url_host_2nd_last_part        STRING,
  url_host_3rd_last_part        STRING,
  url_host_4th_last_part        STRING,
  url_host_5th_last_part        STRING,
  url_host_registry_suffix      STRING,
  url_host_registered_domain    STRING,
  url_host_private_suffix       STRING,
  url_host_private_domain       STRING,
  url_protocol                  STRING,
  url_port                      INT,
  url_path                      STRING,
  url_query                     STRING,
  fetch_time                    TIMESTAMP,
  fetch_status                  SMALLINT,
  content_digest                STRING,
  content_mime_type             STRING,
  content_mime_detected         STRING,
  warc_filename                 STRING,
  warc_record_offset            INT,
  warc_record_length            INT,
  warc_segment                  STRING)
PARTITIONED BY (
  crawl                         STRING,
  subset                        STRING)
STORED AS parquet
LOCATION 's3://commoncrawl/cc-index/table/cc-main/warc/';

The same query runs in the Athena web interface without complaint.

from pyathena.

laughingman7743 commented on May 25, 2024

SQLAlchemy seems to call the get_columns method before executing the query.
https://github.com/laughingman7743/PyAthena/blob/master/pyathena/sqlalchemy_athena.py#L134

With read_sql method of pandas, it seems that the table_name argument of get_columns method is not the table name, but the query to be executed is getting passed.
The get_columns method gets an error when trying to execute the following query.

SELECT
  table_schema,
  table_name,
  column_name,
  data_type,
  is_nullable,
  column_default,
  ordinal_position,
  comment
FROM information_schema.columns
WHERE table_schema = 'ccindex'
AND table_name = '
SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
'

Even if an error occurs in the get_columns method, it seems that the query execution ends normally.
I feel like a problem with the implementation of pandas's read_sql method.

As a solution, it is better to pass the DB-API connection instead of the SQLAlchemy engine to the read_sql method.

df = pd.read_sql("""
SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
""", engine.connect().connection)

If it is a read_sql_query method, it seems to be ok to pass SQLAlchemy engine.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_query.html#pandas.read_sql_query

df = pd.read_sql_query("""
SELECT COUNT(*) AS count,
       url_host_registered_domain
FROM "ccindex"."ccindex"
WHERE crawl = 'CC-MAIN-2018-05'
  AND subset = 'warc'
  AND url_host_tld = 'no'
GROUP BY  url_host_registered_domain
HAVING (COUNT(*) >= 100)
ORDER BY  count DESC
""", engine)

from pyathena.

mhlr commented on May 25, 2024

@laughingman7743 Thanks!!

from pyathena.

query succeds with correct answer, but a botocore.errorfactory.InvalidRequestException error message is logged about pyathena HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent