Git Product home page Git Product logo

Comments (17)

joshsinger avatar joshsinger commented on May 26, 2024

proxy support for NcbiImporter added in version 0.1.122.

To enable it, you must configure 3 properties in your conf/gluetools-config.xml (adding these under the XML element).

Here's an example of how the config properties would look:

		<!-- HTTP proxy config example -->
		<property>
			<name>gluetools.core.http.proxy.enabled</name>
			<value>true</value>
		</property>
		<property>
			<name>gluetools.core.http.proxy.host</name>
			<value>45.232.52.23</value>
		</property>
		<property>
			<name>gluetools.core.http.proxy.port</name>
			<value>3128</value>
		</property>

from gluetools.

daveuu avatar daveuu commented on May 26, 2024

Unfortunately, no luck here. I downloaded the newer versions (BTW with two versions for the .jar in /lib, gluetools.sh uses the older version). The end of my conf/gluetools-config.xml looks like:

                <!-- Cayenne -->
                <property>
                        <name>cayenne.querycache.size</name>
                        <value>30000</value>
                </property>
                <!-- HTTP proxy config -->
                <property>
                        <name>gluetools.core.http.proxy.enabled</name>
                        <value>true</value>
                </property>
                <property>
                        <name>gluetools.core.http.proxy.host</name>
                        <value>158.119.150.18</value>
                </property>
                <property>
                        <name>gluetools.core.http.proxy.port</name>
                        <value>8080</value>
                </property>
        </properties>
</gluetools>

Corresponding to my environment variables and browser settings for our proxy: 158.119.150.18:8080. The GLUE commands that don't work as expected:

GLUE version 0.1.122
Mode path: /
GLUE> project hev
OK
Mode path: /project/hev
GLUE> module ncbiHevImporter
OK
Mode path: /project/hev/module/ncbiHevImporter
GLUE> preview --detailed
Error: I/O error during eSearch: Connection timed out (Connection timed out)
Cause: Connection timed out (Connection timed out)
Mode path: /project/hev/module/ncbiHevImporter

The report from tcptrack:

 158.119.178.158:36688 158.119.150.18:8080   ESTABLISHED  2m     0 B/s
 158.119.178.158:36828 158.119.150.18:8080   ESTABLISHED  22s    0 B/s
 158.119.178.158:36140 158.119.150.18:8080   ESTABLISHED  59s    0 B/s

WRT debugging, if I use these settings for incorrect proxy address:

                <property>
                        <name>gluetools.core.http.proxy.host</name>
                        <value>158.119.150.19</value>
                </property>

I get:

GLUE> preview --detailed
Error: I/O error during eSearch: Connect to 158.119.150.19:8080 [/158.119.150.19] failed: Connection refused (Connection refused)
Cause: Connect to 158.119.150.19:8080 [/158.119.150.19] failed: Connection refused (Connection refused)
Cause: Connection refused (Connection refused)
Mode path: /project/hev/module/ncbiHevImporter

and

 158.119.178.158:36140 158.119.150.18:8080   RESET  0s    0 B/s

And if I switch the proxy off in the config:

                <property>
                        <name>gluetools.core.http.proxy.enabled</name>
                        <value>false</value>
                </property>

I eventually get (trying import instead of preview):

GLUE> import --detailed
Error: I/O error during eSearch: Connection timed out (Connection timed out)
Cause: Connection timed out (Connection timed out)
Mode path: /project/hev/module/ncbiHevImporter

and

 158.119.178.158:56684 130.14.29.110:443     SYN_SENT     2s     0 B/s

So it looks like your modification is behaving as expected WRT to redirection to the proxy, but the objective still isn't achieved . . . is it also listening for responses via the proxy?

from gluetools.

daveuu avatar daveuu commented on May 26, 2024

I just tested Python's BioPython

handle = Entrez.esearch(db="nuccore", term="Hepatitis E", field="title", rettype='xml')
print(Entrez.read(handle)[u'QueryTranslation'])

gives

hepatitis e[Title]

and proxy address activity:

 158.119.178.158:37124 158.119.150.18:8080   ESTABLISHED  13s    0 B/s
 158.119.178.158:36140 158.119.150.18:8080   ESTABLISHED  27s    0 B/s
 158.119.178.158:37212 158.119.150.18:8080   CLOSING      20s    0 B/s

I believe Python looks to environment variables http_proxy and https_proxy which are:

~/VRD/gluetools$ echo $http_proxy
http://158.119.150.18:8080/
~/VRD/gluetools$ echo $https_proxy
https://158.119.150.18:8080/

from gluetools.

joshsinger avatar joshsinger commented on May 26, 2024

OK, I am guessing the situation is this:

NCBI requires HTTPS to be used to access its API, although this was not always the case, so if you investigate this issue you may see people talking about accessing NCBI via plain HTTP.

BioPython Entrez functionality uses Python's urllib which will pick up proxy settings from environment variables. The endpoint (NCBI) is HTTPS, therefore urllib will use the proxy defined in https_proxy.

You could test this hypothesis by messing with the https_proxy environment variable -- this should break BioPython Entrez functionality. Conversely, messing with http_proxy should have no effect on BioPython Entrez.

In your case (and this is not universally true) the HTTPS proxy in your local site itself uses HTTPS with the client, hence the fact that https://... is the protocol.

So, in the latest release 0.1.123, I've changed GLUE so that you set the HTTPS proxy in glue-config.xml like this:

	<!-- HTTPS proxy config example -->
	<property>
		<name>gluetools.core.https.proxy.enabled</name>
		<value>true</value>
	</property>
	<property>
		<name>gluetools.core.https.proxy.url</name>
		<value>https://123.45.67.89:8080</value>
	</property>

So now you can set this up in a similar way to the https_proxy environment variable. The Ncbi importer will use the protocol you configure in the URL.

PS I have also updated gluetools.sh -- if it finds multiple jars in the lib directory it throws an error. Updated script version here:

https://github.com/giffordlabcvr/gluetools/blob/master/gluetools-core/gluetools/bin/gluetools.sh

from gluetools.

daveuu avatar daveuu commented on May 26, 2024

Success! Thanks.

from gluetools.

daveuu avatar daveuu commented on May 26, 2024

Almost . . . the first search returned the table of GI numbers, status etc. But a subsequent call of the same command has problems:

GLUE> preview --detailed
Error: I/O error during eSearch: Remote host closed connection during handshake
Cause: Remote host closed connection during handshake
Cause: SSL peer shut down incorrectly
Mode path: /project/enterovirus/module/enterovirusCuratedNcbiImporter
GLUE> preview --detailed
Error: I/O error during eSearch: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Cause: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Cause: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
Cause: unable to find valid certification path to requested target
Mode path: /project/enterovirus/module/enterovirusCuratedNcbiImporter

Same result for preview and import --preview.

from gluetools.

daveuu avatar daveuu commented on May 26, 2024

Quiting GLUE then rerunning GLUE, opening project etc returns second error above.

from gluetools.

joshsinger avatar joshsinger commented on May 26, 2024

This is because we're trying to connect to your proxy via SSL, and Java's SSL layer does not recognise the certificate of your proxy server, probably because it is self-signed. To fix this, you would have to install your proxy server's certificate in the JRE using something like this method:

https://www.grim.se/guide/jre-cert

from gluetools.

daveuu avatar daveuu commented on May 26, 2024

grabbing the https proxy's certificate with

echo -n | openssl s_client -connect 158.119.150.18:443 | sed -ne '/-BEGIN CERTIFICATE-/,/-END CERTIFICATE-/p' > /tmp/local_https_proxy.cert

and examining it with

openssl x509 -in /tmp/local_https_proxy.cert -text

Gives: Issuer: C=US, O=DigiCert Inc, OU=www.digicert.com and I did not have to add any certificates to the browser to get it working through the proxy - so I don't think the HTTPS proxy is using a self-signed certificate.

With my GLUE (ver. 0.1.131) config settings as

                <property>
                        <name>gluetools.core.https.proxy.enabled</name>
                        <value>true</value>
                </property>
                <property>
                        <name>gluetools.core.https.proxy.url</name>
                        <value>https://158.119.150.18:443</value>
                </property>

with and without the above certificate in the key store, manually added with:

CERT="/tmp/local_https_proxy.cert"
CERTALIAS="local_https_proxy"

sudo keytool -import \
-trustcacerts \
-keystore /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/security/cacerts \
-storepass changeit \
-noprompt \
-alias $CERTALIAS \
-file $CERT

I get this error message from within GLUE

Error: I/O error during eSearch: Host name '158.119.150.18' does not match the certificate subject provided by the peer (CN=*.phe.gov.uk, O=Public Health England, L=London, ST=England, C=GB)
Cause: Host name '158.119.150.18' does not match the certificate subject provided by the peer (CN=*.phe.gov.uk, O=Public Health England, L=London, ST=England, C=GB)

I tried an independent test of the local Java's ability to deal with the proxy server over SSL/HTTPS using "SSLPoke" code here which returned:

java -cp ./ SSLPoke 158.119.150.18 443
Successfully connected

So it looks like the HTTPS certificate on my computer, my Java install and this https proxy server are all in order. Is GLUE applying an overly stringent check on matching the domain name *phe.gov.uk against the proxy's IP address? Do I need to supply a full domain name for this proxy server to match the certificate instead of an IP address? I suspect there is an internal domain name that doesn't end in *phe.gov.uk if any for this proxy . . .

from gluetools.

joshsinger avatar joshsinger commented on May 26, 2024

Yes, this proves that the GLUE stack is establishing the correct SSL certificate, and the verification step is failing at the hostname match step.

This is something which the Apache HttpComponents library (which GLUE uses) applies strictly by default. However it seems to be quite configurable so we can probably get it to skip this step if necessary.

It is possible that your proxy server has an internal domain name which ends in .phe.gov.uk. If so you can find it out using this unix command:

% host 158.119.150.18

If that's the case then you can just apply this in the GLUE config. If not then I can investigate switching off the strict hostname match step.

from gluetools.

daveuu avatar daveuu commented on May 26, 2024

Indeed there is an appropriate looking host name. I'll test shortly.

from gluetools.

daveuu avatar daveuu commented on May 26, 2024

This could be progress!
With https://tmgcol001.phe.gov.uk:443 as value for gluetools.core.https.proxy.url
I get:

GLUE> preview --detailed
Error: Protocol error during eSearch: HTTP/1.1 400 Bad Request ( The data is invalid.  )
Mode path: /project/enterovirus/module/enterovirusCuratedNcbiImporter

Maybe we've actually made contact with NCBI at this point?

from gluetools.

joshsinger avatar joshsinger commented on May 26, 2024

Yes, we may have made contact. Something, possibly the proxy or eSearch, is responding with the 400 error code, maybe because the URL or the search query is wrong?

I think we will have to dig into the response to find out more, possibly by adding some more logging in GLUE.

from gluetools.

joshsinger avatar joshsinger commented on May 26, 2024

I strongly suspect this is an interaction between the PHE proxy and ApacheHttpClient rather than an NCBI issue but let's try to confirm that.

I've created a minimal Java program
minimalProxyTest.zip
which attempts to connect to an endpoint optionally via an SSL proxy.

To run it, unzip the attached file and do
% java -jar minimalProxyTest.jar request.properties

It will try to connect to an HTTP endpoint, optionally via an SSL proxy, and if that works, it will output the request details to stdout. It uses the same HTTP java libraries which GLUE uses.
You can twiddle where it tries to connect to and other details using the request.properties file (the '#' character comments out a line).

So from your end it will be informative to first confirm that you can reproduce the HTTP 400 error connecting to NCBI via the proxy. If there's no 400 error then there's some difference between the GLUE setup and the minimal program, which we will need to identify.

Assuming the error is reproduced we could then test for example if we can use the program to GET https://www.google.com via the proxy. If so then the NCBI request is a factor. If not then it's purely something to do with ApacheHttpClient and the proxy. If that does turn out to be the case, I found this thing which supposedly helps debug proxy issues: https://www.charlesproxy.com/.

I have also included the Java source for reference, if you want to build it I can help with that.

from gluetools.

joshsinger avatar joshsinger commented on May 26, 2024

BTW I also released GLUE 0.1.132, which adds some logging at FINEST level, concerning the request sent to NCBI, and the response, if there's an error.

from gluetools.

joshsinger avatar joshsinger commented on May 26, 2024

I believe this is finally fixed in GLUE 1.1.38.

The issue was actually this:
-- connection to NCBI / https via proxy is enabled by these settings in the gluetools-config.xml file:

gluetools.core.https.proxy.enabled
true


gluetools.core.https.proxy.url
-your HTTPS proxy URL -

-- This worked in terms of retrieving data from NCBI
-- However, the XML document returned from NCBI contains a DTD reference with its own URL.
-- The XML parser within GLUE then tries to resolve this DTD reference via the network connection. This network connection is unaware of the need to use a web proxy. Hence, the "connection refused" behaviour.

So the fix was to disable the remote lookup of the DTD reference GLUE's XML parser.

from gluetools.

daveuu avatar daveuu commented on May 26, 2024

Thanks Josh - I hope to have time to revisit this later in the summer

from gluetools.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.