Git Product home page Git Product logo

Comments (22)

davidben avatar davidben commented on July 20, 2024 6

I don't think this is a false positive. Thread sanitizer is correct that this code is a data race. It's a data race where you ignore the result of the race, but C says that data races are undefined behavior whether or not you ignore the result.

The issue isn't that TSan doesn't understand the object is read-only. It's not read-only. It is logically read-only, but the next and prev pointers are legitimately not read-only. TSan just doesn't understand that you then ignore the bad reads:

openssl/ssl/ssl_sess.c

Lines 163 to 166 in 1977c00

/* As the copy is not in the cache, we remove the associated pointers */
dest->prev = NULL;
dest->next = NULL;
dest->owner = NULL;

But TSan is correct to not understand that because C's rules don't care.

from openssl.

davidben avatar davidben commented on July 20, 2024 3

While that will silence the report, it won't fix the bug in OpenSSL. The data race is still undefined behavior in the language.

from openssl.

Sashan avatar Sashan commented on July 20, 2024 1

Thanks for the detailed report (link to log helped). I think this is kind of false positive. I must admit I don't all details how thread sanitizer works internally. So my analysis here might be quite wrong.

The mutex M2 is rw-lock whcih belongs to SSL context (according to this T-SAN output here):

E             Mutex M2 (0x721000275940) created at:
E               #0 pthread_rwlock_init <null> (clickhouse+0x723e9bd) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #1 CRYPTO_THREAD_lock_new build_docker/./contrib/openssl/crypto/threads_pthread.c:54:9 (clickhouse+0x2044f271) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #2 SSL_CTX_new_ex build_docker/./contrib/openssl/ssl/ssl_lib.c:3883:17 (clickhouse+0x20244a35) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #3 SSL_CTX_new build_docker/./contrib/openssl/ssl/ssl_lib.c:4120:12 (clickhouse+0x2024533d) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #4 Poco::Net::Context::createSSLContext() build_docker/./base/poco/NetSSL_OpenSSL/src/Context.cpp (clickhouse+0x1d240abc) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
</snip>

Mutexes M0 and M1 don't belong to OpenSSL they seem to be created by DB application.

E             Mutex M1 (0x722c0000a800) created at:
E               #0 pthread_mutex_init <null> (clickhouse+0x723d7bf) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #1 std::__1::__libcpp_recursive_mutex_init[abi:v15000](pthread_mutex_t*) build_docker/./contrib/llvm-project/libcxx/include/__threading_support:269:10 (clickhouse+0x2082e9a4) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #2 std::__1::recursive_mutex::recursive_mutex() build_docker/./contrib/llvm-project/libcxx/src/mutex.cpp:61:14 (clickhouse+0x2082e9a4)
E               #3 Poco::Net::SecureSocketImpl::SecureSocketImpl(Poco::AutoPtr<Poco::Net::SocketImpl>, Poco::AutoPtr<Poco::Net::Context>) build_docker/./base/poco/NetSSL_OpenSSL/src/SecureSocketImpl.cpp:64:19 (clickhouse+0x1d2591a3) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #4 Poco::Net::SecureStreamSocketImpl::SecureStreamSocketImpl(Poco::AutoPtr<Poco::Net::Context>) build_docker/./base/poco/NetSSL_OpenSSL/src/SecureStreamSocketImpl.cpp:26:2 (clickhouse+0x1d25f8db) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #5 Poco::Net::SecureStreamSocket::SecureStreamSocket() build_docker/./base/poco/NetSSL_OpenSSL/src/SecureStreamSocket.cpp:30:19 (clickhouse+0x1d25c987) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #6 std::__1::__unique_if<Poco::Net::SecureStreamSocket>::__unique_single std::__1::make_unique[abi:v15000]<Poco::Net::SecureStreamSocket>() build_docker/./contrib/llvm-project/libcxx/include/__memory/unique_ptr.h:714:32 (clickhouse+0x195c81ee) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #7 DB::Connection::connect(DB::ConnectionTimeouts const&) build_docker/./src/Client/Connection.cpp:128:26 (clickhouse+0x195c81ee)

and stack here belongs to M0:

E             Mutex M0 (0x722c0016c010) created at:
E               #0 pthread_mutex_init <null> (clickhouse+0x723d7bf) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #1 std::__1::__libcpp_recursive_mutex_init[abi:v15000](pthread_mutex_t*) build_docker/./contrib/llvm-project/libcxx/include/__threading_support:269:10 (clickhouse+0x2082e9a4) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #2 std::__1::recursive_mutex::recursive_mutex() build_docker/./contrib/llvm-project/libcxx/src/mutex.cpp:61:14 (clickhouse+0x2082e9a4)
E               #3 Poco::Net::SecureSocketImpl::SecureSocketImpl(Poco::AutoPtr<Poco::Net::SocketImpl>, Poco::AutoPtr<Poco::Net::Context>) build_docker/./base/poco/NetSSL_OpenSSL/src/SecureSocketImpl.cpp:64:19 (clickhouse+0x1d2591a3) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #4 Poco::Net::SecureStreamSocketImpl::SecureStreamSocketImpl(Poco::AutoPtr<Poco::Net::Context>) build_docker/./base/poco/NetSSL_OpenSSL/src/SecureStreamSocketImpl.cpp:26:2 (clickhouse+0x1d25f8db) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #5 Poco::Net::SecureStreamSocket::SecureStreamSocket() build_docker/./base/poco/NetSSL_OpenSSL/src/SecureStreamSocket.cpp:30:19 (clickhouse+0x1d25c987) (BuildId: a38098db2d267457b06b8d4a8b29e4d4d8b98a0a)
E               #6 std::__1::__unique_if<Poco::N
</snip>

So what really matters to OpenSSL is M2 (our mutex). The mutex is held as we insert session to cache created by thread T672. The matching code which does that can be found here:

 712 int SSL_CTX_add_session(SSL_CTX *ctx, SSL_SESSION *c)
 713 {
 714     int ret = 0;
 715     SSL_SESSION *s;
 716
 717     /*
 718      * add just 1 reference count for the SSL_CTX's session cache even though
 719      * it has two ways of access: each session is in a doubly linked list and
 720      * an lhash
 721      */
 722     SSL_SESSION_up_ref(c);
 723     /*
 724      * if session c is in already in cache, we take back the increment later
 725      */
 726
 727     if (!CRYPTO_THREAD_write_lock(ctx->lock)) {
 728         SSL_SESSION_free(c);
 729         return 0;
 730     }
 731     s = lh_SSL_SESSION_insert(ctx->sessions, c);
 732
 733     /*
 734      * s != NULL iff we already had a session with the given PID. In this
 735      * case, s == c should hold (then we did not really modify
 736      * ctx->sessions), or we're in trouble.
 737      */

So this where we grab the lock for SSL context (a.k.a. M2) and insert a chunk of memory (session) into global list (a.k.a. session cache). What's worth to note is that function lh_SSL_SESSION_insert() modifies link members (prev, next pointers) of session which is inserted to cache. The modification happens under M2 protection and thread sanitizer makes a note of it (I think).

Later there is a thread T674 which calls a dup on session created by T672. I'm sure the look up operation did happen after the protection M2. However as soon as T674 finds the session it grabs the reference and drops M2. Because the session is immutable. The memcpy() operation copies the whole session including prev,next pointers which were modified under protection of M2 when T672 was calling lh_SSL_SESSION_insert(). And that's the why the T-san assert is triggered.

So thread sanitizer is right we are reading memory without holding M2 but this is harmful here because thread sanitizer does not understand the object is read-only here. May be we can copy the object member by member avoiding to copy {prev, next} members instead of doing memcpy(), to make thread sanitizer happy. don't know if it would help. At least that's my understanding of the report.

from openssl.

davidben avatar davidben commented on July 20, 2024 1

Be careful also with struct padding. Since actually quite a lot of fields that can't be copied (e.g. all the allocated buffers), it might make sense to just move the copyable fields into a smaller embedded struct and memcpy that. Or just copy field by field if there aren't many of them.

from openssl.

nhorman avatar nhorman commented on July 20, 2024 1

@rschu1ze thank you for testing.

@mattcaswell thanks I had forgotten about the refcount. Also to @davidben's comment, can you elaborate on which buffers may be uncopyable? From my read that might include (but not be limited to):

  • psk_identity_hint
  • psk_identity
  • srp_username
  • ticket_appdata

Do those need to be ignored? Also do we need to up_ref things like cipher, peer_chain, peer_rpk, etc to avoid having those values freed early when a duplicate session holds them?

@rschu1ze there may be a subsequent test I will ask you to do based on the answer to the above questions

from openssl.

rschu1ze avatar rschu1ze commented on July 20, 2024 1

Yepp, I'll push #24629 (comment) tomorrow (too tired now). But since we (ClickHouse) are mere users of OpenSSL you or other OpenSSL experts might want to double-check.

from openssl.

nhorman avatar nhorman commented on July 20, 2024

I'm not sure if this helps you, but since you appear to be using the aws python sdk, we recently closed this issue:
#24480
Which is in a completely different location, but also related to the aws python sdk, in which we found that parts of the sdk were creating SSL_CTX objects that it was sharing among SSL objects (which is fine) but mutating the shared objects while they were in use in other threads (which is not).

Not sure if it relates here, but it might be worth looking at.

from openssl.

Sashan avatar Sashan commented on July 20, 2024

We might try to provide thread sanitizer with a hint to ignore ssl_session_dup build(). this might help us with clang's sanitizer.

from openssl.

mattcaswell avatar mattcaswell commented on July 20, 2024

Could we move the next/prev members to the end of the structure and memcpy everything but those pointers?

from openssl.

davidben avatar davidben commented on July 20, 2024

Best to avoid the refcount too, since that's also synchronized.

from openssl.

nhorman avatar nhorman commented on July 20, 2024

FWIW, I'm not sure of the feasibility, but RCU might be a solution here. the lock/unlock functions for rcu are already annotated to convince tsan that there is no race condition here. The addition of a read side lock should be effectively no performance impact (ideally), but it sounds like a reasonably easy replacement.

from openssl.

davidben avatar davidben commented on July 20, 2024

I think RCU is way, way overkill here. There is no complex synchronization or anything going on here. Just an overbroad convenience memcpy that never needed to copy the offending fields in the first place.

from openssl.

nhorman avatar nhorman commented on July 20, 2024

If the memcpy can be constrained to only touch fields that aren't protected on the write side, then yes, I would agree, but I'm not currently looking at the code, so I'm not sure of the feasibility there.

from openssl.

mattcaswell avatar mattcaswell commented on July 20, 2024

If the memcpy can be constrained to only touch fields that aren't protected on the write side, then yes, I would agree, but I'm not currently looking at the code, so I'm not sure of the feasibility there.

As noted above I think this can be done simply by moving the offending fields to the end of the structure.

from openssl.

rschu1ze avatar rschu1ze commented on July 20, 2024

I made a temporary fix (since this blocks the next release of ClickHouse), and would replace it by a more comprehensive fix once available.

from openssl.

nhorman avatar nhorman commented on July 20, 2024
diff --git a/ssl/ssl_local.h b/ssl/ssl_local.h
index 9083ec2f3b..3a96c6a136 100644
--- a/ssl/ssl_local.h
+++ b/ssl/ssl_local.h
@@ -540,11 +540,6 @@ struct ssl_session_st {
                                  * load the 'cipher' structure */
     unsigned int kex_group;      /* TLS group from key exchange */
     CRYPTO_EX_DATA ex_data;     /* application specific data */
-    /*
-     * These are used to make removal of session-ids more efficient and to
-     * implement a maximum cache size.
-     */
-    struct ssl_session_st *prev, *next;
 
     struct {
         char *hostname;
@@ -574,6 +569,12 @@ struct ssl_session_st {
     size_t ticket_appdata_len;
     uint32_t flags;
     SSL_CTX *owner;
+
+    /*
+     * These are used to make removal of session-ids more efficient and to
+     * implement a maximum cache size.
+     */
+    struct ssl_session_st *prev, *next;
 };
 
 /* Extended master secret support */
diff --git a/ssl/ssl_sess.c b/ssl/ssl_sess.c
index d326eadb04..67b342df49 100644
--- a/ssl/ssl_sess.c
+++ b/ssl/ssl_sess.c
@@ -138,7 +138,7 @@ static SSL_SESSION *ssl_session_dup_intern(const SSL_SESSION *src, int ticket)
     dest = OPENSSL_malloc(sizeof(*dest));
     if (dest == NULL)
         return NULL;
-    memcpy(dest, src, sizeof(*dest));
+    memcpy(dest, src, sizeof(*dest) - sizeof(SSL_SESSION *) * 2);
 
     /*
      * Set the various pointers to NULL so that we can call SSL_SESSION_free in

@rschu1ze would you be willing to test the above patch? Its based on the suggestion @mattcaswell made above to move the prev/next pointers to the end of the struct and avoid copying them in dup_intern.

from openssl.

rschu1ze avatar rschu1ze commented on July 20, 2024

Thanks. Yes, I will test in a moment.

from openssl.

mattcaswell avatar mattcaswell commented on July 20, 2024

@nhorman - @davidben also suggested moving the refcount too

from openssl.

rschu1ze avatar rschu1ze commented on July 20, 2024

The patch in #24629 (comment) resolves the problem.

from openssl.

rschu1ze avatar rschu1ze commented on July 20, 2024

Side note: This is slightly ridiculous but I just discovered that my colleague @alesapin noticed and fixed the same issue four years ago --> ClickHouse#6 I guess the issue was not reported upstream back then. This was shortly before ClickHouse migrated to boringssl and recently back to OpenSSL.

from openssl.

nhorman avatar nhorman commented on July 20, 2024

ok, @rschu1ze is the implication there that you have a patch for this locally? Care to push it to a PR here so we can complete this?

from openssl.

nhorman avatar nhorman commented on July 20, 2024

associated pr complete, closing. Thank you!

from openssl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.