Git Product home page Git Product logo

duolicious-backend's Introduction

Duolicious Backend

There's screenshots of the app at https://github.com/duolicious.

Contributing

There's three ways you can contribute:

  1. Tell your friends about Duolicious and share on social media! This is the best way to make it grow.
  2. Donate on Ko-fi: https://ko-fi.com/duolicious
  3. Raise a pull request. Developer instructions can be found at DEVELOPER.md.

duolicious-backend's People

Contributors

duogenesis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

duolicious-backend's Issues

Deal with old chats

Nobody should be able to participate in a chat where:

  • One person is blocked by the other
  • One person deleted their account
  • One person deactivated their account

Filter explicit images automatically

During my pilot deployment, users uploaded more explicit images than I expected. About 4% of them were explicit. It doesn't sound like a lot, but even one can be enough to ruin the experience for most users. Less than 0.1% seems like a good goal. There must be a tiny, free neural net I can run each time an image is uploaded.

Deactivate accounts of people who have never been online

If someone's never been online before, they'll have no entry in the duo_chat.last table. The account deactivator requires them to be online at least once to figure out when they were last online. One solution would be to insert a row into the last table when someone signs up and set it to now().

Related: #60.

Feature: More specific bio tags about kids

Include options which communicate something to the effect of:

  • I'm open to becoming a step parent
  • I'm on the fence but maybe I can be convinced

In general, a dictionary disambiguating the meaning of each bio option would be nice.

Here's some suggestions from ChatGPT:

  1. Definitely want kids: I'm looking to start or expand a family in the future.
  2. Do not want kids: I'm certain I don't want children.
  3. Open to being a step-parent: I'd welcome the opportunity to be a part of a pre-existing family.
  4. Undecided about biological kids: I'm uncertain about having my own children, but I could be open to the idea.
  5. Open to adoption or fostering: I'd consider adopting or fostering children.
  6. Depends on partner: My decision largely depends on my future partner's wishes.
  7. It's complicated: My feelings on this are complex and are best discussed in person.

Use ejabberd

MongooseIM is using 12% CPU with 34 open connections. But mongoose ejabberd doesn't have mod_inbox. Someone on Stack Exchange said it should take a day for someone who knows elixr to port the module though.

Traits' orders are only partially sorted

Traits for which the app doesn't have enough data tend to appear near the middle of the list for some users. Not sure why this happening. I think I might've assigned those traits a numerical value of 0 instead of -1, which is what it used to be.

Photo deletion strategy

Consider having a "photo graveyard" table. In the same transaction where photos are deleted from their usual tables, add uuids to the graveyard. Delete old photos in a batch job. Make sure also to handle updated photos.

Search: Bigger `LIMIT` during initial pass, smaller `LIMIT` during final pass

Change this to something like 2500 then add a limit here which is something like 250. Why? Because:

  • Having a bigger limit in the initial pass makes the approximation less-wrong.
  • Bigger limits make the query take more time, but inserting into search_cache is the slowest part of the query. So we can maintain similar query speed while improving search result accuracy by making the initial pass much bigger and making the final pass only slightly smaller.
  • More concretely, when there's 1000 profiles, the entire query takes about 120ms. Just the selection without inserting into search_cache takes about 30 ms. So insertion takes about 90 ms. If those two parts of the query scale linearly with the sizes of the limits, using the limits suggested in the first sentence would make the query take about 2500 * (30 / 1000) + 250 * (90 / 1000) = 97.5 milliseconds.

Do something about the reply rate

The reply rate sucks. Consider matching people who have a high reply rate. It'll compound the issue for people who never talk, but they never talk anyway.

Ideally, the app would get the tight-lipped folks speaking too.

`begin; drop table mam_message_backup; commit;`

I might've fixed the root cause of #82. Either way, I cleaned up the DB a little bit with this:

begin;

CREATE TABLE mam_message_backup(
  -- Message UID (64 bits)
  -- A server-assigned UID that MUST be unique within the archive.
  id BIGINT NOT NULL,
  user_id INT NOT NULL,
  -- FromJID used to form a message without looking into stanza.
  -- This value will be send to the client "as is".
  from_jid varchar(250) NOT NULL,
  -- The remote JID that the stanza is to (for an outgoing message) or from (for an incoming message).
  -- This field is for sorting and filtering.
  remote_bare_jid varchar(250) NOT NULL,
  remote_resource varchar(250) NOT NULL,
  -- I - incoming, remote_jid is a value from From.
  -- O - outgoing, remote_jid is a value from To.
  -- Has no meaning for MUC-rooms.
  direction mam_direction NOT NULL,
  -- Term-encoded message packet
  message bytea NOT NULL,
  search_body text,
  origin_id varchar,
  PRIMARY KEY(user_id, id)
);

WITH t1 AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY
        user_id, from_jid, remote_bare_jid, remote_resource, direction, search_body
      ORDER BY
        user_id, from_jid, remote_bare_jid, remote_resource, direction, search_body, id
    ) AS rn
  FROM mam_message
  where search_body <> ''
), t2 AS (
  SELECT
      *
  FROM t1
  where rn > 1 and direction = 'I'
  order by search_body, rn
), t3 AS (
  delete from mam_message where id in (select id from t2) returning *
)
insert into mam_message_backup
select * from t3;

commit;

Now the situation's like this:

duo_chat=# select count(*), direction
from mam_message
group by from_jid, remote_bare_jid, direction, search_body
having count(*) > 1
order by direction, count desc;
 count | direction 
-------+-----------
     5 | I
     5 | I
     4 | I
     4 | I
     3 | I
     3 | I
     3 | I
     3 | I
     3 | I
     3 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     2 | I
     5 | O
     4 | O
     4 | O
     2 | O
     2 | O
     2 | O
     2 | O
     2 | O
     2 | O
     2 | O
     2 | O
     2 | O
     2 | O
(46 rows)

That query once returned 593 rows.

Anywho, I'm gonna let that cook for a little bit and keep monitoring the situation. At some point I should hopefully be able to give it one of these:

begin;
drop mam_message_backup;
commit;

Start shuffling questions after 100, not 250

The question bank has been ordered so that questions of better quality some first. Consequently, users would have the best experience if they answer them in-order, given the current matching algorithm.

But I'd like to improve the algorithm in the future. I plan to do that with an auto encoder. But I need training data for that. If only a very small minority of users ever answer the later questions, I wouldn't have enough training data to predict answers to those later questions, given users' initial answers. So I'm currently randomising the order of the questions, but only after users have answered 250 questions. Having the order of the initial questions be fixed and the later questions be random strikes the balance between (1) optimising the performance of the current algorithm, and (2) optimising the size of the training set I can use to build a better model.

The "250" threshold was chosen by feel, after I tested the personality assessment on myself, to check how few answers I needed to give before the system accurately determined my personality. But while 250 questions was better, 100 worked well too. I've also gotten more data since my initial self-test. I had 50 other users sign up by posting the app on a social media website. Based on their usage, a threshold of 100 would increase the percentage of users who get past the initial, fixed portion of the question bank from about 3% to about 35%:

postgres=# select row_number() over (order by count_answers desc), percent_rank() over (order by count_answers desc), count_answers from person;
 row_number | percent_rank | count_answers 
------------+--------------+---------------
          1 |            0 |          1200
          2 |         0.02 |           564 <- Only 2 out of 51 people answered 250 questions or more
          3 |         0.04 |           242    (and I suspect the person who answered 1200 answered randomly)
          4 |         0.06 |           241
          5 |         0.08 |           226
          6 |          0.1 |           217
          7 |         0.12 |           203
          8 |         0.14 |           186
          9 |         0.16 |           183
         10 |         0.18 |           158
         11 |          0.2 |           148
         12 |         0.22 |           132
         13 |         0.24 |           115
         14 |         0.26 |           114
         15 |         0.28 |           113
         16 |          0.3 |           111
         17 |         0.32 |           110
         18 |         0.34 |           105  <- 18 out of 51 people answered at least 100 questions
         19 |         0.36 |            90
         20 |         0.38 |            87
         21 |          0.4 |            84
         22 |         0.42 |            83
         23 |         0.44 |            72
         24 |         0.46 |            62
         25 |         0.48 |            53
         26 |          0.5 |            50
         27 |         0.52 |            45
         28 |         0.54 |            42
         29 |         0.56 |            41
         30 |         0.56 |            41
         31 |          0.6 |            40
         32 |         0.62 |            35
         33 |         0.64 |            30
         34 |         0.66 |            25
         35 |         0.68 |            24
         36 |         0.68 |            24
         37 |         0.72 |            21
         38 |         0.74 |            16
         39 |         0.76 |            14
         40 |         0.76 |            14
         41 |          0.8 |            12
         42 |         0.82 |             9
         43 |         0.84 |             8
         44 |         0.86 |             5
         45 |         0.88 |             3
         46 |         0.88 |             3
         47 |         0.92 |             1
         48 |         0.94 |             0
         49 |         0.94 |             0
         50 |         0.94 |             0
         51 |         0.94 |             0

Some notifications are doubled

Immediately upon deploying #55, I noticed that 59 emails had been sent to 58 distinct recipients. That means one recipient received two notification emails. That person had at least one unread chat and intro. The two notifications were sent at 05-09-2023 04:10:45 and 05-09-2023 04:08:04.

Edit: I've been looking at the notifications sent. I've noticed that at least five other people had unread chats and intros, yet they only got one notification email each.

Implement moveToChats using the XMPP proxy

This logic should be in the XMPP proxy. That should make it more reliable.

The motivation for this ticket is that some conversations remain in the "intros" box of the person who sent them. They should have been in the "chats" box.

Traits approach 50% as people answer more questions

Someone who's answered a few questions tends to have quite extreme values for traits (near 0% or 100%). As they answer more questions, the value tends towards 50%. This shouldn't affect the order of matches too much, but it makes the in-depth screen less useful. Consider re-introducing the information > 0.2 threshold as a fix.

Related: #148

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.