PL/Java kills unicode? about pljava HOT 6 CLOSED

tada commented on July 19, 2024

PL/Java kills unicode?

from pljava.

Comments (6)

vatsan commented on July 19, 2024

FWIW the original report was on Greenplum DB (Postgres 8.2 compliant). But I also went ahead and tested it out on Postgres 8.4 and the problem persists - although the input string is mangled differently.

Here is a screenshot of the results in Postgres 8.4

Any pointers/workarounds?
Thank you for your attention.

from pljava.

skelter commented on July 19, 2024

All those times that findbugs slaps my wrists for converting byte streams to Strings without specifying an encoding come to mind.

-s

On Aug 14, 2013, at 2:01 AM, Srivatsan Ramanujam wrote:

I believe PL/java is killing unicode characters (it is probably converting text to a byte stream and reading them as single byte characters - perhaps Latin-1 and not as UTF-8). I don't observe this happening with PL/Python or PL/R.

I basically have a record which looks like the attached image

When I invoke a PL/Java function to simply read this input text field and return it as is, i noticed that the unicode characters are lost. The attached image shows the result.

Here is my PL/java function and the corresponding java snippet.
PL/Java Function
drop function if exists demo.returnString(text) cascade;
create function demo.returnString(text)
returns text
as
'demopkg.Example.returnString'
immutable language pljavau;
Java Snippet (in class demopkg.Example)

public static String returnString(String tweet) {
if (tweet == null) {
    return null;
}
Writer writer = null;

try {
    writer = new BufferedWriter(new OutputStreamWriter(
          new FileOutputStream("/tmp/pljava_out.txt"), "utf-8"));
    writer.write("Tweet\n");
    writer.write(tweet);
} catch (IOException ex){
  // report
} finally {
   try {writer.close();} catch (Exception ex) {}
}
return tweet;
}

Here is how I am invoking the SQL

select tweet_body,
demo.returnString(tweet_body) pljava_result,
demo.dummy(tweet_body) plpython_result
from demo.training_data
where id = 'tag:search.twitter.com,2005:356830788370706433'
The file that I am writing out to in the java code (/tmp/pljava_out.txt) shows that the unicode chars have already been lost (i don't see the emoticons in the file). So the error is occurring even before the function returns - perhaps during the postgres type to java type conversion.

Things I have tried to debug

My database is UTF-8

The "file.encoding" property also returns UTF-8 (when I invoke the PL/java function to return the property).

The "LOCALE" settings on my machine is also UTF-8.

The problem only occurs with PL/java (PL/Python and PL/R return the string alright).

Exhausted Google to search for this, but there is only one other user who has reported it and there is no resolution. Here is the related thread: http://lists.pgfoundry.org/pipermail/pljava-dev/2008/001385.html

—
Reply to this email directly or view it on GitHub.

from pljava.

vatsan commented on July 19, 2024

The bug might be stemming from the invocation of JNIEnv->NewStringUTF() in JNICalls.c.
The mess-up seems to happen only while dealing with non BMP chars (Unicode highpoints - surrogate pairs). This might be related to : http://bugs.sun.com/view_bug.do?bug_id=5030776

from pljava.

vatsan commented on July 19, 2024

Work Around-1

For those who may stumble on this thread (until this bug gets fixed), here is one possible workaround.
Replace unicode highpoints with a character of your choice before passing the string into PL/Java. Since the bug is in PL/Java (the JNICalls.c) this replacement has to be performed outside PL/java. I'm using PL/Python for this (you may use regexp in postgres if you will).

drop function if exists remove_high_points(text) cascade;
create or replace function remove_high_points(tweet text) 
     returns text
as
$$           
     import re,sys
     #We'll use the std replacement character (Question Mark inside a diamond)
     replacement_char = u'\ufffd'
     #Check if python on this system is a narrow or a wide-build
     highpoints = re.compile(u'[\U00010000-\U0010ffff]',re.UNICODE) if(sys.maxunicode==1114111) else re.compile(u'[\uD800-\uDBFF\uDC00-\uDFFF]',re.UNICODE)
     cleaned_tweet=re.sub(highpoints,replacement_char,tweet.decode('utf-8'),re.UNICODE)
     return cleaned_tweet        
$$ language plpythonu;

I then invoke it so:
select YOUR_PL_JAVA_FUNCTION(remove_high_points(input_string))
from YOUR_TABLE_NAME

The reason why this work-around is okay for me (although it is lossy) is because I don't want this bug to make me run into "invalid byte sequence for character 0xbla bla bla" errors later on.

from pljava.

jcflack commented on July 19, 2024

I think #21 (comment) is probably spot on. Here is a tester to confirm the problem (and, eventually, the fix):

import java.sql.ResultSet;
import java.sql.SQLException;

import org.postgresql.pljava.annotation.SQLAction;
import org.postgresql.pljava.annotation.SQLActions;
import org.postgresql.pljava.annotation.Function;

@SQLActions({
  @SQLAction(requires="ctest fn", install=
"   with " +
"    usable_codepoints ( cp ) as ( " +
"     select generate_series(1,x'd7ff'::int) " +
"     union all " +
"     select generate_series(x'e000'::int,x'10ffff'::int) " +
"    ), " +
"    test_inputs ( groupnum, cparray, s ) as ( " +
"     select " +
"       cp / 1024 as groupnum, " +
"       array_agg(cp order by cp), string_agg(chr(cp), '' order by cp) " +
"     from usable_codepoints " +
"     group by groupnum " +
"    ), " +
"    test_outputs as ( " +
"     select groupnum, cparray, s, ctest(s, cparray) as roundtrip " +
"     from test_inputs " +
"    ), " +
"    test_failures as ( " +
"     select * " +
"     from test_outputs " +
"     where " +
"      cparray != (roundtrip).cparray or s != (roundtrip).s " +
"      or not (roundtrip).matched " +
"    ), " +
"    test_summary ( n_failing_groups, first_failing_group ) as ( " +
"     select count(*), min(groupnum) from test_failures " +
"    ) " +
"   select " +
"    case when n_failing_groups > 0 then " +
"     javatest.logmessage('WARNING', format( " +
"      '%s 1k codepoint ranges had mismatches, first is block starting 0x%s', " +
"      n_failing_groups, to_hex(1024 * first_failing_group))) " +
"    end " +
"    from test_summary "
  ),
  @SQLAction(
    install=
      "CREATE TYPE ctestrow AS (matched boolean, cparray integer[], s text)",
    remove="DROP TYPE ctestrow",
    provides="ctest type"
  )
})
public class ctest {
  @Function(requires="ctest type", provides="ctest fn", complexType="ctestrow")
  public static boolean ctest(String s, int[] ints, ResultSet rs)
  throws SQLException {
    boolean ok = true;

    int cpc = s.codePointCount(0, s.length());
    Integer[] myInts = new Integer[cpc];
    int ci = 0;
    for ( int cpi = 0; cpi < cpc; cpi++ )
    {
      myInts[cpi] = s.codePointAt(ci);
      ci = s.offsetByCodePoints(ci, 1);
    }

    String myS = new String(ints, 0, ints.length);

    if ( ints.length != myInts.length )
      ok = false;
    else
      for ( int i = 0; i < ints.length; ++ i )
        if ( ints[i] != myInts[i] )
          ok = false;

    rs.updateBoolean("matched", true);
    rs.updateObject("cparray", myInts);
    rs.updateString("s", myS);
    return true;
  }
}

(requires the pljava-examples jar already installed, for logmessage. we need a nicer testing story....)

# select sqlj.replace_jar('file:///tmp/ctest-0.0.1-SNAPSHOT.jar', 'ctest', true);
WARNING:  19 Sep 15 22:55:30 org.postgresql.pljava.example.LoggerTest 1024 1k codepoint ranges had mismatches, first is block starting 0x10000
 replace_jar 
-------------

So everything goes great through the basic multilingual plane, all the higher planes mess up. Sounds exactly like using the JNI fooStringUTF functions when we need to use something else.

from pljava.

jcflack commented on July 19, 2024

I notice that even though the Java API spec only requires a small handful of encodings to be supported, an actual vanilla Oracle JVM install includes many more than that, and with a little easy munging the names of at least twenty PG encodings can be mapped to them. So we may (in many cases at least) be able to use a solution that skips the pg_do_encoding_conversion step entirely (and the allocating and copying that goes with it), and apply the JVM's conversion straight to the bytes in server encoding.

This little snippet shows what PG encodings can be mapped to ones the hosting JVM knows:

import java.nio.charset.Charset;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

import org.postgresql.pljava.annotation.SQLAction;
import org.postgresql.pljava.annotation.Function;

@SQLAction(requires="jvmcsname fn", install=
" with recursive " +
"   encodings (num, name) as ( " +
"     values (0, pg_encoding_to_char(0)::text) " +
"     union distinct ( " +
"      select num+1, pg_encoding_to_char(num+1)::text " +
"      from encodings " +
"      where '' != pg_encoding_to_char(num+1) " +
"    ) " +
"   ), " +
"   mapped (num, name, jvmname) as ( " +
"     select num, name, jvmCharsetName(name) " +
"    from encodings " +
"   ) " +
" select javatest.logmessage('INFO', name || '(' || num || ') -> ' || " +
"   coalesce(jvmname, '')) " +
" from mapped"
)
public class csmap {

  static final Pattern munges = Pattern.compile(
    "(_|(?<=^KOI8)(?=[^-_]))|(^WIN(?=\\d))"
  );

  @Function(provides="jvmcsname fn")
  public static String jvmCharsetName(String pgname) {
    Charset c = null;

    try {
      c = Charset.forName( pgname);
    }
    catch ( IllegalArgumentException iae ) { }

    if ( null == c ) {
      Matcher m = munges.matcher( pgname);
      StringBuffer sb = new StringBuffer();
      while ( m.find() ) {
        String s;
        if ( null != ( s = m.group( 1) ) )
          m.appendReplacement( sb, "-");
        else if ( null != ( s = m.group( 2) ) )
          m.appendReplacement( sb, "windows-");
      }
      m.appendTail( sb);
      try {
        c = Charset.forName( sb.toString());
      }
      catch ( IllegalArgumentException iae ) { }
    }

    if ( null == c )
      return null;

    return c.name();
  }
}

from pljava.

PL/Java kills unicode? about pljava HOT 6 CLOSED

Comments (6)

Work Around-1

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent