Comments (6)
FWIW the original report was on Greenplum DB (Postgres 8.2 compliant). But I also went ahead and tested it out on Postgres 8.4 and the problem persists - although the input string is mangled differently.
Here is a screenshot of the results in Postgres 8.4
Any pointers/workarounds?
Thank you for your attention.
from pljava.
All those times that findbugs slaps my wrists for converting byte streams to Strings without specifying an encoding come to mind.
-s
On Aug 14, 2013, at 2:01 AM, Srivatsan Ramanujam wrote:
I believe PL/java is killing unicode characters (it is probably converting text to a byte stream and reading them as single byte characters - perhaps Latin-1 and not as UTF-8). I don't observe this happening with PL/Python or PL/R.
I basically have a record which looks like the attached image
When I invoke a PL/Java function to simply read this input text field and return it as is, i noticed that the unicode characters are lost. The attached image shows the result.
Here is my PL/java function and the corresponding java snippet.
PL/Java Function
drop function if exists demo.returnString(text) cascade;
create function demo.returnString(text)
returns text
as
'demopkg.Example.returnString'
immutable language pljavau;Java Snippet (in class demopkg.Example) public static String returnString(String tweet) { if (tweet == null) { return null; } Writer writer = null; try { writer = new BufferedWriter(new OutputStreamWriter( new FileOutputStream("/tmp/pljava_out.txt"), "utf-8")); writer.write("Tweet\n"); writer.write(tweet); } catch (IOException ex){ // report } finally { try {writer.close();} catch (Exception ex) {} } return tweet;
}
Here is how I am invoking the SQL
select tweet_body,
demo.returnString(tweet_body) pljava_result,
demo.dummy(tweet_body) plpython_result
from demo.training_data
where id = 'tag:search.twitter.com,2005:356830788370706433'
The file that I am writing out to in the java code (/tmp/pljava_out.txt) shows that the unicode chars have already been lost (i don't see the emoticons in the file). So the error is occurring even before the function returns - perhaps during the postgres type to java type conversion.Things I have tried to debug
- My database is UTF-8
- The "file.encoding" property also returns UTF-8 (when I invoke the PL/java function to return the property).
- The "LOCALE" settings on my machine is also UTF-8.
- The problem only occurs with PL/java (PL/Python and PL/R return the string alright).
- Exhausted Google to search for this, but there is only one other user who has reported it and there is no resolution. Here is the related thread: http://lists.pgfoundry.org/pipermail/pljava-dev/2008/001385.html
—
Reply to this email directly or view it on GitHub.
from pljava.
The bug might be stemming from the invocation of JNIEnv->NewStringUTF() in JNICalls.c.
The mess-up seems to happen only while dealing with non BMP chars (Unicode highpoints - surrogate pairs). This might be related to : http://bugs.sun.com/view_bug.do?bug_id=5030776
from pljava.
Work Around-1
For those who may stumble on this thread (until this bug gets fixed), here is one possible workaround.
Replace unicode highpoints with a character of your choice before passing the string into PL/Java. Since the bug is in PL/Java (the JNICalls.c) this replacement has to be performed outside PL/java. I'm using PL/Python for this (you may use regexp in postgres if you will).
drop function if exists remove_high_points(text) cascade;
create or replace function remove_high_points(tweet text)
returns text
as
$$
import re,sys
#We'll use the std replacement character (Question Mark inside a diamond)
replacement_char = u'\ufffd'
#Check if python on this system is a narrow or a wide-build
highpoints = re.compile(u'[\U00010000-\U0010ffff]',re.UNICODE) if(sys.maxunicode==1114111) else re.compile(u'[\uD800-\uDBFF\uDC00-\uDFFF]',re.UNICODE)
cleaned_tweet=re.sub(highpoints,replacement_char,tweet.decode('utf-8'),re.UNICODE)
return cleaned_tweet
$$ language plpythonu;
I then invoke it so:
select YOUR_PL_JAVA_FUNCTION(remove_high_points(input_string))
from YOUR_TABLE_NAME
The reason why this work-around is okay for me (although it is lossy) is because I don't want this bug to make me run into "invalid byte sequence for character 0xbla bla bla" errors later on.
from pljava.
I think #21 (comment) is probably spot on. Here is a tester to confirm the problem (and, eventually, the fix):
import java.sql.ResultSet;
import java.sql.SQLException;
import org.postgresql.pljava.annotation.SQLAction;
import org.postgresql.pljava.annotation.SQLActions;
import org.postgresql.pljava.annotation.Function;
@SQLActions({
@SQLAction(requires="ctest fn", install=
" with " +
" usable_codepoints ( cp ) as ( " +
" select generate_series(1,x'd7ff'::int) " +
" union all " +
" select generate_series(x'e000'::int,x'10ffff'::int) " +
" ), " +
" test_inputs ( groupnum, cparray, s ) as ( " +
" select " +
" cp / 1024 as groupnum, " +
" array_agg(cp order by cp), string_agg(chr(cp), '' order by cp) " +
" from usable_codepoints " +
" group by groupnum " +
" ), " +
" test_outputs as ( " +
" select groupnum, cparray, s, ctest(s, cparray) as roundtrip " +
" from test_inputs " +
" ), " +
" test_failures as ( " +
" select * " +
" from test_outputs " +
" where " +
" cparray != (roundtrip).cparray or s != (roundtrip).s " +
" or not (roundtrip).matched " +
" ), " +
" test_summary ( n_failing_groups, first_failing_group ) as ( " +
" select count(*), min(groupnum) from test_failures " +
" ) " +
" select " +
" case when n_failing_groups > 0 then " +
" javatest.logmessage('WARNING', format( " +
" '%s 1k codepoint ranges had mismatches, first is block starting 0x%s', " +
" n_failing_groups, to_hex(1024 * first_failing_group))) " +
" end " +
" from test_summary "
),
@SQLAction(
install=
"CREATE TYPE ctestrow AS (matched boolean, cparray integer[], s text)",
remove="DROP TYPE ctestrow",
provides="ctest type"
)
})
public class ctest {
@Function(requires="ctest type", provides="ctest fn", complexType="ctestrow")
public static boolean ctest(String s, int[] ints, ResultSet rs)
throws SQLException {
boolean ok = true;
int cpc = s.codePointCount(0, s.length());
Integer[] myInts = new Integer[cpc];
int ci = 0;
for ( int cpi = 0; cpi < cpc; cpi++ )
{
myInts[cpi] = s.codePointAt(ci);
ci = s.offsetByCodePoints(ci, 1);
}
String myS = new String(ints, 0, ints.length);
if ( ints.length != myInts.length )
ok = false;
else
for ( int i = 0; i < ints.length; ++ i )
if ( ints[i] != myInts[i] )
ok = false;
rs.updateBoolean("matched", true);
rs.updateObject("cparray", myInts);
rs.updateString("s", myS);
return true;
}
}
(requires the pljava-examples jar already installed, for logmessage
. we need a nicer testing story....)
# select sqlj.replace_jar('file:///tmp/ctest-0.0.1-SNAPSHOT.jar', 'ctest', true);
WARNING: 19 Sep 15 22:55:30 org.postgresql.pljava.example.LoggerTest 1024 1k codepoint ranges had mismatches, first is block starting 0x10000
replace_jar
-------------
So everything goes great through the basic multilingual plane, all the higher planes mess up. Sounds exactly like using the JNI fooStringUTF
functions when we need to use something else.
from pljava.
I notice that even though the Java API spec only requires a small handful of encodings to be supported, an actual vanilla Oracle JVM install includes many more than that, and with a little easy munging the names of at least twenty PG encodings can be mapped to them. So we may (in many cases at least) be able to use a solution that skips the pg_do_encoding_conversion
step entirely (and the allocating and copying that goes with it), and apply the JVM's conversion straight to the bytes in server encoding.
This little snippet shows what PG encodings can be mapped to ones the hosting JVM knows:
import java.nio.charset.Charset;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import org.postgresql.pljava.annotation.SQLAction;
import org.postgresql.pljava.annotation.Function;
@SQLAction(requires="jvmcsname fn", install=
" with recursive " +
" encodings (num, name) as ( " +
" values (0, pg_encoding_to_char(0)::text) " +
" union distinct ( " +
" select num+1, pg_encoding_to_char(num+1)::text " +
" from encodings " +
" where '' != pg_encoding_to_char(num+1) " +
" ) " +
" ), " +
" mapped (num, name, jvmname) as ( " +
" select num, name, jvmCharsetName(name) " +
" from encodings " +
" ) " +
" select javatest.logmessage('INFO', name || '(' || num || ') -> ' || " +
" coalesce(jvmname, '')) " +
" from mapped"
)
public class csmap {
static final Pattern munges = Pattern.compile(
"(_|(?<=^KOI8)(?=[^-_]))|(^WIN(?=\\d))"
);
@Function(provides="jvmcsname fn")
public static String jvmCharsetName(String pgname) {
Charset c = null;
try {
c = Charset.forName( pgname);
}
catch ( IllegalArgumentException iae ) { }
if ( null == c ) {
Matcher m = munges.matcher( pgname);
StringBuffer sb = new StringBuffer();
while ( m.find() ) {
String s;
if ( null != ( s = m.group( 1) ) )
m.appendReplacement( sb, "-");
else if ( null != ( s = m.group( 2) ) )
m.appendReplacement( sb, "windows-");
}
m.appendTail( sb);
try {
c = Charset.forName( sb.toString());
}
catch ( IllegalArgumentException iae ) { }
}
if ( null == c )
return null;
return c.name();
}
}
from pljava.
Related Issues (20)
- JSONB type mapping HOT 2
- wiki update: prebuilt docker images HOT 2
- Vulnerability scan: multiple issues HOT 12
- `NEWLINE` pattern can fail to match HOT 1
- wiki update: testcontainers magic HOT 2
- Error building against PostgreSQL 16 HOT 4
- Fails on s390x HOT 1
- Always prompts that the language pljava does not exist HOT 4
- Class Loading issue HOT 31
- Question: How to install Pl/Java into AWS RDS Postgres HOT 2
- Compilen error occuring while running mvn clean install HOT 2
- dependency not adding..issue in plugin HOT 1
- fatal error: postgres.h: No such file or directory HOT 2
- XML parsing errors reported as `XX000` when DOM API is used HOT 1
- Documentation comments for sandboxed and unsandboxed grants in pljava.policy inverted HOT 6
- Publish PlJava to a public Maven Repository HOT 2
- Unhelpful output when build fails because no platform rules matched HOT 10
- undefined symbol: GetMemoryChunkContext HOT 2
- pljava.ddr This script file related issues HOT 13
- Execute Java Code at PostgreSQL startup to "Get Things Ready" HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pljava.