Git Product home page Git Product logo

Comments (22)

metaskills avatar metaskills commented on May 22, 2024

Could you please provide more information about what you are proposing or have to do? For instance do you have a patch? What methods are you passing down?

Also, have you looked at the section in the README pertaining to native string types and unicode column configuration? Not sure if this threads into your issue but it does detail encodings the adapter does at the "column" level, not the connection level.

Lastly, the adapter is tested on a non-unicode DB too and because of the things it does for "national/unicode" column types, it does pass all the UTF8 tests we have thrown at it. So it is critical that you be detailed in your explanation and proposed changes.

from activerecord-sqlserver-adapter.

Excelan avatar Excelan commented on May 22, 2024

Thanks for quick response!
I tried to add encoding option in to my database.yml
adapter: sqlserver
mode: odbc
dsn: excel-sctest
encoding: cp1251

But it didn't give any effect.
I also didnt find anything interesting in the Gem directory by the 'encod' pattern.
So I decided there is no settings for encoding.

I've found some solution (in russian: http://drazdsoft.com/articles.php?article_id=6)
Where there were proposed to add Iconv just inside the "raw" methods:
ENC_EXEC = Iconv.new("cp1251", "utf-8")
ENC_SELECT = Iconv.new("utf-8", "cp1251")

def raw_connection_run(sql)
sql=ENC_EXEC.iconv(sql)
..

def raw_connection_do(sql)
sql=ENC_EXEC.iconv(sql)
...

def handle_to_fields_and_rows_odbc(handle)
...
rows = results.inject([]) do |rows,row|
row.each_with_index do |value, i|
if value.is_a? String
row[i] = ENC_SELECT.iconv(value)
end
end

This works, but I didn't test it agianst any tests.

Though I guess there is a better way to set the encoding in wich data goes and come from database?

from activerecord-sqlserver-adapter.

Excelan avatar Excelan commented on May 22, 2024

"the adapter is tested on a non-unicode DB too"

Ken, I didn't understand what did you mean. Was it tested to work with non-unicode database from a non-unicode app?
Or there is a way to use non-unicode database (columns) from UTF-8 application?

from activerecord-sqlserver-adapter.

metaskills avatar metaskills commented on May 22, 2024

Well the test coverage explains this pretty well, let me see if I can recap better. The adapter has tests that cover unicode data coming in and out AR objects attributes (columns) in a mixed schema. That means that most tables have columns that are not national and some tables have national columns. We specifically test that the schema reflection correctly cast/converts national columns correctly coming out and back into the DB. Simple use cases like a Post#body when setting attributes on instances.

There have been a few tickets opened up like #44 that covers an edge case quoting when using unicode strings as conditions/predicates. There is also a ticket that came up when there is a unicode column name itself and I can not remember now if that was applied.

In general all this assumes that you are working with and ODBC (other) low level connection that handles UTF8. If you can give me more info on your specific issue and/or tell me if it lines up with the ones I mentioned I can help. Tested patches are the way to go. This one should be applied on 2-3-stable too.

from activerecord-sqlserver-adapter.

Excelan avatar Excelan commented on May 22, 2024

I have a large complex database used by some heavy loaded application, where strings are stored in cp-1251 encoding, and thus cannot be converted. However I desided my new rails app to be in it's native utf-8, to avoid possible pitfalls.
So adapter was choosen to be the best place to convert data from app encoding to CP1251.
I convert queries to cp1251 and results to UTF.
But to do this i had to change adapter code.
I don't like it much, and I wander is there other 'legal'/better way to convert data between aplication and database.

from activerecord-sqlserver-adapter.

KDGundermann avatar KDGundermann commented on May 22, 2024

I have the same problem here, and currently trying to write some test covering these problems. One part of the problem is ruby-odbc which delivers the data as ASCII-8BIT.
I hope I will have some tests ( and maybe a solution for 2.3.x ) next week..

from activerecord-sqlserver-adapter.

KDGundermann avatar KDGundermann commented on May 22, 2024

So, I am able to read the codepage from the database. but

  • currently the ActiveRecordAdapter returns all strings with an encoding "ASCII-8BIT"
    for Ruby 1.9
  • should the adapter convert the data to the database encoding (eg. CP-1251 )
  • or should the adapter convert all data to encoding UTF-8 ?
    as Yehuda recommends it in http://yehudakatz.com/2010/05/17/encodings-unabridged/

from activerecord-sqlserver-adapter.

Excelan avatar Excelan commented on May 22, 2024

It should convert data to the user defined codepage (CP-1251), as far as all idea is to work with the database, wich cannot be converted to UTF (eg. it is used by other non-UTF applications).

from activerecord-sqlserver-adapter.

metaskills avatar metaskills commented on May 22, 2024

OK, this is good stuff. Klaus, do you have any recommendations? I just read that article and I want to pick your brain on a few things. For instance, I am a bit unclear on the "utf8" version of the ODBC, but will test that today for the new adapter in my RVM setup and report back. That still leaves what should be done in the adapter to support that possibility of all strings coming back from the DB as UTF8.

I see a few issues or things that confuse me. I believe SQL Server is in a odd place compared to other DBs. First, our data types are either UTF8 friendly or not, ie nvarchar or varchar. So let's say we recommend using and ODBC layer that returns all strings encoded to UTF8. That would fix column names that are non-ascii and would seem to work just fine. But would bad things happen on the way back into the DB for those that did not specify their columns to "national". From what I read on that brief, this should only be a problem with those that expect national characters to be persisted to the DB and did not do the legwork of making the right data types, a user issue.

But will the adapter quote all string conditions with an N'xxx' prefix and be cool with that. Will we need to enhance the case insensitive equality operator? I'm not sure.

@Excelan, sorry I did not understand your ticket at first. I think it is important to solve this bigger issue first that Yehuda raises and then see if we can thread your expectations in as well. I'm not sure what we can do if the ODBC transport either returns ASCII-8BIT or UTF8. Is CP1251 a super set of any of those are we just going to have to give solid hook points for the iconv to happen?

from activerecord-sqlserver-adapter.

metaskills avatar metaskills commented on May 22, 2024

I just finished a project called TinyTds and updated both the 2-3 and 3 branches/gem to allow a :dblib connection option which would use TinyTds.

http://github.com/rails-sqlserver/tiny_tds

Basically, this is more than likely going to be the only client/connection mode that supports user/client encodings. It even tags them right in ruby 1.9. Please take a look and file any issues under that project if not.

from activerecord-sqlserver-adapter.

Excelan avatar Excelan commented on May 22, 2024

Many thanks, Ken.
I'll try it in my configuration little later, after coming back to the project.
But i think you still don't understand my initial question.
CP1251 is not a superset of UTF8, but rather viceversa. UTF8 contains all characters that are in cp1251.
I red Yehuda article, and it mostly complains about asian languages. In my case, as is in most other encodings there is no such problem. There is no problem to convert data from cp1251 to UTF8 and back. All I wanted was parameters for configuration file, where one can set the encoding in wich data stored in DB. While Yehuda wrote about situations when you cannot convert data from some encoding to UTF8 and back without loosing information.
Do we undertand each other?

from activerecord-sqlserver-adapter.

metaskills avatar metaskills commented on May 22, 2024

Probably not :( Encodings are really not my strong point. From my perspective if I can get binary and unicode data in and out of the DB with a configuration to re-encode that to a different client encoding, then success.

So what I hear you saying is that you need another lower level setting. I wonder if the charset option of FreeTDS here. Basically from my previous understanding, SQL Server always stored strings in non-national columns as ASCII-8BIT and national columns as UCS-2. Here are a few links to some tests and notes I had while writing the new client. I still think there is a low level config for the TDS lib that handles this. If you have the time and can articulate a test string that I can manually insert and test for in TinyTds, I would do so.

http://github.com/rails-sqlserver/tiny_tds/blob/master/test/result_test.rb#L112
http://github.com/rails-sqlserver/tiny_tds/commit/fc253452ddfe54bf181e550909d34332076ccdab#L0L46

from activerecord-sqlserver-adapter.

KDGundermann avatar KDGundermann commented on May 22, 2024

I have tried to solve the problem in my monkeypatch for the sqlserver-adapter 2.3.8
I have written some test for ruby-odbc trying to tackle this problem.
I will try to write some tests for TinyTDS too, but as I am really busy atm, it will take some days ...

from activerecord-sqlserver-adapter.

KDGundermann avatar KDGundermann commented on May 22, 2024

High level summary:

when using nchar/nvarchar/ntext fields in an SQL Server database, ODBC is happy to accept UTF-8 Strings when quoted with N'' .
ODBC will somehow convert the UTF-8 String to UCS-2 and send it over the wire to the database. So when the RoR application is working only with UFT-8 Strings everybody is happy.

When you are using char/varchar/text columns in your database, those data will be stored as 8 bit Chars and assumed that they will have an encoding corresponding to a special codepage. The codepage is configured in the options of the database as a "Collation" which not only contains the codepage but also a sorting order.
( Samples for Collations are : "Latin1_General_CI_AS", "Chinese_Hong_Kong_Stroke_90_CI_AI_KS", "Germany_Phone_Book_BIN" )

So when the RoR application is using UTF-8 Strings, these String have to be somehow converted to encoding used in the database and vice versa.
When the RoR app is NOT using UTF-8 but ASCII/BINARY ( in Ruby 1.9 ) or no encoding at all ( In Ruby 1.8 ) you will have a hard time to find the correct conversion method.

Technical details

Its a long way for your data getting in and out of the database. The following stack may apply to a RoR application running on unix:

  • RoR application
  • ActiveRecord
  • activerecord-sqlserver-adapter
  • ruby-odbc
  • unixODBC
  • FreeTDS
  • TCP/IP
  • MS SQL Server

So where will the data be converted ???

  • In activerecord-sqlserver-adapter
    I am able to read then collation from the database with:

     collation = select_value("SELECT SERVERPROPERTY('Collation') AS Collation")
     server_codepage = "CP" + select_value("SELECT COLLATIONPROPERTY('#{collation}', 'CodePage') AS CodePage")
    

and when reading data from the database I can encode it to UTF-8:

 value = value.force_encoding(server_codepage)
 if Encoding.default_internal
 row[i] = data.encode(Encoding.default_internal)
 else
 row[i] = data.encode("UTF-8")
 end
  • ruby-odbc
    will convert between UTF-8 and UCS when using the odbc_utf8 version
  • unixODBC
    has a field "ClientCharset" in its odbc.ini ?!?
  • FreeTDS
    has a field "ClientCharset" in its freetds.conf

So depending of the configuration of the components you may get the right conversion or you may get a double conversion resulting in wrong characters.

Remarks:
All these conversion stuff only matters when using text with extended/national characters. As long you have normal ASCII characters you wouldn't notice any problem. So the problem may arise depending of the data entered by your users in the database.
So when creating tests it is absolutly necessary to have data with the correct encodings in the database and not only test with ascii data.

To make it worse, starting with SQL Server 2005 ?? it is possible to configure different encodings for each column in the database !!

from activerecord-sqlserver-adapter.

metaskills avatar metaskills commented on May 22, 2024

So helpful, and hurts my head too. I'll make some notes to run TinyTds thru some more paces in regards to this. Also, now that I think about it, the adapter may need a bit more work to not work so hard when TinyTds is being used too. ODBC, I have no idea.

from activerecord-sqlserver-adapter.

Excelan avatar Excelan commented on May 22, 2024

If Ken is crying, I probably don't understand anything at all :).
But i have working solution that I've mentioned at the begining of the thread.
Can't we just use it?

BTW, taking SERVERPROPERTY('Collation') is not a universal solution, because databases, tables and even columns might have their own colations.

from activerecord-sqlserver-adapter.

metaskills avatar metaskills commented on May 22, 2024

It might work for you, but not :adonet and/or :dblib connection modes. I also have a strong feeling that dumping in iconv at this low level is going to be a huge performance loss too. I'm not saying let's not solve this issue, I'm just saying I do not want to do anything half assed and without tests that cover it.

from activerecord-sqlserver-adapter.

metaskills avatar metaskills commented on May 22, 2024

Reopened this ticket.

from activerecord-sqlserver-adapter.

metaskills avatar metaskills commented on May 22, 2024

I am also of the opinion that this problem is a concern of the low level client. It is one of the reasons I started the TinyTds project so we could be in control of this stuff and handle it at the C layer and the lowest level, as Mysql2 gem did.

Please consider reviewing the tests for that library and making sure it does it's job. The goal of the adapter should be to trust the low level connection mode to do the right thing, not monkey patch every client that does it half assed.

from activerecord-sqlserver-adapter.

metaskills avatar metaskills commented on May 22, 2024

Closing this, concern of TinyTds, or ODBC, open tickets there if tests do not pass or if they do and the adapter is not trusting the low level connection mode properly.

from activerecord-sqlserver-adapter.

drobazko avatar drobazko commented on May 22, 2024

Hi, colleagues.
I have a problem. I am googling during 8 hours but no result )
I have:
windows XP xp3
mssql 2005
ruby 1.9.2
rails 3.0.9
activerecord-sqlserver-adapter 3.0.15
ruby-odbc 0.99994

I have a table with a column where data in Russian (cp1251).
rails s:
irb(main):001:0> Adddocs.where(["code=1203"])
=> [#Adddocs id: 4, ordnung: 3, Code: 1203, Name: "\xCA\xF0\xE5\xF1\xEB\xE5\xED\xED\xFF"]

at field named "Name" I have strange symbols "\xCA\xF0\xE5\xF1\xEB\xE5\xED\xED\xFF"
in I have performed "Perform translation for character data” is OFF. “Use regional settings when outputting currency, numbers, dates and times” is OFF." in my driver settings, but no result

Help, please!

from activerecord-sqlserver-adapter.

KDGundermann avatar KDGundermann commented on May 22, 2024

Please don't use ruby-odbc, go for TinyTds https://github.com/rails-sqlserver/tiny_tds

from activerecord-sqlserver-adapter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.