FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495
FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495subrata-ms wants to merge 3 commits intomainfrom
Conversation
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 3249-3262 3249 "length=%lu",
3250 i, (unsigned long)numCharsInData);
3251 } else {
3252 // Buffer too small, fallback to streaming
! 3253 LOG("SQLGetData: CHAR column %d (WCHAR path) data "
! 3254 "truncated, using streaming LOB",
! 3255 i);
! 3256 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3257 "utf-16le"));
! 3258 }
3259 } else if (dataLen == SQL_NULL_DATA) {
3260 LOG("SQLGetData: Column %d is NULL (CHAR via WCHAR)", i);
3261 row.append(py::none());
3262 } else if (dataLen == 0) {Lines 3261-3286 3261 row.append(py::none());
3262 } else if (dataLen == 0) {
3263 row.append(py::str(""));
3264 } else if (dataLen == SQL_NO_TOTAL) {
! 3265 LOG("SQLGetData: Cannot determine data length "
! 3266 "(SQL_NO_TOTAL) for column %d (CHAR via WCHAR), "
! 3267 "returning NULL",
! 3268 i);
! 3269 row.append(py::none());
! 3270 } else if (dataLen < 0) {
! 3271 LOG("SQLGetData: Unexpected negative data length "
! 3272 "for column %d - dataType=%d, dataLen=%ld",
! 3273 i, dataType, (long)dataLen);
! 3274 ThrowStdException("SQLGetData returned an unexpected negative "
! 3275 "data length");
! 3276 }
3277 } else {
! 3278 LOG("SQLGetData: Error retrieving data for column %d "
! 3279 "(CHAR via WCHAR) - SQLRETURN=%d, returning NULL",
! 3280 i, ret);
! 3281 row.append(py::none());
! 3282 }
3283 } else {
3284 // Allocate columnSize * 4 + 1 on ALL platforms (no #if guard).
3285 //
3286 // Why this differs from SQLBindColums / FetchBatchData:Lines 4791-4801 4791 arrowColumnProducer->ptrValueBuffer = arrowColumnProducer->bitVal.get();
4792 break;
4793 default:
4794 std::ostringstream errorString;
! 4795 errorString << "Unsupported data type for Arrow batch fetch for column - "
! 4796 << columnName.c_str() << ", Type - " << dataType << ", column ID - "
! 4797 << (i + 1);
4798 LOG(errorString.str().c_str());
4799 ThrowStdException(errorString.str());
4800 break;
4801 }Lines 5049-5058 5049 buffers.datetimeoffsetBuffers[idxCol].data(),
5050 sizeof(DateTimeOffset),
5051 buffers.indicators[idxCol].data());
5052 if (!SQL_SUCCEEDED(ret)) {
! 5053 LOG("Error fetching SS_TIMESTAMPOFFSET data for column %d",
! 5054 idxCol + 1);
5055 return ret;
5056 }
5057 break;
5058 }Lines 5100-5110 5100 nullCounts[idxCol] += 1;
5101 continue;
5102 } else if (indicator < 0) {
5103 // Negative value is unexpected, log column index, SQL type & raise exception
! 5104 LOG("Unexpected negative data length. Column ID - %d, SQL Type - %d, Data "
! 5105 "Length - %lld",
! 5106 idxCol + 1, dataType, (long long)indicator);
5107 ThrowStdException("Unexpected negative data length.");
5108 }
5109 auto dataLen = static_cast<uint64_t>(indicator);Lines 5406-5415 5406 arrowSchemaBatchCapsule =
5407 py::capsule(arrowSchemaBatch.get(), "arrow_schema", [](void* ptr) {
5408 auto arrowSchema = static_cast<ArrowSchema*>(ptr);
5409 if (arrowSchema->release) {
! 5410 arrowSchema->release(arrowSchema);
! 5411 }
5412 delete arrowSchema;
5413 });
5414 } catch (...) {
5415 arrowSchemaBatch->release(arrowSchemaBatch.get());mssql_python/pybind/ddbc_bindings.h📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.7%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 73.5%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.__init__.py: 77.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%🔗 Quick Links
|
There was a problem hiding this comment.
Pull request overview
This PR changes the default decoding path for SQL CHAR/VARCHAR (ODBC SQL_CHAR) columns to fetch as wide characters (SQL_C_WCHAR) and decode as UTF-16LE, eliminating Windows-vs-Linux inconsistencies when server code pages contain bytes that are invalid UTF-8.
Changes:
- Update connection defaults so
SQL_CHARdecoding usesencoding="utf-16le"withctype=SQL_WCHAR. - Plumb
charCtypethrough cursor fetch APIs into the C++ bindings, enabling wide-char fetch forSQL_CHARcolumns. - Expand/adjust encoding tests to validate new defaults and reproduce the CP1252 byte behavior.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
mssql_python/connection.py |
Changes default decoding settings for SQL_CHAR to UTF-16LE/SQL_WCHAR. |
mssql_python/cursor.py |
Passes updated decoding settings (encoding + ctype) into the native fetch functions. |
mssql_python/pybind/ddbc_bindings.cpp |
Adds charCtype plumb-through and implements SQL_C_WCHAR paths for SQLGetData and bound-column fetching. |
mssql_python/pybind/ddbc_bindings.h |
Extends per-column metadata and adds a wide-char branch in ProcessChar. |
tests/test_013_encoding_decoding.py |
Updates default expectations and adds Windows-focused regression tests for the CP1252 byte case. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self._decoding_settings = { | ||
| ConstantsDDBC.SQL_CHAR.value: { | ||
| "encoding": "utf-8", | ||
| "ctype": ConstantsDDBC.SQL_CHAR.value, | ||
| "encoding": "utf-16le", | ||
| "ctype": ConstantsDDBC.SQL_WCHAR.value, | ||
| }, |
There was a problem hiding this comment.
Connection.init now sets SQL_CHAR decoding defaults to encoding='utf-16le' with ctype=SQL_WCHAR, but setdecoding() still defaults SQL_CHAR to 'utf-8' when encoding is None. This makes it impossible to reliably “reset to defaults” via setdecoding(SQL_CHAR) and creates inconsistent documented behavior. Update setdecoding’s default-encoding branch (and its docstring/examples) so SQL_CHAR defaults match the new connection defaults (utf-16le + SQL_WCHAR).
| std::string utf8str = WideToUTF8(wstr); | ||
| row.append(py::str(utf8str)); | ||
| #else | ||
| std::wstring wstr(reinterpret_cast<wchar_t*>(dataBuffer.data())); |
There was a problem hiding this comment.
In the SQL_C_WCHAR path, this constructs a std::wstring from a null-terminated buffer (wstring(ptr)) even though SQLGetData provides an explicit length (numCharsInData). If the data contains embedded NULs, or if the driver fails to null-terminate as expected, the value may be truncated or read past the valid range. Prefer constructing wstr with the explicit length (e.g., wstring(ptr, numCharsInData)) and avoid relying on terminators.
| std::wstring wstr(reinterpret_cast<wchar_t*>(dataBuffer.data())); | |
| std::wstring wstr( | |
| reinterpret_cast<wchar_t*>(dataBuffer.data()), | |
| static_cast<size_t>(numCharsInData)); |
| Py_INCREF(Py_None); | ||
| PyList_SET_ITEM(row, col - 1, Py_None); |
There was a problem hiding this comment.
ProcessChar’s wide-char decode failure path clears the exception and stores None in the result. This silently drops data and is inconsistent with the narrow-char path (which falls back to bytes on decode failure) and ProcessWChar (which falls back to empty string). Consider aligning behavior (e.g., return raw UTF-16 bytes or a deterministic fallback) rather than returning None.
| Py_INCREF(Py_None); | |
| PyList_SET_ITEM(row, col - 1, Py_None); | |
| PyObject* pyBytes = PyBytes_FromStringAndSize( | |
| reinterpret_cast<const char*>(wcharData), numCharsInData * sizeof(SQLWCHAR)); | |
| if (!pyBytes) { | |
| Py_INCREF(Py_None); | |
| PyList_SET_ITEM(row, col - 1, Py_None); | |
| } else { | |
| PyList_SET_ITEM(row, col - 1, pyBytes); | |
| } |
There was a problem hiding this comment.
+1 to Copilot's comment here. Worth adding a bit more context on why this matters.
The three decode-failure paths across closely related functions currently behave differently:
ProcessChar narrow-char → falls back to raw bytes (data preserved)
ProcessWChar → falls back to empty str (data lost but visible)
ProcessChar wide-char → stores None (data lost, looks like SQL NULL)
None is the most harmful choice because it is indistinguishable from a genuine SQL NULL to the caller — there is no way to detect the difference. That said, a PyUnicode_FromWideChar / PyUnicode_DecodeUTF16 failure on data that the ODBC driver has already converted to valid UTF-16 almost exclusively means OOM, so this branch is very unlikely to fire in practice. Still, it seems worth aligning with the narrow-char fallback (PyBytes_FromStringAndSize over the raw SQLWCHAR buffer) so that if it ever does fire, no data is silently lost and the behavior is consistent across the function.
| for b in defined_bytes: | ||
| cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))") |
There was a problem hiding this comment.
This test inserts 123 rows using a per-row execute() loop, which can add noticeable runtime on Windows CI (many round-trips). Consider batching these inserts (executemany, a single INSERT ... VALUES list, or inserting into a temp table) to keep the regression coverage without the per-row overhead.
| for b in defined_bytes: | |
| cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))") | |
| cursor.executemany( | |
| f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES (?, CHAR(?))", | |
| [(b, b) for b in defined_bytes], | |
| ) |
| # ==================================================================================== | ||
|
|
||
|
|
||
| @pytest.mark.skipif( |
There was a problem hiding this comment.
One gap worth considering: the default ctype for SQL_CHAR is now SQL_C_WCHAR / utf-16le on all platforms, not just Windows. On Linux/macOS, ProcessChar now takes the PyUnicode_DecodeUTF16 branch for every VARCHAR fetch instead of the previous PyUnicode_FromStringAndSize path. There are no platform-neutral tests guarding this new code path.
A small non-skipif test class with two cases would help:
- A plain ASCII VARCHAR value — the most common case, ensures the new default doesn't regress the happy path on Linux/macOS.
- A non-ASCII Unicode value in a VARCHAR column (e.g. a UTF-8 collation or an explicit CONVERT) — ensures multi-byte characters round-trip correctly through the new PyUnicode_DecodeUTF16 path on Linux/macOS.
Without this, a bug in the wide-char path on Linux/macOS (e.g. a byte-order or null-terminator issue) would not be caught by CI.
| ret = SQLBindColums(hStmt, buffers, columnNames, numCols, fetchSize); | ||
| // Bind columns — Arrow always uses SQL_C_CHAR for VARCHAR because | ||
| // it processes raw byte buffers directly, not via Python codecs. | ||
| ret = SQLBindColums(hStmt, buffers, columnNames, numCols, fetchSize, SQL_C_CHAR); |
There was a problem hiding this comment.
This PR fixes issue #468 for fetchone/fetchmany/fetchall, but the identical bug remains for fetch_arrow_batch() on Windows. A user migrating to Arrow-based fetching to improve performance would still get garbled or raw-byte data for any CP-1252 high byte (≥ 0x80) in a VARCHAR column.
The comment says "not via Python codecs" — but the Arrow path already has the infrastructure to handle SQL_C_WCHAR for SQL_WCHAR/NVARCHAR columns: it converts the UTF-16 buffer to UTF-8 via WideCharToMultiByte(CP_UTF8, ...) on Windows (and WideToUTF8(SQLWCHARToWString(...)) on Linux/macOS) and writes the result into the Arrow varData buffer. The same approach would work for VARCHAR columns bound as SQL_C_WCHAR — the SQL_CHAR/SQL_VARCHAR/SQL_LONGVARCHAR case in the Arrow data loop could be extended with an #ifdef _WIN32 branch that reads from wcharBuffers and converts to UTF-8, identical to the existing NVARCHAR path.
If fixing this is out of scope for this PR, it would be helpful to document the limitation in code. We can then track this as a follow-up improvement.
cc: @ffelixg would appreciate your thoughts here.
There was a problem hiding this comment.
@gargsaumya @subrata-ms Fundamentally I would prefer if it was impossible for the user to configure the driver such that corrupt data is returned. If we want that, we can only allow two scenarios:
- Fetch as WCHAR & decode utf16
- Fetch as CHAR & decode utf8 IF locale is set to utf8 at connection time
Even if the locale is set to cp1252 and the decoding is also correctly set to cp1252, we may still return corrupt data if the column we are querying has any other locale. From what I can tell there isn't even any easy way for us to know the actual column collation so this corruption would be silent (I think the driver simply replaces characters with a questionmark)
I get that for fetchall etc we may want to preserve other decoding scenarios for backwards compatibility, but since the arrow fetch path is new, I think the best course of action would be to either throw exceptions if we can't guarantee correct data for the current configuration or silently use WCHAR to ensure correct data.
So the simple variant is what Saumya says: Always pass SQL_C_WCHAR to SQLBindColums and integrate the SQL_VARCHAR cases with the SQL_WVARCHAR cases. Optionally raise a NotImplementedError if the user deviated from the default utf16 configuration.
A more complex, but also potentially ~2x more performant option would be to keep the existing path, but only allow it if the decoding is set to utf8 and the connection-time locale is also utf8. The scope of this performance boost would be limited though, since the majority of SQL Server users already default to nvarchars for string data, so for now it's probably best to keep it safe and simple.
| Py_INCREF(Py_None); | ||
| PyList_SET_ITEM(row, col - 1, Py_None); |
There was a problem hiding this comment.
+1 to Copilot's comment here. Worth adding a bit more context on why this matters.
The three decode-failure paths across closely related functions currently behave differently:
ProcessChar narrow-char → falls back to raw bytes (data preserved)
ProcessWChar → falls back to empty str (data lost but visible)
ProcessChar wide-char → stores None (data lost, looks like SQL NULL)
None is the most harmful choice because it is indistinguishable from a genuine SQL NULL to the caller — there is no way to detect the difference. That said, a PyUnicode_FromWideChar / PyUnicode_DecodeUTF16 failure on data that the ODBC driver has already converted to valid UTF-16 almost exclusively means OOM, so this branch is very unlikely to fire in practice. Still, it seems worth aligning with the narrow-char fallback (PyBytes_FromStringAndSize over the raw SQLWCHAR buffer) so that if it ever does fire, no data is silently lost and the behavior is consistent across the function.
Work Item / Issue Reference
Summary
This pull request updates the default handling of SQL
CHAR/VARCHARcolumns to use UTF-16 (wide character) encoding instead of UTF-8, primarily to address encoding mismatches on Windows and ensure consistent Unicode decoding. The changes span the connection, cursor, and C++ binding layers, and update related tests to reflect the new default behavior.Default Encoding and Decoding Changes:
The default decoding for SQL
CHARcolumns is now set to use"utf-16le"encoding and theSQL_WCHARctype, replacing the previous"utf-8"/SQL_CHARdefaults. This avoids issues where Windows ODBC drivers return raw bytes in the server's native code page, which may not decode as UTF-8. (mssql_python/connection.py, mssql_python/connection.pyR264-R271)All cursor fetch methods (
fetchone,fetchmany,fetchall) are updated to request UTF-16 decoding and pass the correct ctype when fetchingCHARdata, ensuring consistent behavior across platforms. (mssql_python/cursor.py, [1] [2] [3]C++ Binding and Processing Updates:
ColumnInfoExtstruct now tracks whether wide character (UTF-16) fetching is used for a column, and theProcessCharfunction is updated to handle both wide and narrow character paths, decoding appropriately based on the new setting. (mssql_python/pybind/ddbc_bindings.h, [1] [2] [3]Test Adjustments:
"utf-16le"andSQL_WCHARas the default decoding settings forSQL_CHARcolumns, and to validate the new default behavior. (tests/test_013_encoding_decoding.py, [1] [2] [3]