Skip to content

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495

Open
subrata-ms wants to merge 3 commits intomainfrom
subrata-ms/CP1252Encoding
Open

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468#495
subrata-ms wants to merge 3 commits intomainfrom
subrata-ms/CP1252Encoding

Conversation

@subrata-ms
Copy link
Copy Markdown
Contributor

@subrata-ms subrata-ms commented Apr 3, 2026

Work Item / Issue Reference

AB#43177

GitHub Issue: #468


Summary

This pull request updates the default handling of SQL CHAR/VARCHAR columns to use UTF-16 (wide character) encoding instead of UTF-8, primarily to address encoding mismatches on Windows and ensure consistent Unicode decoding. The changes span the connection, cursor, and C++ binding layers, and update related tests to reflect the new default behavior.

Default Encoding and Decoding Changes:

  • The default decoding for SQL CHAR columns is now set to use "utf-16le" encoding and the SQL_WCHAR ctype, replacing the previous "utf-8"/SQL_CHAR defaults. This avoids issues where Windows ODBC drivers return raw bytes in the server's native code page, which may not decode as UTF-8. (mssql_python/connection.py, mssql_python/connection.pyR264-R271)

  • All cursor fetch methods (fetchone, fetchmany, fetchall) are updated to request UTF-16 decoding and pass the correct ctype when fetching CHAR data, ensuring consistent behavior across platforms. (mssql_python/cursor.py, [1] [2] [3]

C++ Binding and Processing Updates:

  • The ColumnInfoExt struct now tracks whether wide character (UTF-16) fetching is used for a column, and the ProcessChar function is updated to handle both wide and narrow character paths, decoding appropriately based on the new setting. (mssql_python/pybind/ddbc_bindings.h, [1] [2] [3]

Test Adjustments:

  • Tests are updated to expect "utf-16le" and SQL_WCHAR as the default decoding settings for SQL_CHAR columns, and to validate the new default behavior. (tests/test_013_encoding_decoding.py, [1] [2] [3]

@github-actions github-actions bot added the pr-size: large Substantial code update label Apr 3, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 3, 2026

📊 Code Coverage Report

🔥 Diff Coverage

85%


🎯 Overall Coverage

78%


📈 Total Lines Covered: 6636 out of 8461
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/ddbc_bindings.cpp (87.4%): Missing lines 3253-3258,3265-3276,3278-3282,4795-4797,5053-5054,5104-5106,5410-5411
  • mssql_python/pybind/ddbc_bindings.h (65.2%): Missing lines 830-832,838-842

Summary

  • Total: 285 lines
  • Missing: 41 lines
  • Coverage: 85%

mssql_python/pybind/ddbc_bindings.cpp

Lines 3249-3262

  3249                                     "length=%lu",
  3250                                     i, (unsigned long)numCharsInData);
  3251                             } else {
  3252                                 // Buffer too small, fallback to streaming
! 3253                                 LOG("SQLGetData: CHAR column %d (WCHAR path) data "
! 3254                                     "truncated, using streaming LOB",
! 3255                                     i);
! 3256                                 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3257                                                               "utf-16le"));
! 3258                             }
  3259                         } else if (dataLen == SQL_NULL_DATA) {
  3260                             LOG("SQLGetData: Column %d is NULL (CHAR via WCHAR)", i);
  3261                             row.append(py::none());
  3262                         } else if (dataLen == 0) {

Lines 3261-3286

  3261                             row.append(py::none());
  3262                         } else if (dataLen == 0) {
  3263                             row.append(py::str(""));
  3264                         } else if (dataLen == SQL_NO_TOTAL) {
! 3265                             LOG("SQLGetData: Cannot determine data length "
! 3266                                 "(SQL_NO_TOTAL) for column %d (CHAR via WCHAR), "
! 3267                                 "returning NULL",
! 3268                                 i);
! 3269                             row.append(py::none());
! 3270                         } else if (dataLen < 0) {
! 3271                             LOG("SQLGetData: Unexpected negative data length "
! 3272                                 "for column %d - dataType=%d, dataLen=%ld",
! 3273                                 i, dataType, (long)dataLen);
! 3274                             ThrowStdException("SQLGetData returned an unexpected negative "
! 3275                                               "data length");
! 3276                         }
  3277                     } else {
! 3278                         LOG("SQLGetData: Error retrieving data for column %d "
! 3279                             "(CHAR via WCHAR) - SQLRETURN=%d, returning NULL",
! 3280                             i, ret);
! 3281                         row.append(py::none());
! 3282                     }
  3283                 } else {
  3284                     // Allocate columnSize * 4 + 1 on ALL platforms (no #if guard).
  3285                     //
  3286                     // Why this differs from SQLBindColums / FetchBatchData:

Lines 4791-4801

  4791                 arrowColumnProducer->ptrValueBuffer = arrowColumnProducer->bitVal.get();
  4792                 break;
  4793             default:
  4794                 std::ostringstream errorString;
! 4795                 errorString << "Unsupported data type for Arrow batch fetch for column - "
! 4796                             << columnName.c_str() << ", Type - " << dataType << ", column ID - "
! 4797                             << (i + 1);
  4798                 LOG(errorString.str().c_str());
  4799                 ThrowStdException(errorString.str());
  4800                 break;
  4801         }

Lines 5049-5058

  5049                                                  buffers.datetimeoffsetBuffers[idxCol].data(),
  5050                                                  sizeof(DateTimeOffset),
  5051                                                  buffers.indicators[idxCol].data());
  5052                             if (!SQL_SUCCEEDED(ret)) {
! 5053                                 LOG("Error fetching SS_TIMESTAMPOFFSET data for column %d",
! 5054                                     idxCol + 1);
  5055                                 return ret;
  5056                             }
  5057                             break;
  5058                         }

Lines 5100-5110

  5100                     nullCounts[idxCol] += 1;
  5101                     continue;
  5102                 } else if (indicator < 0) {
  5103                     // Negative value is unexpected, log column index, SQL type & raise exception
! 5104                     LOG("Unexpected negative data length. Column ID - %d, SQL Type - %d, Data "
! 5105                         "Length - %lld",
! 5106                         idxCol + 1, dataType, (long long)indicator);
  5107                     ThrowStdException("Unexpected negative data length.");
  5108                 }
  5109                 auto dataLen = static_cast<uint64_t>(indicator);

Lines 5406-5415

  5406         arrowSchemaBatchCapsule =
  5407             py::capsule(arrowSchemaBatch.get(), "arrow_schema", [](void* ptr) {
  5408                 auto arrowSchema = static_cast<ArrowSchema*>(ptr);
  5409                 if (arrowSchema->release) {
! 5410                     arrowSchema->release(arrowSchema);
! 5411                 }
  5412                 delete arrowSchema;
  5413             });
  5414     } catch (...) {
  5415         arrowSchemaBatch->release(arrowSchemaBatch.get());

mssql_python/pybind/ddbc_bindings.h

  826             PyObject* pyStr =
  827                 PyUnicode_FromWideChar(reinterpret_cast<const wchar_t*>(wcharData), numCharsInData);
  828 #endif
  829             if (!pyStr) {
! 830                 PyErr_Clear();
! 831                 Py_INCREF(Py_None);
! 832                 PyList_SET_ITEM(row, col - 1, Py_None);
  833             } else {
  834                 PyList_SET_ITEM(row, col - 1, pyStr);
  835             }
  836         } else {

  834                 PyList_SET_ITEM(row, col - 1, pyStr);
  835             }
  836         } else {
  837             // LOB / truncated: stream with SQL_C_WCHAR
! 838             PyList_SET_ITEM(row, col - 1,
! 839                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 840                                 .release()
! 841                                 .ptr());
! 842         }
  843         return;
  844     }
  845 
  846     // Original narrow-char path (charCtype == SQL_C_CHAR)


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.7%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 73.5%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.__init__.py: 77.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@subrata-ms subrata-ms marked this pull request as ready for review April 3, 2026 08:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes the default decoding path for SQL CHAR/VARCHAR (ODBC SQL_CHAR) columns to fetch as wide characters (SQL_C_WCHAR) and decode as UTF-16LE, eliminating Windows-vs-Linux inconsistencies when server code pages contain bytes that are invalid UTF-8.

Changes:

  • Update connection defaults so SQL_CHAR decoding uses encoding="utf-16le" with ctype=SQL_WCHAR.
  • Plumb charCtype through cursor fetch APIs into the C++ bindings, enabling wide-char fetch for SQL_CHAR columns.
  • Expand/adjust encoding tests to validate new defaults and reproduce the CP1252 byte behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
mssql_python/connection.py Changes default decoding settings for SQL_CHAR to UTF-16LE/SQL_WCHAR.
mssql_python/cursor.py Passes updated decoding settings (encoding + ctype) into the native fetch functions.
mssql_python/pybind/ddbc_bindings.cpp Adds charCtype plumb-through and implements SQL_C_WCHAR paths for SQLGetData and bound-column fetching.
mssql_python/pybind/ddbc_bindings.h Extends per-column metadata and adds a wide-char branch in ProcessChar.
tests/test_013_encoding_decoding.py Updates default expectations and adds Windows-focused regression tests for the CP1252 byte case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 268 to 272
self._decoding_settings = {
ConstantsDDBC.SQL_CHAR.value: {
"encoding": "utf-8",
"ctype": ConstantsDDBC.SQL_CHAR.value,
"encoding": "utf-16le",
"ctype": ConstantsDDBC.SQL_WCHAR.value,
},
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connection.init now sets SQL_CHAR decoding defaults to encoding='utf-16le' with ctype=SQL_WCHAR, but setdecoding() still defaults SQL_CHAR to 'utf-8' when encoding is None. This makes it impossible to reliably “reset to defaults” via setdecoding(SQL_CHAR) and creates inconsistent documented behavior. Update setdecoding’s default-encoding branch (and its docstring/examples) so SQL_CHAR defaults match the new connection defaults (utf-16le + SQL_WCHAR).

Copilot uses AI. Check for mistakes.
std::string utf8str = WideToUTF8(wstr);
row.append(py::str(utf8str));
#else
std::wstring wstr(reinterpret_cast<wchar_t*>(dataBuffer.data()));
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the SQL_C_WCHAR path, this constructs a std::wstring from a null-terminated buffer (wstring(ptr)) even though SQLGetData provides an explicit length (numCharsInData). If the data contains embedded NULs, or if the driver fails to null-terminate as expected, the value may be truncated or read past the valid range. Prefer constructing wstr with the explicit length (e.g., wstring(ptr, numCharsInData)) and avoid relying on terminators.

Suggested change
std::wstring wstr(reinterpret_cast<wchar_t*>(dataBuffer.data()));
std::wstring wstr(
reinterpret_cast<wchar_t*>(dataBuffer.data()),
static_cast<size_t>(numCharsInData));

Copilot uses AI. Check for mistakes.
Comment on lines +831 to +832
Py_INCREF(Py_None);
PyList_SET_ITEM(row, col - 1, Py_None);
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProcessChar’s wide-char decode failure path clears the exception and stores None in the result. This silently drops data and is inconsistent with the narrow-char path (which falls back to bytes on decode failure) and ProcessWChar (which falls back to empty string). Consider aligning behavior (e.g., return raw UTF-16 bytes or a deterministic fallback) rather than returning None.

Suggested change
Py_INCREF(Py_None);
PyList_SET_ITEM(row, col - 1, Py_None);
PyObject* pyBytes = PyBytes_FromStringAndSize(
reinterpret_cast<const char*>(wcharData), numCharsInData * sizeof(SQLWCHAR));
if (!pyBytes) {
Py_INCREF(Py_None);
PyList_SET_ITEM(row, col - 1, Py_None);
} else {
PyList_SET_ITEM(row, col - 1, pyBytes);
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Copilot's comment here. Worth adding a bit more context on why this matters.

The three decode-failure paths across closely related functions currently behave differently:

ProcessChar narrow-char → falls back to raw bytes (data preserved)
ProcessWChar → falls back to empty str (data lost but visible)
ProcessChar wide-char → stores None (data lost, looks like SQL NULL)

None is the most harmful choice because it is indistinguishable from a genuine SQL NULL to the caller — there is no way to detect the difference. That said, a PyUnicode_FromWideChar / PyUnicode_DecodeUTF16 failure on data that the ODBC driver has already converted to valid UTF-16 almost exclusively means OOM, so this branch is very unlikely to fire in practice. Still, it seems worth aligning with the narrow-char fallback (PyBytes_FromStringAndSize over the raw SQLWCHAR buffer) so that if it ever does fire, no data is silently lost and the behavior is consistent across the function.

Comment on lines +7420 to +7421
for b in defined_bytes:
cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))")
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test inserts 123 rows using a per-row execute() loop, which can add noticeable runtime on Windows CI (many round-trips). Consider batching these inserts (executemany, a single INSERT ... VALUES list, or inserting into a temp table) to keep the regression coverage without the per-row overhead.

Suggested change
for b in defined_bytes:
cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))")
cursor.executemany(
f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES (?, CHAR(?))",
[(b, b) for b in defined_bytes],
)

Copilot uses AI. Check for mistakes.
# ====================================================================================


@pytest.mark.skipif(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One gap worth considering: the default ctype for SQL_CHAR is now SQL_C_WCHAR / utf-16le on all platforms, not just Windows. On Linux/macOS, ProcessChar now takes the PyUnicode_DecodeUTF16 branch for every VARCHAR fetch instead of the previous PyUnicode_FromStringAndSize path. There are no platform-neutral tests guarding this new code path.

A small non-skipif test class with two cases would help:

  1. A plain ASCII VARCHAR value — the most common case, ensures the new default doesn't regress the happy path on Linux/macOS.
  2. A non-ASCII Unicode value in a VARCHAR column (e.g. a UTF-8 collation or an explicit CONVERT) — ensures multi-byte characters round-trip correctly through the new PyUnicode_DecodeUTF16 path on Linux/macOS.

Without this, a bug in the wide-char path on Linux/macOS (e.g. a byte-order or null-terminator issue) would not be caught by CI.

ret = SQLBindColums(hStmt, buffers, columnNames, numCols, fetchSize);
// Bind columns — Arrow always uses SQL_C_CHAR for VARCHAR because
// it processes raw byte buffers directly, not via Python codecs.
ret = SQLBindColums(hStmt, buffers, columnNames, numCols, fetchSize, SQL_C_CHAR);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR fixes issue #468 for fetchone/fetchmany/fetchall, but the identical bug remains for fetch_arrow_batch() on Windows. A user migrating to Arrow-based fetching to improve performance would still get garbled or raw-byte data for any CP-1252 high byte (≥ 0x80) in a VARCHAR column.

The comment says "not via Python codecs" — but the Arrow path already has the infrastructure to handle SQL_C_WCHAR for SQL_WCHAR/NVARCHAR columns: it converts the UTF-16 buffer to UTF-8 via WideCharToMultiByte(CP_UTF8, ...) on Windows (and WideToUTF8(SQLWCHARToWString(...)) on Linux/macOS) and writes the result into the Arrow varData buffer. The same approach would work for VARCHAR columns bound as SQL_C_WCHAR — the SQL_CHAR/SQL_VARCHAR/SQL_LONGVARCHAR case in the Arrow data loop could be extended with an #ifdef _WIN32 branch that reads from wcharBuffers and converts to UTF-8, identical to the existing NVARCHAR path.

If fixing this is out of scope for this PR, it would be helpful to document the limitation in code. We can then track this as a follow-up improvement.

cc: @ffelixg would appreciate your thoughts here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gargsaumya @subrata-ms Fundamentally I would prefer if it was impossible for the user to configure the driver such that corrupt data is returned. If we want that, we can only allow two scenarios:

  • Fetch as WCHAR & decode utf16
  • Fetch as CHAR & decode utf8 IF locale is set to utf8 at connection time

Even if the locale is set to cp1252 and the decoding is also correctly set to cp1252, we may still return corrupt data if the column we are querying has any other locale. From what I can tell there isn't even any easy way for us to know the actual column collation so this corruption would be silent (I think the driver simply replaces characters with a questionmark)

I get that for fetchall etc we may want to preserve other decoding scenarios for backwards compatibility, but since the arrow fetch path is new, I think the best course of action would be to either throw exceptions if we can't guarantee correct data for the current configuration or silently use WCHAR to ensure correct data.

So the simple variant is what Saumya says: Always pass SQL_C_WCHAR to SQLBindColums and integrate the SQL_VARCHAR cases with the SQL_WVARCHAR cases. Optionally raise a NotImplementedError if the user deviated from the default utf16 configuration.

A more complex, but also potentially ~2x more performant option would be to keep the existing path, but only allow it if the decoding is set to utf8 and the connection-time locale is also utf8. The scope of this performance boost would be limited though, since the majority of SQL Server users already default to nvarchars for string data, so for now it's probably best to keep it safe and simple.

Comment on lines +831 to +832
Py_INCREF(Py_None);
PyList_SET_ITEM(row, col - 1, Py_None);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Copilot's comment here. Worth adding a bit more context on why this matters.

The three decode-failure paths across closely related functions currently behave differently:

ProcessChar narrow-char → falls back to raw bytes (data preserved)
ProcessWChar → falls back to empty str (data lost but visible)
ProcessChar wide-char → stores None (data lost, looks like SQL NULL)

None is the most harmful choice because it is indistinguishable from a genuine SQL NULL to the caller — there is no way to detect the difference. That said, a PyUnicode_FromWideChar / PyUnicode_DecodeUTF16 failure on data that the ODBC driver has already converted to valid UTF-16 almost exclusively means OOM, so this branch is very unlikely to fire in practice. Still, it seems worth aligning with the narrow-char fallback (PyBytes_FromStringAndSize over the raw SQLWCHAR buffer) so that if it ever does fire, no data is silently lost and the behavior is consistent across the function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-size: large Substantial code update

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants