FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468 by subrata-ms · Pull Request #495 · microsoft/mssql-python

subrata-ms · 2026-04-03T06:43:52Z

Work Item / Issue Reference

AB#43177

GitHub Issue: #468

Summary

This pull request updates the default handling of SQL CHAR/VARCHAR columns to use UTF-16 (wide character) encoding instead of UTF-8, primarily to address encoding mismatches on Windows and ensure consistent Unicode decoding. The changes span the connection, cursor, and C++ binding layers, and update related tests to reflect the new default behavior.

Default Encoding and Decoding Changes:

The default decoding for SQL CHAR columns is now set to use "utf-16le" encoding and the SQL_WCHAR ctype, replacing the previous "utf-8"/SQL_CHAR defaults. This avoids issues where Windows ODBC drivers return raw bytes in the server's native code page, which may not decode as UTF-8. (mssql_python/connection.py, mssql_python/connection.pyR264-R271)
All cursor fetch methods (fetchone, fetchmany, fetchall) are updated to request UTF-16 decoding and pass the correct ctype when fetching CHAR data, ensuring consistent behavior across platforms. (mssql_python/cursor.py, [1] [2] [3]

C++ Binding and Processing Updates:

The ColumnInfoExt struct now tracks whether wide character (UTF-16) fetching is used for a column, and the ProcessChar function is updated to handle both wide and narrow character paths, decoding appropriately based on the new setting. (mssql_python/pybind/ddbc_bindings.h, [1] [2] [3]

Test Adjustments:

Tests are updated to expect "utf-16le" and SQL_WCHAR as the default decoding settings for SQL_CHAR columns, and to validate the new default behavior. (tests/test_013_encoding_decoding.py, [1] [2] [3]

mssql_python/pybind/ddbc_bindings.cpp

github-actions · 2026-04-03T07:43:00Z

📊 Code Coverage Report

🔥 Diff Coverage 85%	🎯 Overall Coverage 78%	📈 Total Lines Covered: `6636` out of `8461` 📁 Project: `mssql-python`

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

mssql_python/pybind/ddbc_bindings.cpp (87.4%): Missing lines 3253-3258,3265-3276,3278-3282,4795-4797,5053-5054,5104-5106,5410-5411
mssql_python/pybind/ddbc_bindings.h (65.2%): Missing lines 830-832,838-842

Summary

Total: 285 lines
Missing: 41 lines
Coverage: 85%

mssql_python/pybind/ddbc_bindings.cpp

Lines 3249-3262

  3249                                     "length=%lu",
  3250                                     i, (unsigned long)numCharsInData);
  3251                             } else {
  3252                                 // Buffer too small, fallback to streaming
! 3253                                 LOG("SQLGetData: CHAR column %d (WCHAR path) data "
! 3254                                     "truncated, using streaming LOB",
! 3255                                     i);
! 3256                                 row.append(FetchLobColumnData(hStmt, i, SQL_C_WCHAR, true, false,
! 3257                                                               "utf-16le"));
! 3258                             }
  3259                         } else if (dataLen == SQL_NULL_DATA) {
  3260                             LOG("SQLGetData: Column %d is NULL (CHAR via WCHAR)", i);
  3261                             row.append(py::none());
  3262                         } else if (dataLen == 0) {

Lines 3261-3286

  3261                             row.append(py::none());
  3262                         } else if (dataLen == 0) {
  3263                             row.append(py::str(""));
  3264                         } else if (dataLen == SQL_NO_TOTAL) {
! 3265                             LOG("SQLGetData: Cannot determine data length "
! 3266                                 "(SQL_NO_TOTAL) for column %d (CHAR via WCHAR), "
! 3267                                 "returning NULL",
! 3268                                 i);
! 3269                             row.append(py::none());
! 3270                         } else if (dataLen < 0) {
! 3271                             LOG("SQLGetData: Unexpected negative data length "
! 3272                                 "for column %d - dataType=%d, dataLen=%ld",
! 3273                                 i, dataType, (long)dataLen);
! 3274                             ThrowStdException("SQLGetData returned an unexpected negative "
! 3275                                               "data length");
! 3276                         }
  3277                     } else {
! 3278                         LOG("SQLGetData: Error retrieving data for column %d "
! 3279                             "(CHAR via WCHAR) - SQLRETURN=%d, returning NULL",
! 3280                             i, ret);
! 3281                         row.append(py::none());
! 3282                     }
  3283                 } else {
  3284                     // Allocate columnSize * 4 + 1 on ALL platforms (no #if guard).
  3285                     //
  3286                     // Why this differs from SQLBindColums / FetchBatchData:

Lines 4791-4801

  4791                 arrowColumnProducer->ptrValueBuffer = arrowColumnProducer->bitVal.get();
  4792                 break;
  4793             default:
  4794                 std::ostringstream errorString;
! 4795                 errorString << "Unsupported data type for Arrow batch fetch for column - "
! 4796                             << columnName.c_str() << ", Type - " << dataType << ", column ID - "
! 4797                             << (i + 1);
  4798                 LOG(errorString.str().c_str());
  4799                 ThrowStdException(errorString.str());
  4800                 break;
  4801         }

Lines 5049-5058

  5049                                                  buffers.datetimeoffsetBuffers[idxCol].data(),
  5050                                                  sizeof(DateTimeOffset),
  5051                                                  buffers.indicators[idxCol].data());
  5052                             if (!SQL_SUCCEEDED(ret)) {
! 5053                                 LOG("Error fetching SS_TIMESTAMPOFFSET data for column %d",
! 5054                                     idxCol + 1);
  5055                                 return ret;
  5056                             }
  5057                             break;
  5058                         }

Lines 5100-5110

  5100                     nullCounts[idxCol] += 1;
  5101                     continue;
  5102                 } else if (indicator < 0) {
  5103                     // Negative value is unexpected, log column index, SQL type & raise exception
! 5104                     LOG("Unexpected negative data length. Column ID - %d, SQL Type - %d, Data "
! 5105                         "Length - %lld",
! 5106                         idxCol + 1, dataType, (long long)indicator);
  5107                     ThrowStdException("Unexpected negative data length.");
  5108                 }
  5109                 auto dataLen = static_cast<uint64_t>(indicator);

Lines 5406-5415

  5406         arrowSchemaBatchCapsule =
  5407             py::capsule(arrowSchemaBatch.get(), "arrow_schema", [](void* ptr) {
  5408                 auto arrowSchema = static_cast<ArrowSchema*>(ptr);
  5409                 if (arrowSchema->release) {
! 5410                     arrowSchema->release(arrowSchema);
! 5411                 }
  5412                 delete arrowSchema;
  5413             });
  5414     } catch (...) {
  5415         arrowSchemaBatch->release(arrowSchemaBatch.get());

mssql_python/pybind/ddbc_bindings.h

  826             PyObject* pyStr =
  827                 PyUnicode_FromWideChar(reinterpret_cast<const wchar_t*>(wcharData), numCharsInData);
  828 #endif
  829             if (!pyStr) {
! 830                 PyErr_Clear();
! 831                 Py_INCREF(Py_None);
! 832                 PyList_SET_ITEM(row, col - 1, Py_None);
  833             } else {
  834                 PyList_SET_ITEM(row, col - 1, pyStr);
  835             }
  836         } else {

  834                 PyList_SET_ITEM(row, col - 1, pyStr);
  835             }
  836         } else {
  837             // LOB / truncated: stream with SQL_C_WCHAR
! 838             PyList_SET_ITEM(row, col - 1,
! 839                             FetchLobColumnData(hStmt, col, SQL_C_WCHAR, true, false, "utf-16le")
! 840                                 .release()
! 841                                 .ptr());
! 842         }
  843         return;
  844     }
  845 
  846     // Original narrow-char path (charCtype == SQL_C_CHAR)

📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)

mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 67.7%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 73.5%
mssql_python.pybind.connection.connection.cpp: 75.3%
mssql_python.__init__.py: 77.1%
mssql_python.ddbc_bindings.py: 79.6%
mssql_python.pybind.connection.connection_pool.cpp: 79.6%
mssql_python.connection.py: 85.2%

🔗 Quick Links

⚙️ Build Summary	📋 Coverage Details
View Azure DevOps Build	Browse Full Coverage Report

Copilot

Pull request overview

This PR changes the default decoding path for SQL CHAR/VARCHAR (ODBC SQL_CHAR) columns to fetch as wide characters (SQL_C_WCHAR) and decode as UTF-16LE, eliminating Windows-vs-Linux inconsistencies when server code pages contain bytes that are invalid UTF-8.

Changes:

Update connection defaults so SQL_CHAR decoding uses encoding="utf-16le" with ctype=SQL_WCHAR.
Plumb charCtype through cursor fetch APIs into the C++ bindings, enabling wide-char fetch for SQL_CHAR columns.
Expand/adjust encoding tests to validate new defaults and reproduce the CP1252 byte behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`mssql_python/connection.py`	Changes default decoding settings for `SQL_CHAR` to UTF-16LE/`SQL_WCHAR`.
`mssql_python/cursor.py`	Passes updated decoding settings (encoding + ctype) into the native fetch functions.
`mssql_python/pybind/ddbc_bindings.cpp`	Adds `charCtype` plumb-through and implements `SQL_C_WCHAR` paths for `SQLGetData` and bound-column fetching.
`mssql_python/pybind/ddbc_bindings.h`	Extends per-column metadata and adds a wide-char branch in `ProcessChar`.
`tests/test_013_encoding_decoding.py`	Updates default expectations and adds Windows-focused regression tests for the CP1252 byte case.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T08:15:18Z

mssql_python/connection.py

        self._decoding_settings = {
            ConstantsDDBC.SQL_CHAR.value: {
-                "encoding": "utf-8",
-                "ctype": ConstantsDDBC.SQL_CHAR.value,
+                "encoding": "utf-16le",
+                "ctype": ConstantsDDBC.SQL_WCHAR.value,
            },


Connection.init now sets SQL_CHAR decoding defaults to encoding='utf-16le' with ctype=SQL_WCHAR, but setdecoding() still defaults SQL_CHAR to 'utf-8' when encoding is None. This makes it impossible to reliably “reset to defaults” via setdecoding(SQL_CHAR) and creates inconsistent documented behavior. Update setdecoding’s default-encoding branch (and its docstring/examples) so SQL_CHAR defaults match the new connection defaults (utf-16le + SQL_WCHAR).

Copilot · 2026-04-03T08:15:18Z

mssql_python/pybind/ddbc_bindings.cpp

+                                std::string utf8str = WideToUTF8(wstr);
+                                row.append(py::str(utf8str));
+#else
+                                std::wstring wstr(reinterpret_cast<wchar_t*>(dataBuffer.data()));


In the SQL_C_WCHAR path, this constructs a std::wstring from a null-terminated buffer (wstring(ptr)) even though SQLGetData provides an explicit length (numCharsInData). If the data contains embedded NULs, or if the driver fails to null-terminate as expected, the value may be truncated or read past the valid range. Prefer constructing wstr with the explicit length (e.g., wstring(ptr, numCharsInData)) and avoid relying on terminators.

Suggested change

std::wstring wstr(reinterpret_cast<wchar_t*>(dataBuffer.data()));

std::wstring wstr(

reinterpret_cast<wchar_t*>(dataBuffer.data()),

static_cast<size_t>(numCharsInData));

Copilot · 2026-04-03T08:15:19Z

mssql_python/pybind/ddbc_bindings.h

+                Py_INCREF(Py_None);
+                PyList_SET_ITEM(row, col - 1, Py_None);


ProcessChar’s wide-char decode failure path clears the exception and stores None in the result. This silently drops data and is inconsistent with the narrow-char path (which falls back to bytes on decode failure) and ProcessWChar (which falls back to empty string). Consider aligning behavior (e.g., return raw UTF-16 bytes or a deterministic fallback) rather than returning None.

Suggested change

Py_INCREF(Py_None);

PyList_SET_ITEM(row, col - 1, Py_None);

PyObject* pyBytes = PyBytes_FromStringAndSize(

reinterpret_cast<const char*>(wcharData), numCharsInData * sizeof(SQLWCHAR));

if (!pyBytes) {

Py_INCREF(Py_None);

PyList_SET_ITEM(row, col - 1, Py_None);

} else {

PyList_SET_ITEM(row, col - 1, pyBytes);

}

+1 to Copilot's comment here. Worth adding a bit more context on why this matters.

The three decode-failure paths across closely related functions currently behave differently:

ProcessChar narrow-char → falls back to raw bytes (data preserved)
ProcessWChar → falls back to empty str (data lost but visible)
ProcessChar wide-char → stores None (data lost, looks like SQL NULL)

None is the most harmful choice because it is indistinguishable from a genuine SQL NULL to the caller — there is no way to detect the difference. That said, a PyUnicode_FromWideChar / PyUnicode_DecodeUTF16 failure on data that the ODBC driver has already converted to valid UTF-16 almost exclusively means OOM, so this branch is very unlikely to fire in practice. Still, it seems worth aligning with the narrow-char fallback (PyBytes_FromStringAndSize over the raw SQLWCHAR buffer) so that if it ever does fire, no data is silently lost and the behavior is consistent across the function.

Copilot · 2026-04-03T08:15:19Z

tests/test_013_encoding_decoding.py

+        for b in defined_bytes:
+            cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))")


This test inserts 123 rows using a per-row execute() loop, which can add noticeable runtime on Windows CI (many round-trips). Consider batching these inserts (executemany, a single INSERT ... VALUES list, or inserting into a temp table) to keep the regression coverage without the per-row overhead.

Suggested change

for b in defined_bytes:

cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))")

cursor.executemany(

f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES (?, CHAR(?))",

[(b, b) for b in defined_bytes],

)

gargsaumya · 2026-04-08T05:27:21Z

tests/test_013_encoding_decoding.py

+# ====================================================================================
+
+
+@pytest.mark.skipif(


One gap worth considering: the default ctype for SQL_CHAR is now SQL_C_WCHAR / utf-16le on all platforms, not just Windows. On Linux/macOS, ProcessChar now takes the PyUnicode_DecodeUTF16 branch for every VARCHAR fetch instead of the previous PyUnicode_FromStringAndSize path. There are no platform-neutral tests guarding this new code path.

A small non-skipif test class with two cases would help:

A plain ASCII VARCHAR value — the most common case, ensures the new default doesn't regress the happy path on Linux/macOS.

A non-ASCII Unicode value in a VARCHAR column (e.g. a UTF-8 collation or an explicit CONVERT) — ensures multi-byte characters round-trip correctly through the new PyUnicode_DecodeUTF16 path on Linux/macOS.

Without this, a bug in the wide-char path on Linux/macOS (e.g. a byte-order or null-terminator issue) would not be caught by CI.

gargsaumya · 2026-04-08T05:39:44Z

mssql_python/pybind/ddbc_bindings.cpp

-        ret = SQLBindColums(hStmt, buffers, columnNames, numCols, fetchSize);
+        // Bind columns — Arrow always uses SQL_C_CHAR for VARCHAR because
+        // it processes raw byte buffers directly, not via Python codecs.
+        ret = SQLBindColums(hStmt, buffers, columnNames, numCols, fetchSize, SQL_C_CHAR);


This PR fixes issue #468 for fetchone/fetchmany/fetchall, but the identical bug remains for fetch_arrow_batch() on Windows. A user migrating to Arrow-based fetching to improve performance would still get garbled or raw-byte data for any CP-1252 high byte (≥ 0x80) in a VARCHAR column.

The comment says "not via Python codecs" — but the Arrow path already has the infrastructure to handle SQL_C_WCHAR for SQL_WCHAR/NVARCHAR columns: it converts the UTF-16 buffer to UTF-8 via WideCharToMultiByte(CP_UTF8, ...) on Windows (and WideToUTF8(SQLWCHARToWString(...)) on Linux/macOS) and writes the result into the Arrow varData buffer. The same approach would work for VARCHAR columns bound as SQL_C_WCHAR — the SQL_CHAR/SQL_VARCHAR/SQL_LONGVARCHAR case in the Arrow data loop could be extended with an #ifdef _WIN32 branch that reads from wcharBuffers and converts to UTF-8, identical to the existing NVARCHAR path.

If fixing this is out of scope for this PR, it would be helpful to document the limitation in code. We can then track this as a follow-up improvement.

cc: @ffelixg would appreciate your thoughts here.

@gargsaumya @subrata-ms Fundamentally I would prefer if it was impossible for the user to configure the driver such that corrupt data is returned. If we want that, we can only allow two scenarios:

Fetch as WCHAR & decode utf16

Fetch as CHAR & decode utf8 IF locale is set to utf8 at connection time

Even if the locale is set to cp1252 and the decoding is also correctly set to cp1252, we may still return corrupt data if the column we are querying has any other locale. From what I can tell there isn't even any easy way for us to know the actual column collation so this corruption would be silent (I think the driver simply replaces characters with a questionmark)

I get that for fetchall etc we may want to preserve other decoding scenarios for backwards compatibility, but since the arrow fetch path is new, I think the best course of action would be to either throw exceptions if we can't guarantee correct data for the current configuration or silently use WCHAR to ensure correct data.

So the simple variant is what Saumya says: Always pass SQL_C_WCHAR to SQLBindColums and integrate the SQL_VARCHAR cases with the SQL_WVARCHAR cases. Optionally raise a NotImplementedError if the user deviated from the default utf16 configuration.

A more complex, but also potentially ~2x more performant option would be to keep the existing path, but only allow it if the decoding is set to utf8 and the connection-time locale is also utf8. The scope of this performance boost would be limited though, since the majority of SQL Server users already default to nvarchars for string data, so for now it's probably best to keep it safe and simple.

gargsaumya · 2026-04-08T05:41:19Z

mssql_python/pybind/ddbc_bindings.h

+                Py_INCREF(Py_None);
+                PyList_SET_ITEM(row, col - 1, Py_None);


+1 to Copilot's comment here. Worth adding a bit more context on why this matters.

The three decode-failure paths across closely related functions currently behave differently:

ProcessChar narrow-char → falls back to raw bytes (data preserved)
ProcessWChar → falls back to empty str (data lost but visible)
ProcessChar wide-char → stores None (data lost, looks like SQL NULL)

None is the most harmful choice because it is indistinguishable from a genuine SQL NULL to the caller — there is no way to detect the difference. That said, a PyUnicode_FromWideChar / PyUnicode_DecodeUTF16 failure on data that the ODBC driver has already converted to valid UTF-16 almost exclusively means OOM, so this branch is very unlikely to fire in practice. Still, it seems worth aligning with the narrow-char fallback (PyBytes_FromStringAndSize over the raw SQLWCHAR buffer) so that if it ever does fire, no data is silently lost and the behavior is consistent across the function.

base code changes

14d1f1d

github-actions bot added the pr-size: large Substantial code update label Apr 3, 2026

github-advanced-security bot found potential problems Apr 3, 2026

View reviewed changes

mssql_python/pybind/ddbc_bindings.cpp Show resolved Hide resolved

mssql_python/pybind/ddbc_bindings.cpp Show resolved Hide resolved

subrata-ms and others added 2 commits April 3, 2026 12:32

Merge branch 'main' into subrata-ms/CP1252Encoding

8817007

code fix for linux based system

597ee4c

subrata-ms marked this pull request as ready for review April 3, 2026 08:10

subrata-ms requested review from Copilot, gargsaumya and sumitmsft April 3, 2026 08:10

Copilot started reviewing on behalf of subrata-ms April 3, 2026 08:10 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

subrata-ms mentioned this pull request Apr 3, 2026

FIX: Inconsistent retrieval of CP1252 encoded data in VARCHAR columns - Windows vs. Linux #468 #478

Closed

gargsaumya reviewed Apr 8, 2026

View reviewed changes

		for b in defined_bytes:
		cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))")

-        for b in defined_bytes:
-            cursor.execute(f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES ({b}, CHAR({b}))")
+        cursor.executemany(
+            f"INSERT INTO {self.TABLE_NAME} (id, data) VALUES (?, CHAR(?))",
+            [(b, b) for b in defined_bytes],
+        )

		# ====================================================================================


		@pytest.mark.skipif(

Conversation

subrata-ms commented Apr 3, 2026 • edited by azure-boards bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Work Item / Issue Reference

Summary

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 3, 2026

📊 Code Coverage Report

🔥 Diff Coverage

85%

🎯 Overall Coverage

78%

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

Summary

mssql_python/pybind/ddbc_bindings.cpp

mssql_python/pybind/ddbc_bindings.h

📋 Files Needing Attention

🔗 Quick Links

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gargsaumya Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gargsaumya Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gargsaumya Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

ffelixg Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

gargsaumya Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

subrata-ms commented Apr 3, 2026 •

edited by azure-boards bot

Loading