blob: 4d680ccc8f6157337f55980e65eb18eab53831ac [file] [log] [blame]
Brad Bishopbec4ebc2022-08-03 09:55:16 -04001From bd5e882cf6e0def3dd1bc106075d59a303fe0d1e Mon Sep 17 00:00:00 2001
2From: David Malcolm <dmalcolm@redhat.com>
3Date: Mon, 18 Oct 2021 18:55:31 -0400
4Subject: [PATCH] diagnostics: escape non-ASCII source bytes for certain
5 diagnostics
6MIME-Version: 1.0
7Content-Type: text/plain; charset=utf8
8Content-Transfer-Encoding: 8bit
9
10This patch adds support to GCC's diagnostic subsystem for escaping certain
11bytes and Unicode characters when quoting source code.
12
13Specifically, this patch adds a new flag rich_location::m_escape_on_output
14which is a hint from a diagnostic that non-ASCII bytes in the pertinent
15lines of the user's source code should be escaped when printed.
16
17The patch sets this for the following diagnostics:
18- when complaining about stray bytes in the program (when these
19are non-printable)
20- when complaining about "null character(s) ignored");
21- for -Wnormalized= (and generate source ranges for such warnings)
22
23The escaping is controlled by a new option:
24 -fdiagnostics-escape-format=[unicode|bytes]
25
26For example, consider a diagnostic involing a source line containing the
27string "before" followed by the Unicode character U+03C0 ("GREEK SMALL
28LETTER PI", with UTF-8 encoding 0xCF 0x80) followed by the byte 0xBF
29(a stray UTF-8 trailing byte), followed by the string "after", where the
30diagnostic highlights the U+03C0 character.
31
32By default, this line will be printed verbatim to the user when
33reporting a diagnostic at it, as:
34
35 beforeÏXafter
36 ^
37
38(using X for the stray byte to avoid putting invalid UTF-8 in this
39commit message)
40
41If the diagnostic sets the "escape" flag, it will be printed as:
42
43 before<U+03C0><BF>after
44 ^~~~~~~~
45
46with -fdiagnostics-escape-format=unicode (the default), or as:
47
48 before<CF><80><BF>after
49 ^~~~~~~~
50
51if the user supplies -fdiagnostics-escape-format=bytes.
52
53This only affects how the source is printed; it does not affect
54how column numbers that are printed (as per -fdiagnostics-column-unit=
55and -fdiagnostics-column-origin=).
56
57gcc/c-family/ChangeLog:
58 * c-lex.c (c_lex_with_flags): When complaining about non-printable
59 CPP_OTHER tokens, set the "escape on output" flag.
60
61gcc/ChangeLog:
62 * common.opt (fdiagnostics-escape-format=): New.
63 (diagnostics_escape_format): New enum.
64 (DIAGNOSTICS_ESCAPE_FORMAT_UNICODE): New enum value.
65 (DIAGNOSTICS_ESCAPE_FORMAT_BYTES): Likewise.
66 * diagnostic-format-json.cc (json_end_diagnostic): Add
67 "escape-source" attribute.
68 * diagnostic-show-locus.c
69 (exploc_with_display_col::exploc_with_display_col): Replace
70 "tabstop" param with a cpp_char_column_policy and add an "aspect"
71 param. Use these to compute m_display_col accordingly.
72 (struct char_display_policy): New struct.
73 (layout::m_policy): New field.
74 (layout::m_escape_on_output): New field.
75 (def_policy): New function.
76 (make_range): Update for changes to exploc_with_display_col ctor.
77 (default_print_decoded_ch): New.
78 (width_per_escaped_byte): New.
79 (escape_as_bytes_width): New.
80 (escape_as_bytes_print): New.
81 (escape_as_unicode_width): New.
82 (escape_as_unicode_print): New.
83 (make_policy): New.
84 (layout::layout): Initialize new fields. Update m_exploc ctor
85 call for above change to ctor.
86 (layout::maybe_add_location_range): Update for changes to
87 exploc_with_display_col ctor.
88 (layout::calculate_x_offset_display): Update for change to
89 cpp_display_width.
90 (layout::print_source_line): Pass policy
91 to cpp_display_width_computation. Capture cpp_decoded_char when
92 calling process_next_codepoint. Move printing of source code to
93 m_policy.m_print_cb.
94 (line_label::line_label): Pass in policy rather than context.
95 (layout::print_any_labels): Update for change to line_label ctor.
96 (get_affected_range): Pass in policy rather than context, updating
97 calls to location_compute_display_column accordingly.
98 (get_printed_columns): Likewise, also for cpp_display_width.
99 (correction::correction): Pass in policy rather than tabstop.
100 (correction::compute_display_cols): Pass m_policy rather than
101 m_tabstop to cpp_display_width.
102 (correction::m_tabstop): Replace with...
103 (correction::m_policy): ...this.
104 (line_corrections::line_corrections): Pass in policy rather than
105 context.
106 (line_corrections::m_context): Replace with...
107 (line_corrections::m_policy): ...this.
108 (line_corrections::add_hint): Update to use m_policy rather than
109 m_context.
110 (line_corrections::add_hint): Likewise.
111 (layout::print_trailing_fixits): Likewise.
112 (selftest::test_display_widths): New.
113 (selftest::test_layout_x_offset_display_utf8): Update to use
114 policy rather than tabstop.
115 (selftest::test_one_liner_labels_utf8): Add test of escaping
116 source lines.
117 (selftest::test_diagnostic_show_locus_one_liner_utf8): Update to
118 use policy rather than tabstop.
119 (selftest::test_overlapped_fixit_printing): Likewise.
120 (selftest::test_overlapped_fixit_printing_utf8): Likewise.
121 (selftest::test_overlapped_fixit_printing_2): Likewise.
122 (selftest::test_tab_expansion): Likewise.
123 (selftest::test_escaping_bytes_1): New.
124 (selftest::test_escaping_bytes_2): New.
125 (selftest::diagnostic_show_locus_c_tests): Call the new tests.
126 * diagnostic.c (diagnostic_initialize): Initialize
127 context->escape_format.
128 (convert_column_unit): Update to use default character width policy.
129 (selftest::test_diagnostic_get_location_text): Likewise.
130 * diagnostic.h (enum diagnostics_escape_format): New enum.
131 (diagnostic_context::escape_format): New field.
132 * doc/invoke.texi (-fdiagnostics-escape-format=): New option.
133 (-fdiagnostics-format=): Add "escape-source" attribute to examples
134 of JSON output, and document it.
135 * input.c (location_compute_display_column): Pass in "policy"
136 rather than "tabstop", passing to
137 cpp_byte_column_to_display_column.
138 (selftest::test_cpp_utf8): Update to use cpp_char_column_policy.
139 * input.h (class cpp_char_column_policy): New forward decl.
140 (location_compute_display_column): Pass in "policy" rather than
141 "tabstop".
142 * opts.c (common_handle_option): Handle
143 OPT_fdiagnostics_escape_format_.
144 * selftest.c (temp_source_file::temp_source_file): New ctor
145 overload taking a size_t.
146 * selftest.h (temp_source_file::temp_source_file): Likewise.
147
148gcc/testsuite/ChangeLog:
149 * c-c++-common/diagnostic-format-json-1.c: Add regexp to consume
150 "escape-source" attribute.
151 * c-c++-common/diagnostic-format-json-2.c: Likewise.
152 * c-c++-common/diagnostic-format-json-3.c: Likewise.
153 * c-c++-common/diagnostic-format-json-4.c: Likewise, twice.
154 * c-c++-common/diagnostic-format-json-5.c: Likewise.
155 * gcc.dg/cpp/warn-normalized-4-bytes.c: New test.
156 * gcc.dg/cpp/warn-normalized-4-unicode.c: New test.
157 * gcc.dg/encoding-issues-bytes.c: New test.
158 * gcc.dg/encoding-issues-unicode.c: New test.
159 * gfortran.dg/diagnostic-format-json-1.F90: Add regexp to consume
160 "escape-source" attribute.
161 * gfortran.dg/diagnostic-format-json-2.F90: Likewise.
162 * gfortran.dg/diagnostic-format-json-3.F90: Likewise.
163
164libcpp/ChangeLog:
165 * charset.c (convert_escape): Use encoding_rich_location when
166 complaining about nonprintable unknown escape sequences.
167 (cpp_display_width_computation::::cpp_display_width_computation):
168 Pass in policy rather than tabstop.
169 (cpp_display_width_computation::process_next_codepoint): Add "out"
170 param and populate *out if non-NULL.
171 (cpp_display_width_computation::advance_display_cols): Pass NULL
172 to process_next_codepoint.
173 (cpp_byte_column_to_display_column): Pass in policy rather than
174 tabstop. Pass NULL to process_next_codepoint.
175 (cpp_display_column_to_byte_column): Pass in policy rather than
176 tabstop.
177 * errors.c (cpp_diagnostic_get_current_location): New function,
178 splitting out the logic from...
179 (cpp_diagnostic): ...here.
180 (cpp_warning_at): New function.
181 (cpp_pedwarning_at): New function.
182 * include/cpplib.h (cpp_warning_at): New decl for rich_location.
183 (cpp_pedwarning_at): Likewise.
184 (struct cpp_decoded_char): New.
185 (struct cpp_char_column_policy): New.
186 (cpp_display_width_computation::cpp_display_width_computation):
187 Replace "tabstop" param with "policy".
188 (cpp_display_width_computation::process_next_codepoint): Add "out"
189 param.
190 (cpp_display_width_computation::m_tabstop): Replace with...
191 (cpp_display_width_computation::m_policy): ...this.
192 (cpp_byte_column_to_display_column): Replace "tabstop" param with
193 "policy".
194 (cpp_display_width): Likewise.
195 (cpp_display_column_to_byte_column): Likewise.
196 * include/line-map.h (rich_location::escape_on_output_p): New.
197 (rich_location::set_escape_on_output): New.
198 (rich_location::m_escape_on_output): New.
199 * internal.h (cpp_diagnostic_get_current_location): New decl.
200 (class encoding_rich_location): New.
201 * lex.c (skip_whitespace): Use encoding_rich_location when
202 complaining about null characters.
203 (warn_about_normalization): Generate a source range when
204 complaining about improperly normalized tokens, rather than just a
205 point, and use encoding_rich_location so that the source code
206 is escaped on printing.
207 * line-map.c (rich_location::rich_location): Initialize
208 m_escape_on_output.
209
210Signed-off-by: David Malcolm <dmalcolm@redhat.com>
211
212CVE: CVE-2021-42574
213Upstream-Status: Backport [https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=bd5e882cf6e0def3dd1bc106075d59a303fe0d1e]
214Signed-off-by: Pgowda <pgowda.cve@gmail.com>
215
216---
217 gcc/c-family/c-lex.c | 6 +-
218 gcc/common.opt | 13 +
219 gcc/diagnostic-format-json.cc | 3 +
220 gcc/diagnostic-show-locus.c | 580 +++++++++++++++---
221 gcc/diagnostic.c | 10 +-
222 gcc/diagnostic.h | 18 +
223 gcc/doc/invoke.texi | 43 +-
224 gcc/input.c | 62 +-
225 gcc/input.h | 7 +-
226 gcc/opts.c | 4 +
227 gcc/selftest.c | 15 +
228 gcc/selftest.h | 2 +
229 .../c-c++-common/diagnostic-format-json-1.c | 1 +
230 .../c-c++-common/diagnostic-format-json-2.c | 1 +
231 .../c-c++-common/diagnostic-format-json-3.c | 1 +
232 .../c-c++-common/diagnostic-format-json-4.c | 2 +
233 .../c-c++-common/diagnostic-format-json-5.c | 1 +
234 .../gcc.dg/cpp/warn-normalized-4-bytes.c | 21 +
235 .../gcc.dg/cpp/warn-normalized-4-unicode.c | 19 +
236 gcc/testsuite/gcc.dg/encoding-issues-bytes.c | Bin 0 -> 595 bytes
237 .../gcc.dg/encoding-issues-unicode.c | Bin 0 -> 613 bytes
238 .../gfortran.dg/diagnostic-format-json-1.F90 | 1 +
239 .../gfortran.dg/diagnostic-format-json-2.F90 | 1 +
240 .../gfortran.dg/diagnostic-format-json-3.F90 | 1 +
241 libcpp/charset.c | 63 +-
242 libcpp/errors.c | 82 ++-
243 libcpp/include/cpplib.h | 76 ++-
244 libcpp/include/line-map.h | 13 +
245 libcpp/internal.h | 23 +
246 libcpp/lex.c | 38 +-
247 libcpp/line-map.c | 3 +-
248 31 files changed, 942 insertions(+), 168 deletions(-)
249 create mode 100644 gcc/testsuite/gcc.dg/cpp/warn-normalized-4-bytes.c
250 create mode 100644 gcc/testsuite/gcc.dg/cpp/warn-normalized-4-unicode.c
251 create mode 100644 gcc/testsuite/gcc.dg/encoding-issues-bytes.c
252 create mode 100644 gcc/testsuite/gcc.dg/encoding-issues-unicode.c
253
254diff --git a/gcc/c-family/c-lex.c b/gcc/c-family/c-lex.c
255--- a/gcc/c-family/c-lex.c 2021-07-27 23:55:06.980283060 -0700
256+++ b/gcc/c-family/c-lex.c 2021-12-14 01:16:01.541943272 -0800
257@@ -603,7 +603,11 @@ c_lex_with_flags (tree *value, location_
258 else if (ISGRAPH (c))
259 error_at (*loc, "stray %qc in program", (int) c);
260 else
261- error_at (*loc, "stray %<\\%o%> in program", (int) c);
262+ {
263+ rich_location rich_loc (line_table, *loc);
264+ rich_loc.set_escape_on_output (true);
265+ error_at (&rich_loc, "stray %<\\%o%> in program", (int) c);
266+ }
267 }
268 goto retry;
269
270diff --git a/gcc/common.opt b/gcc/common.opt
271--- a/gcc/common.opt 2021-12-13 22:08:44.939137107 -0800
272+++ b/gcc/common.opt 2021-12-14 01:16:01.541943272 -0800
273@@ -1348,6 +1348,10 @@ fdiagnostics-format=
274 Common Joined RejectNegative Enum(diagnostics_output_format)
275 -fdiagnostics-format=[text|json] Select output format.
276
277+fdiagnostics-escape-format=
278+Common Joined RejectNegative Enum(diagnostics_escape_format)
279+-fdiagnostics-escape-format=[unicode|bytes] Select how to escape non-printable-ASCII bytes in the source for diagnostics that suggest it.
280+
281 ; Required for these enum values.
282 SourceInclude
283 diagnostic.h
284@@ -1362,6 +1366,15 @@ EnumValue
285 Enum(diagnostics_column_unit) String(byte) Value(DIAGNOSTICS_COLUMN_UNIT_BYTE)
286
287 Enum
288+Name(diagnostics_escape_format) Type(int)
289+
290+EnumValue
291+Enum(diagnostics_escape_format) String(unicode) Value(DIAGNOSTICS_ESCAPE_FORMAT_UNICODE)
292+
293+EnumValue
294+Enum(diagnostics_escape_format) String(bytes) Value(DIAGNOSTICS_ESCAPE_FORMAT_BYTES)
295+
296+Enum
297 Name(diagnostics_output_format) Type(int)
298
299 EnumValue
300diff --git a/gcc/diagnostic.c b/gcc/diagnostic.c
301--- a/gcc/diagnostic.c 2021-07-27 23:55:07.232286576 -0700
302+++ b/gcc/diagnostic.c 2021-12-14 01:16:01.545943202 -0800
303@@ -230,6 +230,7 @@ diagnostic_initialize (diagnostic_contex
304 context->column_unit = DIAGNOSTICS_COLUMN_UNIT_DISPLAY;
305 context->column_origin = 1;
306 context->tabstop = 8;
307+ context->escape_format = DIAGNOSTICS_ESCAPE_FORMAT_UNICODE;
308 context->edit_context_ptr = NULL;
309 context->diagnostic_group_nesting_depth = 0;
310 context->diagnostic_group_emission_count = 0;
311@@ -382,7 +383,10 @@ convert_column_unit (enum diagnostics_co
312 gcc_unreachable ();
313
314 case DIAGNOSTICS_COLUMN_UNIT_DISPLAY:
315- return location_compute_display_column (s, tabstop);
316+ {
317+ cpp_char_column_policy policy (tabstop, cpp_wcwidth);
318+ return location_compute_display_column (s, policy);
319+ }
320
321 case DIAGNOSTICS_COLUMN_UNIT_BYTE:
322 return s.column;
323@@ -2275,8 +2279,8 @@ test_diagnostic_get_location_text ()
324 const char *const content = "smile \xf0\x9f\x98\x82\n";
325 const int line_bytes = strlen (content) - 1;
326 const int def_tabstop = 8;
327- const int display_width = cpp_display_width (content, line_bytes,
328- def_tabstop);
329+ const cpp_char_column_policy policy (def_tabstop, cpp_wcwidth);
330+ const int display_width = cpp_display_width (content, line_bytes, policy);
331 ASSERT_EQ (line_bytes - 2, display_width);
332 temp_source_file tmp (SELFTEST_LOCATION, ".c", content);
333 const char *const fname = tmp.get_filename ();
334diff --git a/gcc/diagnostic-format-json.cc b/gcc/diagnostic-format-json.cc
335--- a/gcc/diagnostic-format-json.cc 2021-07-27 23:55:07.232286576 -0700
336+++ b/gcc/diagnostic-format-json.cc 2021-12-14 01:16:01.541943272 -0800
337@@ -264,6 +264,9 @@ json_end_diagnostic (diagnostic_context
338 json::value *path_value = context->make_json_for_path (context, path);
339 diag_obj->set ("path", path_value);
340 }
341+
342+ diag_obj->set ("escape-source",
343+ new json::literal (richloc->escape_on_output_p ()));
344 }
345
346 /* No-op implementation of "begin_group_cb" for JSON output. */
347diff --git a/gcc/diagnostic.h b/gcc/diagnostic.h
348--- a/gcc/diagnostic.h 2021-07-27 23:55:07.236286632 -0700
349+++ b/gcc/diagnostic.h 2021-12-14 01:16:01.545943202 -0800
350@@ -38,6 +38,20 @@ enum diagnostics_column_unit
351 DIAGNOSTICS_COLUMN_UNIT_BYTE
352 };
353
354+/* An enum for controlling how to print non-ASCII characters/bytes when
355+ a diagnostic suggests escaping the source code on output. */
356+
357+enum diagnostics_escape_format
358+{
359+ /* Escape non-ASCII Unicode characters in the form <U+XXXX> and
360+ non-UTF-8 bytes in the form <XX>. */
361+ DIAGNOSTICS_ESCAPE_FORMAT_UNICODE,
362+
363+ /* Escape non-ASCII bytes in the form <XX> (thus showing the underlying
364+ encoding of non-ASCII Unicode characters). */
365+ DIAGNOSTICS_ESCAPE_FORMAT_BYTES
366+};
367+
368 /* Enum for overriding the standard output format. */
369
370 enum diagnostics_output_format
371@@ -320,6 +334,10 @@ struct diagnostic_context
372 /* The size of the tabstop for tab expansion. */
373 int tabstop;
374
375+ /* How should non-ASCII/non-printable bytes be escaped when
376+ a diagnostic suggests escaping the source code on output. */
377+ enum diagnostics_escape_format escape_format;
378+
379 /* If non-NULL, an edit_context to which fix-it hints should be
380 applied, for generating patches. */
381 edit_context *edit_context_ptr;
382diff --git a/gcc/diagnostic-show-locus.c b/gcc/diagnostic-show-locus.c
383--- a/gcc/diagnostic-show-locus.c 2021-07-27 23:55:07.232286576 -0700
384+++ b/gcc/diagnostic-show-locus.c 2021-12-14 01:16:01.545943202 -0800
385@@ -175,10 +175,26 @@ enum column_unit {
386 class exploc_with_display_col : public expanded_location
387 {
388 public:
389- exploc_with_display_col (const expanded_location &exploc, int tabstop)
390- : expanded_location (exploc),
391- m_display_col (location_compute_display_column (exploc, tabstop))
392- {}
393+ exploc_with_display_col (const expanded_location &exploc,
394+ const cpp_char_column_policy &policy,
395+ enum location_aspect aspect)
396+ : expanded_location (exploc),
397+ m_display_col (location_compute_display_column (exploc, policy))
398+ {
399+ if (exploc.column > 0)
400+ {
401+ /* m_display_col is now the final column of the byte.
402+ If escaping has happened, we may want the first column instead. */
403+ if (aspect != LOCATION_ASPECT_FINISH)
404+ {
405+ expanded_location prev_exploc (exploc);
406+ prev_exploc.column--;
407+ int prev_display_col
408+ = (location_compute_display_column (prev_exploc, policy));
409+ m_display_col = prev_display_col + 1;
410+ }
411+ }
412+ }
413
414 int m_display_col;
415 };
416@@ -313,6 +329,31 @@ test_line_span ()
417
418 #endif /* #if CHECKING_P */
419
420+/* A bundle of information containing how to print unicode
421+ characters and bytes when quoting source code.
422+
423+ Provides a unified place to support escaping some subset
424+ of characters to some format.
425+
426+ Extends char_column_policy; printing is split out to avoid
427+ libcpp having to know about pretty_printer. */
428+
429+struct char_display_policy : public cpp_char_column_policy
430+{
431+ public:
432+ char_display_policy (int tabstop,
433+ int (*width_cb) (cppchar_t c),
434+ void (*print_cb) (pretty_printer *pp,
435+ const cpp_decoded_char &cp))
436+ : cpp_char_column_policy (tabstop, width_cb),
437+ m_print_cb (print_cb)
438+ {
439+ }
440+
441+ void (*m_print_cb) (pretty_printer *pp,
442+ const cpp_decoded_char &cp);
443+};
444+
445 /* A class to control the overall layout when printing a diagnostic.
446
447 The layout is determined within the constructor.
448@@ -345,6 +386,8 @@ class layout
449
450 void print_line (linenum_type row);
451
452+ void on_bad_codepoint (const char *ptr, cppchar_t ch, size_t ch_sz);
453+
454 private:
455 bool will_show_line_p (linenum_type row) const;
456 void print_leading_fixits (linenum_type row);
457@@ -386,6 +429,7 @@ class layout
458 private:
459 diagnostic_context *m_context;
460 pretty_printer *m_pp;
461+ char_display_policy m_policy;
462 location_t m_primary_loc;
463 exploc_with_display_col m_exploc;
464 colorizer m_colorizer;
465@@ -398,6 +442,7 @@ class layout
466 auto_vec <line_span> m_line_spans;
467 int m_linenum_width;
468 int m_x_offset_display;
469+ bool m_escape_on_output;
470 };
471
472 /* Implementation of "class colorizer". */
473@@ -646,6 +691,11 @@ layout_range::intersects_line_p (linenum
474 /* Default for when we don't care what the tab expansion is set to. */
475 static const int def_tabstop = 8;
476
477+static cpp_char_column_policy def_policy ()
478+{
479+ return cpp_char_column_policy (8, cpp_wcwidth);
480+}
481+
482 /* Create some expanded locations for testing layout_range. The filename
483 member of the explocs is set to the empty string. This member will only be
484 inspected by the calls to location_compute_display_column() made from the
485@@ -662,10 +712,13 @@ make_range (int start_line, int start_co
486 = {"", start_line, start_col, NULL, false};
487 const expanded_location finish_exploc
488 = {"", end_line, end_col, NULL, false};
489- return layout_range (exploc_with_display_col (start_exploc, def_tabstop),
490- exploc_with_display_col (finish_exploc, def_tabstop),
491+ return layout_range (exploc_with_display_col (start_exploc, def_policy (),
492+ LOCATION_ASPECT_START),
493+ exploc_with_display_col (finish_exploc, def_policy (),
494+ LOCATION_ASPECT_FINISH),
495 SHOW_RANGE_WITHOUT_CARET,
496- exploc_with_display_col (start_exploc, def_tabstop),
497+ exploc_with_display_col (start_exploc, def_policy (),
498+ LOCATION_ASPECT_CARET),
499 0, NULL);
500 }
501
502@@ -959,6 +1012,164 @@ fixit_cmp (const void *p_a, const void *
503 return hint_a->get_start_loc () - hint_b->get_start_loc ();
504 }
505
506+/* Callbacks for use when not escaping the source. */
507+
508+/* The default callback for char_column_policy::m_width_cb is cpp_wcwidth. */
509+
510+/* Callback for char_display_policy::m_print_cb for printing source chars
511+ when not escaping the source. */
512+
513+static void
514+default_print_decoded_ch (pretty_printer *pp,
515+ const cpp_decoded_char &decoded_ch)
516+{
517+ for (const char *ptr = decoded_ch.m_start_byte;
518+ ptr != decoded_ch.m_next_byte; ptr++)
519+ {
520+ if (*ptr == '\0' || *ptr == '\r')
521+ {
522+ pp_space (pp);
523+ continue;
524+ }
525+
526+ pp_character (pp, *ptr);
527+ }
528+}
529+
530+/* Callbacks for use with DIAGNOSTICS_ESCAPE_FORMAT_BYTES. */
531+
532+static const int width_per_escaped_byte = 4;
533+
534+/* Callback for char_column_policy::m_width_cb for determining the
535+ display width when escaping with DIAGNOSTICS_ESCAPE_FORMAT_BYTES. */
536+
537+static int
538+escape_as_bytes_width (cppchar_t ch)
539+{
540+ if (ch < 0x80 && ISPRINT (ch))
541+ return cpp_wcwidth (ch);
542+ else
543+ {
544+ if (ch <= 0x7F) return 1 * width_per_escaped_byte;
545+ if (ch <= 0x7FF) return 2 * width_per_escaped_byte;
546+ if (ch <= 0xFFFF) return 3 * width_per_escaped_byte;
547+ return 4 * width_per_escaped_byte;
548+ }
549+}
550+
551+/* Callback for char_display_policy::m_print_cb for printing source chars
552+ when escaping with DIAGNOSTICS_ESCAPE_FORMAT_BYTES. */
553+
554+static void
555+escape_as_bytes_print (pretty_printer *pp,
556+ const cpp_decoded_char &decoded_ch)
557+{
558+ if (!decoded_ch.m_valid_ch)
559+ {
560+ for (const char *iter = decoded_ch.m_start_byte;
561+ iter != decoded_ch.m_next_byte; ++iter)
562+ {
563+ char buf[16];
564+ sprintf (buf, "<%02x>", (unsigned char)*iter);
565+ pp_string (pp, buf);
566+ }
567+ return;
568+ }
569+
570+ cppchar_t ch = decoded_ch.m_ch;
571+ if (ch < 0x80 && ISPRINT (ch))
572+ pp_character (pp, ch);
573+ else
574+ {
575+ for (const char *iter = decoded_ch.m_start_byte;
576+ iter < decoded_ch.m_next_byte; ++iter)
577+ {
578+ char buf[16];
579+ sprintf (buf, "<%02x>", (unsigned char)*iter);
580+ pp_string (pp, buf);
581+ }
582+ }
583+}
584+
585+/* Callbacks for use with DIAGNOSTICS_ESCAPE_FORMAT_UNICODE. */
586+
587+/* Callback for char_column_policy::m_width_cb for determining the
588+ display width when escaping with DIAGNOSTICS_ESCAPE_FORMAT_UNICODE. */
589+
590+static int
591+escape_as_unicode_width (cppchar_t ch)
592+{
593+ if (ch < 0x80 && ISPRINT (ch))
594+ return cpp_wcwidth (ch);
595+ else
596+ {
597+ // Width of "<U+%04x>"
598+ if (ch > 0xfffff)
599+ return 10;
600+ else if (ch > 0xffff)
601+ return 9;
602+ else
603+ return 8;
604+ }
605+}
606+
607+/* Callback for char_display_policy::m_print_cb for printing source chars
608+ when escaping with DIAGNOSTICS_ESCAPE_FORMAT_UNICODE. */
609+
610+static void
611+escape_as_unicode_print (pretty_printer *pp,
612+ const cpp_decoded_char &decoded_ch)
613+{
614+ if (!decoded_ch.m_valid_ch)
615+ {
616+ escape_as_bytes_print (pp, decoded_ch);
617+ return;
618+ }
619+
620+ cppchar_t ch = decoded_ch.m_ch;
621+ if (ch < 0x80 && ISPRINT (ch))
622+ pp_character (pp, ch);
623+ else
624+ {
625+ char buf[16];
626+ sprintf (buf, "<U+%04X>", ch);
627+ pp_string (pp, buf);
628+ }
629+}
630+
631+/* Populate a char_display_policy based on DC and RICHLOC. */
632+
633+static char_display_policy
634+make_policy (const diagnostic_context &dc,
635+ const rich_location &richloc)
636+{
637+ /* The default is to not escape non-ASCII bytes. */
638+ char_display_policy result
639+ (dc.tabstop, cpp_wcwidth, default_print_decoded_ch);
640+
641+ /* If the diagnostic suggests escaping non-ASCII bytes, then
642+ use policy from user-supplied options. */
643+ if (richloc.escape_on_output_p ())
644+ {
645+ result.m_undecoded_byte_width = width_per_escaped_byte;
646+ switch (dc.escape_format)
647+ {
648+ default:
649+ gcc_unreachable ();
650+ case DIAGNOSTICS_ESCAPE_FORMAT_UNICODE:
651+ result.m_width_cb = escape_as_unicode_width;
652+ result.m_print_cb = escape_as_unicode_print;
653+ break;
654+ case DIAGNOSTICS_ESCAPE_FORMAT_BYTES:
655+ result.m_width_cb = escape_as_bytes_width;
656+ result.m_print_cb = escape_as_bytes_print;
657+ break;
658+ }
659+ }
660+
661+ return result;
662+}
663+
664 /* Implementation of class layout. */
665
666 /* Constructor for class layout.
667@@ -975,8 +1186,10 @@ layout::layout (diagnostic_context * con
668 diagnostic_t diagnostic_kind)
669 : m_context (context),
670 m_pp (context->printer),
671+ m_policy (make_policy (*context, *richloc)),
672 m_primary_loc (richloc->get_range (0)->m_loc),
673- m_exploc (richloc->get_expanded_location (0), context->tabstop),
674+ m_exploc (richloc->get_expanded_location (0), m_policy,
675+ LOCATION_ASPECT_CARET),
676 m_colorizer (context, diagnostic_kind),
677 m_colorize_source_p (context->colorize_source_p),
678 m_show_labels_p (context->show_labels_p),
679@@ -986,7 +1199,8 @@ layout::layout (diagnostic_context * con
680 m_fixit_hints (richloc->get_num_fixit_hints ()),
681 m_line_spans (1 + richloc->get_num_locations ()),
682 m_linenum_width (0),
683- m_x_offset_display (0)
684+ m_x_offset_display (0),
685+ m_escape_on_output (richloc->escape_on_output_p ())
686 {
687 for (unsigned int idx = 0; idx < richloc->get_num_locations (); idx++)
688 {
689@@ -1072,10 +1286,13 @@ layout::maybe_add_location_range (const
690
691 /* Everything is now known to be in the correct source file,
692 but it may require further sanitization. */
693- layout_range ri (exploc_with_display_col (start, m_context->tabstop),
694- exploc_with_display_col (finish, m_context->tabstop),
695+ layout_range ri (exploc_with_display_col (start, m_policy,
696+ LOCATION_ASPECT_START),
697+ exploc_with_display_col (finish, m_policy,
698+ LOCATION_ASPECT_FINISH),
699 loc_range->m_range_display_kind,
700- exploc_with_display_col (caret, m_context->tabstop),
701+ exploc_with_display_col (caret, m_policy,
702+ LOCATION_ASPECT_CARET),
703 original_idx, loc_range->m_label);
704
705 /* If we have a range that finishes before it starts (perhaps
706@@ -1409,7 +1626,7 @@ layout::calculate_x_offset_display ()
707 = get_line_bytes_without_trailing_whitespace (line.get_buffer (),
708 line.length ());
709 int eol_display_column
710- = cpp_display_width (line.get_buffer (), line_bytes, m_context->tabstop);
711+ = cpp_display_width (line.get_buffer (), line_bytes, m_policy);
712 if (caret_display_column > eol_display_column
713 || !caret_display_column)
714 {
715@@ -1488,7 +1705,7 @@ layout::print_source_line (linenum_type
716 /* This object helps to keep track of which display column we are at, which is
717 necessary for computing the line bounds in display units, for doing
718 tab expansion, and for implementing m_x_offset_display. */
719- cpp_display_width_computation dw (line, line_bytes, m_context->tabstop);
720+ cpp_display_width_computation dw (line, line_bytes, m_policy);
721
722 /* Skip the first m_x_offset_display display columns. In case the leading
723 portion that will be skipped ends with a character with wcwidth > 1, then
724@@ -1536,7 +1753,8 @@ layout::print_source_line (linenum_type
725 tabs and replacing some control bytes with spaces as necessary. */
726 const char *c = dw.next_byte ();
727 const int start_disp_col = dw.display_cols_processed () + 1;
728- const int this_display_width = dw.process_next_codepoint ();
729+ cpp_decoded_char cp;
730+ const int this_display_width = dw.process_next_codepoint (&cp);
731 if (*c == '\t')
732 {
733 /* The returned display width is the number of spaces into which the
734@@ -1545,15 +1763,6 @@ layout::print_source_line (linenum_type
735 pp_space (m_pp);
736 continue;
737 }
738- if (*c == '\0' || *c == '\r')
739- {
740- /* cpp_wcwidth() promises to return 1 for all control bytes, and we
741- want to output these as a single space too, so this case is
742- actually the same as the '\t' case. */
743- gcc_assert (this_display_width == 1);
744- pp_space (m_pp);
745- continue;
746- }
747
748 /* We have a (possibly multibyte) character to output; update the line
749 bounds if it is not whitespace. */
750@@ -1565,7 +1774,8 @@ layout::print_source_line (linenum_type
751 }
752
753 /* Output the character. */
754- while (c != dw.next_byte ()) pp_character (m_pp, *c++);
755+ m_policy.m_print_cb (m_pp, cp);
756+ c = dw.next_byte ();
757 }
758 print_newline ();
759 return lbounds;
760@@ -1664,14 +1874,14 @@ layout::print_annotation_line (linenum_t
761 class line_label
762 {
763 public:
764- line_label (diagnostic_context *context, int state_idx, int column,
765+ line_label (const cpp_char_column_policy &policy,
766+ int state_idx, int column,
767 label_text text)
768 : m_state_idx (state_idx), m_column (column),
769 m_text (text), m_label_line (0), m_has_vbar (true)
770 {
771 const int bytes = strlen (text.m_buffer);
772- m_display_width
773- = cpp_display_width (text.m_buffer, bytes, context->tabstop);
774+ m_display_width = cpp_display_width (text.m_buffer, bytes, policy);
775 }
776
777 /* Sorting is primarily by column, then by state index. */
778@@ -1731,7 +1941,7 @@ layout::print_any_labels (linenum_type r
779 if (text.m_buffer == NULL)
780 continue;
781
782- labels.safe_push (line_label (m_context, i, disp_col, text));
783+ labels.safe_push (line_label (m_policy, i, disp_col, text));
784 }
785 }
786
787@@ -2011,7 +2221,7 @@ public:
788
789 /* Get the range of bytes or display columns that HINT would affect. */
790 static column_range
791-get_affected_range (diagnostic_context *context,
792+get_affected_range (const cpp_char_column_policy &policy,
793 const fixit_hint *hint, enum column_unit col_unit)
794 {
795 expanded_location exploc_start = expand_location (hint->get_start_loc ());
796@@ -2022,13 +2232,11 @@ get_affected_range (diagnostic_context *
797 int finish_column;
798 if (col_unit == CU_DISPLAY_COLS)
799 {
800- start_column
801- = location_compute_display_column (exploc_start, context->tabstop);
802+ start_column = location_compute_display_column (exploc_start, policy);
803 if (hint->insertion_p ())
804 finish_column = start_column - 1;
805 else
806- finish_column
807- = location_compute_display_column (exploc_finish, context->tabstop);
808+ finish_column = location_compute_display_column (exploc_finish, policy);
809 }
810 else
811 {
812@@ -2041,12 +2249,13 @@ get_affected_range (diagnostic_context *
813 /* Get the range of display columns that would be printed for HINT. */
814
815 static column_range
816-get_printed_columns (diagnostic_context *context, const fixit_hint *hint)
817+get_printed_columns (const cpp_char_column_policy &policy,
818+ const fixit_hint *hint)
819 {
820 expanded_location exploc = expand_location (hint->get_start_loc ());
821- int start_column = location_compute_display_column (exploc, context->tabstop);
822+ int start_column = location_compute_display_column (exploc, policy);
823 int hint_width = cpp_display_width (hint->get_string (), hint->get_length (),
824- context->tabstop);
825+ policy);
826 int final_hint_column = start_column + hint_width - 1;
827 if (hint->insertion_p ())
828 {
829@@ -2056,8 +2265,7 @@ get_printed_columns (diagnostic_context
830 {
831 exploc = expand_location (hint->get_next_loc ());
832 --exploc.column;
833- int finish_column
834- = location_compute_display_column (exploc, context->tabstop);
835+ int finish_column = location_compute_display_column (exploc, policy);
836 return column_range (start_column,
837 MAX (finish_column, final_hint_column));
838 }
839@@ -2075,13 +2283,13 @@ public:
840 column_range affected_columns,
841 column_range printed_columns,
842 const char *new_text, size_t new_text_len,
843- int tabstop)
844+ const cpp_char_column_policy &policy)
845 : m_affected_bytes (affected_bytes),
846 m_affected_columns (affected_columns),
847 m_printed_columns (printed_columns),
848 m_text (xstrdup (new_text)),
849 m_byte_length (new_text_len),
850- m_tabstop (tabstop),
851+ m_policy (policy),
852 m_alloc_sz (new_text_len + 1)
853 {
854 compute_display_cols ();
855@@ -2099,7 +2307,7 @@ public:
856
857 void compute_display_cols ()
858 {
859- m_display_cols = cpp_display_width (m_text, m_byte_length, m_tabstop);
860+ m_display_cols = cpp_display_width (m_text, m_byte_length, m_policy);
861 }
862
863 void overwrite (int dst_offset, const char_span &src_span)
864@@ -2127,7 +2335,7 @@ public:
865 char *m_text;
866 size_t m_byte_length; /* Not including null-terminator. */
867 int m_display_cols;
868- int m_tabstop;
869+ const cpp_char_column_policy &m_policy;
870 size_t m_alloc_sz;
871 };
872
873@@ -2163,15 +2371,16 @@ correction::ensure_terminated ()
874 class line_corrections
875 {
876 public:
877- line_corrections (diagnostic_context *context, const char *filename,
878+ line_corrections (const char_display_policy &policy,
879+ const char *filename,
880 linenum_type row)
881- : m_context (context), m_filename (filename), m_row (row)
882+ : m_policy (policy), m_filename (filename), m_row (row)
883 {}
884 ~line_corrections ();
885
886 void add_hint (const fixit_hint *hint);
887
888- diagnostic_context *m_context;
889+ const char_display_policy &m_policy;
890 const char *m_filename;
891 linenum_type m_row;
892 auto_vec <correction *> m_corrections;
893@@ -2217,10 +2426,10 @@ source_line::source_line (const char *fi
894 void
895 line_corrections::add_hint (const fixit_hint *hint)
896 {
897- column_range affected_bytes = get_affected_range (m_context, hint, CU_BYTES);
898- column_range affected_columns = get_affected_range (m_context, hint,
899+ column_range affected_bytes = get_affected_range (m_policy, hint, CU_BYTES);
900+ column_range affected_columns = get_affected_range (m_policy, hint,
901 CU_DISPLAY_COLS);
902- column_range printed_columns = get_printed_columns (m_context, hint);
903+ column_range printed_columns = get_printed_columns (m_policy, hint);
904
905 /* Potentially consolidate. */
906 if (!m_corrections.is_empty ())
907@@ -2289,7 +2498,7 @@ line_corrections::add_hint (const fixit_
908 printed_columns,
909 hint->get_string (),
910 hint->get_length (),
911- m_context->tabstop));
912+ m_policy));
913 }
914
915 /* If there are any fixit hints on source line ROW, print them.
916@@ -2303,7 +2512,7 @@ layout::print_trailing_fixits (linenum_t
917 {
918 /* Build a list of correction instances for the line,
919 potentially consolidating hints (for the sake of readability). */
920- line_corrections corrections (m_context, m_exploc.file, row);
921+ line_corrections corrections (m_policy, m_exploc.file, row);
922 for (unsigned int i = 0; i < m_fixit_hints.length (); i++)
923 {
924 const fixit_hint *hint = m_fixit_hints[i];
925@@ -2646,6 +2855,59 @@ namespace selftest {
926
927 /* Selftests for diagnostic_show_locus. */
928
929+/* Verify that cpp_display_width correctly handles escaping. */
930+
931+static void
932+test_display_widths ()
933+{
934+ gcc_rich_location richloc (UNKNOWN_LOCATION);
935+
936+ /* U+03C0 "GREEK SMALL LETTER PI". */
937+ const char *pi = "\xCF\x80";
938+ /* U+1F642 "SLIGHTLY SMILING FACE". */
939+ const char *emoji = "\xF0\x9F\x99\x82";
940+ /* Stray trailing byte of a UTF-8 character. */
941+ const char *stray = "\xBF";
942+ /* U+10FFFF. */
943+ const char *max_codepoint = "\xF4\x8F\xBF\xBF";
944+
945+ /* No escaping. */
946+ {
947+ test_diagnostic_context dc;
948+ char_display_policy policy (make_policy (dc, richloc));
949+ ASSERT_EQ (cpp_display_width (pi, strlen (pi), policy), 1);
950+ ASSERT_EQ (cpp_display_width (emoji, strlen (emoji), policy), 2);
951+ ASSERT_EQ (cpp_display_width (stray, strlen (stray), policy), 1);
952+ /* Don't check width of U+10FFFF; it's in a private use plane. */
953+ }
954+
955+ richloc.set_escape_on_output (true);
956+
957+ {
958+ test_diagnostic_context dc;
959+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_UNICODE;
960+ char_display_policy policy (make_policy (dc, richloc));
961+ ASSERT_EQ (cpp_display_width (pi, strlen (pi), policy), 8);
962+ ASSERT_EQ (cpp_display_width (emoji, strlen (emoji), policy), 9);
963+ ASSERT_EQ (cpp_display_width (stray, strlen (stray), policy), 4);
964+ ASSERT_EQ (cpp_display_width (max_codepoint, strlen (max_codepoint),
965+ policy),
966+ strlen ("<U+10FFFF>"));
967+ }
968+
969+ {
970+ test_diagnostic_context dc;
971+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_BYTES;
972+ char_display_policy policy (make_policy (dc, richloc));
973+ ASSERT_EQ (cpp_display_width (pi, strlen (pi), policy), 8);
974+ ASSERT_EQ (cpp_display_width (emoji, strlen (emoji), policy), 16);
975+ ASSERT_EQ (cpp_display_width (stray, strlen (stray), policy), 4);
976+ ASSERT_EQ (cpp_display_width (max_codepoint, strlen (max_codepoint),
977+ policy),
978+ 16);
979+ }
980+}
981+
982 /* For precise tests of the layout, make clear where the source line will
983 start. test_left_margin sets the total byte count from the left side of the
984 screen to the start of source lines, after the line number and the separator,
985@@ -2715,10 +2977,10 @@ test_layout_x_offset_display_utf8 (const
986 char_span lspan = location_get_source_line (tmp.get_filename (), 1);
987 ASSERT_EQ (line_display_cols,
988 cpp_display_width (lspan.get_buffer (), lspan.length (),
989- def_tabstop));
990+ def_policy ()));
991 ASSERT_EQ (line_display_cols,
992 location_compute_display_column (expand_location (line_end),
993- def_tabstop));
994+ def_policy ()));
995 ASSERT_EQ (0, memcmp (lspan.get_buffer () + (emoji_col - 1),
996 "\xf0\x9f\x98\x82\xf0\x9f\x98\x82", 8));
997
998@@ -2866,12 +3128,13 @@ test_layout_x_offset_display_tab (const
999 ASSERT_EQ ('\t', *(lspan.get_buffer () + (tab_col - 1)));
1000 for (int tabstop = 1; tabstop != num_tabstops; ++tabstop)
1001 {
1002+ cpp_char_column_policy policy (tabstop, cpp_wcwidth);
1003 ASSERT_EQ (line_bytes + extra_width[tabstop],
1004 cpp_display_width (lspan.get_buffer (), lspan.length (),
1005- tabstop));
1006+ policy));
1007 ASSERT_EQ (line_bytes + extra_width[tabstop],
1008 location_compute_display_column (expand_location (line_end),
1009- tabstop));
1010+ policy));
1011 }
1012
1013 /* Check that the tab is expanded to the expected number of spaces. */
1014@@ -4003,6 +4266,43 @@ test_one_liner_labels_utf8 ()
1015 " bb\xf0\x9f\x98\x82\xf0\x9f\x98\x82\n",
1016 pp_formatted_text (dc.printer));
1017 }
1018+
1019+ /* Example of escaping the source lines. */
1020+ {
1021+ text_range_label label0 ("label 0\xf0\x9f\x98\x82");
1022+ text_range_label label1 ("label 1\xcf\x80");
1023+ text_range_label label2 ("label 2\xcf\x80");
1024+ gcc_rich_location richloc (foo, &label0);
1025+ richloc.add_range (bar, SHOW_RANGE_WITHOUT_CARET, &label1);
1026+ richloc.add_range (field, SHOW_RANGE_WITHOUT_CARET, &label2);
1027+ richloc.set_escape_on_output (true);
1028+
1029+ {
1030+ test_diagnostic_context dc;
1031+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_UNICODE;
1032+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1033+ ASSERT_STREQ (" <U+1F602>_foo = <U+03C0>_bar.<U+1F602>_field<U+03C0>;\n"
1034+ " ^~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~\n"
1035+ " | | |\n"
1036+ " | | label 2\xcf\x80\n"
1037+ " | label 1\xcf\x80\n"
1038+ " label 0\xf0\x9f\x98\x82\n",
1039+ pp_formatted_text (dc.printer));
1040+ }
1041+ {
1042+ test_diagnostic_context dc;
1043+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_BYTES;
1044+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1045+ ASSERT_STREQ
1046+ (" <f0><9f><98><82>_foo = <cf><80>_bar.<f0><9f><98><82>_field<cf><80>;\n"
1047+ " ^~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n"
1048+ " | | |\n"
1049+ " | | label 2\xcf\x80\n"
1050+ " | label 1\xcf\x80\n"
1051+ " label 0\xf0\x9f\x98\x82\n",
1052+ pp_formatted_text (dc.printer));
1053+ }
1054+ }
1055 }
1056
1057 /* Make sure that colorization codes don't interrupt a multibyte
1058@@ -4057,9 +4357,9 @@ test_diagnostic_show_locus_one_liner_utf
1059
1060 char_span lspan = location_get_source_line (tmp.get_filename (), 1);
1061 ASSERT_EQ (25, cpp_display_width (lspan.get_buffer (), lspan.length (),
1062- def_tabstop));
1063+ def_policy ()));
1064 ASSERT_EQ (25, location_compute_display_column (expand_location (line_end),
1065- def_tabstop));
1066+ def_policy ()));
1067
1068 test_one_liner_simple_caret_utf8 ();
1069 test_one_liner_caret_and_range_utf8 ();
1070@@ -4445,30 +4745,31 @@ test_overlapped_fixit_printing (const li
1071 pp_formatted_text (dc.printer));
1072
1073 /* Unit-test the line_corrections machinery. */
1074+ char_display_policy policy (make_policy (dc, richloc));
1075 ASSERT_EQ (3, richloc.get_num_fixit_hints ());
1076 const fixit_hint *hint_0 = richloc.get_fixit_hint (0);
1077 ASSERT_EQ (column_range (12, 12),
1078- get_affected_range (&dc, hint_0, CU_BYTES));
1079+ get_affected_range (policy, hint_0, CU_BYTES));
1080 ASSERT_EQ (column_range (12, 12),
1081- get_affected_range (&dc, hint_0, CU_DISPLAY_COLS));
1082- ASSERT_EQ (column_range (12, 22), get_printed_columns (&dc, hint_0));
1083+ get_affected_range (policy, hint_0, CU_DISPLAY_COLS));
1084+ ASSERT_EQ (column_range (12, 22), get_printed_columns (policy, hint_0));
1085 const fixit_hint *hint_1 = richloc.get_fixit_hint (1);
1086 ASSERT_EQ (column_range (18, 18),
1087- get_affected_range (&dc, hint_1, CU_BYTES));
1088+ get_affected_range (policy, hint_1, CU_BYTES));
1089 ASSERT_EQ (column_range (18, 18),
1090- get_affected_range (&dc, hint_1, CU_DISPLAY_COLS));
1091- ASSERT_EQ (column_range (18, 20), get_printed_columns (&dc, hint_1));
1092+ get_affected_range (policy, hint_1, CU_DISPLAY_COLS));
1093+ ASSERT_EQ (column_range (18, 20), get_printed_columns (policy, hint_1));
1094 const fixit_hint *hint_2 = richloc.get_fixit_hint (2);
1095 ASSERT_EQ (column_range (29, 28),
1096- get_affected_range (&dc, hint_2, CU_BYTES));
1097+ get_affected_range (policy, hint_2, CU_BYTES));
1098 ASSERT_EQ (column_range (29, 28),
1099- get_affected_range (&dc, hint_2, CU_DISPLAY_COLS));
1100- ASSERT_EQ (column_range (29, 29), get_printed_columns (&dc, hint_2));
1101+ get_affected_range (policy, hint_2, CU_DISPLAY_COLS));
1102+ ASSERT_EQ (column_range (29, 29), get_printed_columns (policy, hint_2));
1103
1104 /* Add each hint in turn to a line_corrections instance,
1105 and verify that they are consolidated into one correction instance
1106 as expected. */
1107- line_corrections lc (&dc, tmp.get_filename (), 1);
1108+ line_corrections lc (policy, tmp.get_filename (), 1);
1109
1110 /* The first replace hint by itself. */
1111 lc.add_hint (hint_0);
1112@@ -4660,30 +4961,31 @@ test_overlapped_fixit_printing_utf8 (con
1113 pp_formatted_text (dc.printer));
1114
1115 /* Unit-test the line_corrections machinery. */
1116+ char_display_policy policy (make_policy (dc, richloc));
1117 ASSERT_EQ (3, richloc.get_num_fixit_hints ());
1118 const fixit_hint *hint_0 = richloc.get_fixit_hint (0);
1119 ASSERT_EQ (column_range (14, 14),
1120- get_affected_range (&dc, hint_0, CU_BYTES));
1121+ get_affected_range (policy, hint_0, CU_BYTES));
1122 ASSERT_EQ (column_range (12, 12),
1123- get_affected_range (&dc, hint_0, CU_DISPLAY_COLS));
1124- ASSERT_EQ (column_range (12, 22), get_printed_columns (&dc, hint_0));
1125+ get_affected_range (policy, hint_0, CU_DISPLAY_COLS));
1126+ ASSERT_EQ (column_range (12, 22), get_printed_columns (policy, hint_0));
1127 const fixit_hint *hint_1 = richloc.get_fixit_hint (1);
1128 ASSERT_EQ (column_range (22, 22),
1129- get_affected_range (&dc, hint_1, CU_BYTES));
1130+ get_affected_range (policy, hint_1, CU_BYTES));
1131 ASSERT_EQ (column_range (18, 18),
1132- get_affected_range (&dc, hint_1, CU_DISPLAY_COLS));
1133- ASSERT_EQ (column_range (18, 20), get_printed_columns (&dc, hint_1));
1134+ get_affected_range (policy, hint_1, CU_DISPLAY_COLS));
1135+ ASSERT_EQ (column_range (18, 20), get_printed_columns (policy, hint_1));
1136 const fixit_hint *hint_2 = richloc.get_fixit_hint (2);
1137 ASSERT_EQ (column_range (35, 34),
1138- get_affected_range (&dc, hint_2, CU_BYTES));
1139+ get_affected_range (policy, hint_2, CU_BYTES));
1140 ASSERT_EQ (column_range (30, 29),
1141- get_affected_range (&dc, hint_2, CU_DISPLAY_COLS));
1142- ASSERT_EQ (column_range (30, 30), get_printed_columns (&dc, hint_2));
1143+ get_affected_range (policy, hint_2, CU_DISPLAY_COLS));
1144+ ASSERT_EQ (column_range (30, 30), get_printed_columns (policy, hint_2));
1145
1146 /* Add each hint in turn to a line_corrections instance,
1147 and verify that they are consolidated into one correction instance
1148 as expected. */
1149- line_corrections lc (&dc, tmp.get_filename (), 1);
1150+ line_corrections lc (policy, tmp.get_filename (), 1);
1151
1152 /* The first replace hint by itself. */
1153 lc.add_hint (hint_0);
1154@@ -4877,15 +5179,16 @@ test_overlapped_fixit_printing_2 (const
1155 richloc.add_fixit_insert_before (col_21, "}");
1156
1157 /* These fixits should be accepted; they can't be consolidated. */
1158+ char_display_policy policy (make_policy (dc, richloc));
1159 ASSERT_EQ (2, richloc.get_num_fixit_hints ());
1160 const fixit_hint *hint_0 = richloc.get_fixit_hint (0);
1161 ASSERT_EQ (column_range (23, 22),
1162- get_affected_range (&dc, hint_0, CU_BYTES));
1163- ASSERT_EQ (column_range (23, 23), get_printed_columns (&dc, hint_0));
1164+ get_affected_range (policy, hint_0, CU_BYTES));
1165+ ASSERT_EQ (column_range (23, 23), get_printed_columns (policy, hint_0));
1166 const fixit_hint *hint_1 = richloc.get_fixit_hint (1);
1167 ASSERT_EQ (column_range (21, 20),
1168- get_affected_range (&dc, hint_1, CU_BYTES));
1169- ASSERT_EQ (column_range (21, 21), get_printed_columns (&dc, hint_1));
1170+ get_affected_range (policy, hint_1, CU_BYTES));
1171+ ASSERT_EQ (column_range (21, 21), get_printed_columns (policy, hint_1));
1172
1173 /* Verify that they're printed correctly. */
1174 diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1175@@ -5152,10 +5455,11 @@ test_tab_expansion (const line_table_cas
1176 ....................123 45678901234 56789012345 columns */
1177
1178 const int tabstop = 8;
1179+ cpp_char_column_policy policy (tabstop, cpp_wcwidth);
1180 const int first_non_ws_byte_col = 7;
1181 const int right_quote_byte_col = 15;
1182 const int last_byte_col = 25;
1183- ASSERT_EQ (35, cpp_display_width (content, last_byte_col, tabstop));
1184+ ASSERT_EQ (35, cpp_display_width (content, last_byte_col, policy));
1185
1186 temp_source_file tmp (SELFTEST_LOCATION, ".c", content);
1187 line_table_test ltt (case_);
1188@@ -5198,6 +5502,114 @@ test_tab_expansion (const line_table_cas
1189 }
1190 }
1191
1192+/* Verify that the escaping machinery can cope with a variety of different
1193+ invalid bytes. */
1194+
1195+static void
1196+test_escaping_bytes_1 (const line_table_case &case_)
1197+{
1198+ const char content[] = "before\0\1\2\3\r\x80\xff""after\n";
1199+ const size_t sz = sizeof (content);
1200+ temp_source_file tmp (SELFTEST_LOCATION, ".c", content, sz);
1201+ line_table_test ltt (case_);
1202+ const line_map_ordinary *ord_map = linemap_check_ordinary
1203+ (linemap_add (line_table, LC_ENTER, false, tmp.get_filename (), 0));
1204+ linemap_line_start (line_table, 1, 100);
1205+
1206+ location_t finish
1207+ = linemap_position_for_line_and_column (line_table, ord_map, 1,
1208+ strlen (content));
1209+
1210+ if (finish > LINE_MAP_MAX_LOCATION_WITH_COLS)
1211+ return;
1212+
1213+ /* Locations of the NUL and \r bytes. */
1214+ location_t nul_loc
1215+ = linemap_position_for_line_and_column (line_table, ord_map, 1, 7);
1216+ location_t r_loc
1217+ = linemap_position_for_line_and_column (line_table, ord_map, 1, 11);
1218+ gcc_rich_location richloc (nul_loc);
1219+ richloc.add_range (r_loc);
1220+
1221+ {
1222+ test_diagnostic_context dc;
1223+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1224+ ASSERT_STREQ (" before \1\2\3 \x80\xff""after\n"
1225+ " ^ ~\n",
1226+ pp_formatted_text (dc.printer));
1227+ }
1228+ richloc.set_escape_on_output (true);
1229+ {
1230+ test_diagnostic_context dc;
1231+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_UNICODE;
1232+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1233+ ASSERT_STREQ
1234+ (" before<U+0000><U+0001><U+0002><U+0003><U+000D><80><ff>after\n"
1235+ " ^~~~~~~~ ~~~~~~~~\n",
1236+ pp_formatted_text (dc.printer));
1237+ }
1238+ {
1239+ test_diagnostic_context dc;
1240+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_BYTES;
1241+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1242+ ASSERT_STREQ (" before<00><01><02><03><0d><80><ff>after\n"
1243+ " ^~~~ ~~~~\n",
1244+ pp_formatted_text (dc.printer));
1245+ }
1246+}
1247+
1248+/* As above, but verify that we handle the initial byte of a line
1249+ correctly. */
1250+
1251+static void
1252+test_escaping_bytes_2 (const line_table_case &case_)
1253+{
1254+ const char content[] = "\0after\n";
1255+ const size_t sz = sizeof (content);
1256+ temp_source_file tmp (SELFTEST_LOCATION, ".c", content, sz);
1257+ line_table_test ltt (case_);
1258+ const line_map_ordinary *ord_map = linemap_check_ordinary
1259+ (linemap_add (line_table, LC_ENTER, false, tmp.get_filename (), 0));
1260+ linemap_line_start (line_table, 1, 100);
1261+
1262+ location_t finish
1263+ = linemap_position_for_line_and_column (line_table, ord_map, 1,
1264+ strlen (content));
1265+
1266+ if (finish > LINE_MAP_MAX_LOCATION_WITH_COLS)
1267+ return;
1268+
1269+ /* Location of the NUL byte. */
1270+ location_t nul_loc
1271+ = linemap_position_for_line_and_column (line_table, ord_map, 1, 1);
1272+ gcc_rich_location richloc (nul_loc);
1273+
1274+ {
1275+ test_diagnostic_context dc;
1276+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1277+ ASSERT_STREQ (" after\n"
1278+ " ^\n",
1279+ pp_formatted_text (dc.printer));
1280+ }
1281+ richloc.set_escape_on_output (true);
1282+ {
1283+ test_diagnostic_context dc;
1284+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_UNICODE;
1285+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1286+ ASSERT_STREQ (" <U+0000>after\n"
1287+ " ^~~~~~~~\n",
1288+ pp_formatted_text (dc.printer));
1289+ }
1290+ {
1291+ test_diagnostic_context dc;
1292+ dc.escape_format = DIAGNOSTICS_ESCAPE_FORMAT_BYTES;
1293+ diagnostic_show_locus (&dc, &richloc, DK_ERROR);
1294+ ASSERT_STREQ (" <00>after\n"
1295+ " ^~~~\n",
1296+ pp_formatted_text (dc.printer));
1297+ }
1298+}
1299+
1300 /* Verify that line numbers are correctly printed for the case of
1301 a multiline range in which the width of the line numbers changes
1302 (e.g. from "9" to "10"). */
1303@@ -5254,6 +5666,8 @@ diagnostic_show_locus_c_tests ()
1304 test_layout_range_for_single_line ();
1305 test_layout_range_for_multiple_lines ();
1306
1307+ test_display_widths ();
1308+
1309 for_each_line_table_case (test_layout_x_offset_display_utf8);
1310 for_each_line_table_case (test_layout_x_offset_display_tab);
1311
1312@@ -5274,6 +5688,8 @@ diagnostic_show_locus_c_tests ()
1313 for_each_line_table_case (test_fixit_replace_containing_newline);
1314 for_each_line_table_case (test_fixit_deletion_affecting_newline);
1315 for_each_line_table_case (test_tab_expansion);
1316+ for_each_line_table_case (test_escaping_bytes_1);
1317+ for_each_line_table_case (test_escaping_bytes_2);
1318
1319 test_line_numbers_multiline_range ();
1320 }
1321diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
1322--- a/gcc/doc/invoke.texi 2021-12-13 23:23:05.764437151 -0800
1323+++ b/gcc/doc/invoke.texi 2021-12-14 01:16:01.553943061 -0800
1324@@ -312,7 +312,8 @@ Objective-C and Objective-C++ Dialects}.
1325 -fdiagnostics-show-path-depths @gol
1326 -fno-show-column @gol
1327 -fdiagnostics-column-unit=@r{[}display@r{|}byte@r{]} @gol
1328--fdiagnostics-column-origin=@var{origin}}
1329+-fdiagnostics-column-origin=@var{origin} @gol
1330+-fdiagnostics-escape-format=@r{[}unicode@r{|}bytes@r{]}}
1331
1332 @item Warning Options
1333 @xref{Warning Options,,Options to Request or Suppress Warnings}.
1334@@ -5083,6 +5084,38 @@ first column. The default value of 1 co
1335 behavior and to the GNU style guide. Some utilities may perform better with an
1336 origin of 0; any non-negative value may be specified.
1337
1338+@item -fdiagnostics-escape-format=@var{FORMAT}
1339+@opindex fdiagnostics-escape-format
1340+When GCC prints pertinent source lines for a diagnostic it normally attempts
1341+to print the source bytes directly. However, some diagnostics relate to encoding
1342+issues in the source file, such as malformed UTF-8, or issues with Unicode
1343+normalization. These diagnostics are flagged so that GCC will escape bytes
1344+that are not printable ASCII when printing their pertinent source lines.
1345+
1346+This option controls how such bytes should be escaped.
1347+
1348+The default @var{FORMAT}, @samp{unicode} displays Unicode characters that
1349+are not printable ASCII in the form @samp{<U+XXXX>}, and bytes that do not
1350+correspond to a Unicode character validly-encoded in UTF-8-encoded will be
1351+displayed as hexadecimal in the form @samp{<XX>}.
1352+
1353+For example, a source line containing the string @samp{before} followed by the
1354+Unicode character U+03C0 (``GREEK SMALL LETTER PI'', with UTF-8 encoding
1355+0xCF 0x80) followed by the byte 0xBF (a stray UTF-8 trailing byte), followed by
1356+the string @samp{after} will be printed for such a diagnostic as:
1357+
1358+@smallexample
1359+ before<U+03C0><BF>after
1360+@end smallexample
1361+
1362+Setting @var{FORMAT} to @samp{bytes} will display all non-printable-ASCII bytes
1363+in the form @samp{<XX>}, thus showing the underlying encoding of non-ASCII
1364+Unicode characters. For the example above, the following will be printed:
1365+
1366+@smallexample
1367+ before<CF><80><BF>after
1368+@end smallexample
1369+
1370 @item -fdiagnostics-format=@var{FORMAT}
1371 @opindex fdiagnostics-format
1372 Select a different format for printing diagnostics.
1373@@ -5150,9 +5183,11 @@ might be printed in JSON form (after for
1374 @}
1375 @}
1376 ],
1377+ "escape-source": false,
1378 "message": "...this statement, but the latter is @dots{}"
1379 @}
1380 ]
1381+ "escape-source": false,
1382 "column-origin": 1,
1383 @},
1384 @dots{}
1385@@ -5239,6 +5274,7 @@ of the expression, which have labels. I
1386 "label": "T @{aka struct t@}"
1387 @}
1388 ],
1389+ "escape-source": false,
1390 "message": "invalid operands to binary + @dots{}"
1391 @}
1392 @end smallexample
1393@@ -5292,6 +5328,7 @@ might be printed in JSON form as:
1394 @}
1395 @}
1396 ],
1397+ "escape-source": false,
1398 "message": "\u2018struct s\u2019 has no member named @dots{}"
1399 @}
1400 @end smallexample
1401@@ -5349,6 +5386,10 @@ For example, the intraprocedural example
1402 ]
1403 @end smallexample
1404
1405+Diagnostics have a boolean attribute @code{escape-source}, hinting whether
1406+non-ASCII bytes should be escaped when printing the pertinent lines of
1407+source code (@code{true} for diagnostics involving source encoding issues).
1408+
1409 @end table
1410
1411 @node Warning Options
1412diff --git a/gcc/input.c b/gcc/input.c
1413--- a/gcc/input.c 2021-07-27 23:55:07.328287915 -0700
1414+++ b/gcc/input.c 2021-12-14 01:16:01.553943061 -0800
1415@@ -913,7 +913,8 @@ make_location (location_t caret, source_
1416 source line in order to calculate the display width. If that cannot be done
1417 for any reason, then returns the byte column as a fallback. */
1418 int
1419-location_compute_display_column (expanded_location exploc, int tabstop)
1420+location_compute_display_column (expanded_location exploc,
1421+ const cpp_char_column_policy &policy)
1422 {
1423 if (!(exploc.file && *exploc.file && exploc.line && exploc.column))
1424 return exploc.column;
1425@@ -921,7 +922,7 @@ location_compute_display_column (expande
1426 /* If line is NULL, this function returns exploc.column which is the
1427 desired fallback. */
1428 return cpp_byte_column_to_display_column (line.get_buffer (), line.length (),
1429- exploc.column, tabstop);
1430+ exploc.column, policy);
1431 }
1432
1433 /* Dump statistics to stderr about the memory usage of the line_table
1434@@ -3611,43 +3612,50 @@ test_line_offset_overflow ()
1435 void test_cpp_utf8 ()
1436 {
1437 const int def_tabstop = 8;
1438+ cpp_char_column_policy policy (def_tabstop, cpp_wcwidth);
1439+
1440 /* Verify that wcwidth of invalid UTF-8 or control bytes is 1. */
1441 {
1442- int w_bad = cpp_display_width ("\xf0!\x9f!\x98!\x82!", 8, def_tabstop);
1443+ int w_bad = cpp_display_width ("\xf0!\x9f!\x98!\x82!", 8, policy);
1444 ASSERT_EQ (8, w_bad);
1445- int w_ctrl = cpp_display_width ("\r\n\v\0\1", 5, def_tabstop);
1446+ int w_ctrl = cpp_display_width ("\r\n\v\0\1", 5, policy);
1447 ASSERT_EQ (5, w_ctrl);
1448 }
1449
1450 /* Verify that wcwidth of valid UTF-8 is as expected. */
1451 {
1452- const int w_pi = cpp_display_width ("\xcf\x80", 2, def_tabstop);
1453+ const int w_pi = cpp_display_width ("\xcf\x80", 2, policy);
1454 ASSERT_EQ (1, w_pi);
1455- const int w_emoji = cpp_display_width ("\xf0\x9f\x98\x82", 4, def_tabstop);
1456+ const int w_emoji = cpp_display_width ("\xf0\x9f\x98\x82", 4, policy);
1457 ASSERT_EQ (2, w_emoji);
1458 const int w_umlaut_precomposed = cpp_display_width ("\xc3\xbf", 2,
1459- def_tabstop);
1460+ policy);
1461 ASSERT_EQ (1, w_umlaut_precomposed);
1462 const int w_umlaut_combining = cpp_display_width ("y\xcc\x88", 3,
1463- def_tabstop);
1464+ policy);
1465 ASSERT_EQ (1, w_umlaut_combining);
1466- const int w_han = cpp_display_width ("\xe4\xb8\xba", 3, def_tabstop);
1467+ const int w_han = cpp_display_width ("\xe4\xb8\xba", 3, policy);
1468 ASSERT_EQ (2, w_han);
1469- const int w_ascii = cpp_display_width ("GCC", 3, def_tabstop);
1470+ const int w_ascii = cpp_display_width ("GCC", 3, policy);
1471 ASSERT_EQ (3, w_ascii);
1472 const int w_mixed = cpp_display_width ("\xcf\x80 = 3.14 \xf0\x9f\x98\x82"
1473 "\x9f! \xe4\xb8\xba y\xcc\x88",
1474- 24, def_tabstop);
1475+ 24, policy);
1476 ASSERT_EQ (18, w_mixed);
1477 }
1478
1479 /* Verify that display width properly expands tabs. */
1480 {
1481 const char *tstr = "\tabc\td";
1482- ASSERT_EQ (6, cpp_display_width (tstr, 6, 1));
1483- ASSERT_EQ (10, cpp_display_width (tstr, 6, 3));
1484- ASSERT_EQ (17, cpp_display_width (tstr, 6, 8));
1485- ASSERT_EQ (1, cpp_display_column_to_byte_column (tstr, 6, 7, 8));
1486+ ASSERT_EQ (6, cpp_display_width (tstr, 6,
1487+ cpp_char_column_policy (1, cpp_wcwidth)));
1488+ ASSERT_EQ (10, cpp_display_width (tstr, 6,
1489+ cpp_char_column_policy (3, cpp_wcwidth)));
1490+ ASSERT_EQ (17, cpp_display_width (tstr, 6,
1491+ cpp_char_column_policy (8, cpp_wcwidth)));
1492+ ASSERT_EQ (1,
1493+ cpp_display_column_to_byte_column
1494+ (tstr, 6, 7, cpp_char_column_policy (8, cpp_wcwidth)));
1495 }
1496
1497 /* Verify that cpp_byte_column_to_display_column can go past the end,
1498@@ -3660,13 +3668,13 @@ void test_cpp_utf8 ()
1499 /* 111122223456
1500 Byte columns. */
1501
1502- ASSERT_EQ (5, cpp_display_width (str, 6, def_tabstop));
1503+ ASSERT_EQ (5, cpp_display_width (str, 6, policy));
1504 ASSERT_EQ (105,
1505- cpp_byte_column_to_display_column (str, 6, 106, def_tabstop));
1506+ cpp_byte_column_to_display_column (str, 6, 106, policy));
1507 ASSERT_EQ (10000,
1508- cpp_byte_column_to_display_column (NULL, 0, 10000, def_tabstop));
1509+ cpp_byte_column_to_display_column (NULL, 0, 10000, policy));
1510 ASSERT_EQ (0,
1511- cpp_byte_column_to_display_column (NULL, 10000, 0, def_tabstop));
1512+ cpp_byte_column_to_display_column (NULL, 10000, 0, policy));
1513 }
1514
1515 /* Verify that cpp_display_column_to_byte_column can go past the end,
1516@@ -3680,25 +3688,25 @@ void test_cpp_utf8 ()
1517 /* 000000000000000000000000000000000111111
1518 111122223333444456666777788889999012345
1519 Byte columns. */
1520- ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 2, def_tabstop));
1521+ ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 2, policy));
1522 ASSERT_EQ (15,
1523- cpp_display_column_to_byte_column (str, 15, 11, def_tabstop));
1524+ cpp_display_column_to_byte_column (str, 15, 11, policy));
1525 ASSERT_EQ (115,
1526- cpp_display_column_to_byte_column (str, 15, 111, def_tabstop));
1527+ cpp_display_column_to_byte_column (str, 15, 111, policy));
1528 ASSERT_EQ (10000,
1529- cpp_display_column_to_byte_column (NULL, 0, 10000, def_tabstop));
1530+ cpp_display_column_to_byte_column (NULL, 0, 10000, policy));
1531 ASSERT_EQ (0,
1532- cpp_display_column_to_byte_column (NULL, 10000, 0, def_tabstop));
1533+ cpp_display_column_to_byte_column (NULL, 10000, 0, policy));
1534
1535 /* Verify that we do not interrupt a UTF-8 sequence. */
1536- ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 1, def_tabstop));
1537+ ASSERT_EQ (4, cpp_display_column_to_byte_column (str, 15, 1, policy));
1538
1539 for (int byte_col = 1; byte_col <= 15; ++byte_col)
1540 {
1541 const int disp_col
1542- = cpp_byte_column_to_display_column (str, 15, byte_col, def_tabstop);
1543+ = cpp_byte_column_to_display_column (str, 15, byte_col, policy);
1544 const int byte_col2
1545- = cpp_display_column_to_byte_column (str, 15, disp_col, def_tabstop);
1546+ = cpp_display_column_to_byte_column (str, 15, disp_col, policy);
1547
1548 /* If we ask for the display column in the middle of a UTF-8
1549 sequence, it will return the length of the partial sequence,
1550diff --git a/gcc/input.h b/gcc/input.h
1551--- a/gcc/input.h 2021-07-27 23:55:07.328287915 -0700
1552+++ b/gcc/input.h 2021-12-14 01:16:01.553943061 -0800
1553@@ -39,8 +39,11 @@ STATIC_ASSERT (BUILTINS_LOCATION < RESER
1554 extern bool is_location_from_builtin_token (location_t);
1555 extern expanded_location expand_location (location_t);
1556
1557-extern int location_compute_display_column (expanded_location exploc,
1558- int tabstop);
1559+class cpp_char_column_policy;
1560+
1561+extern int
1562+location_compute_display_column (expanded_location exploc,
1563+ const cpp_char_column_policy &policy);
1564
1565 /* A class capturing the bounds of a buffer, to allow for run-time
1566 bounds-checking in a checked build. */
1567diff --git a/gcc/opts.c b/gcc/opts.c
1568--- a/gcc/opts.c 2021-07-27 23:55:07.364288417 -0700
1569+++ b/gcc/opts.c 2021-12-14 01:16:01.553943061 -0800
1570@@ -2573,6 +2573,10 @@ common_handle_option (struct gcc_options
1571 dc->column_origin = value;
1572 break;
1573
1574+ case OPT_fdiagnostics_escape_format_:
1575+ dc->escape_format = (enum diagnostics_escape_format)value;
1576+ break;
1577+
1578 case OPT_fdiagnostics_show_cwe:
1579 dc->show_cwe = value;
1580 break;
1581diff --git a/gcc/selftest.c b/gcc/selftest.c
1582--- a/gcc/selftest.c 2021-07-27 23:55:07.500290315 -0700
1583+++ b/gcc/selftest.c 2021-12-14 01:16:01.557942991 -0800
1584@@ -193,6 +193,21 @@ temp_source_file::temp_source_file (cons
1585 fclose (out);
1586 }
1587
1588+/* As above, but with a size, to allow for NUL bytes in CONTENT. */
1589+
1590+temp_source_file::temp_source_file (const location &loc,
1591+ const char *suffix,
1592+ const char *content,
1593+ size_t sz)
1594+: named_temp_file (suffix)
1595+{
1596+ FILE *out = fopen (get_filename (), "w");
1597+ if (!out)
1598+ fail_formatted (loc, "unable to open tempfile: %s", get_filename ());
1599+ fwrite (content, sz, 1, out);
1600+ fclose (out);
1601+}
1602+
1603 /* Avoid introducing locale-specific differences in the results
1604 by hardcoding open_quote and close_quote. */
1605
1606diff --git a/gcc/selftest.h b/gcc/selftest.h
1607--- a/gcc/selftest.h 2021-07-27 23:55:07.500290315 -0700
1608+++ b/gcc/selftest.h 2021-12-14 01:16:01.557942991 -0800
1609@@ -112,6 +112,8 @@ class temp_source_file : public named_te
1610 public:
1611 temp_source_file (const location &loc, const char *suffix,
1612 const char *content);
1613+ temp_source_file (const location &loc, const char *suffix,
1614+ const char *content, size_t sz);
1615 };
1616
1617 /* RAII-style class for avoiding introducing locale-specific differences
1618diff --git a/gcc/testsuite/c-c++-common/diagnostic-format-json-1.c b/gcc/testsuite/c-c++-common/diagnostic-format-json-1.c
1619--- a/gcc/testsuite/c-c++-common/diagnostic-format-json-1.c 2021-07-27 23:55:07.596291654 -0700
1620+++ b/gcc/testsuite/c-c++-common/diagnostic-format-json-1.c 2021-12-14 01:16:01.557942991 -0800
1621@@ -9,6 +9,7 @@
1622
1623 /* { dg-regexp "\"kind\": \"error\"" } */
1624 /* { dg-regexp "\"column-origin\": 1" } */
1625+/* { dg-regexp "\"escape-source\": false" } */
1626 /* { dg-regexp "\"message\": \"#error message\"" } */
1627
1628 /* { dg-regexp "\"caret\": \{" } */
1629diff --git a/gcc/testsuite/c-c++-common/diagnostic-format-json-2.c b/gcc/testsuite/c-c++-common/diagnostic-format-json-2.c
1630--- a/gcc/testsuite/c-c++-common/diagnostic-format-json-2.c 2021-07-27 23:55:07.596291654 -0700
1631+++ b/gcc/testsuite/c-c++-common/diagnostic-format-json-2.c 2021-12-14 01:16:01.557942991 -0800
1632@@ -9,6 +9,7 @@
1633
1634 /* { dg-regexp "\"kind\": \"warning\"" } */
1635 /* { dg-regexp "\"column-origin\": 1" } */
1636+/* { dg-regexp "\"escape-source\": false" } */
1637 /* { dg-regexp "\"message\": \"#warning message\"" } */
1638 /* { dg-regexp "\"option\": \"-Wcpp\"" } */
1639 /* { dg-regexp "\"option_url\": \"https:\[^\n\r\"\]*#index-Wcpp\"" } */
1640diff --git a/gcc/testsuite/c-c++-common/diagnostic-format-json-3.c b/gcc/testsuite/c-c++-common/diagnostic-format-json-3.c
1641--- a/gcc/testsuite/c-c++-common/diagnostic-format-json-3.c 2021-07-27 23:55:07.596291654 -0700
1642+++ b/gcc/testsuite/c-c++-common/diagnostic-format-json-3.c 2021-12-14 01:16:01.557942991 -0800
1643@@ -9,6 +9,7 @@
1644
1645 /* { dg-regexp "\"kind\": \"error\"" } */
1646 /* { dg-regexp "\"column-origin\": 1" } */
1647+/* { dg-regexp "\"escape-source\": false" } */
1648 /* { dg-regexp "\"message\": \"#warning message\"" } */
1649 /* { dg-regexp "\"option\": \"-Werror=cpp\"" } */
1650 /* { dg-regexp "\"option_url\": \"https:\[^\n\r\"\]*#index-Wcpp\"" } */
1651diff --git a/gcc/testsuite/c-c++-common/diagnostic-format-json-4.c b/gcc/testsuite/c-c++-common/diagnostic-format-json-4.c
1652--- a/gcc/testsuite/c-c++-common/diagnostic-format-json-4.c 2021-07-27 23:55:07.596291654 -0700
1653+++ b/gcc/testsuite/c-c++-common/diagnostic-format-json-4.c 2021-12-14 01:16:01.557942991 -0800
1654@@ -19,6 +19,7 @@ int test (void)
1655
1656 /* { dg-regexp "\"kind\": \"note\"" } */
1657 /* { dg-regexp "\"message\": \"...this statement, but the latter is misleadingly indented as if it were guarded by the 'if'\"" } */
1658+/* { dg-regexp "\"escape-source\": false" } */
1659
1660 /* { dg-regexp "\"caret\": \{" } */
1661 /* { dg-regexp "\"file\": \"\[^\n\r\"\]*diagnostic-format-json-4.c\"" } */
1662@@ -39,6 +40,7 @@ int test (void)
1663 /* { dg-regexp "\"kind\": \"warning\"" } */
1664 /* { dg-regexp "\"column-origin\": 1" } */
1665 /* { dg-regexp "\"message\": \"this 'if' clause does not guard...\"" } */
1666+/* { dg-regexp "\"escape-source\": false" } */
1667 /* { dg-regexp "\"option\": \"-Wmisleading-indentation\"" } */
1668 /* { dg-regexp "\"option_url\": \"https:\[^\n\r\"\]*#index-Wmisleading-indentation\"" } */
1669
1670diff --git a/gcc/testsuite/c-c++-common/diagnostic-format-json-5.c b/gcc/testsuite/c-c++-common/diagnostic-format-json-5.c
1671--- a/gcc/testsuite/c-c++-common/diagnostic-format-json-5.c 2021-07-27 23:55:07.596291654 -0700
1672+++ b/gcc/testsuite/c-c++-common/diagnostic-format-json-5.c 2021-12-14 01:16:01.557942991 -0800
1673@@ -14,6 +14,7 @@ int test (struct s *ptr)
1674
1675 /* { dg-regexp "\"kind\": \"error\"" } */
1676 /* { dg-regexp "\"column-origin\": 1" } */
1677+/* { dg-regexp "\"escape-source\": false" } */
1678 /* { dg-regexp "\"message\": \".*\"" } */
1679
1680 /* Verify fix-it hints. */
1681diff --git a/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-bytes.c b/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-bytes.c
1682--- a/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-bytes.c 1969-12-31 16:00:00.000000000 -0800
1683+++ b/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-bytes.c 2021-12-14 01:16:01.557942991 -0800
1684@@ -0,0 +1,21 @@
1685+// { dg-do preprocess }
1686+// { dg-options "-std=gnu99 -Werror=normalized=nfc -fdiagnostics-show-caret -fdiagnostics-escape-format=bytes" }
1687+/* { dg-message "some warnings being treated as errors" "" {target "*-*-*"} 0 } */
1688+
1689+/* འ= U+0F43 TIBETAN LETTER GHA, which has decomposition "0F42 0FB7" i.e.
1690+ U+0F42 TIBETAN LETTER GA: à½
1691+ U+0FB7 TIBETAN SUBJOINED LETTER HA: ྷ
1692+
1693+ The UTF-8 encoding of U+0F43 TIBETAN LETTER GHA is: E0 BD 83. */
1694+
1695+foo before_\u0F43_after bar // { dg-error "`before_.U00000f43_after' is not in NFC .-Werror=normalized=." }
1696+/* { dg-begin-multiline-output "" }
1697+ foo before_\u0F43_after bar
1698+ ^~~~~~~~~~~~~~~~~~~
1699+ { dg-end-multiline-output "" } */
1700+
1701+foo before_à½_after bar // { dg-error "`before_.U00000f43_after' is not in NFC .-Werror=normalized=." }
1702+/* { dg-begin-multiline-output "" }
1703+ foo before_<e0><bd><83>_after bar
1704+ ^~~~~~~~~~~~~~~~~~~~~~~~~
1705+ { dg-end-multiline-output "" } */
1706diff --git a/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-unicode.c b/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-unicode.c
1707--- a/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-unicode.c 1969-12-31 16:00:00.000000000 -0800
1708+++ b/gcc/testsuite/gcc.dg/cpp/warn-normalized-4-unicode.c 2021-12-14 01:16:01.557942991 -0800
1709@@ -0,0 +1,19 @@
1710+// { dg-do preprocess }
1711+// { dg-options "-std=gnu99 -Werror=normalized=nfc -fdiagnostics-show-caret -fdiagnostics-escape-format=unicode" }
1712+/* { dg-message "some warnings being treated as errors" "" {target "*-*-*"} 0 } */
1713+
1714+/* འ= U+0F43 TIBETAN LETTER GHA, which has decomposition "0F42 0FB7" i.e.
1715+ U+0F42 TIBETAN LETTER GA: à½
1716+ U+0FB7 TIBETAN SUBJOINED LETTER HA: ྷ */
1717+
1718+foo before_\u0F43_after bar // { dg-error "`before_.U00000f43_after' is not in NFC .-Werror=normalized=." }
1719+/* { dg-begin-multiline-output "" }
1720+ foo before_\u0F43_after bar
1721+ ^~~~~~~~~~~~~~~~~~~
1722+ { dg-end-multiline-output "" } */
1723+
1724+foo before_à½_after bar // { dg-error "`before_.U00000f43_after' is not in NFC .-Werror=normalized=." }
1725+/* { dg-begin-multiline-output "" }
1726+ foo before_<U+0F43>_after bar
1727+ ^~~~~~~~~~~~~~~~~~~~~
1728+ { dg-end-multiline-output "" } */
1729diff --git a/gcc/testsuite/gfortran.dg/diagnostic-format-json-1.F90 b/gcc/testsuite/gfortran.dg/diagnostic-format-json-1.F90
1730--- a/gcc/testsuite/gfortran.dg/diagnostic-format-json-1.F90 2021-07-27 23:55:08.472303878 -0700
1731+++ b/gcc/testsuite/gfortran.dg/diagnostic-format-json-1.F90 2021-12-14 01:16:01.557942991 -0800
1732@@ -9,6 +9,7 @@
1733
1734 ! { dg-regexp "\"kind\": \"error\"" }
1735 ! { dg-regexp "\"column-origin\": 1" }
1736+! { dg-regexp "\"escape-source\": false" }
1737 ! { dg-regexp "\"message\": \"#error message\"" }
1738
1739 ! { dg-regexp "\"caret\": \{" }
1740diff --git a/gcc/testsuite/gfortran.dg/diagnostic-format-json-2.F90 b/gcc/testsuite/gfortran.dg/diagnostic-format-json-2.F90
1741--- a/gcc/testsuite/gfortran.dg/diagnostic-format-json-2.F90 2021-07-27 23:55:08.472303878 -0700
1742+++ b/gcc/testsuite/gfortran.dg/diagnostic-format-json-2.F90 2021-12-14 01:16:01.557942991 -0800
1743@@ -9,6 +9,7 @@
1744
1745 ! { dg-regexp "\"kind\": \"warning\"" }
1746 ! { dg-regexp "\"column-origin\": 1" }
1747+! { dg-regexp "\"escape-source\": false" }
1748 ! { dg-regexp "\"message\": \"#warning message\"" }
1749 ! { dg-regexp "\"option\": \"-Wcpp\"" }
1750 ! { dg-regexp "\"option_url\": \"\[^\n\r\"\]*#index-Wcpp\"" }
1751diff --git a/gcc/testsuite/gfortran.dg/diagnostic-format-json-3.F90 b/gcc/testsuite/gfortran.dg/diagnostic-format-json-3.F90
1752--- a/gcc/testsuite/gfortran.dg/diagnostic-format-json-3.F90 2021-07-27 23:55:08.472303878 -0700
1753+++ b/gcc/testsuite/gfortran.dg/diagnostic-format-json-3.F90 2021-12-14 01:16:01.557942991 -0800
1754@@ -9,6 +9,7 @@
1755
1756 ! { dg-regexp "\"kind\": \"error\"" }
1757 ! { dg-regexp "\"column-origin\": 1" }
1758+! { dg-regexp "\"escape-source\": false" }
1759 ! { dg-regexp "\"message\": \"#warning message\"" }
1760 ! { dg-regexp "\"option\": \"-Werror=cpp\"" }
1761 ! { dg-regexp "\"option_url\": \"\[^\n\r\"\]*#index-Wcpp\"" }
1762diff --git a/libcpp/charset.c b/libcpp/charset.c
1763--- a/libcpp/charset.c 2021-07-27 23:55:08.712307227 -0700
1764+++ b/libcpp/charset.c 2021-12-14 01:16:01.557942991 -0800
1765@@ -1552,12 +1552,14 @@ convert_escape (cpp_reader *pfile, const
1766 "unknown escape sequence: '\\%c'", (int) c);
1767 else
1768 {
1769+ encoding_rich_location rich_loc (pfile);
1770+
1771 /* diagnostic.c does not support "%03o". When it does, this
1772 code can use %03o directly in the diagnostic again. */
1773 char buf[32];
1774 sprintf(buf, "%03o", (int) c);
1775- cpp_error (pfile, CPP_DL_PEDWARN,
1776- "unknown escape sequence: '\\%s'", buf);
1777+ cpp_error_at (pfile, CPP_DL_PEDWARN, &rich_loc,
1778+ "unknown escape sequence: '\\%s'", buf);
1779 }
1780 }
1781
1782@@ -2280,14 +2282,16 @@ cpp_string_location_reader::get_next ()
1783 }
1784
1785 cpp_display_width_computation::
1786-cpp_display_width_computation (const char *data, int data_length, int tabstop) :
1787+cpp_display_width_computation (const char *data, int data_length,
1788+ const cpp_char_column_policy &policy) :
1789 m_begin (data),
1790 m_next (m_begin),
1791 m_bytes_left (data_length),
1792- m_tabstop (tabstop),
1793+ m_policy (policy),
1794 m_display_cols (0)
1795 {
1796- gcc_assert (m_tabstop > 0);
1797+ gcc_assert (policy.m_tabstop > 0);
1798+ gcc_assert (policy.m_width_cb);
1799 }
1800
1801
1802@@ -2299,19 +2303,28 @@ cpp_display_width_computation (const cha
1803 point to a valid UTF-8-encoded sequence, then it will be treated as a single
1804 byte with display width 1. m_cur_display_col is the current display column,
1805 relative to which tab stops should be expanded. Returns the display width of
1806- the codepoint just processed. */
1807+ the codepoint just processed.
1808+ If OUT is non-NULL, it is populated. */
1809
1810 int
1811-cpp_display_width_computation::process_next_codepoint ()
1812+cpp_display_width_computation::process_next_codepoint (cpp_decoded_char *out)
1813 {
1814 cppchar_t c;
1815 int next_width;
1816
1817+ if (out)
1818+ out->m_start_byte = m_next;
1819+
1820 if (*m_next == '\t')
1821 {
1822 ++m_next;
1823 --m_bytes_left;
1824- next_width = m_tabstop - (m_display_cols % m_tabstop);
1825+ next_width = m_policy.m_tabstop - (m_display_cols % m_policy.m_tabstop);
1826+ if (out)
1827+ {
1828+ out->m_ch = '\t';
1829+ out->m_valid_ch = true;
1830+ }
1831 }
1832 else if (one_utf8_to_cppchar ((const uchar **) &m_next, &m_bytes_left, &c)
1833 != 0)
1834@@ -2321,14 +2334,24 @@ cpp_display_width_computation::process_n
1835 of one. */
1836 ++m_next;
1837 --m_bytes_left;
1838- next_width = 1;
1839+ next_width = m_policy.m_undecoded_byte_width;
1840+ if (out)
1841+ out->m_valid_ch = false;
1842 }
1843 else
1844 {
1845 /* one_utf8_to_cppchar() has updated m_next and m_bytes_left for us. */
1846- next_width = cpp_wcwidth (c);
1847+ next_width = m_policy.m_width_cb (c);
1848+ if (out)
1849+ {
1850+ out->m_ch = c;
1851+ out->m_valid_ch = true;
1852+ }
1853 }
1854
1855+ if (out)
1856+ out->m_next_byte = m_next;
1857+
1858 m_display_cols += next_width;
1859 return next_width;
1860 }
1861@@ -2344,7 +2367,7 @@ cpp_display_width_computation::advance_d
1862 const int start = m_display_cols;
1863 const int target = start + n;
1864 while (m_display_cols < target && !done ())
1865- process_next_codepoint ();
1866+ process_next_codepoint (NULL);
1867 return m_display_cols - start;
1868 }
1869
1870@@ -2352,29 +2375,33 @@ cpp_display_width_computation::advance_d
1871 how many display columns are occupied by the first COLUMN bytes. COLUMN
1872 may exceed DATA_LENGTH, in which case the phantom bytes at the end are
1873 treated as if they have display width 1. Tabs are expanded to the next tab
1874- stop, relative to the start of DATA. */
1875+ stop, relative to the start of DATA, and non-printable-ASCII characters
1876+ will be escaped as per POLICY. */
1877
1878 int
1879 cpp_byte_column_to_display_column (const char *data, int data_length,
1880- int column, int tabstop)
1881+ int column,
1882+ const cpp_char_column_policy &policy)
1883 {
1884 const int offset = MAX (0, column - data_length);
1885- cpp_display_width_computation dw (data, column - offset, tabstop);
1886+ cpp_display_width_computation dw (data, column - offset, policy);
1887 while (!dw.done ())
1888- dw.process_next_codepoint ();
1889+ dw.process_next_codepoint (NULL);
1890 return dw.display_cols_processed () + offset;
1891 }
1892
1893 /* For the string of length DATA_LENGTH bytes that begins at DATA, compute
1894 the least number of bytes that will result in at least DISPLAY_COL display
1895 columns. The return value may exceed DATA_LENGTH if the entire string does
1896- not occupy enough display columns. */
1897+ not occupy enough display columns. Non-printable-ASCII characters
1898+ will be escaped as per POLICY. */
1899
1900 int
1901 cpp_display_column_to_byte_column (const char *data, int data_length,
1902- int display_col, int tabstop)
1903+ int display_col,
1904+ const cpp_char_column_policy &policy)
1905 {
1906- cpp_display_width_computation dw (data, data_length, tabstop);
1907+ cpp_display_width_computation dw (data, data_length, policy);
1908 const int avail_display = dw.advance_display_cols (display_col);
1909 return dw.bytes_processed () + MAX (0, display_col - avail_display);
1910 }
1911diff --git a/libcpp/errors.c b/libcpp/errors.c
1912--- a/libcpp/errors.c 2021-07-27 23:55:08.712307227 -0700
1913+++ b/libcpp/errors.c 2021-12-14 01:16:01.557942991 -0800
1914@@ -27,6 +27,31 @@ along with this program; see the file CO
1915 #include "cpplib.h"
1916 #include "internal.h"
1917
1918+/* Get a location_t for the current location in PFILE,
1919+ generally that of the previously lexed token. */
1920+
1921+location_t
1922+cpp_diagnostic_get_current_location (cpp_reader *pfile)
1923+{
1924+ if (CPP_OPTION (pfile, traditional))
1925+ {
1926+ if (pfile->state.in_directive)
1927+ return pfile->directive_line;
1928+ else
1929+ return pfile->line_table->highest_line;
1930+ }
1931+ /* We don't want to refer to a token before the beginning of the
1932+ current run -- that is invalid. */
1933+ else if (pfile->cur_token == pfile->cur_run->base)
1934+ {
1935+ return 0;
1936+ }
1937+ else
1938+ {
1939+ return pfile->cur_token[-1].src_loc;
1940+ }
1941+}
1942+
1943 /* Print a diagnostic at the given location. */
1944
1945 ATTRIBUTE_FPTR_PRINTF(5,0)
1946@@ -52,25 +77,7 @@ cpp_diagnostic (cpp_reader * pfile, enum
1947 enum cpp_warning_reason reason,
1948 const char *msgid, va_list *ap)
1949 {
1950- location_t src_loc;
1951-
1952- if (CPP_OPTION (pfile, traditional))
1953- {
1954- if (pfile->state.in_directive)
1955- src_loc = pfile->directive_line;
1956- else
1957- src_loc = pfile->line_table->highest_line;
1958- }
1959- /* We don't want to refer to a token before the beginning of the
1960- current run -- that is invalid. */
1961- else if (pfile->cur_token == pfile->cur_run->base)
1962- {
1963- src_loc = 0;
1964- }
1965- else
1966- {
1967- src_loc = pfile->cur_token[-1].src_loc;
1968- }
1969+ location_t src_loc = cpp_diagnostic_get_current_location (pfile);
1970 rich_location richloc (pfile->line_table, src_loc);
1971 return cpp_diagnostic_at (pfile, level, reason, &richloc, msgid, ap);
1972 }
1973@@ -142,6 +149,43 @@ cpp_warning_syshdr (cpp_reader * pfile,
1974
1975 va_end (ap);
1976 return ret;
1977+}
1978+
1979+/* As cpp_warning above, but use RICHLOC as the location of the diagnostic. */
1980+
1981+bool cpp_warning_at (cpp_reader *pfile, enum cpp_warning_reason reason,
1982+ rich_location *richloc, const char *msgid, ...)
1983+{
1984+ va_list ap;
1985+ bool ret;
1986+
1987+ va_start (ap, msgid);
1988+
1989+ ret = cpp_diagnostic_at (pfile, CPP_DL_WARNING, reason, richloc,
1990+ msgid, &ap);
1991+
1992+ va_end (ap);
1993+ return ret;
1994+
1995+}
1996+
1997+/* As cpp_pedwarning above, but use RICHLOC as the location of the
1998+ diagnostic. */
1999+
2000+bool
2001+cpp_pedwarning_at (cpp_reader * pfile, enum cpp_warning_reason reason,
2002+ rich_location *richloc, const char *msgid, ...)
2003+{
2004+ va_list ap;
2005+ bool ret;
2006+
2007+ va_start (ap, msgid);
2008+
2009+ ret = cpp_diagnostic_at (pfile, CPP_DL_PEDWARN, reason, richloc,
2010+ msgid, &ap);
2011+
2012+ va_end (ap);
2013+ return ret;
2014 }
2015
2016 /* Print a diagnostic at a specific location. */
2017diff --git a/libcpp/include/cpplib.h b/libcpp/include/cpplib.h
2018--- a/libcpp/include/cpplib.h 2021-12-13 23:23:05.768437079 -0800
2019+++ b/libcpp/include/cpplib.h 2021-12-14 01:20:16.189507386 -0800
2020@@ -1275,6 +1275,14 @@ extern bool cpp_warning_syshdr (cpp_read
2021 const char *msgid, ...)
2022 ATTRIBUTE_PRINTF_3;
2023
2024+/* As their counterparts above, but use RICHLOC. */
2025+extern bool cpp_warning_at (cpp_reader *, enum cpp_warning_reason,
2026+ rich_location *richloc, const char *msgid, ...)
2027+ ATTRIBUTE_PRINTF_4;
2028+extern bool cpp_pedwarning_at (cpp_reader *, enum cpp_warning_reason,
2029+ rich_location *richloc, const char *msgid, ...)
2030+ ATTRIBUTE_PRINTF_4;
2031+
2032 /* Output a diagnostic with "MSGID: " preceding the
2033 error string of errno. No location is printed. */
2034 extern bool cpp_errno (cpp_reader *, enum cpp_diagnostic_level,
2035@@ -1435,42 +1443,95 @@ extern const char * cpp_get_userdef_suff
2036
2037 /* In charset.c */
2038
2039+/* The result of attempting to decode a run of UTF-8 bytes. */
2040+
2041+struct cpp_decoded_char
2042+{
2043+ const char *m_start_byte;
2044+ const char *m_next_byte;
2045+
2046+ bool m_valid_ch;
2047+ cppchar_t m_ch;
2048+};
2049+
2050+/* Information for mapping between code points and display columns.
2051+
2052+ This is a tabstop value, along with a callback for getting the
2053+ widths of characters. Normally this callback is cpp_wcwidth, but we
2054+ support other schemes for escaping non-ASCII unicode as a series of
2055+ ASCII chars when printing the user's source code in diagnostic-show-locus.c
2056+
2057+ For example, consider:
2058+ - the Unicode character U+03C0 "GREEK SMALL LETTER PI" (UTF-8: 0xCF 0x80)
2059+ - the Unicode character U+1F642 "SLIGHTLY SMILING FACE"
2060+ (UTF-8: 0xF0 0x9F 0x99 0x82)
2061+ - the byte 0xBF (a stray trailing byte of a UTF-8 character)
2062+ Normally U+03C0 would occupy one display column, U+1F642
2063+ would occupy two display columns, and the stray byte would be
2064+ printed verbatim as one display column.
2065+
2066+ However when escaping them as unicode code points as "<U+03C0>"
2067+ and "<U+1F642>" they occupy 8 and 9 display columns respectively,
2068+ and when escaping them as bytes as "<CF><80>" and "<F0><9F><99><82>"
2069+ they occupy 8 and 16 display columns respectively. In both cases
2070+ the stray byte is escaped to <BF> as 4 display columns. */
2071+
2072+struct cpp_char_column_policy
2073+{
2074+ cpp_char_column_policy (int tabstop,
2075+ int (*width_cb) (cppchar_t c))
2076+ : m_tabstop (tabstop),
2077+ m_undecoded_byte_width (1),
2078+ m_width_cb (width_cb)
2079+ {}
2080+
2081+ int m_tabstop;
2082+ /* Width in display columns of a stray byte that isn't decodable
2083+ as UTF-8. */
2084+ int m_undecoded_byte_width;
2085+ int (*m_width_cb) (cppchar_t c);
2086+};
2087+
2088 /* A class to manage the state while converting a UTF-8 sequence to cppchar_t
2089 and computing the display width one character at a time. */
2090 class cpp_display_width_computation {
2091 public:
2092 cpp_display_width_computation (const char *data, int data_length,
2093- int tabstop);
2094+ const cpp_char_column_policy &policy);
2095 const char *next_byte () const { return m_next; }
2096 int bytes_processed () const { return m_next - m_begin; }
2097 int bytes_left () const { return m_bytes_left; }
2098 bool done () const { return !bytes_left (); }
2099 int display_cols_processed () const { return m_display_cols; }
2100
2101- int process_next_codepoint ();
2102+ int process_next_codepoint (cpp_decoded_char *out);
2103 int advance_display_cols (int n);
2104
2105 private:
2106 const char *const m_begin;
2107 const char *m_next;
2108 size_t m_bytes_left;
2109- const int m_tabstop;
2110+ const cpp_char_column_policy &m_policy;
2111 int m_display_cols;
2112 };
2113
2114 /* Convenience functions that are simple use cases for class
2115 cpp_display_width_computation. Tab characters will be expanded to spaces
2116- as determined by TABSTOP. */
2117+ as determined by POLICY.m_tabstop, and non-printable-ASCII characters
2118+ will be escaped as per POLICY. */
2119+
2120 int cpp_byte_column_to_display_column (const char *data, int data_length,
2121- int column, int tabstop);
2122+ int column,
2123+ const cpp_char_column_policy &policy);
2124 inline int cpp_display_width (const char *data, int data_length,
2125- int tabstop)
2126+ const cpp_char_column_policy &policy)
2127 {
2128 return cpp_byte_column_to_display_column (data, data_length, data_length,
2129- tabstop);
2130+ policy);
2131 }
2132 int cpp_display_column_to_byte_column (const char *data, int data_length,
2133- int display_col, int tabstop);
2134+ int display_col,
2135+ const cpp_char_column_policy &policy);
2136 int cpp_wcwidth (cppchar_t c);
2137
2138 #endif /* ! LIBCPP_CPPLIB_H */
2139diff --git a/libcpp/include/line-map.h b/libcpp/include/line-map.h
2140--- a/libcpp/include/line-map.h 2021-07-27 23:55:08.716307283 -0700
2141+++ b/libcpp/include/line-map.h 2021-12-14 01:16:01.557942991 -0800
2142@@ -1781,6 +1781,18 @@ class rich_location
2143 const diagnostic_path *get_path () const { return m_path; }
2144 void set_path (const diagnostic_path *path) { m_path = path; }
2145
2146+ /* A flag for hinting that the diagnostic involves character encoding
2147+ issues, and thus that it will be helpful to the user if we show some
2148+ representation of how the characters in the pertinent source lines
2149+ are encoded.
2150+ The default is false (i.e. do not escape).
2151+ When set to true, non-ASCII bytes in the pertinent source lines will
2152+ be escaped in a manner controlled by the user-supplied option
2153+ -fdiagnostics-escape-format=, so that the user can better understand
2154+ what's going on with the encoding in their source file. */
2155+ bool escape_on_output_p () const { return m_escape_on_output; }
2156+ void set_escape_on_output (bool flag) { m_escape_on_output = flag; }
2157+
2158 private:
2159 bool reject_impossible_fixit (location_t where);
2160 void stop_supporting_fixits ();
2161@@ -1807,6 +1819,7 @@ protected:
2162 bool m_fixits_cannot_be_auto_applied;
2163
2164 const diagnostic_path *m_path;
2165+ bool m_escape_on_output;
2166 };
2167
2168 /* A struct for the result of range_label::get_text: a NUL-terminated buffer
2169diff --git a/libcpp/internal.h b/libcpp/internal.h
2170--- a/libcpp/internal.h 2021-12-13 23:23:05.768437079 -0800
2171+++ b/libcpp/internal.h 2021-12-14 01:16:01.557942991 -0800
2172@@ -776,6 +776,9 @@ extern void _cpp_do_file_change (cpp_rea
2173 extern void _cpp_pop_buffer (cpp_reader *);
2174 extern char *_cpp_bracket_include (cpp_reader *);
2175
2176+/* In errors.c */
2177+extern location_t cpp_diagnostic_get_current_location (cpp_reader *);
2178+
2179 /* In traditional.c. */
2180 extern bool _cpp_scan_out_logical_line (cpp_reader *, cpp_macro *, bool);
2181 extern bool _cpp_read_logical_line_trad (cpp_reader *);
2182@@ -942,6 +945,26 @@ int linemap_get_expansion_line (class li
2183 const char* linemap_get_expansion_filename (class line_maps *,
2184 location_t);
2185
2186+/* A subclass of rich_location for emitting a diagnostic
2187+ at the current location of the reader, but flagging
2188+ it with set_escape_on_output (true). */
2189+class encoding_rich_location : public rich_location
2190+{
2191+ public:
2192+ encoding_rich_location (cpp_reader *pfile)
2193+ : rich_location (pfile->line_table,
2194+ cpp_diagnostic_get_current_location (pfile))
2195+ {
2196+ set_escape_on_output (true);
2197+ }
2198+
2199+ encoding_rich_location (cpp_reader *pfile, location_t loc)
2200+ : rich_location (pfile->line_table, loc)
2201+ {
2202+ set_escape_on_output (true);
2203+ }
2204+};
2205+
2206 #ifdef __cplusplus
2207 }
2208 #endif
2209diff --git a/libcpp/lex.c b/libcpp/lex.c
2210--- a/libcpp/lex.c 2021-12-14 01:14:48.435225968 -0800
2211+++ b/libcpp/lex.c 2021-12-14 01:24:37.220995816 -0800
2212@@ -1774,7 +1774,11 @@ skip_whitespace (cpp_reader *pfile, cppc
2213 while (is_nvspace (c));
2214
2215 if (saw_NUL)
2216- cpp_error (pfile, CPP_DL_WARNING, "null character(s) ignored");
2217+ {
2218+ encoding_rich_location rich_loc (pfile);
2219+ cpp_error_at (pfile, CPP_DL_WARNING, &rich_loc,
2220+ "null character(s) ignored");
2221+ }
2222
2223 buffer->cur--;
2224 }
2225@@ -1803,6 +1807,28 @@ warn_about_normalization (cpp_reader *pf
2226 if (CPP_OPTION (pfile, warn_normalize) < NORMALIZE_STATE_RESULT (s)
2227 && !pfile->state.skipping)
2228 {
2229+ location_t loc = token->src_loc;
2230+
2231+ /* If possible, create a location range for the token. */
2232+ if (loc >= RESERVED_LOCATION_COUNT
2233+ && token->type != CPP_EOF
2234+ /* There must be no line notes to process. */
2235+ && (!(pfile->buffer->cur
2236+ >= pfile->buffer->notes[pfile->buffer->cur_note].pos
2237+ && !pfile->overlaid_buffer)))
2238+ {
2239+ source_range tok_range;
2240+ tok_range.m_start = loc;
2241+ tok_range.m_finish
2242+ = linemap_position_for_column (pfile->line_table,
2243+ CPP_BUF_COLUMN (pfile->buffer,
2244+ pfile->buffer->cur));
2245+ loc = COMBINE_LOCATION_DATA (pfile->line_table,
2246+ loc, tok_range, NULL);
2247+ }
2248+
2249+ encoding_rich_location rich_loc (pfile, loc);
2250+
2251 /* Make sure that the token is printed using UCNs, even
2252 if we'd otherwise happily print UTF-8. */
2253 unsigned char *buf = XNEWVEC (unsigned char, cpp_token_len (token));
2254@@ -1810,11 +1836,11 @@ warn_about_normalization (cpp_reader *pf
2255
2256 sz = cpp_spell_token (pfile, token, buf, false) - buf;
2257 if (NORMALIZE_STATE_RESULT (s) == normalized_C)
2258- cpp_warning_with_line (pfile, CPP_W_NORMALIZE, token->src_loc, 0,
2259- "`%.*s' is not in NFKC", (int) sz, buf);
2260+ cpp_warning_at (pfile, CPP_W_NORMALIZE, &rich_loc,
2261+ "`%.*s' is not in NFKC", (int) sz, buf);
2262 else
2263- cpp_warning_with_line (pfile, CPP_W_NORMALIZE, token->src_loc, 0,
2264- "`%.*s' is not in NFC", (int) sz, buf);
2265+ cpp_warning_at (pfile, CPP_W_NORMALIZE, &rich_loc,
2266+ "`%.*s' is not in NFC", (int) sz, buf);
2267 free (buf);
2268 }
2269 }
2270diff --git a/libcpp/line-map.c b/libcpp/line-map.c
2271--- a/libcpp/line-map.c 2021-07-27 23:55:08.716307283 -0700
2272+++ b/libcpp/line-map.c 2021-12-14 01:16:01.561942921 -0800
2273@@ -2086,7 +2086,8 @@ rich_location::rich_location (line_maps
2274 m_fixit_hints (),
2275 m_seen_impossible_fixit (false),
2276 m_fixits_cannot_be_auto_applied (false),
2277- m_path (NULL)
2278+ m_path (NULL),
2279+ m_escape_on_output (false)
2280 {
2281 add_range (loc, SHOW_RANGE_WITH_CARET, label);
2282 }