aboutsummaryrefslogtreecommitdiff
path: root/lib/libc/amd64
Commit message (Collapse)AuthorAgeFilesLines
* libc, libthr: Ditch MD __pthread_distribute_static_tls helpersJessica Clarke2025-05-291-44/+0
| | | | | | | | | | | | | | | | | | _libc_get_static_tls_base() is just _tcb_get() followed by adding (for Variant I) or subtracting (for Variant II) the offset, so just inline that as the implementation (like we do in rtld-elf) rather than having another copy (or equivalent) of _tcb_get()'s assembly. _get_static_tls_base() doesn't even have any MD assembly as it's reading thr->tcb, the only difference is whether to add or subtract, so again just inline that. Whilst here add some missing blank lines to comply with style(9) for elf_utils.c's includes, and use a pointer type rather than uintptr_t to reduce the need to cast, as is done in rtld-elf. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D50592
* lib/libc/amd64/string: fix overread condition in memccpyRobert Clausecker2024-07-291-56/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An overread condition in memccpy(dst, src, c, len) would occur if src does not cross a 16 byte boundary and there is no instance of c between *src and the next 16 byte boundary. This could cause a read fault if src is just before the end of a page and the next page is unmapped or unreadable. The bug is a consequence of basing memccpy() on the strlcpy() code: whereas strlcpy() assumes that src is a nul-terminated string and hence a terminator is always present, c may not be present at all in the source string. It was not caught earlier due to insufficient unit test design. As a part of the fix, the function is refactored such that the runt case (buffer length from last alignment boundary between 1 and 32 B) is handled separately. This reduces the number of conditional branches on all code paths and simplifies the handling of early matches in the non-runt case. Performance is improved slightly. os: FreeBSD arch: amd64 cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz │ memccpy.unfixed.out │ memccpy.fixed.out │ │ sec/op │ sec/op vs base │ Short 66.76µ ± 0% 62.45µ ± 1% -6.44% (p=0.000 n=20) Mid 7.938µ ± 0% 7.967µ ± 0% +0.36% (p=0.001 n=20) Long 3.577µ ± 0% 3.577µ ± 0% ~ (p=0.429 n=20) geomean 12.38µ 12.12µ -2.08% │ memccpy.unfixed.out │ memccpy.fixed.out │ │ B/s │ B/s vs base │ Short 1.744Gi ± 0% 1.864Gi ± 1% +6.89% (p=0.000 n=20) Mid 14.67Gi ± 0% 14.61Gi ± 0% -0.36% (p=0.001 n=20) Long 32.55Gi ± 0% 32.55Gi ± 0% ~ (p=0.429 n=20) geomean 9.407Gi 9.606Gi +2.12% Reported by: getz Reviewed by: getz Approved by: mjg (blanket, via IRC) See also: D46051 MFC: stable/14 Event: GSoC 2024 Differential Revision: https://reviews.freebsd.org/D46052
* Remove residual blank line at start of MakefileWarner Losh2024-07-152-2/+0
| | | | | | | This is a residual of the $FreeBSD$ removal. MFC After: 3 days (though I'll just run the command on the branches) Sponsored by: Netflix
* include: ssp: round out fortification of current set of headersKyle Evans2024-07-131-0/+2
| | | | | | | | | | | | | | | | | | | | ssp/ssp.h needed some improvements: - `len` isn't always a size_t, it may need casted - In some cases we may want to use a len that isn't specified as a parameter (e.g., L_ctermid), so __ssp_redirect() should be more flexible. - In other cases we may want additional checking, so pull all of the declaration bits out of __ssp_redirect_raw() so that some functions can implement the body themselves. strlcat/strlcpy should be the last of the fortified functions that get their own __*_chk symbols, and these cases are only done to be consistent with the rest of the str*() set. Reviewed by: markj Sponsored by: Klara, Inc. Sponsored by: Stormshield Differential Revision: https://reviews.freebsd.org/D45679
* Prepare the system for _FORTIFY_SOURCEKyle Evans2024-05-134-0/+8
| | | | | | | | | | | | | | | Notably: - libc needs to #undef some of the macros from ssp/* for underlying implementations - ssp/* wants a __RENAME() macro (snatched more or less from NetBSD) There's some extra hinkiness included for read(), since libc spells it as "_read" while the rest of the world spells it "read." Reviewed by: imp, ngie Sponsored by: Stormshield Sponsored by: Klara, Inc. Differential Revision: https://reviews.freebsd.org/D32307
* lib{c,sys}: return wrapped syscall APIs to libcBrooks Davis2024-03-131-0/+3
| | | | | | | | | | | | | These provide standard APIs, but are implemented using another system call (e.g., pipe implemented in terms of pipe2) or are interposed by the threading library to support cancelation. After discussion with kib (see D44111), I've concluded that it is better to keep most public interfaces in libc with as little as possible in libsys. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D44241
* libc: move MD sys related symbols to libsysBrooks Davis2024-02-051-19/+0
| | | | | | | This is a mix genuine MD interfaces and compat symbols like _getlogin. Reviewed by: kib, emaste, imp Pull Request: https://github.com/freebsd/freebsd-src/pull/908
* libc: move rfork_thread(3) to libsysBrooks Davis2024-02-052-92/+1
| | | | | | | | rfork_thread(3) is assembly that makes syscalls directly and uses cerror so it belongs in libsys. Reviewed by: kib, emaste, imp Pull Request: https://github.com/freebsd/freebsd-src/pull/908
* libc: Move per-arch sys/Makefile.inc to libsysBrooks Davis2024-02-051-7/+0
| | | | | | | | | | | | | | | libc/<arch>/sys/Makefile.inc -> libsys/<arch>/Makefile.sys. Require that libsys/<arch>/Makefile.sys exist. At least for current archtiectures, it's not possible for an architecture to not have and MD syscall bits. powerpcspe/Makefile.sys's structure means it had to be modified when moved so rename detection won't work, but it has trivial contents so the history is unimportant. Reviewed by: kib, emaste, imp Pull Request: https://github.com/freebsd/freebsd-src/pull/908
* libc: remove remaining x86 sys bits to libsysBrooks Davis2024-02-054-252/+0
| | | | | Reviewed by: kib, emaste, imp Pull Request: https://github.com/freebsd/freebsd-src/pull/908
* libsys: relocate implementations and manpagesBrooks Davis2024-02-054-205/+0
| | | | | | | | | Remove core system call implementations and documentation to lib/libsys and lib/libsys/<arch> from lib/libc/sys and lib/libc/<arch>/<sys>. Update paths to allow libc to find them in their new home. Reviewed by: kib, emaste, imp Pull Request: https://github.com/freebsd/freebsd-src/pull/908
* libc/amd64: Disable ASAN for amd64_archlevel.cMark Johnston2024-01-281-0/+6
| | | | | | | The code in this file runs before the sanitizer can initialize its shadow map. Fixes: ad2fac552c3f ("lib/libc/amd64: add archlevel-based simd dispatch framework")
* lib/libc/amd64/string: add memrchr() scalar, baseline implementationRobert Clausecker2023-12-252-0/+167
| | | | | | | | | | | | | | | | | | The scalar implementation is fairly simplistic and only performs slightly better than the generic C implementation. It could be improved by using the same algorithm as for memchr, but it would have been a lot more complicated. The baseline implementation is similar to timingsafe_memcmp. It's slightly slower than memchr() due to the more complicated main loop, but I don't think that can be significantly improved. Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Revision: https://reviews.freebsd.org/D42925
* lib/libc/amd64/string: implement strncat() by calling strlen(), memccpy()Robert Clausecker2023-12-252-0/+30
| | | | | | | | | | | This picks up the accelerated implementation of memccpy(). Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Revision: https://reviews.freebsd.org/D42902
* lib/libc/amd64/string: add memccpy scalar, baseline implementationRobert Clausecker2023-12-252-0/+260
| | | | | | | | | | | | | | | | | | | | | Based on the strlcpy code from D42863, this patch adds a SIMD-enhanced implementation of memccpy for amd64. A scalar implementation calling into memchr and memcpy to do the job is provided, too. Please note that this code does not behave exactly the same as the C implementation of memccpy for overlapping inputs. However, overlapping inputs are not allowed for this function by ISO/IEC 9899:1999 and neither has the C implementation any code to deal with the possibility. It just proceeds byte-by-byte, which may or may not do the expected thing for some overlaps. We do not document whether overlapping inputs are supported in memccpy(3). Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Revision: https://reviews.freebsd.org/D42902
* lib/libc/amd64/string: implement strlcat() through strlcpy()Robert Clausecker2023-12-252-0/+26
| | | | | | | | | | | | This should pick up our optimised memchr(), strlen(), and strlcpy() when strlcat() is called. Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Revision: https://reviews.freebsd.org/D42863
* lib/libc/amd64/string: add strlcpy scalar, baseline implementationRobert Clausecker2023-12-252-0/+282
| | | | | | | | | | | | | | | | | | | | | Somewhat similar to stpncpy, but different in that we need to compute the full source length even if the buffer is shorter than the source. strlcat is implemented as a simple wrapper around strlcpy. The scalar implementation of strlcpy just calls into strlen() and memcpy() to do the job. Perf-wise we're very close to stpncpy. The code is slightly slower as it needs to carry on with finding the source string length even if the buffer ends before the string. Sponsored by: The FreeBSD Foundation Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Revision: https://reviews.freebsd.org/D42863
* lib/libc/amd64/string/strcat.S: enable use of SIMDRobert Clausecker2023-12-251-5/+42
| | | | | | | | | | | | | | | | | strcat has a bespoke scalar assembly implementation we inherited from NetBSD. While it performs well, it is better to call into our SIMD implementations if any SIMD features are available at all. So do that and implement strcat() by calling into strlen() and strcpy() if these are available. Sponsored by: The FreeBSD Foundation Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Reviison: https://reviews.freebsd.org/D42600
* lib/libc/amd64/string: implement strncpy() by calling stpncpy()Robert Clausecker2023-12-252-0/+42
| | | | | | | | | | Sponsored by: The FreeBSD Foundation Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Revision: https://reviews.freebsd.org/D42519
* lib/libc/amd64/string: add stpncpy scalar, baseline implementationRobert Clausecker2023-12-252-0/+284
| | | | | | | | | | | | | | | | | | | | | | This was surprisingly annoying to get right, despite being such a simple function. A scalar implementation is also provided, it just calls into our optimised memchr(), memcpy(), and memset() routines to carry out its job. I'm quite happy with the performance. glibc only beats us for very long strings, likely due to the use of AVX-512. The scalar implementation just calls into our optimised memchr(), memcpy(), and memset() routines, so it has a high overhead to begin with but then performs ok for the amount of effort that went into it. Still beats the old C code, except for very short strings. Sponsored by: The FreeBSD Foundation Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Revision: https://reviews.freebsd.org/D42519
* lib/libc/amd64/string: implement strsep() through strcspn()Robert Clausecker2023-12-252-0/+58
| | | | | | | | | | | | | | | The strsep() function is basically strcspn() with extra steps. On amd64, we now have an optimised implementation of strcspn(), so instead of implementing the inner loop manually, just call into the optimised routine. Sponsored by: The FreeBSD Foundation Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Revision: https://reviews.freebsd.org/D42346
* lib/libc/amd64/string: add strrchr scalar, baseline implementationRobert Clausecker2023-12-252-0/+210
| | | | | | | | | | | | | | The baseline implementation is very straightforward, while the scalar implementation suffers from register pressure and the need to use SWAR techniques similar to those used for strchr(). Sponsored by: The FreeBSD Foundation Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Revision: https://reviews.freebsd.org/D42217
* lib/libc/amd64/string: add strncmp scalar, baseline implementationRobert Clausecker2023-12-252-0/+489
| | | | | | | | | | | | | | | | | | | The scalar implementation is fairly straightforward and merely unrolled four times. The baseline implementation closely follows D41971 with appropriate extensions and extra code paths to pay attention to string length. Performance is quite good. We beat both glibc (except for very long strings, but they likely use AVX which we don't) and Bionic (except for medium-sized aligned strings, where we are still in the same ballpark). Sponsored by: The FreeBSD Foundation Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Revision: https://reviews.freebsd.org/D42122
* lib/libc/amd64/string: implement strpbrk() through strcspn()Robert Clausecker2023-12-253-8/+54
| | | | | | | | | | | | This lets us use our optimised strcspn() routine for strpbrk() calls. Sponsored by: The FreeBSD Foundation Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Revision: https://reviews.freebsd.org/D41980
* lib/libc/amd64/string/strcmp.S: add baseline implementationRobert Clausecker2023-12-251-7/+292
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the most complicated one so far. The basic idea is to process the bulk of the string in aligned blocks of 16 bytes such that one string runs ahead and the other runs behind. The string that runs ahead is checked for NUL bytes, the one that runs behind is compared with the corresponding chunk of the string that runs ahead. This trades an extra load per iteration for the very complicated block-reassembly needed in the other implementations (bionic, glibc). On the flip side, we need two code paths depending on the relative alignment of the two buffers. The initial part of the string is compared directly if it is known not to cross a page boundary. Otherwise, a complex slow path to avoid crossing into unmapped memory commences. Performance-wise we beat bionic for misaligned strings (i.e. the strings do not share an alignment offset) and reach comparable performance for aligned strings. glibc is a bit better as it has a special kernel for AVX-512, where this stuff is a bit easier to do. Sponsored by: The FreeBSD Foundation Tested by: developers@, exp-run Approved by: mjg MFC after: 1 month MFC to: stable/14 PR: 275785 Differential Revision: https://reviews.freebsd.org/D41971
* lib/libc/amd64/string/strcspn.S: always return earliest match in 17--32 char ↵Robert Clausecker2023-12-211-3/+24
| | | | | | | | | | | | | | | | | | | | | | case When matching against a set of 17--32 characters, strcspn() uses two invocations of PCMPISTRI to match against the first 16 characters of the set and then the remaining characters. If a match was found in the first half of the set, the code originally immediately returned that match. However, it is possible for a match in the second half of the set to occur earlier in the vector, leading to that match being overlooked. Fix the code by checking if there is a match in the second half of the set and taking the earlier of the two matches. The correctness of the function has been verified with extended unit tests and test runs against the glibc test suite. Approved by: mjg (implicit, via IRC) MFC after: 1 week MFC to: stable/14
* {amd64,i386}/SYS.h: add _SYSCALL and _SYSCALL_BODYBrooks Davis2023-12-184-14/+16
| | | | | | | | | | Add a _SYSCALL(name) which calls the SYS_name syscall. Use it to add a _SYSCALL_BODY() macro which invokes the syscall and calls cerror as required. Use the latter to implement PSEUDO() and RSYSCALL(). Reviewed by: imp, markj Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D43059
* libc: don't needlessly add vfork.o to NOASMBrooks Davis2023-12-061-3/+0
| | | | | | | | | | For architectures where vfork.S was named Ovfork.S this was needed, but it was always pointless here as an entry in either MDASM or NOASM is equivalent. Reviewed by: kib Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D42914
* Remove never implemented sbrk and sstk syscallsBrooks Davis2023-12-041-1/+1
| | | | | | | | | | | | | Both system calls were stubs returning EOPNOTSUPP and libc did not provide _ or __sys_ prefixed symbols. The actual implementation of sbrk(2) is on top of the undocumented break(2) system call. Technically this is a change in ABI, but no non-contrived program ever called these syscalls. Reviewed by: kib, emaste Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D42872
* lib: Remove ancient SCCS tags.Warner Losh2023-11-278-18/+0
| | | | | | | | Remove ancient SCCS tags from the tree, automated scripting, with two minor fixup to keep things compiling. All the common forms in the tree were removed with a perl script. Sponsored by: Netflix
* libc: centralize a few numeric symbolsBrooks Davis2023-11-151-3/+0
| | | | | | | | | | | | fabs, __infinity, and __nan are universally implemented so declare them in gen/Symbol.map. We would also include __flt_rounds, but it's under FBSD_1.3 on arm so until that's gone we're stuck with it. Likewise, everyone but i386 implements fp[gs]etmask. Reviewed by: imp, kib, emaste Differential Revision: https://reviews.freebsd.org/D42618
* libc: centralize makecontext symbolsBrooks Davis2023-11-151-2/+0
| | | | | | | | Declare makecontext() and __makecontext() symbols centrally as they are always implemented. Reviewed by: imp, kib Differential Revision: https://reviews.freebsd.org/D42617
* libc: centralize {_,sig,}{set,long}jmp symbolsBrooks Davis2023-11-151-6/+0
| | | | | | | | | These symbols are universally exposed and documented so declare them centrally. Double- and triple-underscore versions exist on some platforms, but leave those alone for now. Reviewed by: imp, kib Differential Revision: https://reviews.freebsd.org/D42616
* libc: centralize ntoh symbolsBrooks Davis2023-11-151-4/+0
| | | | | | | | These are implemented by net/ntoh.c via headers and compiler intrinsics so declare them in net/Symbol.map. Reviewed by: imp, kib, emaste Differential Revision: https://reviews.freebsd.org/D42615
* libc: further centralize syscall symbolsBrooks Davis2023-11-151-4/+0
| | | | | | | | All architectures necessarily implement _exit(2) and vfork(2) so declare them in sys/Symbol.map. Reviewed by: imp, kib, emaste Differential Revision: https://reviews.freebsd.org/D42614
* libc: Remove empty comments in Symbol.mapBrooks Davis2023-11-151-3/+0
| | | | | | | These were left over from $FreeBSD$ removal. Reviewed by: emaste Differential Revision: https://reviews.freebsd.org/D42612
* libc/<arch>/sys/Makefile.inc: remove cruftBrooks Davis2023-11-151-2/+0
| | | | | | | | Remove stray blank lines left over from $FreeBSD$ removal as well as some CVS-era (perhaps pre-repocopy) version comments. Reviewed by: emaste Differential Revision: https://reviews.freebsd.org/D42611
* libc: Purge unneeded cdefs.hWarner Losh2023-11-0112-12/+0
| | | | | | | | | These sys/cdefs.h are not needed. Purge them. They are mostly left-over from the $FreeBSD$ removal. A few in libc are still required for macros that cdefs.h defines. Keep those. Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D42385
* lib/libc/amd64/string: add timingsafe_memcmp() assembly implementationRobert Clausecker2023-10-152-2/+147
| | | | | | | | | | | | | Conceptually very similar to timingsafe_bcmp(), but with comparison logic inspired by Elijah Stone's fancy memcmp. A baseline (SSE) implementation was omitted this time as I was not able to get it to perform adequately. Best I got was 8% over the scalar version for long inputs, but slower for short inputs. Sponsored by: The FreeBSD Foundation Approved by: security (cperciva) Inspired by: https://github.com/moon-chilled/fancy-memcmp Differential Revision: https://reviews.freebsd.org/D41696
* lib/libc/amd64/string: add timingsafe_bcmp(3) scalar, baseline implementationsRobert Clausecker2023-10-152-1/+234
| | | | | | | | | | Very straightforward and similar to memcmp(3). The code has been written to use only instructions specified as having data operand independent timing by Intel. Sponsored by: The FreeBSD Foundation Approved by: security (cperciva) Differential Revision: https://reviews.freebsd.org/D41673
* lib/libc/amd64/string/memcmp.S: harden against phony buffer lengthsRobert Clausecker2023-09-161-1/+16
| | | | | | | | | | | | | | | | | | | | | | | When memcmp(a, b, len) (or equally, bcmp) is called with a phony length such that a + len < a, the code would malfunction and not compare the two buffers correctly. While such arguments are illegal (buffers do not wrap around the end of the address space), it is neverthless conceivable that people try things like memcmp(a, b, SIZE_MAX) to compare a and b until the first mismatch, in the knowledge that such a mismatch exists, expecting memcmp() to stop comparing somewhere around the mismatch. While memcmp() is usually written to confirm to this assumption, no version of ISO/IEC 9899 guarantees this behaviour (in contrast to memchr() for which it is). Neverthless it appears sensible to at least not grossly misbehave on phony lengths. This change hardens memcmp() against this case by comparing at least until the end of the address space if a + len overflows a 64 bit integer. Sponsored by: The FreeBSD Foundation Approved by: mjg (blanket, via IRC) See also: b2618b651b28fd29e62a4e285f5be09ea30a85d4 MFC after: 1 week
* lib/libc/amd64/string/strcspn.S: fix behaviour with sets of 17--32Robert Clausecker2023-09-121-10/+15
| | | | | | | | | | | | | | | | | | When a string is matched against a set of 17--32 characters, each chunk of the string is matched first against the first 16 characters of the set and then against the remaining characters. We also check at the same time if the string has a nul byte in the current chunk, terminating the search if it does. Due to misconceived logic, the order of checks was "first half of set, nul byte, second half of set", meaning that a match with the second half of the set was ignored when the string ended in the same 16 bytes. Reverse the order of checks to fix this problem. Sponsored by: The FreeBSD Foundation Approved by: mjg (blanket, via IRC) MFC after: 1 week MFC to: stable/14
* lib/libc/amd64/string/memchr.S: fix behaviour with overly long buffersRobert Clausecker2023-09-101-3/+6
| | | | | | | | | | | | | | | | | When memchr(buf, c, len) is called with a phony len (say, SIZE_MAX), buf + len overflows and we have buf + len < buf. This confuses the implementation and makes it return incorrect results. Neverthless we must support this case as memchr() is guaranteed to work even with phony buffer lengths, as long as a match is found before the buffer actually ends. Sponsored by: The FreeBSD Foundation Reported by: yuri, des Tested by: des Approved by: mjg (blanket, via IRC) MFC after: 1 week MFC to: stable/14 PR: 273652
* lib/libc/amd64/string: implement strnlen(3) trough memchr(3)Robert Clausecker2023-09-082-1/+44
| | | | | | | | | | | Now that we have an optimised memchr(3), we can use it to implement strnlen(3) with better perofrmance. Sponsored by: The FreeBSD Foundation Approved by: mjg MFC after: 1 week MFC to: stable/14 Differential Revision: https://reviews.freebsd.org/D41598
* lib/libc/amd64/string: add memchr(3) scalar, baseline implementationRobert Clausecker2023-09-082-0/+205
| | | | | | | | | | | | This is conceptually similar to strchr(3), but there are slight changes to account for the buffer having an explicit buffer length. Sponsored by: The FreeBSD Foundation Approved by: mjg MFC after: 1 week MFC to: stable/14 Differential Revision: https://reviews.freebsd.org/D41598
* lib/libc/amd64/string: add strspn(3) scalar, x86-64-v2 implementationRobert Clausecker2023-09-082-1/+360
| | | | | | | | | | | This is conceptually very similar to the strcspn(3) implementations from D41557, but we can't do the fast paths the same way. Sponsored by: The FreeBSD Foundation Approved by: mjg MFC after: 1 week MFC to: stable/14 Differential Revision: https://reviews.freebsd.org/D41567
* lib/libc/amd64/string: add strcspn(3) scalar, x86-64-v2 implementationRobert Clausecker2023-09-082-0/+369
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changeset adds both a scalar and an x86-64-v2 implementation of the strcspn(3) function to libc. A baseline implementation does not appear to be feasible given the requirements of the function. The scalar implementation is similar to the generic libc implementation, but expands the bit set into a byte set to reduce latency, improving performance. This approach could probably be backported to the generic C version to benefit other platforms. The x86-64-v2 implementation is built around the infamous pcmpistri instruction. An alternative implementation based on the Muła/Langdale algorithm [1] was prototyped, but performed worse than the pcmpistri approach except for sets of more than 16 characters with long input strings. All implementations provide special cases for the empty set (reduces to strlen as well as single-character sets (reduces to strchr). The x86-64-v2 kernel falls back to the scalar implementation for sets of more than 32 characters. This limit could be raised by additional multiples of 16 through the use of additional pcmpistri code paths, but I consider this case to be too rare to be of importance. [1]: http://0x80.pl/articles/simd-byte-lookup.html Sponsored by: The FreeBSD Foundation Approved by: mjg MFC after: 1 week MFC to: stable/14 Differential Revision: https://reviews.freebsd.org/D41557
* lib/libc/amd64/string/strchrnul.S: fix edge case in scalar codeRobert Clausecker2023-08-251-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the buffer is immediately preceeded by the character we are looking for and begins with one higher than that character, and the buffer is misaligned, a match was errorneously detected in the first character. Fix this by changing the way we prevent matches before the buffer from being detected: instead of removing the corresponding bit from the 0x80..80 mask, set the LSB of bytes before the buffer after xoring with the character we look for. The bug only affects amd64 with ARCHLEVEL=scalar (cf. simd(7)). The change comes at a 2% performance impact for short strings if ARCHLEVEL is set to scalar. The default configuration is not affected. os: FreeBSD arch: amd64 cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz │ strchrnul.scalar.0.out │ strchrnul.scalar.2.out │ │ sec/op │ sec/op vs base │ Short 57.89µ ± 2% 59.08µ ± 1% +2.07% (p=0.030 n=20) Mid 19.24µ ± 0% 19.73µ ± 0% +2.53% (p=0.000 n=20) Long 11.03µ ± 0% 11.03µ ± 0% ~ (p=0.547 n=20) geomean 23.07µ 23.43µ +1.53% │ strchrnul.scalar.0.out │ strchrnul.scalar.2.out │ │ B/s │ B/s vs base │ Short 2.011Gi ± 2% 1.970Gi ± 1% -2.02% (p=0.030 n=20) Mid 6.049Gi ± 0% 5.900Gi ± 0% -2.47% (p=0.000 n=20) Long 10.56Gi ± 0% 10.56Gi ± 0% ~ (p=0.547 n=20) geomean 5.045Gi 4.969Gi -1.50% MFC to: stable/14 MFC after: 3 days Approved by: mjg (blanket, via IRC) Sponsored by: The FreeBSD Foundation
* lib/libc/amd64/string/memcmp.S: add baseline implementationRobert Clausecker2023-08-211-6/+175
| | | | | | | | | | | | | | | | | | | | This changeset adds a baseline implementation of memcmp and bcmp for amd64. The same code is used for both functions with conditional code were the behaviour differs (we need more precise output for the memcmp case). FreeBSD documents that memcmp returns the difference between the mismatching characters. Slightly faster code would be possible could we relax this requirement to the ISO/IEC 9899:1999 requirement of merely returning a negative/positive integer or zero. Performance is better than bionic and glibc, except for long strings were the two are 13% faster. This could be because they use SSE4 ptest which we cannot use in a baseline kernel. Sponsored by: The FreeBSD Foundation Approved by: mjg Differential Revision: https://reviews.freebsd.org/D41442
* lib/libc/amd64/string/stpcpy.S: add baseline implementationRobert Clausecker2023-08-212-11/+135
| | | | | | | | | | | | | | This commit adds a baseline implementation of stpcpy(3) for amd64. It performs quite well in comparison to the previous scalar implementation as well as agains bionic and glibc (though glibc is faster for very long strings). Fiddle with the Makefile to also have strcpy(3) call into the optimised stpcpy(3) code, fixing an oversight from D9841. Sponsored by: The FreeBSD Foundation Reviewed by: imp ngie emaste Approved by: mjg kib Fixes: D9841 Differential Revision: https://reviews.freebsd.org/D41349